Introduction

Traditional Chinese murals, as vital carriers of Chinese civilization, hold profound cultural significance. These murals, typically found on the walls of temples, grottoes, and tombs, document the religious beliefs, lifestyles, and aesthetic pursuits of ancient societies. They also embody the spiritual heritage of the Chinese nation, a legacy spanning millennia. However, these invaluable artworks are susceptible to damage and deterioration due to natural erosion and the passage of time. This situation underscores the critical importance of their preservation and inpainting. Figure 1 is the artwork Mural from Fengguo Temple, painted by anonymous artists of the Yuan Dynasty. The mural’s unique style and texture render manual inpainting exceptionally challenging.

Fig. 1: Examples of Mural from Fengguo Temple, painted by anonymous artists of the Yuan Dynasty.
figure 1

These artworks are characterized by their complex compositions, distinctive artistic style, and significant physical degradation.

Deep learning-based image inpainting techniques now constitute a primary approach within computer vision for recovering missing information. They exhibit considerable promise in the digital preservation of cultural heritage1. Unlike conventional physical inpainting, this non-invasive approach enables high-fidelity content generation and texture reconstruction in damaged regions while preserving the integrity of the original artifact. This characteristic is particularly valuable for traditional Chinese murals, given their structural complexity, material fragility, and irreplaceability. Moreover, this technological domain is continuously advancing. In a seminal study, Yu et al. introduced the gated convolution method2. They observed that standard convolutional networks fail to differentiate between valid pixels and invalid values in damaged regions. This deficiency often results in inpainting marred by color distortion and structural artifacts. To overcome this limitation, they designed a learnable gating mechanism that allows the network to selectively process features from valid areas while disregarding corrupted ones. This innovation significantly enhanced the model’s ability to handle large, irregularly shaped defects, producing results with more coherent structures and smoother textural transitions.

The research focus in both academia and industry has recently shifted towards a new paradigm: diffusion models. Prominent examples, such as Latent Diffusion Models3 by Rombach et al. and the RePaint4 by Lugmayr et al., have established a novel generative pathway. The core principle of this approach is an iterative denoising process that transforms random noise into high-quality content coherent with the surrounding image context. This method yields content with exceptional photorealism and detail. It also overcomes critical issues of training instability and mode collapse, which are inherent in Generative Adversarial Networks. These advancements establish diffusion models as the current state of the art in image generation and inpainting.

However, the inpainting of traditional Chinese murals presents unique difficulties, as they are often characterized by complex compositions and abstract meanings. Figure 2 illustrates this with a mural from the Kizil Grottoes in Xinjiang. The artistic style of these works is defined by an intricate interplay of color, brushwork, and material texture. Thus, the primary challenge is to faithfully restore original details while preserving the distinctive artistic style. The key challenges in Chinese mural inpainting therefore encompass the following:

Fig. 2: The Mural is the artwork Mural from the Kizil Grottoes, from the Kucha Kingdom period, created by anonymous artists.
figure 2

Characterized by their unique use of color, abstract motifs, and distinct shading techniques, these murals represent an aesthetic system different from traditional Central Plains styles.

A primary challenge in image inpainting is the tendency to conflate structural reconstruction with stylistic information from a reference image5. This entanglement impedes the simultaneous achievement of structural accuracy and style fidelity, thereby compromising the inpainting overall quality and controllability.

The artistic style of traditional Chinese murals emerges from a complex interplay of color, brushwork, and material texture. Traditional CNN style extractors, limited by their local receptive fields, often fail to capture these global characteristics. Consequently, the inpainting frequently lack the distinctive historical aesthetic and material qualities of the original artwork6.

An optimal inpainting process requires a dynamic allocation of generative focus across different denoising stages, initially prioritizing structure before shifting to style. In contrast, existing methods typically employ static fusion weights. This rigidity can lead to premature style influence disrupting structural formation in early stages. Conversely, it impedes the meticulous refinement of texture in later stages, ultimately compromising the inpainting fidelity to the artistic integrity of the original artwork. Similarly, in other image generation tasks such as low light enhancement, studies have also shown that dynamic guidance is crucial for context enrichment and detail enhancement7, further highlighting the necessity of introducing dynamic control in mural inpainting.

To overcome the aforementioned limitations, we propose the Decoupled Conditional Adaptive Time dynamic Fusion Diffusion inpainting Method (DCADif). This novel diffusion model framework is specifically tailored for the inpainting of traditional Chinese murals. Its core innovation lies in the fine grained decoupling and dynamic control of a mural structural and stylistic attributes.

The main contributions of this work are as follows:

We propose a Decoupled Condition Encoder that employs parallel pathways to extract distinct representations: structural information from line art and stylistic features from reference images. This architectural separation facilitates independent and precise control over both attributes, thereby providing a robust framework for high-fidelity mural inpainting.

We introduce the SwinStyle Encoder to overcome the inherent limitations of traditional methods in characterizing the complex style of murals. This component is specifically engineered to effectively capture the distinctive historical aesthetic and material qualities of the original artwork.

A Time step-based Adaptive Feature Fusion (TAFF) module is proposed, which prioritizes structural accuracy during the initial stages of denoising and later enhances the inpainting of style and texture, thereby yielding a result that is highly faithful to the original artwork.

The physical inpainting of traditional murals is a highly specialized scientific discipline that demands extensive expertise and technical skill from professional conservators. This process encompasses several key stages, including structural consolidation, surface cleaning, pigment re-adhesion, filling of lacunae, and inpainting of lost areas. For example, stabilization may involve applying specialized adhesives to consolidate flaking pigment layers. To address lost pictorial areas while preserving historical authenticity, conservators may employ techniques such as ‘tratteggio’ (inpainting with discernible lines) or ‘filling without painting’. However, physical interventions are often irreversible. They also risk introducing the conservator’s subjective style, potentially compromising the original artistic intent. Furthermore, in cases of extensive or severe damage, the efficacy of physical inpainting is severely limited.

Advances in digital technology have established non contact digital inpainting as a crucial alternative and supplement to physical methods. Initial digital techniques primarily involved the manual application of tools like the clone stamp by expert operators. While this approach avoided direct physical intervention, it suffered from inefficiency and subjectivity, as the outcome was highly contingent upon the artistic proficiency. These limitations spurred the development of algorithms based on texture synthesis, such as PatchMatch8, for image inpainting tasks.

Deep generative models have driven significant advancements in image inpainting. Two primary paradigms have emerged: Generative Adversarial Networks and Diffusion Models. GANs compared to earlier methods, are characterized by more sophisticated architectural designs. The LaMa model9, for instance, utilizes Fast Fourier Convolutions to leverage global contextual information, achieving exceptional performance on large, irregular inpainting tasks. Concurrently, Diffusion Models have become the dominant paradigm, owing to their superior generation quality and training stability. Large-scale, pre-trained models such as the Latent Diffusion Model (LDM)3 can be adapted for this purpose. When fine-tuned or integrated with modules like ControlNet10, they can adhere to existing structures while generating highly realistic and diverse content. These advanced technologies offer promising avenues for the inpainting of ancient paintings.

The iterative denoising process of diffusion models enables the generation of highly detailed and photorealistic images. To harness this generative capability for specific tasks, researchers have developed various guidance and control techniques. PnP Diffusion11, for example, introduced a plug and play method for injecting external feature maps to guide generation. This approach enables flexible control over structure and appearance without necessitating model retraining. Similarly, InstructPix2Pix12 demonstrated the capacity for complex semantic modifications by enabling image editing via natural language instructions. Such guidance techniques have found direct applications in artistic creation and inpainting. DiffEdit13, for instance, performs semantic modifications on specific image regions based on textual descriptions, thus offering new approaches for restoring incomplete content.

Despite these significant advances in controllability, the application of such methods to traditional Chinese murals presents considerable challenges. The profound stylistic diversity and compositional complexity of these artworks often lead to a critical trade off. Existing models frequently struggle to reconcile structural fidelity with stylistic consistency, resulting in undesirable artifacts such as style drift or structural distortion.

The pursuit of more precise content control in inpainting has spurred the integration of semantic and stylistic guidance into the generative process. A seminal development in this domain is the Contrastive Language Image Pre training (CLIP) model14. CLIP aligns images and text within a shared semantic space through training on vast image-text datasets. This alignment enables CLIP to serve as a powerful semantic guide for image generation and editing. Blended Diffusion15, for instance, employs CLIP guidance to perform seamless local edits within specified image regions. Specifically for inpainting tasks, CLIP often functions as a perceptual loss that enforces semantic consistency between the restored region and its surrounding context.

Parallel to semantic control, the precise encoding and transfer of artistic style constitutes a central challenge. The decoupling and control of “style” is a broad and active area of research in generative AI. For example, in the task of stylized image captioning, researchers have explored how to use style embeddings to control the specific style of generated text, rather than merely describing image content16,17,18. Such works demonstrate the significant potential of modeling style as a controllable variable, providing valuable insights for a wide range of style-aware generative tasks, including our mural inpainting. Traditional style transfer methods19 rely on pre-trained VGG networks for style extraction. However, this CNN based approach is often inadequate for capturing the intricate textures and brushwork characteristic of Chinese murals. This limitation motivated a shift towards the Transformer architecture, which excels at modeling long range dependencies. StyTr2 20 pioneered the use of a pure Transformer architecture for arbitrary style transfer, and its success is part of a broader trend. The versatility of Transformer based architectures is evident not only in vision tasks but also in complex language generation problems like unified caption summarization21. In low level vision inpainting, this success is mirrored by models like Uformer22, which have demonstrated a superior ability to reconstruct fine-grained textures while preserving global structural integrity. Precisely preserving and reconstructing structural boundaries is a key challenge for high quality image generation, not only in inpainting but also across other computer vision domains. For instance, in medical image segmentation, researchers have proposed ABANet23, which leverages an attention boundary-aware module to explicitly refine edge features, highlighting a shared pursuit of structural fidelity in cross-domain applications.

Effectively fusing complementary information from different sources is a powerful strategy for enhancing model performance. This idea has been validated in multiple domains; For example, in medical diagnostics, FusionLungNet24 improves diagnostic accuracy by integrating multi-scale features to effectively capture fine-grained pulmonary details. Current fusion strategies for structure and style information typically employ static weighting. This rigid methodology is misaligned with the intrinsically dynamic nature of the generative process, which progresses hierarchically from coarse structural formation to fine grained textural refinement. The absence of a dynamic, stage aware guidance mechanism therefore represents a critical limitation in current style aware inpainting methods.

Methods

Overall Structure

As illustrated in Fig. 3, our proposed model, DCADif, operates on the principle of decoupled feature extraction and adaptive fusion.Its architecture is designed around three core mechanisms:

Fig. 3: Overview of our framework.
figure 3

a Conditional guided Diffusion Models. b SwinStyle Encoder. c CLIP Sketch Encoder. d Condition Encoder.

Guided Denoising UNet. The fused features are injected into our core UNet to guide the reconstruction.

Decoupled Conditional Encoder. We use a parameter frozen CLIP Sketch Encoder for structure and a bespoke SwinStyle Encoder for style.

Time Step Adaptive Feature Fusion. A dynamic module that adjusts the influence of structure and style based on the denoising timestep.

The following sections will detail each component.

CLIP Sketch Encoder

To achieve decoupled control over structure and style, we designate line art as the exclusive medium for structural information. Line art inherently disentangles a mural fundamental structure. Its composition, object contours, and spatial layout, from stylistic attributes such as color, lighting, and material texture. We then employ the pre-trained CLIP Sketch Encoder to extract this purified structural representation.

We employ the CLIP image encoder in a parameter frozen, inference only capacity. This process projects the input line art into a high-dimensional latent space, yielding a compact and semantically rich vector we designate as the structural feature fSketch. Formally, this initial extraction is performed by the pretrained CLIP encoder \({{\mathcal{E}}}_{CLIP}\), which processes the line art, Iline, to produce an intermediate backbone feature fL.

$${{\bf{f}}}_{L}={{\mathcal{E}}}_{CLIP}({I}_{sketch})$$
(1)

To ensure the structural representation is sensitive to stylistic context, we generate a structural update vector fΔ, via a cross attention mechanism. In this operation, the initial structural feature fL, serves as the Query, while an external style feature fS, provides the Key and Value. This process, which follows the standard multi head attention mechanism25, can be summarized as:

$${{\bf{f}}}_{\Delta }=\,{\rm{MultiHead}}\,({{\bf{f}}}_{struct},{{\bf{f}}}_{style},{{\bf{f}}}_{style})$$
(2)

Finally, the structural update vector fΔ is combined with the initial backbone feature fL via element-wise addition. This design, functioning as a residual connection, enables fine-grained, context-aware adjustments while preserving the integrity of the initial structural representation.

$${\left({{{\bf{f}}}^{{\prime} }}_{struct}\right)}_{i,j,c}={\left({{\bf{f}}}_{struct}\right)}_{i,j,c}+{\left({{\bf{f}}}_{\Delta }\right)}_{i,j,c},\,\forall (i,j,c)\in {\mathcal{I}}$$
(3)

This design enables fine grained, context aware adjustments while preserving the integrity of the initial structural representation.

SwinStyle Encoder

To effectively capture the intricate, multi scale stylistic attributes inherent in murals, we employ a Swin Transformer based encoder. This encoder’s primary function is to extract a comprehensive and robust style prior from the reference mural image, yielding the style feature vector, fstyle.

The encoding process begins by converting the input image into a sequence of feature tokens through a standard patch embedding process, which typically involves a convolutional layer followed by a linear projection. These feature tokens are then processed by a multi stage Swin Transformer backbone. This network constructs a hierarchical feature representation through the systematic alternation of Swin Transformer blocks and Patch Merging layers. The Swin Transformer block performs local feature modeling, while the Patch Merging layer downsamples the feature map, thereby expanding the receptive field.

Swin Transformer Block. The block processes the input features Z−1 through the following sequence of operations:

$${\widehat{{\bf{Z}}}}_{l}=\,{\rm{SMSA}}({\rm{LN}}\,({{\bf{Z}}}_{l-1}))+{{\bf{Z}}}_{l-1}$$
(4)
$${{\bf{Z}}}_{l}=\,{\rm{MLP}}({\rm{LN}}\,({\widehat{{\bf{Z}}}}_{l}))+{\widehat{{\bf{Z}}}}_{l}$$
(5)

where LN denotes Layer Normalization (LayerNorm), SMSA signifies Shifted Window Multi-head Self-Attention, and MLP is a Multi-Layer Perceptron.

Path Merging. The downsampling operation performed by this layer is formally defined as:

$${{\bf{f}}}_{out}=({{\mathcal{L}}}_{2}\cdot \,{\rm{LN}}\,\cdot {{\mathcal{L}}}_{1}\cdot {\mathcal{R}})({{\bf{f}}}_{in})$$
(6)

where fin represents the input feature map, \({\mathcal{R}}\) denotes the reshaping and concatenation operation, \({\mathcal{L}}1\) and \({{\mathcal{L}}}_{2}\) are the first and second linear layers, respectively, and LN denotes Layer Normalization.

This hierarchical architecture, characterized by a progressive increase in channel depth and attention heads, enables the encoder to transition its focus from fine grained textural details in shallow layers to more holistic stylistic patterns in deeper layers. This multi stage process of extraction and abstraction culminates in the generation of a single, highly condensed, and discriminative style vector fS.

$${{\bf{f}}}_{S}=\,{\rm{SwinBackbone}}\,({I}_{dam})$$
(7)

where Idam is the damaged mural image serving as the input into the SwinBackbone network.

Finally, to further enhance the representational power, the backbone style feature fS is fed into a style self-attention module, L-Self Attention. This module computes a style update vector by identifying and amplifying the most salient patterns within the feature itself. This update vector is then integrated with the backbone feature fS via a residual connection, yielding the final, refined style vector, fstyle. This entire refinement operation is formally defined as:

$${{\bf{f}}}_{style}={{\bf{f}}}_{S}+{{\mathcal{M}}}_{LSA}({{\bf{f}}}_{S})$$
(8)

where the L-Self Attention module \({{\mathcal{M}}}_{LSA}\), is a self-attention mechanism designed to refine the feature representation by allowing its most salient stylistic patterns to interact and reinforce one another. It is defined as:

$${{\mathcal{M}}}_{SA}(X)=Pro\left(Softmax\left(\frac{({XW}_{Q}){({XW}_{K})}^{T}}{\sqrt{{d}_{k}}}\right)({XW}_{V})\right)$$
(9)

where Pro denotes the linear projection layer, which maps the features aggregated from the value vectors back to the original C dimensional space to ensure that the output can be residually connected with the input. fs RN×C represents the input backbone style feature tensor, where N is the number of token and C is the channel dimensionality.

\({{\mathcal{M}}}_{LSA}\) denotes the style self-attention module, and \(\sqrt{{d}_{k}}\) is the scaling factor used to prevent vanishing gradients. \({W}_{Q},{W}_{K}\in {R}^{C\times {d}_{k}}\) and \({W}_{V}\in {R}^{C\times {d}_{v}}\) are the learnable linear projection weight matrices for generating the query, key, and value, respectively.

Time Step Adaptive Feature Fusion

Conventional conditional diffusion models typically employ static mechanisms, such as cross attention, to integrate external guidance. This rigid approach is fundamentally misaligned with the dynamic nature of the denoising process, where structural guidance is paramount in early stages and stylistic refinement is critical in later ones. Consequently, this misalignment often results in the corruption of structural integrity by premature stylistic influence, leading to significant inaccuracies in the final reconstruction.

To address this challenge, we introduce a novel adaptive fusion mechanism that emulates the coarse to fine strategy of human inpainting experts. This mechanism dynamically modulates the relative influence of structural and stylistic features as a function of the current denoising timestep, t. This ensures that structural guidance dominates in the early, high noise stages, while stylistic and textural refinement prevails in the later, low noise stages.

As illustrated in Fig. 3 (d), the core of the proposed TAFF module lies in generating a set of time dependent weight functions, t (T, . . . , 1) and T. Based on the diffusion model current timestep ωstruct(t) and total number of time steps ωstyle(t).

The core of the TAFF module lies in generating a pair of time dependent weights, and, which are computed as a linear function of the normalized timestep :

$${\omega }_{struct}(t)=0.1+0.9\times (t/T)$$
(10)
$${\omega }_{style}(t)=0.9-0.1\times (t/T)$$
(11)

where T is the total number of diffusion steps. As denoising proceeds, the style weight becomes is completely dominant.

Upon obtaining this set of dynamic weights, we perform a weighted fusion of the final structural feature fStruct and the style feature fStyle to yield the time dependent fused conditional feature.

$${{\bf{f}}}_{fu}(t)=\left[\begin{array}{cc}{{\bf{f}}}_{struct}, & {{\bf{f}}}_{style}\end{array}\right]\cdot {\bf{w}}(t)$$
(12)

where w is the time dependent weight vector \({\bf{w}}(t)={[{\omega }_{struct}(t),{\omega }_{style}(t)]}^{T}\).

This fused feature ffu(t) is subsequently injected into the UNet bottleneck layer of the diffusion model to provide dynamic guidance for each denoising step.

Within the UNet denoising process at each time step t, the input noisy latent Xt, is processed by the downsampling path to produce a bottleneck feature representation hbo. This bottleneck feature is then modulated by our time adaptive fused feature ffu(t). To ensure dimensional compatibility, ffu(t) is first passed through a linear projection layer ϕ to match its dimensionality and then inject it into the network via feature addition to guide the generation process. This guidance step can be formulated as:

$${{\bf{h}}}_{bo}^{{\prime} }({{\bf{X}}}_{t},t)={{\bf{h}}}_{bo}({{\bf{X}}}_{t},t)+\phi ({{\bf{f}}}_{fu}(t))$$
(13)

where hbo(Xt, t) is the original feature extracted from the noisy image Xt by the U-Net bottleneck layer at timestep t. ϕ is a lightweight projection network for feature alignment, and \({{\bf{h}}}_{\mathrm{bo}}^{{\prime} }({{\bf{X}}}_{t},t)\) is the updated bottleneck feature after dynamic conditional guidance.

This strategic temporal decoupling facilitates the reconciliation of macro-structural integrity with fine-grained stylistic details. The result is a reconstruction that achieves a superior degree of fidelity to the original artwork.

Loss Function

The training is governed by a composite objective function that integrates several loss components to jointly optimize for both structural fidelity and stylistic realism.

L1 Loss. It is particularly effective at preserving high frequency structural details, such as edges and contours. This effectiveness stems from its superior robustness to outliers compared to other pixel level losses. The formal is as follows:

$${L}_{1}=\frac{1}{n}{\sum }_{i=1}^{n}\left|{y}_{i}-f({x}_{i})\right|$$
(14)

where yi denotes the ground truth image, and \(f\left({x}_{i}\right.\) is the reconstructed output image.

MSE Loss. It evaluates error by computing the sum of the squared differences between predicted and ground truth values. However, the quadratic nature of this penalty renders MSE highly sensitive to large pixel deviations, which in practice often leads to overly smoothed results that lack the fine textures crucial to artworks. It is formulated as:

$${L}_{MSE}=\frac{1}{n}{\sum }_{i=1}^{n}{({y}_{i}-f({x}_{i}))}^{2}$$
(15)

Preceptual Loss. It does not perform a direct comparison in pixel space. It computes the feature distance between image patches within the feature space of a pre-trained VGGNet. It can be summarized as:

$${L}_{Preceptual}=\frac{1}{n}{\sum }_{i=1}^{n}{({F}_{i}(x)-{F}_{i}(y))}^{2}$$
(16)

where x is the input image and y is the target image, Fi(x) and Fi(y) respectively denote their feature representations at the i layer of a pretrained neural network, and N is the number of feature layers.

LPIPS Loss. The Learned Perceptual Image Patch Similarity loss is designed to more accurately reflect human perceptual judgment than traditional perceptual losses. It calculates the distance between deep features of two images, weighted by learned linear layers to better match human perception. The loss is computed as:

$${L}_{LPIPS}=\mathop{\sum }\limits_{l}{W}_{l}{\left|{F}_{l}(x)-{F}_{l}(y)\right|}_{2}^{2}$$
(17)

where x is the generated image and y is the target image,Fl(x) and Fl(y) are the unit normalized feature representations extracted from the lth layer of a pretrained network, and Wl is a learned weight vector used to scale the contribution of each layer’s feature distance.

Total Loss. It is represented as:

$$\begin{array}{r}{L}_{total}={\lambda }_{1}{L}_{{1}_{n}}+{\lambda }_{2}{L}_{MS{E}_{n}}\\ +{\lambda }_{3}{L}_{{1}_{i}}+{\lambda }_{4}{L}_{Pr{e}_{i}}+{\lambda }_{5}{L}_{LPIPS}\end{array}$$
(18)

where \({L}_{{1}_{n}}\), \({L}_{MS{E}_{n}}\) represent the L1 and Mean Squared Error losses calculated between the predicted noise and the ground truth noise, respectively. While \({L}_{{1}_{i}}\), \({L}_{Pr{e}_{i}}\) enote the L1 and perceptual losses computed between the denoised image and the ground truth image. The weight parameters λ1, λ2, λ3, λ4 and λ5 are weight parameters set to 0.5, 0.5, 0.5, 1, 0.1, respectively.

Datasets

MuralVerse-S. We propose a dataset of murals from various regions of China, comprising 1396 extended and cropped images of Dunhuang murals, 2335 images of Gansu murals, 2950 images of Hebei murals, and 1482 images of Inner Mongolia murals, as illustrated in Fig. 4. All images are cropped to a resolution of 256 × 256 and divided into training, validation, and test sets in a ratio of 8:1:1.

Fig. 4: Examples of Mural paintings.
figure 4

a is Dunhuang murals. b is the line sketch of the Dunhuang murals. c is the real mask. Unlike commonly used synthetic masks, these masks realistically replicate the complex patterns of cracks, fading, and pigment loss that occur over time. Training with such real-world degradation patterns enables our model to generalize more effectively to authentic mural restoration scenarios.

The dataset was curated from images procured from collaborating institutions and digital art databases. The curation process involved a rigorous screening and classification performed by professional artists. Artworks were categorized based on their distinct styles, dynasties, and color palettes to ensure the final dataset diversity and representativeness.

The data preparation pipeline for each mural initiates with the extraction of its corresponding line art. Subsequently, natural damage is simulated by manually applying masks to the intact ground truth image. This process yields a complete training sample, consisting of the damaged image, the binary mask, and the structural line art, which collectively serve as the input for network training.

MaskCLP-S. The dataset is obtained from relevant cooperative research institutions. It comprises 8273 images of Chinese landscape paintings, as illustrated in Fig. 5. The dataset is divided into 7446 training images and 827 testing images, all cropped to 256 × 256. This dataset encompasses a wide range of traditional paintings from various historical dynasties, featuring the unique styles of numerous outstanding artists.

Fig. 5: Demonstrating the generalization capability of our model on the related domain of Chinese landscape paintings.
figure 5

Row a presents examples from various Qing Dynasty landscape paintings, including works by artists such as Wang Jian, Hua Yan and so on. Although these paintings differ from murals in medium and brushwork techniques, they share a common emphasis on linear structure and stylistic mood. Row b contains the corresponding line art extracted for structural guidance, and Row c shows the degradation masks applied for testing.

Ethics Statement

The dataset used in this study is publicly available and has received the necessary approval for use. All images, videos, and associated personal information are published in accordance with the licensing terms of the dataset, and the researchers have adhered to the terms provided by the dataset’s publisher. Since the dataset is publicly accessible and includes content with the required authorization, we confirm that the individuals involved have provided consent at the time of dataset publication.

Implementation Details

In our experiments, models implemented with PyTorch were trained on NVIDIA H20 GPUs. Prior to training, we employed a series of data augmentation techniques to enhance model performance and robustness. These techniques include resizing, cropping, rotation, flipping, and noise addition. Original painting images were resized to a uniform resolution of 256 × 256. The batch size was set to 32. The models were trained for 2000 epochs. A dynamic learning rate schedule was utilized, which progressively annealed the learning rate throughout the training process to ensure stable convergence.

To rigorously evaluate the model’s generalization ability to unseen artistic styles, we carefully partitioned our dataset. Specifically, we ensured that artworks from the same dynasty or by the same artist did not simultaneously appear in both the training and testing sets. This partitioning strategy, analogous to a leave one dynasty out cross validation, is designed to prevent the model from achieving high scores simply by memorizing specific styles.

Model Complexity and Efficiency

The proposed DCADif is a large scale diffusion model, reflected in its complexity metrics. It comprises a total of 561.68 million parameters, positioning it as a substantial network designed to capture intricate artistic features. The computational cost for a single denoising step on a 256 × 256 input is 486.14 GFLOPs. As a diffusion model, the final image generation is an iterative process. The total end to end inference time, which includes the full sampling loop, was benchmarked on a single NVIDIA H20 GPU at 377.00 ms per image. This corresponds to a throughput of approximately 2.65 FPS. While computationally intensive, this scale is crucial for achieving the high fidelity restoration results demonstrated in our experiments.

Evaluation Metrics

We quantitatively assess the inpainting quality using three standard metrics: PSNR26, SSIM27, and LPIPS28. PSNR and SSIM quantify pixel level fidelity and structural correspondence, respectively, while LPIPS evaluates perceptual realism by measuring similarity in a deep feature space. We quantitatively assess the inpainting quality using four standard metrics: PSNR26, SSIM27, LPIPS28, and Gram-matrix style loss19. PSNR and SSIM quantify pixel-level fidelity and structural correspondence, respectively, while LPIPS evaluates perceptual realism and Gram-matrix style loss specifically assesses artistic style fidelity.

Results

Baselines

For a comprehensive performance evaluation, we benchmark our model against nine representative baseline methods, which are grouped into three distinct categories:

(1) CNN. These methods employ CNN architecture for image inpainting.

AdaIR29: An adaptive image inpainting network that handles diverse degradations through frequency-domain feature modulation and learnable degradation adaptation mechanisms.

CTSDG30: A coupled texture structure decomposition network implementing dual-stream inpainting through task-specific subnetworks.

RFR31: A progressive inpainting framework employing iterative refinement with cascaded recurrent feedback modules.

EC32: A CNN based line art colorization model that uses adaptive normalization to inject structural prior, yielding high-fidelity colorization without diffusion steps.

(2) Transformer. These methods employ Transformer architecture for image inpainting.

MAT33: The pioneering transformer based large hole inpainting framework combining global attention mechanisms with local convolutional features for high resolution inpainting.

PromptIR34: A prompt driven transformer that unifies all image inpainting tasks via textual degradation queries, dispensing with task specific branches.

(3) Diffusion. These methods employ diffusion model architecture for image inpainting.

Strdiffusion35: A lightweight diffusion sampler with momentum based skip for fast, multi scale inpainting.

SDE36: Score based generative framework using stochastic differential equations for high-quality image synthesis and inpainting.

RePaint37: A diffusion inpainting method that couples denoising sampling with mask-aware reverse SDE for structure and texture consistency.

Quantitative Analysis

To validate the efficacy of the proposed DCADif framework, we conducted a comprehensive benchmark against leading state-of-the-art methods using our proprietary MuralVerse-S dataset. This evaluation encompassed both quantitative metrics and qualitative visual comparisons.

We conducted a comprehensive quantitative benchmark of the proposed DCADif model against a diverse set of leading image inpainting methods. This selection intentionally spans multiple architectural paradigms, including CNN based, Transformer based, and Diffusion based approaches.

Table 1 summarizes the quantitative results. To guarantee a rigorous and unbiased comparison, all baseline methods were retrained from scratch on our proprietary MuralVerse-S dataset. The models were evaluated across three distinct ranges of random mask ratios (0.1-10%, 10-20%, and 20-30%) to assess their performance under varying levels of degradation.

Table 1 Comparison results on DCADif

In terms of PSNR, DCADif consistently outperforms all baseline methods. This performance advantage becomes increasingly pronounced at higher mask ratios, demonstrating the superior robustness to severe degradation. This trend is best illustrated by the comparison with its leading Diffusion based competitor, RePaint. In the most challenging scenario, 20–30% mask ratio, DCADif achieves a performance margin of 0.51 dB.

DCADif demonstrates exceptional performance stability on the SSIM metric, particularly as the degree of image degradation increases. This stability stands in sharp contrast to methods like EC, which exhibit a performance degradation of over 25% at high mask ratios, indicating a critical failure in structural reconstruction. Furthermore, DCADif surpasses the similarly robust RePaint model, achieving a 3.6% relative performance gain under high mask ratios. This margin underscores its superior capacity for maintaining global structural consistency.

DCADif demonstrates a commanding lead on the LPIPS metric, indicating a substantial improvement in perceptual realism. Relative to the second-best performing model, RePaint, DCADif reduces perceptual error by a remarkable 50% to 57%. Notably, this performance gap widens as the level of image degradation increases.

Moreover, its exceptional performance stability across varying levels of inpainting difficulty validates the sophistication and efficiency of our model’s design, establishing it as a robust and reliable new benchmark in the field of image inpainting Fig. 6.

Fig. 6: A visualization of the quantitative comparison results from Table 1.
figure 6

These line plots track the performance trends of our proposed model (Ours) against baseline methods across varying mask ratios. Each subplot corresponds to one of four metrics: PSNR, SSIM, LPIPS, and Gram matrix style loss.

Qualitative Analysis

As shown in the comparative results in Figs. 7 and 8, our model demonstrates a superior capability to generate visually plausible and high-fidelity results, markedly outperforming other mainstream methods.

Fig. 7: Qualitative comparison with state of the art methods, including SDE, Strdiffusion, PromptIR, AdaIR, and EC.
figure 7

The figure presents restoration results for two different examples: (a) The first example, and (b) the second example. Rows (a1) and (b1) provide zoomed in views of the regions highlighted by red boxes for detailed comparison. Note that baseline methods often suffer from artifacts, blurriness, or stylistic inconsistencies. In contrast, our DCADif successfully reconstructs both fine textures and clean contours with high fidelity to the original artwork in both examples.

Fig. 8: Qualitative comparison with state of the art methods, including SDE, Strdiffusion, PromptIR, AdaIR, and EC.
figure 8

The figure presents restoration results for two different examples: (a) The first example, and (b) the second example. Rows (a1) and (b1) provide zoomed in views of the regions highlighted by red boxes for detailed comparison. Note that baseline methods often suffer from artifacts, blurriness, or stylistic inconsistencies. In contrast, our DCADif successfully reconstructs both fine textures and clean contours with high fidelity to the original artwork in both examples.

The task in Fig. 8a requires the reconstruction of significant facial and sartorial details. The magnified inset (a1) highlights the limitations of the baseline methods. CTSDG and MAT, for instance, generate overly smoothed results that fail to recover crucial facial features or the hat’s texture. RFR exhibits catastrophic failure, producing incoherent artifacts with no semantic relevance to a face. While the result from RePaint is plausible, it lacks the requisite sharpness and fine detail of the ground truth (GT). In contrast, DCADif successfully reconstructs the face with well defined features and contours. It also restores the hat’s intricate texture, yielding a final inpainting nearly indistinguishable from the ground truth.

In Fig. 8b showcases the inpainting of an ornate headdress, a task defined by its intricate details and fine linework. The magnified inset (b1) underscores the difficulty of preserving such high frequency details. RFR again fails to generate coherent content, producing chaotic artifacts and demonstrating an inability to model the image underlying structure. CTSDG and MAT render the fine lines as an indistinct blur, thereby compromising the design structural integrity. Although RePaint captures the overall form, it fails to reproduce the linework with sufficient sharpness, resulting in a loss of definition. In contrast, our model performs exceptionally well, accurately reconstructing both the fine linework and the subtle background color gradations.

These visual comparisons provide empirical validation for our quantitative results. While competing methods often suffer from blurring, artifacting, and structural inconsistencies, DCADif consistently delivers inpaintings that are semantically coherent, rich in detail, and stylistically faithful to the original artwork.

Comparision with Celebrated Tanditional Mural Painting

To further assess its generalization and practical utility, the DCADif framework was applied to the inpainting of authentic ancient murals exhibiting natural degradation. Unlike the synthetic masks used for training, these cases feature complex, compound forms of degradation, including color fading, pigment flaking, and structural cracks, posing a formidable test of the robustness. The successful outcomes of these inpainting underscore the substantial potential of DCADif as a viable tool for digital cultural heritage preservation.

In Fig. 9, we present the inpainting results for three cases of authentic murals. The versatility and robustness are evident in its handling of diverse degradation types. It successfully addresses the diffuse mottled stains of ‘the Winged Beast mural’, reconnects the fractured structural lines of ‘the Charging Bull mural’, and reconstructs the holistically deteriorated scene of ‘the Court Ladies and Apsaras mural’. This performance on authentic artifacts validates the efficacy of DCADif as a practical inpainting tool. Furthermore, this success underscores the significant potential for contributions to digital archaeology, museology, and the broader field of cultural heritage preservation.

Fig. 9: Comparison on Traditional Chinese Painting.
figure 9

Comparison on Traditional Chinese Painting. The figure displays results from three different examples, shown in rows (a), (b), and (c). ‘GT’ denotes the ground truth image. ‘Line’ represents the corresponding line art. ‘Mask’ indicates the synthetically damaged painting, and ‘Inpainted’ shows the restored image.

Comparision on Diverse Datasets

To further validate our model, we conducted additional experiments on the MaskCLP-S dataset. This dataset encompasses a wide range of traditional paintings from various historical dynasties, featuring the unique styles of numerous outstanding artists.

The experimental results demonstrate that, during the pixel wise decoding of missing regions, the model not only aligns the local brushwork with the style of the original artwork but also, through its adaptive fusion strategy, ensures that the ink tones create a natural transition with the surrounding strokes. As shown in Fig. 10, in a landscape painting by Wang Shigu where approximately 18% of the left mountain ridge was damaged by mildew, the inpainting result not only recovers the fine layers of the ‘hemp fiber strokes’ but also precisely reproduces the texture of the dry brush work, rendering the repaired boundary virtually imperceptible.

Fig. 10: Comparison on Traditional Chinese Painting.
figure 10

Comparison on Traditional Chinese Painting. The figure displays results from three different examples, shown in rows (a), (b), and (c). ‘GT’ denotes the ground truth image. ‘Line’ represents the corresponding line art. ‘Mask’ indicates the synthetically damaged painting, and ‘Inpainted’ shows the restored image.

Comparision with Celebrated Traditional Chinese Painting

Digital inpainting must strike a balance between semantics and aesthetics. It must not only precisely restore the physical structure and stylistic characteristics of missing areas but, more importantly, ensure that the result integrates seamlessly with the original artwork. By incorporating a multi level style perception mechanism and global context modeling, our model effectively captures the subtle continuity of artistic intent across broken brushstrokes a key characteristic of traditional painting thereby achieving a new equilibrium between visual realism and aesthetics. For our experiments, we cropped sections from the Tang dynasty painting Court Ladies Wearing Flowered Headdresses and a landscape painting by Wang Shigu to create the images for inpainting. Figure 10 presents the qualitative inpainting results of our proposed model on the renowned painting Court Ladies Wearing Flowered Headdresses, with source data details provided in Fig. 11. The experimental results demonstrate that our model achieves a remarkable balance between semantic coherence and aesthetic fidelity. The model transcends basic pixel filling to perform context aware semantic inpainting. For instance, in subplot (b), the model not only recovers the correct color of the dog’s coat but also precisely reconstructs its directional flow, maintaining fine grained textural consistency. Furthermore, subplot (c) highlights the exceptional ability to preserve structural integrity, as evidenced by the high fidelity reconstruction of facial contours and edges. Collectively, these results demonstrate that our model is capable of understanding and generating content that is both visually plausible and semantically meaningful within the context of classical Chinese painting.

Fig. 11: The image is One of Top Ten China Famous Paintings.
figure 11

This is the artwork Court Ladies Wearing Flowered Headdresses, painted by the Tang Dynasty artist Zhou Fang. The red boxes indicate two representative regions chosen for our inpainting experiments: the lady’s face with her high chignon, and the small dog below. The two panels on the right are the corresponding magnified views of these regions, which serve as the Ground Truth for our experiments. The unique style and fine details of this artwork present a challenging scenario for validating the capabilities of our model.

User Study

We recruited 50 participants, including faculty and graduate students specializing in art. We used several competing models, as well as our own, to restore a set of murals and presented the resulting images to the participants. They were then asked to rate the results based on the following three criteria: (a) Content Consistency: the degree to which the content of the restored image is consistent with the original. (b) Style Fidelity: the extent to which the brushwork, color, and texture reproduce those of the original mural. and (c) Degree: a comprehensive assessment of how well the mural was restored. As shown in Fig. 12, DCADif demonstrates outstanding performance.

Fig. 12
figure 12

Illustration of the User Study.

A rating scale from 0 (Worst) to 5 (Best) was used for each criterion, where a higher score indicates a more favorable evaluation. The rating scale was defined as follows:

$${\rm{Score}}=\frac{{\sum }_{i=1}^{n}({f}_{i}\cdot {w}_{i})}{P}$$
(19)

where P is the number of participants who answered the question, fi denotes the frequency of the i-th option being selected, and wi represents the weight of the i-th option determined by its ranking.

Effectiveness of Componets

To systematically deconstruct the DCADif model and validate its architectural design, we conducted a key ablation study focused on the selection of the Style Encoder and the Sketch Encoder. We evaluated four different combinations of two powerful encoders for extracting style and structure information: CLIP, for its high-level semantic understanding, and SwinStyle, for its ability to capture both local and global visual features. The qualitative results are shown in Fig. 13. The quantitative results presented in Table 2.

Fig. 13: Ablation Study of Encoder-Decoder.
figure 13

Visual ablation study of our encoder design. `GT' denotes the ground truth and `Mask' is the damaged input. The other columns compare different encoder combinations for style and line art extraction, formatted as [Style]/[Line Art], using either a CLIP encoder (C) or our SwinStyle encoder (S).

Table 2 Ablation study on the Style Encoder and Sketch Encoder within the DCADif framework

The experimental results in Fig. 13 demonstrate that the heterogeneous combination, employing SwinStyle as the style encoder and CLIP as the sketch encoder achieved decisively superior performance. This configuration not only attains the best results across all three key metrics but, more significantly, a comparison with the other combinations provides profound insight into the intrinsic logic.

First, when we used SwinStyle to extract style and CLIP to extract the sketch, a catastrophic decline in performance was observed: the PSNR plummeted by over 5.0 dB, and both SSIM and LPIPS deteriorated substantially. This provides compelling evidence that SwinStyle expertise in precisely parsing the local structure and edge information of line art is irreplaceable by CLIP, while concurrently demonstrating that CLIP capacity for capturing and encoding high level, abstract style semantics far surpasses that of SwinStyle. This finding clearly delineates the ‘capability boundaries’ of the two encoders, confirming that their assigned roles are both correct and uniquely suited.

Second, the performance of the homogeneous combinations further reinforces our design rationale. When two CLIP encoders were used, the model demonstrated the poorest performance in structural reconstruction despite its stylistic understanding, registering one of the lowest SSIM scores among all combinations. This underscores CLIP shortcomings in fine grained structural perception. Conversely, when two SwinStyle encoders were used, the model excelled on the SSIM metric, performing nearly on par with the optimal configuration, but showed a noticeable gap in PSNR and LPIPS. This indicates that while SwinStyle is highly effective at processing structural information, it lacks CLIP ability to associate image content with high level semantic style, resulting in generated textures and details that are less rich and realistic.

In conclusion, CLIP powerful semantic understanding makes it the ideal choice for extracting style information, while SwinStyle fine grained perception of visual patterns makes it most effective for parsing structural contours. The ability of DCADif to synergistically process these two distinct information streams is what enables it to achieve state of the art performance in image inpainting tasks.

Effectiveness of Frozen CLIP for Structural Guidance

To justify the use of a frozen CLIP encoder for structure extraction, we balanced the trade-off between fine-tuning for domain adaptation and freezing the weights to preserve robust, general-purpose features. While fine-tuning may enhance specialization, it carries the risk of catastrophic forgetting of the extensive knowledge acquired during large-scale pre-training. To validate this design choice, we conducted an ablation study comparing our full model against a baseline variant devoid of the CLIP encoder. As demonstrated in Table 3, the quantitative results highlight the significant contribution of the frozen CLIP guidance.

Table 3 Ablation study on the structural encoder

The results of the ablation study demonstrate the nuanced role of the frozen CLIP encoder. Regarding pixel-level fidelity, both configurations achieved identical performance, and the perceptual quality (LPIPS) showed no variance. This suggests that the main diffusion model is already capable of handling overall color and texture generation.

A minor discrepancy, however, was observed in the SSIM metric. The baseline model without the encoder reached an SSIM of 0.926, slightly higher than the 0.925 achieved with the frozen CLIP. This implies that while the frozen CLIP provides robust structural priors, it may introduce a level of ’rigidity,’ potentially limiting the model’s flexibility to match local ground truth details perfectly.

Despite this slight drop in SSIM, we incorporate the frozen CLIP encoder in the final design. It acts as a critical structural backbone, ensuring stability and preventing major distortions in large missing regions—benefits that extend beyond what standard metrics can measure. Overall, the results confirm that using a frozen, pre-trained encoder is a robust and effective strategy for ensuring structural fidelity in inpainting.

Effectiveness of Loss Fuction

During the training process, we observed that for diffusion models, relying solely on an L1 loss to constrain the predicted and ground truth noise can sometimes lead to subtle color discrepancies between the final generated image and the ground truth image, thereby compromising the fidelity of the inpainting. To address this, we introduced an additional L1 loss term that directly computes the difference between the generated image and the ground truth image, with the aim of enhancing the pixel level alignment capability. To validate the necessity and effectiveness of this design choice, we conducted a key ablation study. The qualitative results are presented in Fig. 14, and the quantitative results are shown in Table 4.

Fig. 14: Loss Ablation Study.
figure 14

Visual comparison illustrating the effect of the image space L1 loss. `GT' denotes the ground truth and `Mask' is the damaged input.

Table 4 Ablation study on the L1 loss components

When the L1i loss was removed, a significant decline was observed across all performance metrics. Specifically, the PSNR dropped by a substantial margin of over 1.8 dB, a considerable gap that directly validates the critical role of the L1i loss in color correction and overall fidelity enhancement.

This performance degradation was also evident at the structural and perceptual levels. In the absence of the L1i loss, the SSIM decreased by nearly 5%, indicating that direct image level supervision is crucial for helping the model better reconstruct local structures and ensure the seamless integration of the restored region with its surroundings. The most remarkable change, however, occurred in the LPIPS score, which increased by 84% without the L1i loss. This highlights how theL1iloss effectively aligns the model’s optimization objective with the human perceptual space, enabling the generation of more natural-looking images.

Effectiveness of Time Step Ratio

Within TAFF module, the fusion ratio of sketch and style features is not static but changes dynamically over time. To determine the optimal proportion for model performance, we conducted experiments with various feature fusion ratios. The qualitative results are illustrated in Fig. 15, and the quantitative results are presented in Table 5.

Fig. 15: Visual ablation study on the fusion ratio for the line art and style conditions in TAFF module.
figure 15

`GT' denotes the ground truth, and `Mask' is the damaged input. The other columns show inpainting results for different [Line Art Weight] / [Style Weight] ratios.

Table 5 Ablation Study of feature fusion weight

The experimental results indicate that the model achieves comprehensively optimal performance when the weight for sketch guidance is set to 0.1 and the weight for style guidance is set to 0.9. However, a deeper analysis of the performance variations under different weight configurations reveals a more profound synergistic and constraining relationship between the two information sources.

The performance exhibits high sensitivity to this fusion ratio. When we increased the sketch weight from 0.1 to 0.2, although the SSIM decreased only negligibly, the PSNR experienced a drastic drop of nearly 2.0 dB, while the LPIPS worsened by more than threefold.

Upon further increasing the sketch weight to 0.3, this trend of performance degradation continued. Both the PSNR and SSIM metrics continued to fall sharply, indicating that an overreliance on the given sketch information impedes the model’s ability to learn from the data driven style features and generate natural inpainting content that matches the surrounding context.

In summary, a 0.1/0.9 ratio represents the optimal fusion balance for sketch and style guidance. Style guidance serves as the core driving force for generating highfidelity and photorealistic content, whereas sketch guidance, within the Condition Encoder, should play a subtle, auxiliary corrective role. Overemphasizing external structural constraints severely undermines the model’s powerful internally learned generative capabilities. This finding provides a solid experimental basis for the design, validating the effectiveness and rationality of the current weight configuration.

Discussion

We propose an innovative framework for the inpainting of damaged murals, termed DCADif. We employ two encoder modules, the CLIP Sketch Encoder and the SwinStyle Encoder, to learn the deep features of the image in a decoupled and progressive manner. Specifically, within DCADif, we introduce a Time step Adaptive Feature Fusion module. This module deeply couples the denoising process with information injection by dynamically modulating the weights of structural and stylistic features according to the current timestep, thereby prioritizing the establishment of the macro structure in the early denoising stages and the meticulous rendering of micro details in the later stages to achieve a harmonious synthesis of both.

Experimental results demonstrate that DCADif exhibits superior performance in processing murals, particularly in the task of restoring damaged murals, where it shows exceptional capabilities in artistic style preservation and detail inpainting. This validates its effectiveness in the field of cultural heritage preservation and inpainting. Furthermore, the model achieves excellent visual results on a dataset of Chinese paintings, which further substantiates its generalization capability in image inpainting tasks.

Despite the promising results achieved by DCADif in mural inpainting, it is subject to several limitations that offer clear avenues for future research.

First, the model faces challenges when dealing with extremely large, contiguous areas of damage. When critical structural information in a region is completely lost, the line art condition our model relies on becomes unreliable. In such cases, the model may generate content that is visually plausible but historically inaccuratea phenomenon known as hallucination which is a critical concern for the rigorous demands of cultural heritage preservation.

Second, as a diffusion based model, the iterative denoising process of DCADif is computationally intensive. This leads to slower inference speeds compared to single pass architectures like GANs, which could be a practical constraint in applications requiring rapid processing or large batch restoration.

Finally, while our SwinStyle Encoder effectively captures artistic style, its ability to reproduce highly specific material textures could be further improved. For instance, the model may not perfectly distinguish between the unique craquelure patterns on aged plaster and the fibrous texture of ancient silk paintings. Developing more refined restoration algorithms specifically tailored to the material characteristics of cultural artifacts remains an important direction for future work.

Recent studies have shown that frequency aware learning can offer finer control over detail generation across different frequency bands38. Incorporating such mechanisms could potentially enhance our model’s ability to restore the full spectrum of details in murals from coarse wall textures to delicate brushstrokes thereby further advancing the fidelity of the restoration.