Introduction

The murals of temples and monasteries in Shaanxi have a long history and are diverse in type. As non-renewable cultural heritage, they hold significant historical and artistic value. These murals not only incorporate elements of Buddhism, Taoism, and folk beliefs, reflecting the distinctive features of social life, religious beliefs, and aesthetic values of their time, but also represent the development of ancient Chinese painting techniques. Particularly during the Yuan, Ming, and Qing dynasties, they preserved many original works by renowned artists, establishing their irreplaceable importance1.

Due to the long term impact of natural weathering and human factors, these murals have undergone different kinds of disease, such as cracking, falling off, hollowing, pulverization, fading, color changing, getting mildewed, smudging and scratches, and so on. Therefore, there is an urgent need to restore the murals combined with the environment and painting materials. Meanwhile, the manual restoration work of Shaanxi murals is arduous and complex. The development of image processing and deep learning technology have enabled digital restoration of mural images to become a research hot topic2.

With the development of artificial intelligence, neural networks are widely used in computer vision tasks3,4. As a widely used paradigm in neural networks, the Generative Adversarial Networks (GANs) have been at the forefront of mural restoration, with numerous studies proposing GAN-based improvements for image restoration tasks5,6. These GAN architectures directly learn the data distribution, enabling the generation of highly realistic restorations7,8. However, the recently popularized Diffusion models9 have emerged as a powerful alternative, which is often outperforming traditional GAN approaches in various image generation and restoration tasks10,11,12. Unlike GANs, Diffusion models focus on modeling the gradual change of data distribution over time, a process analogous to the physical diffusion phenomenon.

In the context of mural restoration, Diffusion models offer several advantages. They provide more stable training dynamics compared to GANs, which are well-known that the resulting distributions can suffer from mode collapse and catastrophic forgetting13,14. Furthermore, Diffusion models have demonstrated superior performance in capturing fine details and textures15, which is crucial for preserving the aesthetic integrity of historical murals. The ability of Diffusion models to generate diverse, high-quality samples also addresses the common criticism of mode collapse often associated with GANs. While diffusion models have demonstrated remarkable performance in general image restoration tasks by solely leveraging local information, their application to the specific domain of mural restoration presents unique challenges. In the context of mural restoration, the exclusive focus on local texture may prove insufficient. To achieve optimal results, it is imperative to incorporate prior knowledge about the global low-rank structure inherent in mural paintings16.

The crucial prior is called low-rankness, which refers to the fact that a signal often contains many repetitive local patterns, and thus a local pattern always has many similar patterns across the whole signal17. The low-rank principle forms the cornerstone of many techniques, e.g., image restoration/inpainting, video saliency detection, etc. In mural painting, low-rankness is generally expressed in the following three aspects: Firstly, mural paintings typically exhibit large-scale structural coherence that extends beyond local neighborhoods16; global low-rankness effectively captures these long-range dependencies, thereby ensuring that the restored image maintains overall compositional integrity. Secondly, despite their complexity, mural paintings can often be represented by a relatively small number of basic elements, analogous to the concept of sparsity in signal processing18; the low-rank assumption aligns with this sparse representation, facilitating more computationally efficient and accurate restoration processes19. Thirdly, while local patterns play a crucial role, global low-rankness ensures consistency in texture across the entire mural20, mitigating the risk of localized over-fitting and preserving the overall artistic style. In summary, it is necessary to embed global low-rank in mural restoration.

It is worth noting that the efficacy of low-rank minimization in tensor completion has been corroborated by numerous empirical studies21,22. However, the size of murals is usually arbitrary, and when faced with the restoration of large-scale murals, the time consumption is a major concern. To address the aforementioned challenges, this paper makes the following contributions:

  1. 1.

    We propose a novel framework called Low-rank Structure Guided Diffusion (LRDiff) that seamlessly integrates global low-rank priors into diffusion models for mural restoration, thereby addressing the performance limitations of existing approaches.

  2. 2.

    We formulate a rigorous low-rank constraint using tucker decomposition, embedding low-rank properties on the smaller Tucker kernel tensor, effectively reduces the time complexity of mural restoration with arbitrary scale.

  3. 3.

    We introduce an adaptive mechanism that balances the influence of local and global information during the restoration process, ensuring optimal preservation of both fine details and overall compositional structure.

  4. 4.

    To demonstrate the effectiveness of our method, we conduct extensive experiments on real-world mural dataset, demonstrating the superior performance of our proposed method compared to state-of-the-art techniques.

Methods

Figure 1 presents an overview of our proposed LRDiff method, which addresses three key aspects: 1) Leveraging the low-rank structure of data to mitigate the semantic discrepancy between low-quality and high-quality murals; 2) Adaptively balancing global low-rank structure and pixel-wise similarity during the reverse denoising process; 3) Obtaining low-rank approximate solutions within a reduced subspace of smaller core tensors. This approach effectively integrates low-rank priors into the diffusion model framework, enabling more efficient and accurate mural restoration while maintaining structural coherence across the image. All the mathematical symbols used are summarized in Table 1. Before shedding light on our technique, we elaborate the DDPMs23 for image inpainting.

Fig. 1: The overview of LRDiff.
figure 1

Our method incorporates explicit low-rank constraints during restoration, employing orthogonal Tucker decomposition at each step to derive a low-rank approximate closed-form solutions for denoising.

Table 1 Basic matrix/tensor notation and symbols

Denoising diffusion probabilistic models with increasing rank

Given a high-quality image \({{\mathcal{X}}}_{0}\) and its degraded low-quality image μ, where the paired images have the same size, i.e., \({{\mathcal{X}}}_{0},\mu \in {{\mathbb{R}}}^{3\times H\times W}\). The typical DDPMs for inpainting24 basically involve the forward diffusion process and reverse denoising process.

Forward diffusion process

In contrast to Generative Adversarial Networks25, which directly learn data distributions, diffusion models focus on learning the evolution of data distributions over time. Let p0 denote the initial distribution representing the data, and t [0, T] represent a continuous time variable. We consider a diffusion process \({\{{{\mathcal{X}}}_{t}\}}_{t = 0}^{T}\) governed by a mean-reverting stochastic differential equation26, defined as:

$${\rm{d}}{\mathcal{X}}={\theta }_{t}(\mu -{\mathcal{X}}){\rm{d}}t+{\sigma }_{t}{\rm{d}}{\mathcal{W}},\quad {{\mathcal{X}}}_{0} \sim {p}_{0}({\mathcal{X}}),$$
(1)

where the \({\mathcal{W}}\) denotes a standard Wiener process, introducing stochasticity to the differential equation. The time-dependent positive parameters θt and σt characterize the mean-reversion speed and stochastic volatility of the diffusion process, respectively.

These parameters are constrained by the relation σt/θt = 2λ2, ensuring stationary variance, where λ represents the fixed noise level applied to \({{\mathcal{X}}}_{T}\). This constraint maintains the stability of the diffusion process over time, allowing for consistent stochastic behavior throughout the evolution. Take the \({{\mathcal{X}}}_{0}\) as the initial condition, the SDE Eq. (1) have a closed-form solution:

$${{\mathcal{X}}}_{t}=\mu +({{\mathcal{X}}}_{0}-\mu ){{\rm{e}}}^{-{\bar{\theta }}_{t}}+\mathop{\int}\nolimits_{0}^{t}{\sigma }_{z}{{\rm{e}}}^{-{\bar{\theta }}_{z:t}}{\rm{d}}{\mathcal{W}}(z),$$
(2)

where the \({\bar{\theta }}_{t}=\mathop{\int}\nolimits_{0}^{t}{\theta }_{z}{\rm{d}}z\) is known and the transition kernel \(p(x)={\mathcal{N}}({{\mathcal{X}}}_{t}| {m}_{t}({\mathcal{X}}),{v}_{t})\) is a Gaussian distribution with mean \({m}_{t}({\mathcal{X}})\) and variance vt given by:

$$\begin{array}{rcl}{m}_{t}({\mathcal{X}})&:= &\mu +({{\mathcal{X}}}_{0}-\mu ){{\rm{e}}}^{-{\bar{\theta }}_{t}},\\ {v}_{t}&:= &{\lambda }^{2}(1-{{\rm{e}}}^{-2{\bar{\theta }}_{t}}).\end{array}$$
(3)

As t, the mean mt converges to the low-quality image μ, while the variance vt approaches the stationary variance λ2. This convergence behavior indicates that the forward SDE, described in Eq. (1), progressively diffuses a high-quality image into a low-quality image with additive Gaussian noise of fixed variance. Consequently, the terminal state \({{\mathcal{X}}}_{T}\) is characterized by a Gaussian distribution with predetermined mean and variance, representing the fully degraded image.

During the forward diffusion process, the stochastic nature of the added noise serves to decorrelate pixels that were initially related, potentially transforming linearly dependent rows or columns into linearly independent ones. Theoretically, the introduction of random values to each element of a matrix significantly increases the probability of the resulting data tensor achieving full rank. This phenomenon can be expressed as:

$$\begin{array}{lll}rank({{\mathcal{X}}}_{0})\;\le\; rank(\mu +{\mathcal{N}}(0,{\lambda }^{2}{\mathcal{I}}))\\\qquad\qquad\;\; \le\;rank(\mu )+rank({\mathcal{N}}(0,{\lambda }^{2}{\mathcal{I}})).\end{array}$$
(4)

Thus, the diffusion process causes the rank of the mural image to increase.

Reverse low-rank denoising process

Following the27, we recover the high-quality image from the terminal state \({{\mathcal{X}}}_{T}\) according to:

$${\rm{d}}{\mathcal{X}}=[f({\mathcal{X}},t)-g{(t)}^{2}{\nabla }_{{\mathcal{X}}}\log {p}_{t}({\mathcal{X}})]{\rm{d}}t+g(t){\rm{d}}\hat{{\mathcal{W}}},$$
(5)

where \({\rm{d}}\hat{{\mathcal{W}}}\) denotes a reverse-time Wiener process and let \({{\mathcal{X}}}_{T} \sim {p}_{T}({\mathcal{X}})\). The \({p}_{t}({\mathcal{X}})\) represents the marginal probability density function of \({{\mathcal{X}}}_{t}\) at time t. The score function \({\nabla }_{{\mathcal{X}}}\log {p}_{t}({\mathcal{X}})\) is the sole unknown component, which is generally intractable. Consequently, SDE-based diffusion models approximate this function by training a time-dependent neural network under a score matching objective.

During the training phase, the high-quality image \({{\mathcal{X}}}_{0}\) is available, enabling the neural network to estimate the conditional score \({\nabla }_{{\mathcal{X}}}\log {p}_{t}({\mathcal{X}}| {{\mathcal{X}}}_{0})\). Specifically, we reparameterize \({{\mathcal{X}}}_{t}\) as \({{\mathcal{X}}}_{t}={m}_{t}({\mathcal{X}})+\sqrt{{v}_{t}}{\epsilon }_{t}\). Subsequently, utilizing Eq. (3), we compute the ground truth score as:

$${\nabla }_{{\mathcal{X}}}\log {p}_{t}({\mathcal{X}}| {{\mathcal{X}}}_{0})=-\frac{{\epsilon }_{t}}{\sqrt{{v}_{t}}}.$$
(6)

In above formulation, ϵt represents standard Gaussian noise, where \({\epsilon }_{t} \sim {\mathcal{N}}(0,I)\). Following common practice9, we approximate the noise using a conditional time-dependent neural network \({\epsilon }_{\phi }({{\mathcal{X}}}_{t},\mu ,t)\). This noise network takes as input the current state \({{\mathcal{X}}}_{t}\), the condition μ, and the time t, subsequently outputting pure noise.

An alternative approach involves determining the optimal reverse state \({\hat{{\mathcal{X}}}}_{t-1}\) from \({{\mathcal{X}}}_{t}\) at the (t − 1)-th timestep via maximum likelihood learning. This optimization is achieved by minimizing the negative log-likelihood, expressed as:

$${\hat{{\mathcal{X}}}}_{t-1}=\mathop{{\rm{argmin}}}\limits_{{{\mathcal{X}}}_{t-1}}[-\log p({{\mathcal{X}}}_{t-1}| {{\mathcal{X}}}_{t},{{\mathcal{X}}}_{0})].$$
(7)

The closed-form solution of above objective is formulated as:

$$\begin{array}{lll}{\hat{{\mathcal{X}}}}_{t-1}\;=\;\displaystyle\frac{1-{{\rm{e}}}^{-2{\bar{\theta }}_{t-1}}}{1-{{\rm{e}}}^{-2{\bar{\theta }}_{t}}}{{\rm{e}}}^{-{\theta }_{t}^{{\prime} }}({{\mathcal{X}}}_{t}-\mu )\\\qquad\quad +\;\displaystyle\frac{1-{{\rm{e}}}^{-2{\theta }_{t}^{{\prime} }}}{1-{{\rm{e}}}^{-2{\bar{\theta }}_{t}}}{{\rm{e}}}^{-{\bar{\theta }}_{t-1}}({{\mathcal{X}}}_{0}-\mu )+\mu ,\end{array}$$
(8)

where the \({\theta }_{t}^{{\prime} }=\mathop{\int}\nolimits_{t-1}^{t}{\theta }_{t}{\rm{d}}t\) and \({\bar{\theta }}_{t}=\mathop{\int}\nolimits_{0}^{t}{\theta }_{z}{\rm{d}}z\).

Time-dependent low-rank structure guidance

As show in Eq. (8), to effectively complete a mural using diffusion techniques, which inherently operate on a pixel-wise basis with local pixel information. But the image satisfies the global structural priors of coherent composition, symmetry and patterns, semantic context, perspective and depth, color harmony, etc. As show in Eq. (4), the diffusion process increases the rank of the image data, our motivation is to recovery the global structure low rankness of images. To incorporate global structural guidance ensures that the diffusion process respects the mural’s overall artistic integrity and meaning, we propose the time-dependent reduced-rank function at timestep t as:

$${{\mathcal{X}}}_{t-1}^{* }=\mathop{{\rm{argmin}}}\limits_{{\mathcal{X}}}{\gamma }_{t}\cdot rank({\mathcal{X}})+\parallel {\mathcal{X}}-{\hat{{\mathcal{X}}}}_{t-1}{\parallel }_{F},$$
(9)

where the \({\hat{{\mathcal{X}}}}_{t-1}\) is calculated by Eq. (8). The first term in Eq. (9) can be regarded as global structural low-rank solution, and the second term as pixel-wise similar solution. The γt is used to adaptively balance two optimization directions, which is set as:

$${\gamma }_{t}=\frac{1-{{\rm{e}}}^{-2{\bar{\theta }}_{t-1}}}{1-{{\rm{e}}}^{-2{\bar{\theta }}_{t}}}{{\rm{e}}}^{-{\theta }_{t}^{{\prime} }}.$$
(10)

We take the constant term of the first term in Eq. (8) as time-dependent weight, because it represents exactly the proportion of high-rank noise contained in the pixel-wise optimal solution at the damaged region. The higher the proportion of high-rank noise in the pixel-wise optimal solution, the stronger the low-rank constraint in optimization Eq. (9).

We adaptively adjust the target solution of the reverse SDE to the Pareto front of the low-rankness and similarity by Eq. (9). As the rank minimization problem is NP-hard, it is usually relaxed into the sum of the nuclear norm or L1 norm minimization problem28,29. However, refer to Sec.??, the mean-reverting SDE method is independent of the size of the image, but the size of the image data has a great influence on the rank minimization. The larger image size, the higher time complexity required to minimize the rank function, so we introduce tucker decomposition30 to reduce the complexity.

Tucker decomposition provides a general factorization of an N-th order tensor into a relatively small size core tensor and factor matrices, a 3-th order tensor \({\mathcal{X}}\in {{\mathbb{R}}}^{{I}_{1}\times {I}_{2}\times {I}_{3}}\) can be expressed as:

$$\begin{array}{lll}{\mathcal{X}}\;\cong \;\mathop{\sum }\limits_{{r}_{1}=1}^{{R}_{1}}\cdots \mathop{\sum }\limits_{{r}_{3}=1}^{{R}_{3}}{g}_{{r}_{1},{r}_{2},{r}_{3}}({{\bf{b}}}_{{r}_{1}}^{(1)}\circ {{\bf{b}}}_{{r}_{2}}^{(2)}\circ {{\bf{b}}}_{{r}_{3}}^{(3)})\\\quad \;=\;{\mathcal{G}}{\times }_{1}{{\bf{B}}}^{(1)}{\times }_{2}{{\bf{B}}}^{(2)}{\times }_{3}{{\bf{B}}}^{(3)}\\\quad\; =\;[\![{\mathcal{G}};{{\bf{B}}}^{(1)},{{\bf{B}}}^{(2)},{{\bf{B}}}^{(3)}]\!],\end{array}$$
(11)

where the \({\mathcal{G}}\in {{\mathbb{R}}}^{{R}_{1}\times {R}_{2}\times {R}_{3}}\) is the core tensor, and \({{\bf{B}}}^{(n)}=[{{\bf{b}}}_{1}^{(n)},{{\bf{b}}}_{2}^{(n)},\cdots \,,{{\bf{b}}}_{{R}_{n}}^{(n)}]\in {{\mathbb{R}}}^{{I}_{n}\times {R}_{n}}\) is the mode-n factor matrices, n = 1, 2, 3, typically, Rn In. The core tensor models a potentially complex pattern of mutual interaction between the vectors in different modes.

The Multilinear Singular Value Decomposition, also called the higher-order SVD, can be considered as a special form of the constrained Tucker decomposition31,32, in which all factor matrices, \({{\bf{B}}}^{(n)}={{\bf{U}}}^{(n)}\in {{\mathbb{R}}}^{{I}^{n}\times {I}^{n}}\), are orthogonal matrice, i.e. U(n)T U(n) = I, where the I is identity matrix. After obtaining the orthogonal matrices U(n) of left singular vectors of X(n), the core tensor \({\mathcal{G}}\) can be computed as:

$${\hat{{\mathcal{G}}}}_{t-1}={\hat{{\mathcal{X}}}}_{t-1}{\times }_{1}{{\bf{U}}}^{(1){\rm{T}}}{\times }_{2}{{\bf{U}}}^{(2){\rm{T}}}{\times }_{3}{{\bf{U}}}^{(3){\rm{T}}}.$$
(12)

The two states \(\{{{\mathcal{X}}}_{t-1}^{* },{\hat{{\mathcal{X}}}}_{t-1}\}\) of the image at the same timestep during the diffusion process can be represented using different linear combinations of the same set of factor matrices.

Based on Tucker decomposition mentioned above, we propose Tucker rank minimization to adjust the optimal solution so that the supervised generation process is optimized towards both pixel-wise similarity solution and rank reduction of the global structure.

Tucker rank minimization for mural

The Tucker rank of the 3-th order image \({\mathcal{X}}\in {{\mathbb{R}}}^{3\times H\times W}\) corresponds to the 3-tuple (R1, R2, R3) consisting of the dimensions of the different subspaces. If the Tucker decomposition Eq. (11) holds exactly it is mathematically defined as:

$$ran{k}_{t}({\mathcal{X}})=\{rank({{\bf{X}}}_{(1)}),rank({{\bf{X}}}_{(2)}),rank({{\bf{X}}}_{(3)})\},$$
(13)

where the X(n) is the mode-n unfolding of tensor \({\mathcal{X}}\). For the Tucker format, the sum of nuclear norms for all mode-n unfolding matrices has been developed as a convex surrogate of the Tucker rank28, we can rewrite the Eq. (9) as:

$$\mathop{\min }\limits_{{\mathcal{X}}}\frac{{\gamma }_{t}}{3}\mathop{\sum }\limits_{n=1}^{3}\parallel {{\bf{X}}}_{(n)}{\parallel }_{* }+\parallel {\mathcal{X}}-{\hat{{\mathcal{X}}}}_{t-1}{\parallel }_{F}^{2}.$$
(14)

For the orthonormal Tucker format of given image, that is, \({\mathcal{X}}=[[{\mathcal{G}};{{\bf{U}}}^{(1)},{{\bf{U}}}^{(2)},{{\bf{U}}}^{(3)}]]\), the Frobenius norms and the Schatten p-norms of \({\mathcal{X}}\) and \({\mathcal{G}}\) are equal:

$$\begin{array}{lll}\parallel {\mathcal{X}}-{\hat{{\mathcal{X}}}}_{t-1}{\parallel }_{F}\;=\;\parallel [\![{\mathcal{G}}-{\hat{{\mathcal{G}}}}_{t-1};{{\bf{U}}}^{(1)},{{\bf{U}}}^{(2)},{{\bf{U}}}^{(3)}]\!]{\parallel }_{F}\\\qquad\qquad\qquad\; =\;\parallel {\mathcal{G}}-{\hat{{\mathcal{G}}}}_{t-1}{\parallel }_{F},\end{array}$$
(15)
$$\parallel {\mathcal{X}}{\parallel }_{Sp}=\parallel {\mathcal{G}}{\parallel }_{Sp},1\,\leqslant\, p < \infty .$$
(16)

Thus, the computation of the Frobenius norms can be performed with an \({\mathcal{O}}({R}^{3})\) complexity, where the R = max{R1, R2, R3}, instead of the usual order \({\mathcal{O}}({I}^{3})\) complexity, typically R I. The Schatten p-norm of an N-th order tensor \({\mathcal{X}}\) is defined as the average of the Schatten norms of mode-n unfolding matrices, i.e.,

$$\parallel {\mathcal{X}}{\parallel }_{Sp}=\frac{1}{3}\mathop{\sum }\limits_{n=1}^{3}\parallel {{\bf{X}}}_{(n)}{\parallel }_{Sp},$$
(17)

where the \(\parallel {{\bf{X}}}_{(n)}{\parallel }_{Sp}={(\sum _{r}{\sigma }_{r}^{p})}^{\frac{1}{p}}\), and the σr is the r-th singular value of the unfolding matrix X(n). For p = 1, the Schatten norm become to the nuclear norm. Thus, the nuclear norm of the original tensor and the core tensor are equal, i.e.,

$$\mathop{\sum }\limits_{n=1}^{3}\parallel {{\bf{X}}}_{(n)}{\parallel }_{* }=\mathop{\sum }\limits_{n=1}^{3}\parallel {{\bf{G}}}_{(n)}{\parallel }_{* }.$$
(18)

We propose to employ block coordinate descent for the optimization. The basic idea of block coordinate descent is to optimize a block of variables while fixing the other groups. We divide the variables into 3 blocks: G(1), G(2), G(3). Using the above properties of the core tensor and the original tensor, we can rewrite Eq. (14) as:

$$\mathop{\min }\limits_{{{\bf{G}}}_{(n)}}\frac{{\gamma }_{t}}{2}\left|\left| {{\bf{G}}}_{(n)}\right|\right| _{* }+\frac{1}{2}\left|\left| {{\bf{G}}}_{(n)}-{\hat{{\bf{G}}}}_{(n)}\right|\right| _{F}^{2},$$
(19)

where the \({\hat{{\bf{G}}}}_{(n)}\) is the mode-n unfolding of \({\hat{{\mathcal{G}}}}_{t-1}\). The above problem has been proven to lead to a closed form in recent papers like33,34. Let \({\hat{{\bf{G}}}}_{(n)}={\bf{U}}\cdot diag(\{{\sigma }_{i}\})\cdot {{\bf{V}}}^{T}\) be the SVD decomposition, the optimal of G(n) can be computed as:

$${{\mathfrak{D}}}_{\displaystyle\frac{{\gamma }_{t}}{2}}({\hat{{\bf{G}}}}_{(n)})={\bf{U}}\cdot diag\left(\left\{{\left({\sigma }_{i}-\displaystyle\frac{{\gamma }_{t}}{2}\right)}_{+}\right\}\right)\cdot {{\bf{V}}}^{T},$$
(20)

where the t+ is the positive part of t, namely, \({t}_{+}=\max (0,t)\). Finally, we calculate the average of the optimal solution on each mode to obtain the solution of Eq. (19):

$${{\mathcal{G}}}_{t-1}^{* }=\frac{\mathop{\sum }\nolimits_{n = 1}^{3}{{\rm{fold}}}_{n}({{\mathfrak{D}}}_{\displaystyle\frac{{\gamma }_{t}}{2}}({\hat{{\bf{G}}}}_{(n)}))}{3}.$$
(21)

Use the fixed factor matrices, we can obtain the target solutions with pixel-wise similarity and global low-rankness simultaneously, which is formulated as:

$${{\mathcal{X}}}_{t-1}^{* }={{\mathcal{G}}}_{t-1}^{* }{\times }_{1}{{\bf{U}}}^{(1)}{\times }_{2}{{\bf{U}}}^{(2)}{\times }_{3}{{\bf{U}}}^{(3)}.$$
(22)

Thus, we can optimize ϵϕ via the following training objective:

$${{\mathcal{L}}}_{\beta }^{t}(\phi )=\mathop{\sum }\limits_{t=1}^{T}{\beta }_{t}{\mathbb{E}}[\parallel {{\mathcal{X}}}_{t}-{({\rm{d}}{{\mathcal{X}}}_{t})}_{{\epsilon }_{\phi }}-{{\mathcal{X}}}_{t-1}^{* }{\parallel }_{{\ell }_{1}}],$$
(23)

where the βt is positive weight; \({({\rm{d}}{{\mathcal{X}}}_{t})}_{{\epsilon }_{\phi }}\) denotes the reverse SDE in Eq. (5) and its score is predicted via the noise network \({\epsilon }_{\phi }({{\mathcal{X}}}_{t},\mu ,t)\).

Results

Dataset

To verify the effectiveness of the proposed method, we conducted a restoration experiment on a self-made Shaanxi mural dataset, which includes both artificially added random damage and real damaged murals. The dataset comprises high-definition mural images from Shaanxi, featuring various categories such as landscapes and Buddha statues. A total of 351 images of 768 × 768 resolution were collected, with 320 images used for training and the remaining 31 images reserved for testing. The training set was augmented to 2560 images through techniques including horizontal flipping, vertical flipping, and rotation. For evaluation, we employed several metrics to assess the quality of the restored images. Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) are pixel-based metrics that measure the fidelity of restored images compared to the original ones. Additionally the Learned Perceptual Image Patch Similarity (LPIPS) measures perceptual similarity by leveraging deep learning models trained on human judgments, offering a more nuanced assessment of image quality. Fréchet Inception Distance (FID) evaluates the similarity between generated and real images in feature space, using a pre-trained Inception network to extract features and compute distances, thereby providing a metric that reflects higher-level image characteristics beyond pixel-wise differences. This comprehensive evaluation framework ensures a thorough assessment of the proposed method’s performance in restoring damaged murals.

Detail

Randomly crop the size of 256 × 256 pixels in the image as input and the batchsize is set to 8. During training, the AdamW optimizer is employed over 600 epochs with the momentum value of 0.9, the weight decay value of 5e-4, and the initial learning rate of 1e-4. Cosine annealing scheduler is used to gradually decrease the learning rate to 0. The noise-prediction network is constructed by removing group normalization layers and self-attention layers in the U-Net in DDPM9 for inference efficiency. We employ vanilla conditional net as the noise network ϵϕ. Data augmentation includes random horizontal and vertical flipping and 90-degree rotation. We set the timestep T = 100 for the diffusion model. All the experiments are implemented with the PyTorch framework and run on 4 NVIDIA 2080TI GPU.

Comparison

To validate the effectiveness of our proposed method and demonstrate the capacity of low-rank guidance to enhance mural restoration performance, we conducted comprehensive experiments on a test set of 31 murals. We applied random masks to these images and compared our approach with state-of-the-art GAN-based and Diffusion-based restoration models and the open-source API of stable diffusion model finetuned with LoRA on our dataset. Tables 2, 3 presents the performance metrics across various methods, illustrating the superior performance of our approach.

Table 2 Comparison of quantitative results on PSNR and SSIM under varied mask ratios
Table 3 Comparison of quantitative results on LPIPS and FID under varied mask ratios

Our proposed method achieves state-of-the-art performance across three key metrics on the mural dataset. Specifically, we achieved 31.21 dB and 0.952 on the PSNR and SSIM metrics, respectively, both of which quantify detail preservation and structural fidelity. Furthermore, we achieve 11.52 on the FID, indicating excellent performance in preserving the overall image distribution. These results underscore the efficacy of our method in maintaining both fine-grained details and global image characteristics in mural restoration tasks.

Figure 2 presents a comparative analysis of our proposed method against other diffusion-based models, showcasing eight randomly selected images from our mural restoration dataset. The results demonstrate that our approach achieves superior performance in reconstructing detailed textures, yielding more realistic restorations compared to existing methods. Our method exhibits robust performance across various degradation scenarios, including large-area damage, which presents an ill-posed problem, and small-area degradation, which represents a more deterministic restoration problem. In both cases, our approach produces more natural and visually coherent results, effectively addressing the challenges inherent in mural restoration tasks. The visual comparison clearly illustrates the enhanced ability of our method to preserve fine details and maintain overall image integrity, surpassing the capabilities of current state-of-the-art diffusion-based models in the context of mural restoration.

Fig. 2: Visual comparison of mural restoration results.
figure 2

Input shows the degraded mural and GT displays the corresponding original image. The rows labeled SD-LoRA, IR-SDE, and StrDiffusion present restoration results from existing methods, while the LRDiff row demonstrates the results achieved by our proposed method.

Ablation

Effectiveness of low-rank guidance on performance

To validate the efficacy of low-rank guidance in mural restoration, we conducted experiments using varying numbers of training timesteps as shown in Table 4. Our results demonstrate that tensor rank optimization significantly accelerates the reverse diffusion process. The low-rank guided approach achieves optimal performance at T = 100, outperforming the unguided method. Notably, even at a reduced timestep of T = 80, the low-rank guided reverse diffusion process maintains superior performance compared to its unguided counterpart. These findings underscore the potential of low-rank constraints in enhancing both the efficiency and quality of mural restoration techniques.

Table 4 Quantitative analysis of low-rank guidance

Effectiveness of adaptive mechanism on performance

To validate the efficacy of our adaptive mechanism in Eq. (9), we conducted experiments comparing fixed and dynamic weighting strategies, i.e., using Eq. (10). As shown in Table 5, fixed-weight approaches introduce additional hyperparameters that complicate optimization and reduce interpretability, leading to suboptimal performance. Our adaptive mechanism dynamically balances global low-rank constraints for structural coherence and local Frobenius norm minimization for fine details across diffusion stages. Early restoration prioritizes global consistency, while later stages emphasize local fidelity, naturally reflecting mural restoration workflows.

Table 5 Quantitative analysis of adaptive mechanism

Low-rankness of unfolding tensor

To validate Eq. (4) and justify the application of Eq. (19) for low-rank optimization, we conducted an empirical analysis. We computed the nuclear norm of the unfolding matrix derived from the core tensor following orthogonal Tucker decomposition. Figure 3 illustrates the evolution of this mode-n unfolding matrix as a function of the diffusion timestep. This visualization provides compelling evidence for the theoretical foundations of our proposed approach and elucidates the temporal dynamics of the low-rank structure during the diffusion process.

Fig. 3: Low-rankness of core tensor along the n-th mode.
figure 3

As the denoising diffusion process progresses, the mural approaches complete restoration, while its orthogonal Tucker core tensor exhibits progressively stronger low-rank characteristics.

Effectiveness of Tucker rank {R 1, R 2, R 3} on performance

Our LRDiff model accommodates various rank configurations. To assess the impact of these configurations on restoration performance and computational efficiency, we conducted the quantitative analysis as shown in Table 6, which illustrates the effects of different Tucker rank on both restoration quality and inference time. As the Tucker rank increases, we observe an improvement in restoration quality, albeit at the cost of increased inference time. Notably, when the Tucker rank is configured as {8, 128, 128}, the model achieves optimal performance across multiple metrics: PSNR reaches 31.98 dB, SSIM attains 0.956, and FID decreases to 11.02. This analysis provides valuable insights into the trade-offs between model complexity and performance, enabling optimal parameter selection for specific mural restoration tasks.

Table 6 Quantitative analysis of Tucker rank

Discussion

In this work, we proposed a Low-rank structure guided diffusion model to virtually restore the deteriorated regions of the ancient murals. Our approach leverages a diffusion model to capture pixel-wise similarities between low-quality and high-quality image pairs. Exploiting the inherent low-rank prior of murals, we propose a novel low-rank guidance method based on orthogonal Tucker decomposition. This method is seamlessly integrated into the time-dependent function of the diffusion model, enhancing restoration performance while significantly reducing computational complexity from \({\mathcal{O}}({I}^{3})\) to \({\mathcal{O}}({R}^{3})\). Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance in mural restoration tasks. The efficacy of the low-rank guidance is rigorously quantified through comprehensive ablation studies. These results not only validate the superiority of our approach but also underscore the potential of incorporating structural priors into diffusion models for mural restoration tasks.

Our proposed method reconstructs complex structures while maintaining global coherence represents a substantial advancement in the field of mural restoration. By integrating low-rank priors and diffusion models, we provide a robust framework capable of addressing the unique challenges posed by severely damaged murals, thus contributing to the preservation of invaluable cultural heritage.

Our method has two main limitations. First, despite the use of orthogonal Tucker decomposition and the solution of Eq. (19), the reliance on the SVD algorithm can lead to the loss of fine-grained texture details, particularly in regions with smaller singular values. Second, the experimental dataset used in this study is relatively limited in scope, as it does not encompass the diverse mural data from different cultures and historical backgrounds worldwide. Therefore, constructing a larger and more comprehensive dataset remains a critical task for future research in this field.