Introduction

The purpose of automatic mural sketch extraction is to convert mural directly into sketch, which is of great significance in the conservation of murals. The main challenges are twofold: (1) filter out a wide variety of deteriorations while extracting meaningful lines; (2) keep consistent with experts’ drawing style.

Not many efforts are made in academic field to overcome these two challenges due to the difficulties of direct extraction of mural sketch. Early mural sketch extraction algorithms, are usually based on Canny1 edge detectors combined with smoothing or other traditional algorithms. In FSG2, a mural line drawing extraction system based on Canny edge detection is proposed. The double-line problem is ignored in this method. To solve the double-line problem, Mural2sketch (M2S)3 adopts a heuristic routing algorithm, using the Canny edge detector to detect the stroke contour, and then high-frequency enhanced filter is used to extract the internal information of the strokes. Traditional algorithms usually involve many adjustable parameters. In order to draw accurate sketches, experts have to adjust hyper-parameter frequently. In many cases, these methods failed to extract meaningful sketches even after specific or complex hyper-parameter settings. At the same time, the nature of responding at the position where the gradient changes drastically in the image makes detectors such as Sobel4 and Canny1 unable to distinguish lines from scratches, resulting in the inability to distinguish between deteriorated and non-deteriorated areas. L0 smoothing5 and other smoothing techniques5,6 can only filter out minor mural deteriorations such as stain and narrow scratches, but they cannot effectively filter out obvious ones. To sum up, these methods fail to solve the double-line problem and filter out deteriorations, thus failing to convey the aesthetic expression of the artist.

The development of neural networks especially edge related convolutional neural networks7,8,9,10,11 provides new possible solutions for the problems of mural sketch extraction. Learning-based line extraction methods12,13,14 are more or less inspired by edge detection methods7,9. Based on HED9, the algorithm GSC14 shows stronger deterioration filtering ability than traditional methods by enhancing the feature fusion process of HED9. However, the result is still unacceptable compared to the experts’ strokes. Specifically, the extracted sketch lines are noisy and twined (lack of separation). The subsequent algorithms12,13 are mostly variants of the edge detection algorithms, such as BDN12 which is based on DexiNed7. BDN takes DexiNed, which is a more advanced edge detection algorithm, as backbone and optimize the fusion part of the model to enhance the mural extraction output. All these algorithms mentioned above are ineffective to prevent the extracted sketches from deteriorations.

Then a question arose: What is the main cause for the ineffectiveness of the forementioned methods12,13,14? The reason is that these methods model the mural sketch extraction task as an application of edge detection, but in fact this task is not an edge detection problem. We find that in the designs of these methods, input features are encoded to be high-level features but not fully reconstructed by operations such as sequential convolutions, which leads to inadequate deterioration filtering and noisy lines. Specifically, the modifications of these mural sketch extraction methods are adaptive fusion or adding conv-blocks before fusion. The mural sketch extraction results are better than original edge detection algorithms they based. However, compared to the encoding stage where features are extracted by strong feature extractor like VGG15 or Inception16, the reconstruction stage of these methods are too weak for it is done through simple fusion or modified fusion. Therefore, it is appropriate to refer to these methods as mural sketch detectors that depict edge-like pixels without filtering deteriorations out.

If we treat mural sketch extraction as a segmentation task, we find that most of the algorithms cannot be used directly. Most segmentation algorithms like DeepLab V117, DeepLab V218, DeepLab V3+19, HRNet20, and Panoptic FPN21 set the maximum reconstruction resolution at 1/4 or 1/8 of the original image size. The convolutional results of these segmentation models are then upsampled by bilinear. This makes these algorithms impossible to get high-accuracy mural sketch extraction performance. Due to the insufficient receptive field and the lack of skip connections, a typical AutoEncoder is also not the preferred alternative for mural sketch extraction. As a kind of medical segmentation model design, UNet22 retains enough convolutional layers at reconstruction stage, making it strong enough to build finer mural lines.

Inspired by the analysis above, the idea of treating mural sketch extraction as edge detection is rarely used. We propose to exploit U2Net23 as our backbone and model mural sketch extraction as a generation problem. Firstly, as a variant of UNet, U2Net23 has a two-level nested U structure, which greatly increases the receptive field and network depth. The larger receptive fields and the deeper network depth enable the model to distinguish between noisy mural background and lines that are hard to distinguish by untrained individuals. The retention of skip-links commonly used by UNet makes the transmission of information in U2Net smoother. Secondly, the sketch line corresponding to a bold line in a mural image can be drawn as a single line or contours around it. For an untrained individual, it’s difficult to determine the lines. While experts are able to analyze the whole picture and decide how to draw the sketch. Therefore, mural sketch extraction is not a task of edge detection that faithfully depict contours. The network needs to learn the habits and experience of experts. Meanwhile, mural deterioration, such as scratches may be difficult to distinguish from lines, which requires a deeper understanding of the semantics in the mural image. Generative adversarial networks (GANs)24 are proficient at modeling the mapping between different domains and are widely used in image synthesis25,26,27,28,29, image captioning30, 3D modeling31, video applications32,33, and image inpainting8,34,35,36. The first intersection of GANs and Edge detection was in an image inpainting method called Edge-Connect8, where the generator network completes incomplete edges. In Edge-connect, GANs have been demonstrated to hallucinate the missing content of incomplete images or the missing lines of incomplete sketches. GANs with additional priors are conditional GANs (CGAN)37. Pix2Pix38, Pix2PixHD39 and SPADE40 generate images using segmentation maps, edge maps, or other priors as a guide. At the same time, the generation process of mural sketch is simplified under the guidance of edges, and the direct generation process of mural sketch is transformed into the optimization of edges.

Inspired by Edge-Connect and CGAN, we propose to exploit GANs to provide a deeper insight of murals and use edges as guidance to improve recall. We additionally propose a novel scaled-crossed entropy loss to make the extracted lines thinner.

Experimental results on Dunhuang murals41 demonstrate that our proposed methods can produce accurate mural sketch results. Example results are shown in Fig. 1. Our contributions are summarized below:

  • We propose an edge guided generative model to generate mural sketches, which not only improves the sketch generation recall but also keeps the extracted lines from mural deterioration.

  • We propose a novel scaled cross-entropy loss to make extracted lines thinner so as to fit with expert cognition.

  • Our proposed mural sketch generation system achieves higher accuracy of generated results than previous state of the arts on Dunhuang mural dataset.

Fig. 1: Examples of generated sketch with our method.
figure 1

The outputs (b, d, f) generated by our method correspond to the Dunhuang mural dataset samples (a, c, e), respectively.

Methods

In this section, we describe our approach from bottom to top. We first introduce our proposed mural sketch generation framework and then present our supervision strategies. As shown in Fig. 2, our proposed framework consists of an edge guided generator and a multiscale PatchGAN discriminator. As for training supervision, we propose a new loss function call Scaled-Cross Entropy which reweight and rescale the reconstruction loss commonly used in methods like U2Net23, HED9 and DSS42 to make mural sketch generated more slim.

Fig. 2: Overview of our framework.
figure 2

Our framework with edge guided mural sketch generator and multi-scale PatchGAN discriminator for accurate mural sketch generation.

Edge guided generator

The mural sketch generator G follows the architecture in U2Net which is designed as nested U-Net. Mural and the mural edges extracted by Canny are concatenated as input of generator G. The low threshold and high threshold of Canny operator are 100 and 255 in our experiment. The mural sketch generator G is a two-level nested structure in which every resolution stage is a simplified U-Net followed by max pooling. Hence, generator G can effectively extract multi-scale intra-stage features of each resolution stage and inter-stage features among each resolution stages.

As illustrated in Fig. 2, the generator mainly consists of two parts: encoder and decoder. The first part is encoder consists of six U-Block named from E1 to E6. In stage E1, E2, E3 and E4 the block used are standard U-Block. With multiple down-sampling process, the resolution in E5 and E6 is relatively low, max pooling in Nested U Block make nonsense. So, the max pooling and upsampling process in Nested U Block E5 and E6 are replaced with dilated convolutions. The decoder stages are symmetrical to encoders. Like E5 and E6, the module in D1 is a dilated version of U-Block because it is executed in relatively low resolution. Six side output are generated from stages E6, D1, D2, D3, D4 and D5 by a 3 × 3 convolution layer, up-sample by factors of 32, 16, 8, 4, 2, 1 and sigmoid function. The features before sigmoid of six sides are concatenated and convolved with kernel of 1 × 1 to get final output. All these side output and final output offer supervision for sketch generator.

In this paper, we provide two instances of our generator: a standard version and a lite version. The channel number in each U-Bloch of lite version is lower than the standard version and the U-Block in En1 and De1 in standard version are replaced by normal convolutional layers.

Multi-scale PatchGAN discriminator

For previous mural sketch extraction methods which try to extract every sketch line direct from a single network, supervision like cross entropy loss is used on the side outputs. However, we consider the task of sketch generation where there maybe multiple lines are actually scratches. Motivated by pix2pix38, SPADE40 and SE-GAN36, we proposed to exploit multi-scale PatchGAN discriminator to offer auxiliary supervision.

As shown in Fig. 2, three parallel fully convolutional networks with different number of layers are used as the discriminator, where the input is generated sketch from generator G, and the outputs is three 2-D feature of shape Rh1*w1, Rh2*w2 and Rh3*w3 (h13, w1–3 representing the height, width of each PatchGAN discriminator output, respectively). Specifically, as shown in Fig. 2, six convolution layers with kernel size 3 and stride 2 is stacked in Dis1 to capture the feature statistics of input patches. The Dis2 is 1 layer shorter than Dis1 with only 5 layers. The less layer, the smaller receptive field. So, the captured feature statistics of Dis2 focus on smaller region (quarter of Dis1) of side output. The Dis3 is 2 layers less than Dis1, so as to focus on regions with area of quarter of Dis2. Note that the receptive field of outputs in each PatchGAN discriminator are overlapped and densely supervise the mural sketch generation process, pushing the generator to generate finer details and filter out deteriorations in mural image. At the same time, the combination of large and relatively small receptive fields offered by these three discriminators allow for higher resolution training. To discriminate if the input is real or fake, we use hinge loss as the objective function for generator and discriminator, which can be expressed as

$${L}_{{\rm{G}}}=-{E}_{{\rm{z}} \sim {P}_{{\rm{z}}}\left({\rm{z}}\right)}\left[{D}^{1}\left(G\left(z\bigoplus e\right)\right)\right]-{E}_{{\rm{z}} \sim {P}_{{\rm{z}}}\left({\rm{z}}\right)}\left[{D}^{2}\left(G\left(z\bigoplus e\right)\right)\right]-{E}_{{\rm{z}} \sim {P}_{{\rm{z}}}\left({\rm{z}}\right)}\left[{D}^{3}\left(G\left(z\bigoplus e\right)\right)\right]$$
(1)
$$\begin{array}{l}{{L}_{{\rm{D}}}=E}_{{\rm{y}} \sim {P}_{\mathrm{data}}\left(y\right)}\left[ReLU\left(1-{D}^{1}\left(y\right)\right)+{E}_{{\rm{z}} \sim {P}_{{\rm{z}}}\left({\rm{z}}\right)}\left({ReLU(1+D}^{1}\left(G\left(z\bigoplus e\right)\right)\right)\right]\\ +\,{E}_{{\rm{y}} \sim {P}_{\mathrm{data}}\left({\rm{y}}\right)}\left[ReLU\left(1-{D}^{2}\left(y\right)\right)+\,{E}_{{\rm{z}} \sim {P}_{{\rm{z}}}\left({\rm{z}}\right)}\left({ReLU(1+D}^{2}\left(G\left(z\bigoplus e\right)\right)\right)\right]\\ +\,{E}_{{\rm{y}} \sim {P}_{{data}}\left({\rm{y}}\right)}\left[ReLU\left(1-{D}^{3}\left(y\right)\right)+{E}_{{\rm{z}} \sim {P}_{{\rm{z}}}\left({\rm{z}}\right)}\left({ReLU(1+D}^{3}\left(G\left(z\bigoplus e\right)\right)\right)\right]\end{array}$$
(2)

where D1–3 represents PatchGAN discriminator, G is mural sketch generation network that takes mural image z and edge e as input. \(\oplus\) is concatenation in channel dimension. y is ground truth sketch image.

Scaled cross entropy

The sketch and non-sketch imbalance problem in the mural sketch generation task is essentially the same as the edge and non-edge imbalance problem in the edge detection task, which causes the extracted lines to be blurred (verified in 4.3). We use weighted cross-entropy loss to solve this imbalance problem:

$${L}_{\mathrm{ce}}=-\mathop{\sum }\limits_{{\rm{i}},{\rm{j}}}\left[\alpha \left(Y\left(i,j\right)\right)\left(1-P\left(i,j\right)\right){\text{log}}\left(P\left(i,j\right)\right)\right]+(1-\alpha )(1-Y(i,j))(P(i,j)){\text{log}}(1-P(i,j))]$$
(3)

where Y(i,j) is the ground truth label of the pixel (i,j) and P(i,j) is the predicted probability of mural sketch at pixel(i,j). P = G(ze). \(\alpha =\frac{\left|{s}^{-}\right|}{\left|{s}^{+}\right|+\left|{s}^{-}\right|}\) and \(1-\alpha =\frac{\left|{s}^{+}\right|}{\left|{s}^{+}\right|+\left|{s}^{-}\right|}\). (\(\left|{s}^{-}\right|\) and \(\left|{s}^{+}\right|\) are the sketch and non-sketch, respectively.). By using the weighted cross-entropy loss, the line drawing extraction results can be significantly improved, but there are still obvious line sticking in the areas with dense details. To alleviate this issue, we propose a scaled version of weighted cross-entropy:

$${L}_{\mathrm{ce}}=-\mathop{\sum }\limits_{{\rm{i}},{\rm{j}}}\begin{array}{l}\left[\alpha \left(S\left(Y\left(i,j\right)\right)\right)\left(1-S\left(P\left(i,j\right)\right)\right){\text{log}}\left(S\left(P\left(i,j\right)\right)\right)\right]\\ +(1-\alpha )(1-S(Y\left(i,j\right)))(S(P\left(i,j\right))){\text{log}}(1-S(P\left(i,j\right)))]\end{array}$$
(4)

where S is bilinear up-sample with factor 8. Then the total loss of mural sketch generator is

$${L}_{{G}_{\mathrm{total}}}=\,{\beta L}_{\mathrm{ce}}+\gamma {L}_{{\rm{G}}}$$
(5)

Where the loss weight \(\beta\) and \(\gamma\) are hyperparameters balancing loss term \({L}_{\mathrm{ce}}\) and \({L}_{{\rm{G}}}\). In our experiment β and γ are 1 and 0.1.

Results

Datasets and experiment setting

We train and evaluate our proposed mural sketch generation model on Dunhuang mural-sketch dataset. The Dunhuang mural-sketch dataset is collected and labeled by my colleagues with the help of mural experts. Specifically, we labeled 100 Dunhuang murals in which 80 mural-sketch pairs are divided as train set and 20 mural-sketch pairs are divided as test set. Our proposed models are implemented on Pytorch v1.8.1, CUDA v11.1, and run on hardware with CPU Intel(R) Core(R) CPU i7-10700 (2.90 GHz) and GPU RTX3090. We instantiate two models: a standard model and lite model. The standard version has 178 M of parameters, while the lite version has only 34 M. The different generator configs of standard and lite version of our method are showing in Table 1.

Table 1 Network config of our standard and lite model

Comparisons

We compare our model with previous state-of-the art methods M2S3, GSC14, and BDN12. Figure 3 shows mural extraction results of several representative murals. As shown in the Fig. 3, both the standard (Ours(S)) and the lite (Ours(L)) version of our proposed model achieve much better line drawing extraction results than other two methods. Specifically, with generative modeling and stronger backbone, our models achieve the cleanest background deterioration filtration. Scaled cross-entropy further improves the separation between sketch line strokes. Specifically, through hyper-parameter tuning, traditional algorithm M2S can also obtain relatively cleaner line drawings than the deep learning algorithms GSC and DBN. However, M2S performs poorly in terms of line continuity. GSC and DBN have the problems of thick lines and poor noise filtering. These problems lead to severe adhesion of local lines as well as black blocky artifacts. With the stronger backbone and the proposed losses, we obtain thinner and cleaner lines.

Fig. 3: Qualitative comparison of mural sketch extraction results.
figure 3

On the Dunhuang mural dataset, our method successfully generates accurate mural sketch. Best viewed with zoom-in. From left to right: a input images, b result of M2S, c results of GSC, d results of DBN, e and f are results of our lite model and standard model.

Mural sketch extraction lacks good quantitative evaluation metrics. Nevertheless, we report our evaluation in terms of mean cross entropy error, peak signal-to-noise ratio (PSNR), structural similarity (SSIM) and root mean square error (RMSE) on validation set of Dunhuang mural dataset just for reference in Table 2. As can be seen in Table 2, our proposed method achieves better results in terms of Cross Entropy error, PSNR, SSIM, and RMSE. For the PSNR and SSIM, the higher the better. For the Cross Entropy error and RMSE, the lower the better.

Table 2 Quantitative evaluation on Dunhuang mural dataset

Ablation study

We investigate the effectiveness of our proposed Scaled Cross Entropy loss comparing to other commonly used supervision of previous method. We trained three exact the same model, except for the cross entropy loss they equipped. As shown in Fig. 4, for every test image, vanilla cross entropy failed to offer enough supervision for mural extractor. As a result, lines extracted are blur and light. As the name implies, weighted cross entropy is an idea of reweighting vanilla cross entropy by black and white pixel distribution of result sketch. Results get much better when weighted cross entropy is used. However, the problem of the bold sketch line issue in these results is still obvious. Figure 4d shows the direct output result under the supervision of our proposed scaled cross-entropy. It can be seen that the bold line problem has been significantly alleviated. Less bold lines mean more lines extracted are separable (not stick together) and can be directly used by users. That is why our proposed method can generate sketch of thin hair strands as shown in Fig. 3.

Fig. 4: Qualitative comparison of mural sketch extraction results.
figure 4

From left to right: a input images, b result of model with vanilla cross entropy loss (VCE), c result of model with weighted cross entropy loss (WCE), and d result of our model with scaled cross entropy loss (SCE).

As shown in Fig. 5, our mural sketch generation framework benefits greatly from edge guidance, which is validated by higher recall brought by guidance. Compared to the model trained without guidance, the guided scheme reconstructs more details, as shown in Fig. 5.

Fig. 5: Qualitative comparison of mural sketch extraction results with and without guide.
figure 5

From left to right: a are input images, b are results of model without edge guide and c are results of model with edge guide.

As shown in Fig. 6, we also performed testing to verify the effectiveness of adversarial losses (adv loss). To draw conclusions, we train a line mural sketch extraction model without adversarial loss. Our conclusion is that adversarial loss reduces recall slightly but significantly improves the cleanliness of the results. This makes sense because false positive line examples are laborious to clear up, while slightly reduced recall can be compensated for by simple strokes by human.

Fig. 6: Qualitative comparison of mural sketch extraction results with and without adv loss.
figure 6

From left to right: a are input images, b are results of model without adv loss and c are results of model with adv loss.

Generalization study

We verified the generalization ability of our proposed algorithm on murals at other heritage sites, such as tomb, grotto, and temple. The weather at Dunhuang is always dry and seldom rain, which is very conducive to the preserve of murals. And thus most murals can keep relatively clean if there are no disturbs from human being. Compared with Dunhuang, the murals in Tang Dynasty tomb and other grottoes or tombs are difficult to preserve because of the environment. Deteriorations like salt efflorescence, plaster detachment and mildewing are very common on grotto murals and tomb murals. At the same time, the brush strokes of Dunhuang murals are usually thin, and color is often used to outline objects, while the most outlines of Tang Dynasty tomb murals are drawn with black strokes. All these seem to indicate that the model trained on Dunhuang murals cannot generalize well on tomb or grotto murals with severe deteriorations. Surprisingly, as shown in Fig. 7, our proposed model exhibits better generalization ability than we have expected. We can see that our proposed model filters most of the deteriorations out on these murals and extracts most of the available lines. Of course, the filtering ability of our model for some deteriorations is not perfect. This is due to the differences in style, or the domain gap between Dunhuang murals and the murals from other sites. To narrow the domain gap, we need to enrich the target-style dataset.

Fig. 7: Generalization results of our method on murals with different style.
figure 7

From top to bottom, there are the mural sketch extraction results of murals at other heritage sites, such as tomb, grotto and temple, best viewed with zoom-in.

Discussion

In this paper, we propose a novel high-precision mural sketch generation system based on the strong generative backbone U2Net with edge guidance, trained under supervision of scaled cross-entropy loss and multi-scale PatchGAN. We demonstrate that stronger backbone and supervision can significantly improve the mural sketch extraction results, and the original edges extracted by Canny as a priori guidance to the sketch generator can effectively improve the recall of sketches. Quantitative and qualitative comparisons demonstrate the superiority of the proposed method. At the same time, the generalization study demonstrates the potential of the proposed method on extracting sketches from murals with different styles.

Although significant results have been obtained, further improvement in the following aspects are possible:

The ability of line width control should be improved. First, the line width control ability allows experts to draw murals more finely, but current solutions are difficult to achieve single-pixel line drawing extraction results like canny edge extractor. As a result, the fine areas on murals are often reduced in aesthetic when converted to sketch lines by current methods. Second, thin lines imply lower computational cost for inference. However, current methods can only process murals with flourishing patterns by enlarging input image, which is not only time-consuming but also requires huge computational cost.

A controllable modeling for brushstroke rules should be built. For different types of cultural heritage, experts have different understandings of the line drawing style. However, the current methods cannot flexibly adjust the line density and stroke style according to regional semantics or task requirements. Currently, there is no mapping mechanism between structural semantics and stroke rules, which limits the expansion ability of current methods in professional aesthetic control and task customization.

A human-machine collaboration mechanism based on expert feedback should be established. The current model is a one-way generation process, and the feedback closed loop with cultural heritage experts has not been established. In real application scenarios, experts often have stronger image structure perception ability and historical style judgment, and may put forward specific modification opinions or aesthetic preferences for certain areas of the mural sketch. The lack of interactive mechanism makes it difficult to dynamically adjust the generation strategy to meet the requirements of experts, which limits its usability and interpretability in practical cultural heritage conservation work.

A multi-modal auxiliary information fusion method needs to be explored. Current mural sketch extraction methods completely rely on RGB images and edge information for sketch generation, which is difficult to extract mural sketches when murals are faded or blurred. In fact, multi-modal cues such as color residue, texture, material and infrared images, which are often used in the restoration process of murals, are promising to assist in determining the sketch line drawing structure. Current methods have not yet introduced such auxiliary modal information, resulting limited ability to model lines in low-contrast regions or heavily weathered regions.

To address the above challenges, we will carry out in-depth research from the following directions in the future: (1) Introducing a multi-level structure guidance mechanism and a line width adjustment module to improve the adaptability of the model in line generation at different density regions; (2) Constructing a region-controllable brushstroke rendering module by combining structural semantic labels and regional texture features; (3) Design human-machine collaborative feedback mechanism with expert participation, such as reinforcement learning or few shot tuning, to realize personalized rendering guidance; (4) Expand the input modality of the system, and introduce infrared image, material reflection image or historical restoration image as auxiliary information to enhance the line draw restoration ability of blurred areas.