Introduction

The world is completely reconstructed, from the flames of early civilisation to the industrial machines of the modern era; humanity has journeyed magnificently. Throughout history, it has been most significant because it has precious treasures1. Murals, among the requirements of recording data, present important historical and cultural insights. Therefore, due to human behaviour and natural erosion, many early murals have suffered from fading, flaking, mould, and cracks2. The protection of non-renewable resources and the restoration of murals became crucial challenges. Conventional mural restoration is primarily a physical, manual approach that requires experts with comprehensive knowledge of the humanities, history, archaeology, and the arts3. In addition, this approach can irreversibly lose the murals. Alternatively, digital restoration of painting images is developed as an innovative method in the domain of historic paint protection4. Without compromising the paintings’ innovation, this approach utilises computer technologies to restore them effectively. The reconstructed digital painting images can’t serve only as addresses for manual restoration but also assist in building a replicable dataset, providing very consistent ways to transmit and preserve these cultural heritage items5.

Using computer technologies to restore early paintings digitally aims to restore lost or damaged areas promptly. Recently, multiple exceptional image restoration methods have been developed, classified into two categories: DL-driven approaches and traditional image restoration techniques6. Conventional image restoration techniques primarily rely on two approaches: patch-driven and diffusion-driven7. The central concept behind patch-driven techniques is to obtain imaging patches from degraded regions and utilise them to replace the lost areas. Diffusion-driven techniques fill in missing regions by mimicking the distribution of image pixels, typically using gradient data and local features to guide the restoration8. Even though there is a long period of development, traditional image restoration techniques still make an effort to satisfy the needs of digital restoration for early paintings, mainly for two features: (1) incapacity to make novel data depend on existing content, and (2) poor efficiency in restoring murals with a great region of degradation9.

Artificial intelligence (AI) can play a significant part in heritage preservation, not only for its utility in the research of material remains but also for its application in digital restoration10. Among all AI techniques, the DL method has emerged as an effective approach for achieving excellent results in this domain. DL was described as a method that may learn representations of a database utilising various levels of abstraction5. Unlike conventional techniques, the DL approach may capture semantic data through end-to-end learning, enabling it to address the challenging inpainting task more efficiently. This development has solidified DL as an effective tool for obtaining both semantic and structural content from images11. Furthermore, DL-driven image inpainting methods are used in industrial visual image processing, artistic creation, face repair, cultural relic restoration, object removal, and the production of specific effects for movies and games. They play an essential role in virtual reality technology, image editing, visual image processing, and the preservation of ancient heritage12. The need for precise, non-invasive techniques has driven interest in computational solutions. Advancements in AI and DL offer opportunities to reconstruct missing or damaged regions while preserving original features. Automated digital restoration can thus safeguard cultural heritage for study, display, and future generations.

This article proposes a novel hybrid deep learning-enabled image inpainting model for smart historical artificial restoration, named the HDLIP-SHAR technique. This paper’s primary contributions are as follows:

  • The HDLIP-SHAR technique aims to develop a DL model capable of identifying and reconstructing missing or damaged regions in artefact images.

  • Furthermore, AMF and contrast enhancement are applied to enrich the input image quality.

  • A hybrid SqueezeNet CNN is used to extract deep semantic features from historical artefact images to identify cracks, missing parts, and faded textures.

  • Also, the U-Net method is implemented to segment the image and localise the damaged regions.

  • A transformer-based GAN model is employed to restore and inpaint the missing regions in the imagery.

  • The HDLIP-SHAR methodology is validated using the MuralDH dataset. Comprehensive result analysis denoted that the proposed model achieves improved performance over other methods in terms of different metrics.

The structure of this manuscript is as follows: Sect. “Literature review” provides a detailed overview of existing approaches to the restoration of historical artefact images. Section “Methodological framework” details the methodological framework, including the image pre-processing, feature representation, and segmentation pipelines. Section “Experimental result analysis” reports on the study’s experimental design and results. The performance is then meticulously examined through several key measures. This article concludes in Sect. “Conclusion” with a summary of the model’s main contributions.

Literature review

This section presents a summary of prior research on the restoration of images of smart historical artefacts. Zhong et al.13 proposed a multiscale image inpainting technique based on the denoising diffusion probabilistic method (DDPM), particularly enhanced for wadang pattern reconstruction. This approach, in the early inpainting stage, is achieved through a fusion mechanism that integrates semantic data from the input image with intermediate outputs to achieve sophisticated inpainting outcomes. In14, a DL-driven conditional inpainting method is presented for restoring anatomically correct image data in artefact-affected regions. The reconstruction method comprises a dual-phase process: double structure (DS) artefacts and DL-driven detection of common interpolation (INT), accompanied by conditional inpainting in the artefact regions. The approach relied on the inpainting procedure conditionally, using patient-specific imaging information to ensure structurally consistent outcomes. Kniaz et al.15 presented an innovative method for 3D inpainting of partly damaged images employing a diffusion NN, named Restore3D. The approach utilises principles of diffusion-based procedures to refine and reconstruct lost parts of the 3D wireframe iteratively. By incorporating temporal characteristics into the inpainting procedure, Restore3D efficiently captures spatial relationships and complex details, resulting in a more holistic restoration than traditional methods. Zhang et al.1 introduced a Coordinated Attention Aggregation Transformation (CAAT) GAN architecture with U-Net discriminators to address these limitations. This method generator obtains contextual data from distant areas via a CAAT Block, increasing the domain’s flexibility and enhancing content interpretation in regions where it is lost, thereby restoring original texture or colour. Pajila et al.16 proposed a DL-driven inpainting method that leverages GANs and CNNs to improve image reconstruction precision. The method is intended to preserve contextual integrity, enhance fine details, and improve overall visual quality. Across cutting-edge feature extractors, the method obtains improved restoration results on different databases. In17, an image inpainting framework is introduced for restoring the antique paintings of Myanmar. This structure may be separated into lacuna and crack elimination. This classification of lacuna and crack loss is automated through image segmentation and pre-processing. Crack loss is restored with pixel-neighbouring transmit, though lacuna restoration is useful with coherent transport as well as patch-driven nearest-neighbour resemblance colour-filling techniques. Fan et al.18 evaluated the degree of facial loss in Tang Dynasty female terracotta figurines. They employed the Global and Local Constancy Image Completion (GLCIC) method to restore the standardised form of these figurines, ensuring that the reconstructed region is both locally and globally consistent with the novel imaging. To address the problem of limited information and blurred facial features in the figurines, the research enhanced this method using data augmentation, local improvement methods, and guided filtering. Hu et al.19 presented a Convolutional Super-Resolution GAN (ConvSRGAN) architecture specifically designed for super-resolution of Chinese landscape paintings. This method utilises various Enhanced Adaptive Residual Models (EARMs) to investigate a hierarchical feature extractor for images, integrating an Enhanced Higher-Frequency Retention Mechanism (EHRM) that uses an Adaptive Deep Convolution Block (ADCB) to capture fine-tuned higher-frequency information across multiple stages. Integrating the Multiscale Structural Similarity (MS-SSIM) damage metric with traditional damage metrics ensures that its outputs are more reliable for the novel imaging’s texture and framework. Chen et al.20 proposed a model, Adaptive Feature Fusion and U-Net (AFFU), that integrates a Self-Guided Module (SGM), Adaptive Multi-feature Fusion (AMF), and an Information Transfer Mechanism (ITM). Zhang et al.21 introduced a Semantic-Aware Dehazing Network (SDNet) that integrates a semantic prior as a colour constraint to guide scene reconstruction. The network employs a Densely Connected Block (DCB) to capture both global and local data and uses Adaptive Feature Fusion (AFF) to integrate shallow and deep features. Zhang and Chen22 introduced a Multistage Decoding Network (MSDN) technique that employs a multi-decoder design, layer-wise feature integration, and improved feature mapping to reduce the data loss inherent in self-encoding networks. Sharma, Singh, and Garg23 proposed a model that employs a Forgery Localisation Transformer Network (FLTNet), integrating a CNN Encoder and Transformer-based Attention (TBA) for extraction and global feature correlation capture. Furthermore, the CNN Decoder is designed to enable effective, real-time localisation of forgeries.

Zhang et al.24 improved robust RGBT object tracking by employing Cross-modality Cross-region Dual-stage Attention (CCDA) and Multiscale Intra-region Feature Fusion (MIFF) techniques. The proposed MGNet improves mutual guidance between regions of different modalities. Maitin et al.25 evaluated the efficiency of DL-based Virtual Image Inpainting (VII) for reconstructing missing architectural elements in images of ruined Greek temples. The evaluation employs both objective metrics (mathematical measurements) and expert visual perception (EVP) to evaluate reconstruction quality. Zhang et al.26 utilised the Early Region Feature Enhancement (ERFE) model, integrating Frequency-aware Self-region Feature Enhancement (FSFE) and Cross-attention Cross-region Feature Enhancement (CCFE) to enhance regional features in both modalities. Furthermore, a Bidirectional Multistage Feature Fusion (BMFF) module with Complementary Feature Extraction Attention (CFEA), incorporating Unidirectional Mixed Attention (UMA) and Context Focused Attention (CFA), enables effective cross-modal data exchange. Wang et al.27 achieved virtual restoration of mould-affected painted cultural relics using a 3D CNN artefact restoration network. The model utilises Near-Infrared (NIR) and spatial feature learning to reconstruct visible-spectrum reflectance. Zhang et al.28 proposed a model by utilising Wavelet-Based Physically Guided Normalisation Dehazing Network (WBPGNDN). It employs Physically Guided Normalisation (PGN) to restore pixel similarity in haze-free images and Wavelet Decomposition (WD) in the feature domain to capture long-range dependencies. Jenifer and Devaki29 introduced an Adaptive Multiscale Attention-based DenseNet (AMA-DeNet) approach that integrates the Hybrid Chimp Grasshopper Optimisation Algorithm (HCGOA) and precisely localises anomalies utilising a Cascaded Variational Autoencoder (CVA) model. Li et al.30 aimed to restore degraded mural imaging by employing Denoising Diffusion Probabilistic Models (DDPM), an improved U-Net-based noise estimation network enhanced with a Hybrid Attention Mechanism (hAM) and Atrous Spatial Pyramid Pooling (ASPP) methods to preserve fine details and structural integrity in mural restoration. Merizzi et al.31 used the Deep Image Prior (DIP) inpainting approach, which also utilises an untrained Convolutional Neural Network (CNN) to fit available image information progressively. The method integrates both visible and infrared data to enhance contextual understanding and reduce artefacts. Guan et al.1 proposed a methodology by utilising GAN integrated with Deformable Convolution-based Generator (DCG), global stabilisation strategies, and an adaptive loss function while also conserving structural and colour consistency. Wang et al.32 restored damaged patterns in ancient Chinese silk by utilising Context-Encoder (CE)-based image inpainting techniques. Furthermore, models such as LISK, Multi-Attention Deep Fusion (MADF), and Multi-Encoder Dual Feature Enhancement (MEDFE) are used for reconstructing missing or degraded regions. Zhao et al.33 implemented Diffusion Models (DMs) to refine damaged regions through iterative denoising progressively. A heterogeneous U-Net (UNet) architecture improved with Pixel Space Augmentation Block (PSAB) and Dual Channel Attention Block (DCAB) is also used to improve spatial detail recovery and channel-wise feature representation. Zeng et al.34 implemented Hyperspectral Imaging (HSI) integrated with Super-Pixel Segmentation (SPS) and Support Vector Machine-Markov Random Field (SVM-MRF) classification. Colour correction is performed using the Commission Internationale de l’Eclairage (CIE) Standard Colourimetric System, followed by restoration with CNNs. Cao et al.35 restored large and irregular missing regions in images by employing a Generative Adversarial Network (GAN) model incorporated with an Encoder–Decoder Network (EDN). The approach integrates a Wavelet Downsampling (WDS) module and a Frequency Integrated Learning (FIL) module with attention mechanisms (AMs) to effectively integrate high- and low-frequency data. Xu, Zhang, and Zhang36 utilised a Visual Attention Mechanism (VAM) methodology by integrating Virtual-Real Fusion (VRF) techniques with real-world scene reconstruction. Yan, Chai, and Li37 presented a methodology using AI and Internet of Things (IoT) technologies. The ArtDiff model also integrated a modified U-Net (UNet) for crack detection with an Edge-Guided Restoration (EGR) approach and a Diffusion Model (DM) technique. Sumathi and Uma38 utilised DL-based Image Inpainting techniques for restoring degraded or obstructed regions. CNN and GAN are used for eliminating lesions, tumours, and metallic implants, thereby improving diagnostic clarity and overall image quality in modalities like Magnetic Resonance Imaging (MRI), Computed Tomography (CT), Positron Emission Tomography (PET), ultrasound, and X-ray imaging. A literature review of methods for accurately identifying and reconstructing missing or damaged portions of artefact images is presented in Table 1.

Table 1 Comparison analysis of related works.

Though existing studies are effective at restoring historical artefacts, they primarily focus on texture and structural reconstruction, often neglecting accurate colour consistency and semantic coherence in intrinsic artefact images. Furthermore, several VII and SDNet techniques fail to fully capture global context and cross-modal interactions, resulting in incomplete reconstructions. Moreover, the MSDN and FLTNet approaches improve feature integration and global correlation, but they exhibit limited adaptability to diverse artefact types and spectral data. Additionally, MGNet, ERFE, and BMFF techniques effectively handle cross-modal data but are limited to tracking tasks rather than artefact restoration. Also, WBPGNDN and 3D CNN approaches struggle to preserve fine structural details in degraded cultural images. Furthermore, various methods lack generalisation for heterogeneous image restoration scenarios. The research gap is the lack of a combined framework that simultaneously preserves global context, semantic consistency, spectral features, and fine structural details across diverse historical artefacts.

Methodological framework

The HDLIP-SHAR model presents an intelligent hybrid DL model for restoring and inpainting historical artifacts. Figure 1 presents the workflow of the HDLIP-SHAR method. As shown in Fig. 1, the entire process comprises several primary stages: image pre-processing, feature extraction, segmentation, and inpainting reconstruction.

Fig. 1
figure 1

Workflow of HDLIP-SHAR approach.

  • Stage 1: Image Pre-processing: In the first stage, AMF and contrast enhancement are applied to remove noise, suppress artefacts, and enhance the visibility of fine details in historical images. These steps confirm that the input images are of high quality and fit for further analysis.

  • Stage 2: Feature Extraction: Following pre-processing, a hybrid SqueezeNet CNN method is utilised to extract rich semantic attributes from the pre-processed artifact images. This stage helps extract thorough structural, textural, and contextual patterns needed for precise identification of cracks, faded textures, and missing regions.

  • Stage 3: Segmentation: The next stage comprises the U-Net model, which performs semantic segmentation to localise and define the damaged areas in the artifact images. The segmentation outcome presents accurate region boundaries for subsequent inpainting.

  • Stage 4: Inpainting and Reconstruction: Finally, a Transformer-based GAN is employed to restore and reconstruct the missing regions. The GAN-based reconstruction guarantees realistic texture generation while maintaining the original artistic and stylistic consistency of the historical artifacts.

Image pre-processing techniques

In this step, AMF and contrast enhancement are used as image pre-processing methods to enhance the quality of input images39. This process ensures that both local and global image features are effectively highlighted, thereby strengthening the network’s ability to learn meaningful patterns. Also, the model efficiently preserves fine details, unlike standard pre-processing model methods, while enhancing the visibility of degraded regions. It also mitigates the risk of losing critical structural information compared to conventional filtering or normalisation models. The model can also achieve more accurate and visually coherent restoration results.

AMF

The AMF uses images as input to remove noise. The AMF uses an area \(\:{T}_{ab}\), and based on the location, the area’s dimensions can vary, but filtering is applied. For the input images, all resultant pixels were set to the median of the 2 pixels on opposite sides. Conversely, the image’s edges are padded with zeros. The filter then produces a single output value that exchanges the previous pixel at \(\:\left(a,b\right)\), whereas \(\:S\) is located in the middle.

$$\:{Z}_{min}\:=minimal\:pixel\:rate\:from\:{T}_{ab}$$
(1)
$$\:{Z}_{max}\:=maximal\:pixel\:rate\:from\:{T}_{ab}$$
(2)
$$\:{Z}_{med}=median\:pixel\:rate\:from\:{T}_{ab}$$
(3)
$$\:{Z}_{xy}\:=pixel\:rate\:at\:coordinates\:(a,\:b)$$
(4)
$$\:{S}_{max}\:=maximal\:allowed\:size\:of\:{T}_{ab}$$
(5)

AMF is exposed to guarded image data, but eliminating non-repelling noise in 2D signals without generating edge blur.

Histogram equalisation

HE is an extensively applied image enhancement model that considerably increases image contrast by fine-tuning the grayscale value distribution. It is mainly efficient for scenes with lower unique contrast40. This model reorganises grayscale pixel values to achieve uniform distributions, thereby increasing local or global contrast, improving visual quality, and revealing fine detail.

Assume the grayscale pixel intensity range from \(\:[0,L-1],\) the grayscale equalisation is executed over the succeeding stages:

Compute the cumulative distribution function (CDF) of the grayscale values:

$$\:c\left({r}_{k}\right)=\sum\limits_{i=0}^{k}h\left({r}_{j}\right)$$
(6)

Mapping the value of grayscale \(\:{r}_{k}\) to the matched grayscale value \(\:{s}_{k}\):

$$\:{s}_{k}=\lfloor\left(L-1\right)\cdot\:c\left({r}_{k}\right)\rfloor$$
(7)

Hybrid deep feature extraction approach

In this section, a hybrid SqueezeNet CNN is employed as a feature representation model to capture the semantic features of historical artefact images and recognise cracks, missing parts, and faded textures41. This process enables effective extraction without losing accuracy. It also effectively captures both fine-grained and high-level semantic features, which are crucial for detecting cracks, missing regions, and faded textures in historical artefacts. It also requires fewer computational resources and fewer parameters than heavier CNNs, thus facilitating faster training and inference. This efficiency, combined with robust feature representation, makes it better suited for processing delicate artefact images while maintaining restoration precision.

SqueezeNet is a novel CNN model that aims to achieve superior accuracy while reducing computational cost. Its innovative feature lies in the exploitation of “fire” components, encompassing expand layers (\(\:1\:\text{x}\:1\) and \(\:3\:\text{x}\:3\) convolutions) to increase output channels and squeeze layers (\(\:1\:\text{x}\:1\) convolution) to reduce input channels. This method significantly reduces the limits compared to a standard CNN, improving computational and memory performance. Besides, SqueezeNet utilises techniques such as ReLU activation and global average pooling to reduce model complexity and size. The configuration of SqueezeNet includes a convolution layer that takes the removed attributes subset as input and performs average max pooling; then, fire components 2–4 are executed, and it performs average max pooling again. Afterwards, the result from max pooling is passed to the fire components, such as Fires 5–8. Following this, it implements average max downsampling. Additionally, the max-pooled features are fed into Fire_9, followed by the convolutional layer. At last, a softmax activation function, a categorical cross-entropy loss function, and global average pooling are applied at the resultant layer.

This standardised form of SqueezeNet reduces the channel counts used in the convolutional layers, potentially leading to a loss of data. This bottleneck can hinder the system’s ability to process complex information fully. A novel Scale Dot Product Attention-enabled Squeezenet (SDPA‐SqueezeNet) method is presented to address this problem. The SDPA‐SqueezeNet architecture removes the feature subset and incorporates it into the Conv-2D Conv2D layer, which combines the input information with parameterised filters to produce a mapping feature, and applies Batch Normalisation to standardise all input layers. It discusses the problem of internal covariate shift, which arises from the variation in the allocation of data fed into layers as their capacity to capture previous layers’ variation is reached during training. Then, the Fire_1 components are executed to establish channel-by-channel and spatial connections in the input information while reducing model size and time complexity. Following this, the max pooling function reduces their spatial size, retaining significant information, and applies Batch Normalisation and ReLU activation, improving the feature representation. The Fire components, containing expand and squeeze layers, effectively capture channel-by-channel and spatial connections in the input data, which are vital to an effective feature extractor. The structure integrates average pooling to preserve spatial resolution during downscaling, and scaled dot-product attention layers to enhance the model’s ability to focus on essential attributes. Next, with a further Fire module and max feature pooling, the network aggregates a global pooling state over the total attributes across the dimensions. Eventually, fully connected (FC) layers process the removed attributes for classification, and a better loss function is then used to compute the variance between actual and predicted labels, thereby regulating the model’s optimisation throughout training.

Scale dot product attention

It takes wide-ranging needs in consecutive input information and is expressed in Eq. (8).

$$\:Att\left(Q,\:k,\:V\right)=Softmx\left(\frac{Q{k}^{T}}{\sqrt{dk}}\right)V\:\:$$
(8)

Whereas \(\:Q\) and \(\:k\) define the two matrices \(\:Q\left(\:query\right)\:and\:k\left(query\right)\), respectively. \(\:V\) refers to the actual rate after the key.

Enhanced loss function

It measures the discrepancy between the true classes and the forecast probabilities assigned by the method. This method employs a fusion activation function for efficiency. The hybrid loss function uses two activation functions: the log-sigmoid and the hyperbolic tangent. Followed by, the hybrid enhanced loss function is defined in Eqs. (9) and (10).

$$\:f\left(t\right)=\frac{1}{1+{e}^{-t}}+\frac{{e}^{t}-{e}^{-t}}{{e}^{t}+{e}^{-t}}\:$$
(9)
$$\:=\left[\frac{{e}^{t}+{e}^{-t}+\left(1+{e}^{t}\right)\left({e}^{t}-{e}^{-t}\right)}{\left(1+{e}^{t}\right)\left({e}^{t}-{e}^{-t}\right)}\right]$$
(10)

In which \(\:\frac{1}{1+{e}^{-t}}\) signifies the log sig activation function, and \(\:\frac{{e}^{t}-{e}^{-t}}{{e}^{t}+{e}^{-t}}\) represents the hyperbolic activation function. Yet, this loss function endures the Tversky loss function formula as in Eqs. (11) and (12). At this time, \(\:\beta\:=1/2.\)

$$\:{f}^{{\prime\:}{\prime\:}}\left(t\right)=T\nu\:erskyloss\left(f\left(t\right)\right)\:$$
(11)
$$\:=\frac{p\widehat{p}}{p\widehat{p}+t\left(1-p\right)\widehat{p}+\left(1-t\right)p\left(1-\widehat{p}\right)}\left(f\left(t\right)\right)\:$$
(12)

The SDPA-SqueezeNet output is represented as \(\:{S}_{Sdpa}^{Out}.\).

U-Net architecture

This method is utilised to segment images and localise the damaged regions42. This model performs precise pixel-level segmentation, making it effective for localising damaged areas in historical artefact images. The spatial data, while capturing contextual features, is also preserved adequately by the encoder–decoder architecture with skip connections. Furthermore, U-Net provides superior localisation accuracy compared to standard CNNs, and its efficiency and precision make it particularly appropriate for restoration tasks where detailed damage detection is crucial. Figure 2 represents the structure of the U-Net.

Fig. 2
figure 2

Structure of U-Net.

The typical U-Net structure comprises three parts: the encoder, the bottleneck, and the decoder. During the encoder phase, convolutional layers are executed consecutively on the input imagery. Max downsampling layers can be utilised to reduce the dimensionality of the mapping feature. The bottleneck stage is the middle layer that links the encoding and decoding, but the minimum and then strong mapping features are attained. During the decoder stage, the spatial upscaling function is applied to enhance the feature map sizes. After spatial upscaling, convolutional layers are used again to obtain additional data. At last, skip connections are allocated to a feature mapped at a similar level in the encoding to an equivalent layer in decoding. This reduces information loss and enhances segmentation outcomes.

In the proposed model, two additional blocks appear to deviate from the typical U-Net structure. An initial attention U‐Net block is used, along with the typical U‐Net, to achieve accurate segmentation. This block supports a superior understanding of the RoI and enhances segmentation outcomes. Attention layers compute the significance of all pixels, giving more weight to important ones and generating a weight distribution. The filter information is then assigned to the decoded later filter, and the mapping feature from the encoded.

Next, residual block configurations are employed. The primary purpose is to improve the U-Net architecture by integrating more residual blocks into all convolutional blocks, enabling the transmission of mapping features to the innermost levels and reducing gradient loss. The residual block comprises more than two convolutional layers with residual connections, along with the input and output layers. This residual connection permits the direct input-output link. Residual blocks are deployed only in the encoded part for extracting additional attributes; however, they cannot be utilised in the decoded part, allowing the model to run faster.

Transformer-based GAN model

In the last step, a transformer-based GAN model is employed to restore and inpaint missing image areas. This model is chosen for its ability to capture long-range dependencies and global context. The framework also ensures realistic texture synthesis and colour consistency, while the transformer improves feature interaction across distant regions. Furthermore, the approach effectively preserves structural integrity and semantic consistency, compared to conventional CNN-based inpainting. Its combination of global attention and adversarial learning also enables high-quality restoration with visually plausible results.

GAN network

GAN is presented for image generation. The generator system \(\:\left(GEN\right)\) in GAN creates a novel model \(\:\left(GEN\left(t\right)\right)\), from a chance hidden vector (t), explicitly, \(\:N:t\to\:GEN\left(t\right)\), where \(\:t\in\:{\mathbb{R}}^{d}\) is sampled from a uniform distribution with probability \(\:{p}_{v}\) and \(\:d\) represents the dimension of \(\:t\)43. When the probability functions of the created model \(\:\left(GEN\left(t\right)\right)\) and the actual model \(\:\left({x}_{data}\right)\) are provided by \(\:{p}_{model}\) and \(\:{P}_{data}\), respectively, it is desired to absorb \(\:{p}_{model}\) into \(\:{P}_{data}\). The preparation of the generator system is assisted by the discriminator system \(\:\left(DIS\right)\). The discriminator’s goal is to distinguish between generated samples and actual samples. The outcomes of a discriminant system are assumed to be the actual samples’ probabilities. Therefore, the discriminator system attempts to create \(\:DIS\left({x}_{data}\right)\approx\:1,\) \(\:\forall\:{x}_{data}\sim\:{p}_{data}\), and \(\:\left(GEN\right(t\left)\right)\approx\:0,\) \(\:\forall\:t\sim\:{p}_{v}\), whereas \(\:{P}_{data}\) denotes the probability distribution of actual samples. Conversely, the generator system attempts to fool the discriminator system and to realise \(\:\left(GEN\right(t\left)\right)\approx\:1,\) \(\:\forall\:t\sim\:{p}_{v}\). It treats GAN training as a min-max process. The primary function of GAN is to provide:

$$\:{\mathcal{L}}_{GAN}(GEN,\:DIS)={\text{M}\text{i}\text{n}}_{GEN}{\text{M}\text{a}\text{x}}_{DIS}\left({\mathbb{E}}_{{x}_{data}\sim\:{p}_{data}}\right[\:\text{l}\text{o}\text{g}\:DIS\left({x}_{data}\right)]+$$
$$\:{\mathbb{E}}_{t\sim\:{p}_{v}}\left[\:\text{l}\text{o}\text{g}\:\right(1-DIS\left(GEN\right(t\left)\right)\left)\right])$$
(13)

whereas \(\:{\mathcal{L}}_{GAN}(GEN,\:DIS)\) refers the adversarial loss purpose.

Assuming the unrestricted trained data and unrestricted size for GEN and DIS:

  • The purpose \(\:{of{\:}\mathcal{L}}_{GAN}(GEN,\:DIS)\) is equal to the Jensen-Shannon divergence between \(\:{P}_{data}\:\)and \(\:{p}_{model}\) and a global optimal (Nash equilibrium) can be provided by \(\:{P}_{data}\:\)and \(\:{p}_{model}\).

  • For every iteration, \(\:D\) is allowed to attain its optimal, provided \(\:GEN\) is efficient at reducing \(\:{\mathcal{L}}_{GAN}(GEN,\:DIS)\). Then the model will finally meet the \(\:{P}_{data}\).

The discriminator system is typically a CNN model and serves as a dual classifier. The final layer of the discriminator system is a sigmoid function that produces a probability for the actual class. The discriminator system is merely needed for trained resolution. The generator system is generally an Up-CNN that takes an input at a small scale and produces an output at a larger scale. The generator model is appropriately improved for various types of applications and data, namely, encoded‐decoded based networks.

Transformer network

The transformer network employs a self-attention mechanism (SAM). Initially, the transformer is presented for an automated translation task, but the Encoded and Decoded systems can be made employing transformers. Initially, the high-level features from the input are calculated and integrated into the positional embedding to generate k-sized high-level features for the transformer. The transformer block in the encoded contains a multi-head attention (MHA) unit, followed by a feed-forward component with residual connections and normalisation. Conversely, a transformer block in the decoder comprises an extra masked MHA unit. Primarily, a transformer block converts an input \(\:(u\in\:{\mathbb{R}}^{k})\) into output \(\:(v\in\:{\mathbb{R}}^{k})\) high-level features. The transformer block is repeated \(\:N\times\:\) in encoded and decoded systems. It is mainly the sequence of the resultant of numerous independent scaled dot‐product attention (SDPA), then a linear layer. The use of multiple attention heads enables the removal of attributes from diverse vital features. An input (u) to a transformer block is fed to the Query (QY), Key (KE), and Value (VL) layers, employing linear layers with weights \(\:{W}_{QY},\) \(\:{W}_{KE},\) and \(\:{W}_{VL}\), respectively. The result of the SDPA unit is evaluated as:

$$\:Att\:\left(QY,\:KE,\:VL\right)=\text{s}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\:\left(\frac{QY{KE}^{T}}{\sqrt{{d}_{k}}}\right)VL\:$$
(14)

Whereas, \(\:{d}_{k}\) refers to the size of \(\:QY\) and \(\:KE\). The masking in the SDPA component is employed only in the decoding system. The feed-forward unit comprises a linear layer with the ReLU activation function, followed by additional linear layers.

Vision transformer (ViT) is a type of transformer model that accepts input imagery. Mostly, it splits the images into covers. The high-level attribute extractor in the patches is used as input to the transformer system. Transformer encoding involves \(\:N\) Transformer blocks. A further step in class embedding is included in ViT, which is utilised as input to a multi-layer perceptron (MLP) head to generate label values for diverse labels. Although transformer and ViT models were initially introduced for image classification and automated translation tasks, they are now widely employed for a range of CV problems. This case presents progress in the GAN method utilising transformer systems for video and image synthesis and examines different opinions.

Transformer-based GAN model

A transformer-based GAN comprises a generator and a discriminator. The paper presents a Window-based Self-Attention Block (WSAB) comprising a Feed-Forward Network (FFN), Window-based Multi-head Self-Attention (W-MSA), and 2-layer normalisation44. The generator implements a U-shaped architecture consisting of an encoder, a bottleneck, and a decoder. An input image is denoted \(\:\text{I}\in\:{\mathbb{R}}^{H\times\:W\times\:3}\). Primarily, the generator utilises a forecast layer containing \(\:3\times\:3\) convolutional (convl) layers and LeakyReLU activations to extract the shallow attribute \(\:{I}_{0}\in\:{\mathbb{R}}^{H\times\:W\times\:C}\). Next, four encoded phases are employed for the deep feature extractor on \(\:{I}_{0}\). Every stage is collected by two sequential WSABs and two spatial reduction layers. It implements \(\:4\text{x}4\) convl with progress two as the spatial reduction layer for downscaling the spatial dimension of the mapping feature and doubling the channel size. Accordingly, the \(\:{i}^{th}\) phase of the encoded signal is represented as \(\:{X}_{i}\in\:{\mathbb{R}}^{\frac{H}{{2}^{i}}\times\:\frac{H}{{2}^{i}}\times\:{2}^{i}C}\). At this time, \(\:i=\text{0,1},\text{2,3}\) correspond to the 4 phases. Followed by, \(\:{X}_{3}\) endures a bottleneck containing 2 WSABs. Next, after the U-Net, it uses a symmetrical decoder comprising 4 phases. 2 WSABs and one deconvolution layer collect the output of each phase of decoding. In the same way, the mapping feature of the \(\:{i}^{th}\) phase from the decoded is represented by \(\:{X}_{i}^{{\prime\:}}\in\:{\mathbb{R}}^{\frac{H}{{2}^{i}}\times\:\frac{H}{{2}^{i}}\times\:{2}^{i+1}}c\). The deconvolution layer is a bilinear interpolation, followed by a \(\:3\text{x}3\) convl layer.

To reduce data loss due to spatial reduction during encoding, a residual connection can be used to fuse features between the encoded and decoded paths. Lastly, after processing by the decoder, the mapping features are passed through a \(\:3\text{x}3\) convl layer to generate the residual image I’ \(\:\in\:{\mathbb{R}}^{H\times\:W\times\:3}\). \(\:{I}_{R}={I}_{LQ}+{I}^{{\prime\:}}\) acquires the recovered image.

The discriminator’s purpose is to differentiate the recovered images from ground-truth high-grade images. The patch‐level GAN is considerably potent than the image‐level GAN in apprehending high-resolution and detailed information on imagery. Henceforth, it follows the adversarial training system based on image patches, as in PatchGAN, and presents a Transformer‐driven discriminator. More explicitly, the discriminator uses a structure similar to the encoder in the generator, followed by a \(\:3\text{x}3\) conv layer. The recovered image \(\:{I}_{R}\in\:{\mathbb{R}}^{H\times\:W\times\:3}\) is merged with the ground‐truth image \(\:I\in\:{\mathbb{R}}^{H\times\:W\times\:3}\) in the presented Transformer‐based discriminator to generate the expected map \(\:P\in\:{\mathbb{R}}^{N\times\:N\times\:1}\). Algorithm 1 illustrates the Transformer-based GAN model.

figure a

Algorithm 1: GAN technique.

Experimental result analysis

This section examines the outcomes produced by the HDLIP-SHAR model, which is validated using the MuralDH dataset45, a comprehensive set of high-quality images for the digital restoration of Dunhuang paintings. The technique is simulated using Python 3.6.5 on a PC with an i5-8600k, 250GB SSD, GeForce 1050Ti 4GB, 16GB RAM, and 1 TB HDD. Parameters include a learning rate of 0.01, ReLU activation, 50 epochs, 0.5 dropout, and a batch size of 5.

It also involves pre-processed images, curated to support research in computer vision, digital art restoration, and cultural heritage preservation. This dataset is separated into segments involving high-resolution mural images, degraded mural segmentation, and images processed for super-resolution research. The collection, intended to support the development and testing of digital restoration methods, aims to bridge conventional art with advanced technologies, ensuring the longevity and availability of these invaluable cultural treasures. Figure 3 denotes the sample images, and Fig. 4 illustrates the predicted output of various methods on applied images.

Fig. 3
figure 3

Sample images.

Fig. 4
figure 4

Predicted output of various methods on applied images.

Table 2; Fig. 5 present the comparative MSE results obtained by the HDLIP-SHAR technique under several mural images. The results indicate that the HDLIP-SHAR technique reaches effective results. With ID_001848, the HDLIP-SHAR technique achieved a lower MSE of 0.0156, while the AOT, EC, LaMa, and LID-MIR models attained higher MSEs of 0.0483, 0.0325, 0.0679, and 0.0287, respectively. In addition, with ID_004277, the HDLIP-SHAR model has a lower MSE of 0.0112, while the AOT, EC, LaMa, and LID-MIR techniques have greater MSEs of 0.0216, 0.0536, 0.0602, and 0.0579, respectively.

Table 2 MSE outcome of HDLIP-SHAR approach with existing models under various mural images.
Fig. 5
figure 5

MSE outcome of the HDLIP-SHAR approach under various mural images.

Table 3; Fig. 6 describe the comparative RMSE outcome achieved by the HDLIP-SHAR method under various mural images. The outcome specifies that the HDLIP-SHAR models accomplish effective outcomes. With ID_001848, the HDLIP-SHAR approach achieved an inferior RMSE of 0.1249, while the AOT, EC, LaMa, and LID-MIR techniques achieved superior RMSEs of 0.2197, 0.1803, 0.2606, and 0.1694, respectively. Moreover, with ID_004277, the HDLIP-SHAR model achieved the lowest RMSE of 0.1058, while the AOT, EC, LaMa, and LID-MIR techniques achieved the highest RMSEs of 0.1469, 0.2314, 0.2454, and 0.2406, respectively.

Table 3 RMSE outcome of the HDLIP-SHAR approach with existing models under various mural images.
Fig. 6
figure 6

RMSE outcome of the HDLIP-SHAR approach under various mural images.

Table 4; Fig. 7 show the comparative PSNR results obtained by the HDLIP-SHAR model across various mural images. The outcomes imply that the HDLIP-SHAR approach is effective. With ID_001848, the HDLIP-SHAR method achieved a higher PSNR of 66.20dB, while the AOT, EC, LaMa, and LID-MIR methodologies achieved lower PSNRs of 61.30dB, 63.01dB, 59.81dB, and 63.55 dB, respectively. Also, for ID_004277, the HDLIP-SHAR model achieved the highest PSNR of 67.64dB, while the AOT, EC, LaMa, and LID-MIR technologies achieved the lowest PSNRs of 64.79dB, 60.84dB, 60.33dB, and 60.51 dB, respectively.

Table 4 PSNR outcome of HDLIP-SHAR approach with existing models under various mural images.
Fig. 7
figure 7

PSNR outcome of the HDLIP-SHAR approach under various mural images.

Table 5; Fig. 8 present the SSIM results of the HDLIP-SHAR technique compared with existing models across various mural images. The figure stated that the AOT model has shown the lowest performance, with the lowest SSIM values. In line with this, the EC and LID-MIR methods have achieved somewhat higher SSIM values. Moreover, the LaMa model has achieved substantial SSIM values. However, the HDLIP-SHAR technique achieves better performance, with superior SSIM values of 0.971, 0.991, 0.969, 0.933, 0.903, and 0.899 for ID_001848, ID_002224, ID_002796, ID_003162, ID_003574, and ID_004277, respectively.

Table 5 SSIM outcome of HDLIP-SHAR approach with existing models under various mural images.
Fig. 8
figure 8

SSIM outcome of the HDLIP-SHAR approach under various mural images.

Table 6; Fig. 9 emphasise the LPIPS results of the HDLIP-SHAR model compared with existing methods on several mural images. The figure stated that the AOT approach has presented minimal performance with the highest LPIPS values. In line with, the EC and LID-MIR methods have achieved slightly lower LPIPS values. Moreover, the LaMa approach has tried to attain substantial LPIPS values. However, the HDLIP-SHAR model achieves the best performance with the smallest LPIPS of 0.0164, 0.0050, 0.0118, 0.0346, 0.0376, and 0.0559 for ID_001848, ID_002224, ID_002796, ID_003162, ID_003574, and ID_004277, respectively.

Table 6 LPIPS outcome of HDLIP-SHAR approach with existing models under various mural images.
Fig. 9
figure 9

LPIPS outcome of HDLIP-SHAR approach under various mural images.

The comparative PSNR outcomes of the HDLIP-SHAR technique are reported in Table 7; Fig. 10 []. The results showed that the HDLIP-SHAR technique outperforms existing methods. The results indicate that the LID-MIR model has shown worse PSNR values, a minimum of 58.71dB, a maximum of 63.55dB, and an average of 61.13dB. At the same time, the EC and LaMa approaches have attained slightly improved and closer PSNR values, a minimum of 58.76dB, a maximum of 64.15dB, and an average of 61.46dB. Meanwhile, the AOT model has demonstrated reasonable PSNR values, a minimum of 61.05dB, a maximum of 64.79dB, and an average of 62.91dB. Likewise, the MSDN, FLTNet, and MGNet models attained lower PSNR values. Nevertheless, the HDLIP-SHAR technique results in improved performance with PSNR values of a minimum of 61.42dB, a maximum of 67.76dB, and an average of 64.59dB.

Table 7 PSNR outcome of HDLIP-SHAR approach with existing models.
Fig. 10
figure 10

PSNR outcome of HDLIP-SHAR approach with existing models.

The comparative SSIM results for the HDLIP-SHAR model are reported in Table 8; Fig. 11. The results show that the HDLIP-SHAR method achieves higher performance than existing models. The outcome signifies that the MSDN, FLTNet, and MGNet methods illustrated lower SSIM values. The AOT technique has exposed the worst SSIM values, a minimum of 0.850, a maximum of 0.978, and an average of 0.914. Meanwhile, the LaMa approach has produced reasonable SSIM values, ranging from 0.861 to 0.980, with an average of 0.921. However, the EC and LID-MIR models have achieved slightly higher SSIM values, with minimum, maximum, and average values approaching SSIM. Therefore, the HDLIP-SHAR technology achieves enhanced performance, with SSIM values ranging from 0.899 to 0.991 and an average of 0.945.

Table 8 SSIM outcome of HDLIP-SHAR approach with existing models.
Fig. 11
figure 11

SSIM outcome of HDLIP-SHAR approach with existing models.

The comparative LPIPS results for the HDLIP-SHAR methodology are reported in Table 9; Fig. 12. The results show that the HDLIP-SHAR methodology achieves superior performance compared to existing models. The result implies that the LID-MIR methodology has proved the worst LPIPS values, with a minimum of 0.0090, a maximum of 0.0659, and an average of 0.0375. Meanwhile, the LaMa model has shown reasonable LPIPS values, ranging from 0.0091 to 0.0663, with an average of 0.0377. Nevertheless, the EC and AOT methodologies have reached slightly higher values than LPIPS, with minimum, maximum, and average values. Likewise, the MSDN, FLTNet, and MGNet methodologies attained lower results. However, the HDLIP-SHAR methodology yields improved performance, with LPIPS values ranging from 0.0112 to 0.0690 and an average of 0.0401.

Table 9 LPIPS outcome of HDLIP-SHAR approach with existing models.
Fig. 12
figure 12

LPIPS outcome of HDLIP-SHAR approach with existing models.

The time outcomes of the HDLIP-SHAR approach are compared with other techniques in Table 10; Fig. 13. The results show that the HDLIP-SHAR approach achieves the lowest time of 90.00s. On the other hand, AOT, EC, LaMa, and LID-MIR technologies reach better time values of 110.04s, 151.07s, 145.23s, and 162.00s, respectively. Thus, the HDLIP-SHAR model is utilised to demonstrate superior restoration and improvement performance for smart historical artefact images.

Table 10 Time outcome of HDLIP-SHAR approach with existing models.
Fig. 13
figure 13

Time outcome of HDLIP-SHAR approach with existing models.

Table 11 specifies the ablation study analysis of the HDLIP-SHAR methodology. The baseline setup achieves an average PSNR of 62.52 dB, which increases to 63.22 dB with contrast enhancement and to 63.90 dB when segmentation is added. The HDLIP-SHAR model using pre-processing, feature extraction, and segmentation achieves the highest PSNR of 64.59 dB, highlighting the cumulative benefit of all modules.

Table 11 Ablation study analysis of the HDLIP-SHAR methodology.

Table 12 portrays the computational efficiency evaluation of the HDLIP-SHAR approach46. Conventional methods, namely CNN, ResNet, ViT, and PSMNet, required 22.8 G to 32.2 G FLOPs, 2295 M to 4533 M GPU memory, and 2.79 to 5.15 s of inference time. In contrast, the HDLIP-SHAR model significantly mitigated computational cost with only 10.11 G FLOPs, 755 M GPU usage, and an inference time of 1.08 s, emphasising superior speed and resource efficiency.

Table 12 Comparison of computational efficiency, including FLOPs, GPU usage, and inference time.

Conclusion

In this manuscript, the HDLIP-SHAR methodology is presented to design a DL model for identifying and reconstructing missing or damaged sections of artefact images. To achieve this, the HDLIP-SHAR methodology primarily employed AMF and contrast enhancement to enhance the quality of input images. Furthermore, a hybrid SqueezeNet CNN architecture is implemented, which fully extracts deep semantic features from historical artefact images to recognise cracks, missing parts, and faded textures. Afterwards, the U-Net method is employed for image segmentation and to localise the damaged regions. Lastly, the transformer-driven GAN technique is utilised to restore and inpaint the missing areas of the image. The comparison analysis of the HDLIP-SHAR model demonstrated superior performance with an average PSNR of 64.59 dB, SSIM of 0.945, and LPIPS of 0.0401, outperforming other methods under the MuralDH dataset. The limitations include the reliance on the availability of high-quality reference images. The model also limits restoration accuracy for severely degraded artefacts. The complex patterns and extremely intrinsic textures affect the model, and processing very large or high-resolution images can be computationally intensive, affecting scalability. Moreover, the method concentrates on visual restoration without integrating historical or material context that could improve authenticity. Improving robustness across diverse artefact types and incorporating multimodal data sources could further enhance restoration reliability.