A structural information-guided cross-modal method for damaged inscription inpainting via vision-language models

Liu, Yunjing; Zhang, Erhu; Lin, Guangfeng; Duan, Jinghong

doi:10.1038/s40494-025-02059-1

Download PDF

Article
Open access
Published: 02 October 2025

A structural information-guided cross-modal method for damaged inscription inpainting via vision-language models

Yunjing Liu¹,
Erhu Zhang^1,2,
Guangfeng Lin² &
…
Jinghong Duan³

npj Heritage Science volume 13, Article number: 485 (2025) Cite this article

1261 Accesses
Metrics details

Abstract

Restoring inscriptions is crucial for preserving cultural heritage. Current methods primarily focus on visual-level generation and inpainting, ignoring glyph structure information. However, the structural integrity of Chinese characters is frequently compromised in damaged inscription images. To address this challenge, we propose a structural information-guided cross-modal inpainting method. Our dual-branch network includes an inpainting branch and a structure branch. Firstly, to compensate for missing structural information, we pretrain a vision-language model to obtain high-quality glyph structure representations by decomposing each Chinese character into components and structural relationships. Secondly, the glyph structure representation guides the structure branch to optimize features from the damaged character image, producing features that contain more glyph structure information. Thirdly, a feature interaction mechanism injects the optimized features into the inpainting branch, and an adaptive style embedding module improves restoration accuracy in style, structure, and detail. Moreover, a feature sharing module alleviates potential conflicts between branches.

Fine art image classification and design methods integrating lightweight deep learning

Article Open access 26 September 2025

Material classification method of traditional Chinese painting image based on prototypical network

Article Open access 30 July 2025

Chinese inscription restoration based on artificial intelligent models

Article Open access 05 July 2025

Introduction

Inscription images play a crucial role in preserving historical knowledge, promoting the art of calligraphy, and safeguarding cultural heritage. However, compared with natural images, inscription images often have a simple background and limited contextual information. The lack of suitable references poses a significant challenge for traditional methods to achieve complete and accurate restoration. Currently, most existing methods rely on generative adversarial networks (GANs)^1,2,3,4,5, which perform restoration by learning unimodal image features. However, these approaches often ignore the inherent structural information of Chinese characters, resulting in redundant or incorrect strokes (as shown in Fig. 1a) as well as missing strokes or disordered components (as shown in Fig. 1b). A critical challenge lies in achieving structurally coherent and accurate restoration when continuous and referable texture information is lacking.

**Fig. 1: Examples of improperly restored images.**

To address these limitations, some researchers have incorporated skeleton priors^6,7 into their models. Although these approaches improve inpainting performance to a certain degree, they still depend on skeleton information extracted from inscription images. Moreover, they require a large and diverse dataset for reference, which limits their applicability in real-world scenarios. To this end, we propose a structure-guided restoration method inspired by human perception of Chinese character composition, which explicitly incorporates glyph structure to enhance inpainting quality under missing information.

Through long-term reading, the brain becomes familiar with the structural relationships inherent in Chinese characters. Even if a character’s image is partially damaged, the brain can reconstruct its correct form through structural inference. Inspired by human understanding, we propose that incorporating structural priors into existing restoration models may equip them with analogous inferential capabilities for damaged inscription character inpainting. To achieve this, we have to establish a model to represent the association between the structure of Chinese characters and visual features. Due to the great success of the vision-language model CLIP in text recognition and detection tasks^8,9,10,11, we want to use the CLIP model to obtain the Chinese character structural information. For this purpose, we decompose Chinese characters into components and their spatial structure combination relationships, forming ideographic description sequences (IDS) as a source of structural information. Following this line, we pretrain a CLIP model to achieve cross-modal alignment between Chinese character images (visual modality) and their corresponding structural text (i.e., IDS, textual modality). Therefore, the pre-trained CLIP can provide structural priors to compensate for insufficient visual features from damaged inscription character images, which can produce more plausible restorations.

To faithfully reconstruct the original inscription character style, including stroke morphology and thickness, we introduce a style embedding module to enhance the original visual features. In particular, instead of trying to disentangle style and content in Chinese characters, it matches styles via similarity and uses a linear classifier for style prediction and selection. Subsequently, the chosen style features are integrated into the original features to enhance stylistic representation.

Unlike natural images, the glyph structure of inscription images is compromised and they lack the continuous textures or color transitions found in natural images, which lead to insufficient feature information for inscription image restoration. Moreover, the limited inscription image data and the diversity of font styles make it difficult for traditional image restoration methods to be directly transferred and applied to such tasks. Therefore, we propose the glyph structure-guided inpainting network (CINet), which leverages the spatial and structural relationships between Chinese character components for restoration. Specifically, we establish a deep association between the image and structural components using the CLIP model, mapping the structure of the IDS latent space from the CLIP text encoder as prior information, which can guide the Chinese character structural branch (CSB) of CINet to generate a complete glyph structure representation. Additionally, we establish an interaction between the damaged image features and the structural information of Chinese characters, which enhances the inpainting branch (IB) in understanding the character structure. Furthermore, we introduce a style embedding module to maintain consistency in the restoration style.

Our contributions can be summarized as follows:

(1)
We propose a learning paradigm for acquiring the structural representation of Chinese characters. By decomposing a Chinese character into a series of components and structural sequences, the CLIP model is employed to perform cross-modal alignment between the inscription image and the structural sequence. Through this alignment, the CLIP text encoder gains the ability to model Chinese character structures, supplies structural priors when the image is compromised.
(2)
We propose a dual-branch glyph structure-guided inpainting network (CINet). The two branches collaborate through feature sharing, interaction, and fusion, strengthening the synergy between the two modalities and enhancing the network’s sensitivity to glyph structures and restoration performance.
(3)
We introduce a style embedding module to enhance the network’s sensitivity to different Chinese character styles. Experimental results show that the CINet addresses varying levels of damage and minimizes data dependency, making it particularly suitable for inpainting inscription images.

Methods

Overview of related methods

Image inpainting involves reconstructing damaged regions by utilizing contextual cues and surrounding features. Deep learning has driven the development of numerous image inpainting methods, including local completion based on convolutional neural networks¹², GANs¹, and diffusion models^13,14. Zhu et al.¹⁴ recently introduced GSDM, a two-stage diffusion framework guided by global structure, which integrates structure prediction and content generation for improved text-image inpainting. Although existing methods have achieved notable progress in content generation, they remain limited in structural modeling, especially in capturing long-range dependencies and complex structural relationships. To address this, recent studies have increasingly adopted self-attention mechanisms to enhance global structure perception. Self-attention, a key component of transformer architecture¹⁵, has proven to be highly effective in modeling global dependencies and has shown substantial success in image inpainting tasks. As a result, there has been growing research^16,17,18 focused on enhancing transformers to improve reconstruction quality. One notable approach, proposed by Deng et al.¹⁷, is the Tformer network, which leverages Transformer modules. This network adopts a U-Net-like architecture and integrates an innovative linear attention mechanism with a gating module. This design reduces the computational complexity of traditional self-attention while preserving the ability to model long-range dependencies, greatly improving image inpainting and supporting large-scale and real-time tasks. A similar framework, Uformer, proposed by Wang et al.¹⁹, replaces global self-attention with a local enhanced window Transformer module and introduces a learnable multiscale restoration module, which ensures high-quality image restoration details while alleviating the computational burden. To strike a balance between computational complexity and restoration quality, Huang et al.²⁰ developed a method that enhances Transformer performance in high-resolution image restoration. They designed a cross-channel attention mechanism to model global dependencies, implementing sparse attention distribution by replacing Softmax with ReLU, thus mitigating performance bottlenecks due to high computational complexity. To further enhance the preservation of restoration details, Chen et al.¹⁸ proposed the $\text{M}\times \text{T}$ framework, combining Mamba with Transformer to leverage their synergy for improving detail recovery and ensuring global semantic consistency in image inpainting tasks. In addition to self-attention, researchers have explored various other attention mechanisms to further enhance image inpainting. Cheng et al.²¹ introduced a lightweight framework that incorporates an attention module into group convolutions. This model uses a rotation mechanism to assign attention weights between groups, enhancing the interaction between global and local information, making it especially suitable for resource-limited applications. Wang et al.²² proposed three attention networks aimed at boosting image restoration performance. Chen et al.²³, aiming to process global information more effectively, reduced the resolution of damaged images and used a U-Net-like architecture for global feature capture. They also added a second branch with multiscale channel attention for local restoration and fused the outputs of both branches to improve the final restoration quality. The research above highlights the pivotal role of attention mechanisms in image restoration. By modeling the relationships between global and local features, attention mechanisms effectively leverage contextual information to restore damaged regions. While natural images provide rich information with diverse colors and textures, supporting robust attention mechanisms use, inscription images typically feature simpler backgrounds with fewer usable features, making inpainting more challenging. Therefore, to address the lack of contextual information in inscription images, it is essential to introduce additional reference data during the inpainting process to improve restoration quality.

Due to their age, many crucial glyph structures in inscription images may have been damaged. The goal of inpainting is to restore the complete inscription image, even when structural information is missing. This requires a deep understanding of Chinese character structure and ensures that the restored characters maintain the same style as the original. To provide a comprehensive overview, the following section discusses restoration methods for inscription images, extending the related research on calligraphy and document image restoration. Existing work primarily relies on image features for inpainting. For instance, Sun et al. proposed the RubGAN model³, which utilizes a dual discriminator design: one focuses on detailed information, while the other captures global features. By working together, these discriminators help the generator produce restoration results with richer details and more coherent structures. Chen et al.²⁴ enhanced the GAN framework by incorporating dilated convolutions²⁵ into the generator, expanding the receptive field and improving the model’s feature extraction capabilities. While these methods are effective for lightly damaged inscription images, they fall short when dealing with severe damage, as relying solely on image features provides insufficient information, leading to reduced restoration performance. To overcome this limitation, some researchers^6,7,26 have attempted to incorporate additional prior information to improve inpainting performance. Shi et al.⁷ proposed a method that uses character skeletons as priors to restore real-world inscription images. Built on a GAN framework, this method leverages multiscale feature fusion to enhance detail restoration. Li et al.²⁶ introduced a network similar to font style transfer, incorporating template images to provide structural information and using a style encoder for style consistency. Shi et al.⁶ proposed a parallel-task framework for denoising inscription images, where image and skeleton features are fused using spatial and channel attention mechanisms and reinjected to preserve glyph structures. Song et al.²⁷ incorporated a self-attention mechanism into the GAN generator to better capture global information, employing multiple loss functions to improve handwritten Chinese character inpainting. These methods improve restoration accuracy by extracting skeleton information or using attention mechanisms to enhance the utilization of image features. However, they are fundamentally constrained by their reliance on image features, limiting their ability to restore glyph integrity when the image data quality is poor. In addition, variational autoencoders (VAEs)²⁸, commonly used for image inpainting and reconstruction, encode images into a latent space for progressive reconstruction. Pathak et al. further proposed the context encoder network (CENet)²⁹, which combines VAEs with generative adversarial networks (GANs) to improve performance. A recent study by Zhao et al.³⁰ proposed a cross-autoencoder framework for inscription image inpainting, which employs dilated convolutions and channel attention for parallel feature encoding, and uses shared-parameter decoders optimized with multiple loss functions to improve inpainting performance. In related research, Zhang et al.⁴ expanded the dataset by modeling noise in calligraphy images and used GANs to remove noise patches. Souibgui et al.⁵ proposed a document restoration network based on a conditional GAN with a U-Net architecture, designed to handle watermarks, ink stains, and uneven backgrounds in document images. Lugo-Torres et al.³¹ applied a CycleGAN framework² to address uneven backgrounds in document images (e.g., stains and creases), improving the readability of the documents. In summary, while existing methods for inscription image inpainting have advanced, relying solely on image features is insufficient for restoring glyph integrity and accuracy. Therefore, integrating glyph structure information into restoration networks is essential for improving the performance of inpainting methods, especially when dealing with severely damaged inscription images.

Overall architecture of the proposed method

To enhance the model’s ability to extract discriminative features in severely damaged scenarios, we propose the CINet, a cross-modal glyph structure-guided inpainting network. As illustrated in Fig. 2, the CINet consists of a backbone network (E_B), an inpainting branch (IB), a Chinese character structural branch (CSB), and a pretrained text encoder (E_TEX). The CINet integrates the character structure information from the CSB into the IB through a cross-attention mechanism, allowing the IB to focus on damaged areas and thereby achieving more accurate restoration results. E_TEX is derived from a pretrained CLIP model on large-scale data. The structural vector served as a cross-modal structural prior, learned from extensive the pretraining data rather than the limited samples of the current inpainting task. Moreover, this prior has already been well aligned with the visual features of complete Chinese characters during pretraining and demonstrates strong invariance to font style variations. As a result, even with a limited number of training examples, CINet can reliably obtain accurate structural information via the CSB branch, thereby significantly improving the feature representation and inpainting performance of the IB branch. To clarify the explanation of this process, we formalize it as follows.

$$\hat{x}={IB}\left({E}_{{\rm{B}}}\left({x}_{n}\right),{CSB}\left({x}_{n}\right),{S}_{{\rm{emb}}}\right)$$

(1)

where ${x}_{n}$ represents the damaged image, while ${IB}$, ${E}_{\text{B}}$, and ${CSB}$ denote the functions describing the IB, E_B, and CSB, respectively. ${S}_{\text{emb}}$ represents the style embedding.

Pre-training CLIP for glyph structure representation

Chinese characters have significant structural characteristics, and their glyphs are composed of multiple components arranged according to specific spatial relationships (such as left and right, up and down). For example, the character “构” consists of the components “木”, “勹”, and “厶”, arranged in a left-right and semi-enclosing structure. To formalize the structural representation of Chinese characters, the Unicode standard defines IDS. IDS consists of structural symbols (e.g., ⿰ and ⿹ to represent structures) and component symbols (e.g., 木, 勹, and 厶 to represent components) defined by Unicode. It encodes Chinese character components and their hierarchical relationships, thereby achieving the standardized decomposition of glyph structures.

For inscription image inpainting, although different fonts exhibit significant visual variations, their fundamental structures typically remain consistent. However, relying solely on visual features makes models susceptible to style interference, hindering the learning of unified and stable structural representations. To address this, we introduce a structure-aware cross-modal alignment mechanism inspired by CLIP. Although CLIP is effective in semantic tasks such as retrieval and classification, it lacks structural modeling capabilities, limiting its suitability for structure-sensitive tasks involving Chinese characters. To improve structural understanding, we propose an alignment paradigm that replaces natural language with sequential representations of Chinese character structures. These sequences are then aligned with character images during training, guiding the model to learn style-independent structural representations and improving its ability to model glyph structures. In particular, when training the CLIP model, we input Chinese character images and ideographic description sequences (IDS) into the image encoder and text encoder, respectively, as shown in Fig. 3. Specifically, we adopt the ResNet-50 architecture as the image encoder to extract the image feature vector ($I$), and use a Transformer-based text encoder that models sequential dependencies and projects its output through a linear layer to match the dimensionality of $I$, resulting in the text feature vector ($T$).

**Fig. 3: CLIP model for Chinese character recognition.**

We apply contrastive loss optimization to align a batch of $M$ image feature vectors and text vectors. The specific loss function is as follows, where the first term represents the image-to-text loss, while the second term represents the text-to-image loss.

$${L}_{{\rm{clip}}}=-\frac{1}{2M}\left(\mathop{\sum }\limits_{i=1}^{M}log \frac{exp \left({I}_{i}\cdot {T}_{i}\right)}{{\sum }_{j=1}^{M}exp \left({I}_{i}\cdot {T}_{j}\right)}+\mathop{\sum }\limits_{j=1}^{M}log \frac{exp \left({T}_{j}\cdot {I}_{j}\right)}{{\sum }_{i=1}^{M}exp \left({T}_{j}\cdot {I}_{i}\right)}\right)$$

(2)

By training on large-scale image–IDS pairs, the CLIP model learns to align the visual features of complete Chinese characters with their structural representations. This alignment can establish a stable structural prior after training. On the one hand, the representation of IDS sequences remains invariant to font style variations and exhibits strong consistency. On the other hand, it is independent of the data scale in downstream restoration tasks. Consequently, even when training data is limited, the model can rely on the structural vector produced by the text encoder to provide high-quality structural guidance for the inpainting network.

After training, the inference process is illustrated in Fig. 4. The CLIP model first encodes the Chinese character image using the image encoder to obtain its image feature vector ($I$). Simultaneously, the $K$ predefined candidate Chinese characters are decomposed into IDS and encoded by the E_TEX into a series of textual feature vectors $T=\left\{{T}_{1},{T}_{2},\ldots ,{T}_{K}\right\}$. By computing the similarity between the image and textual representations, the model outputs the index ${k}^{* }$ corresponding to the character category with the highest similarity. Accordingly, the corresponding textural feature ${T}_{k}$ represents the structural information of the Chinese character. To clarify, we provide the following formula.

$${k}^{* }=\mathop{{\rm{arg}}\max }\limits_{k\in \left\{\mathrm{1,2}\ldots ,K\right\}}S\left(I,{T}_{k}\right)$$

(3)

where $S$ denotes the cosine similarity calculation.

**Fig. 4: The inference process of Chinese character recognition.**

Backbone network

ResNet³², renowned for its innovative residual connection design, excels in both global and local feature extraction and is widely used in visual tasks^33,34. Therefore, we select ResNet-50 as the backbone for the backbone feature extraction.

$${F}_{{\rm{bon}}}={{\rm{ResNet}}}_{50}\left({x}_{n}\right)$$

(4)

where ${F}_{{\rm{bon}}}$ represents the extracted the backbone features.

Next, we respectively use three $3\times 3$ convolutions to obtain the initial feature ${F}_{\text{img\_}1}$ of the inpainting branch and the initial feature ${F}_{\text{tex\_}1}$ of the structural branch.

Feature sharing module

Benefiting from the introduction of the CSB, the IB becomes more sensitive to glyph perception, thereby significantly improving the inpainting results. Notably, the IB focuses on restoring pixel-level details, while the CSB emphasizes the semantic features of the overall glyph structure. To align the features with the needs of different tasks, we design a feature sharing module (FSM) to address potential inconsistencies, as shown by the pink dashed box in Fig. 2. First, the FSM applies three convolutional layers (with a kernel size of $3\times 3$) to each task branch to extract task-specific features. Then, the extracted features are concatenated using the concatenation operation (Cat). Finally, a $3\times 3$ convolutional layer is applied to the concatenated features to fuse them, resulting in the sharing feature ${F}_{\text{fus}}$ as shown in Eq. (5), which can enhance the interaction and correlation between the initial features of the CSB and IB branches.

$${F}_{{\rm{fus}}}={{Conv}}_{3\times 3}\left({Cat}\left({F}_{{\rm{img\_}}1},{F}_{{\rm{tex\_}}1}\right)\right)$$

(5)

Character structure branch

The CSB, consisting of a cross-attention module (CA_TEX) and a text decoder (D_TEX) as shown in Fig. 2, is responsible for extracting high-quality glyph structure information, helping the IB more sensitively perceive the semantic features of the glyphs during restoration process. Therefore, obtaining and utilizing cross-modal glyph structure information is a critical task. Inspired by the cross-modal learning between images and texts in the CLIP model, we first pretrained a CLIP model for Chinese character image recognition. Then, we applied the E_TEX from the pretrained CLIP model to get Chinese character structure representations for constraining the output features of the CSB. This process forces the CSB to extract key glyph information from the damaged images. Finally, the glyph representation is passed to the IB through a cross-attention mechanism, enabling the IB to incorporate glyph constraints and ensure the accuracy of the restored character structure.

In the CSB, the process of glyph structure extraction is as follows. First, the CA_TEX is used to obtain an enhanced feature ${F}_{\text{tex\_}2}$ from the initial feature ${F}_{\text{tex\_}1}$ and the sharing feature ${F}_{\text{fus}}$ as shown in Eq. (6). Then, the D_TEX is applied to map the enhanced feature ${F}_{\text{tex\_}2}$ to Chinese character structure feature (${T}_{\text{out}}$). Finally, the similarity between the output feature of E_TEX and ${T}_{\text{out}}$ is calculated to obtain structural information.

$${F}_{{\rm{tex}\_}2}=C{A}_{\text{TEX}}\left({Q}_{1},{K}_{1}{,V}_{1}\right)={Softmax}\left(\frac{{Q}_{1}{K}_{1}^{T}}{\sigma}\right){V}_{1}$$

(6)

where ${Q}_{1}={{\rm{W}}}_{1}^{Q}{F}_{\text{tex\_}1}$, ${K}_{1}={{\rm{W}}}_{1}^{K}{F}_{{\rm{fus}}}$, ${V}_{1}={{\rm{W}}}_{1}^{V}{F}_{{\rm{fus}}}$, with ${{\rm{W}}}_{1}^{Q}$, ${{\rm{W}}}_{1}^{K}$ and ${{\rm{W}}}_{1}^{V}$ denoting learnable weight matrices, and $\sigma$ is the scaling factor.

Since the CSB relies on features from the damaged character image, it encounters limitations of accurately capturing the complete structural information. To improve the representation of Chinese character structures, we leverage the E_TEX output from the pre-trained CLIP model as structural prior knowledge, which constrains the prediction vector from D_TEX and guides the CSB branch toward generating a more complete and accurate glyph structure. The D_TEX is based on a Transformer architecture¹⁵, as illustrated in Fig. 5. Specifically, ${F}_{\text{tex\_}2}$ is first processed by a multihead self-attention mechanism (MSA), followed by a feed forward network and a normalization layer. Finally, a fully connected layer outputs the Chinese character structure prediction vector (${T}_{\text{out}}$).

**Fig. 5: Text decoder of the Chinese character structure branch.**

The loss function ${L}_{\text{tex}}$ of the CSB consists of two components: the cross-entropy loss ${L}_{\text{rec}}$ and the mean squared error loss ${L}_{\text{dis}}$. Specifically, we utilize the ${E}_{\text{out}}$ provided by E_TEX from CLIP as the target. By optimizing the similarity between the ${T}_{\text{out}}$ from D_TEX and ${E}_{\text{out}}$, we aim to bring ${T}_{\text{out}}$ and ${E}_{\text{out}}$ closer within the feature space. Moreover, we apply ${L}_{\text{dis}}$ to constrain the distance between the ${T}_{\text{out}}$ and ${E}_{\text{out}}$. The detailed formulas (7)–(9) are as follows:

$${L}_{{\rm{rec}}}=-\frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}\log \left(\frac{exp \left({T}_{{\rm{out}}}^{n}\cdot {E}_{{\rm{out}}}^{{y}_{n}}\right)}{\mathop{\sum }\nolimits_{j=1}^{C}exp \left({T}_{{\rm{out}}}^{n}\cdot {E}_{{\rm{out}}}^{j}\right)}\right)$$

(7)

$${L}_{\text{dis}}=\frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}{{||}{E}_{{\rm{out}}}^{n}-{T}_{{\rm{out}}}^{n}{||}}_{2}^{2}$$

(8)

$${L}_{{\rm{tex}}}={L}_{{\rm{rec}}}+0.01\cdot {L}_{{\rm{dis}}}$$

(9)

where $N$ represents the number of samples in a batch. $C$ indicates the number of Chinese character categories, while ${E}_{{\rm{out}}}^{{y}_{n}}$ refers to the feature vector corresponding to the ground-truth character label ${y}_{n}$.

Inpainting branch

The inpainting branch (IB) is responsible for restoring the structure, style, and details of Chinese characters. It comprises an image decoder (D_IMG), two cross-attention mechanisms (CA_IMG and CA_{IMG_TEX}), and a style embedding module as shown in Fig. 2. First, the IB branch performs a feature interaction between the initial feature F_img_₁ and the sharing feature ${F}_{\text{fus}}$ through the cross-attention (CA_IMG), obtaining an enhanced feature representation F_img_₂ as shown in Eq. (10). Then, the CA_{IMG_TEX} is employed to inject the cross-modal glyph structure feature extracted from the CSB into the IB, helping it focus on the key parts of the restoration and ensuring the accuracy of the restored Chinese character structure. The feature ${F}_{\text{img}\_3}$ generated by the CA_{IMG_TEX} is expressed by Eq. (11).

$${F}_{{\rm{img\_}}2}={Softmax}\left(\frac{{Q}_{2}{{K}_{2}}^{T}}{\sigma }\right){V}_{2}$$

(10)

where ${Q}_{2}={{\rm{W}}}_{2}^{Q}{F}_{\text{img}\_1}$, ${K}_{2}={{\rm{W}}}_{2}^{K}{F}_{{\rm{fus}}}$, ${V}_{2}={{\rm{W}}}_{2}^{V}{F}_{{\rm{fus}}}$, with ${{\rm{W}}}_{2}^{Q}$, ${{\rm{W}}}_{2}^{K}$ and ${{\rm{W}}}_{2}^{V}$ denoting learnable weight matrices.

$${F}_{{\rm{img\_}}3}={Softmax}\left(\frac{{Q}_{3}{{K}_{3}}^{T}}{\sigma }\right){V}_{3}$$

(11)

where ${Q}_{3}={{\rm{W}}}_{3}^{Q}{F}_{\text{img}\_2}$, ${K}_{3}={{\rm{W}}}_{3}^{K}{T}_{{\rm{out}}}$, ${V}_{3}={{\rm{W}}}_{3}^{V}{T}_{{\rm{out}}}$, with${{\rm{W}}}_{3}^{Q}$, ${{\rm{W}}}_{3}^{K}$ and ${{\rm{W}}}_{3}^{V}$ representing learnable weight matrices.

Due to the difficulty of decoupling the style and content features of Chinese characters from the latent feature (i.e., ${F}_{\text{img}\_3}$), we don’t adopt the decoupling approach^35,36. Instead, we use the style embedding to enhance the latent feature representation, where style loss optimization encourages the style embedding module to learn more discriminative style representations, as shown in Fig. 6. Specifically, we assign a unique index to each style category and convert the style index into a style embedding matrix (${I}_{\text{emb}}$) through an embedding layer. We then compute the similarity by performing a dot product between ${I}_{\text{emb}}$ and the transformed image features ${F}_{\text{img}\_3}^{{\prime} }$ from the original features ${F}_{\text{img}\_3}$. The formula is as follows:

$${S}_{\text{dot}}\left({I}_{\text{emb}},{F}_{\text{im}{\text{g}}_{\_3}}^{{\prime} }\right)={I}_{\text{emb}}.\left({AvgPool}\left({Sigmoid}\left({F}_{\text{im}{\text{g}}_{\_3}}\right)\right)\right)$$

(12)

**Fig. 6: Structure of the style embedding module.**

The dot product ${S}_{\text{dot}}$ is a widely used and effective similarity measure in deep learning. It is adopted in the self-attention mechanism of Transformer models¹⁵, as well as in various tasks such as sentence transformation and sentiment intensity modeling³⁷. Finally, the computed similarity is passed through a fully connected layer to predict the font style, as follows:

$${I}_{{\rm{style}}}={FC}\left({S}_{{dot}}\left({I}_{\text{emb}},{F}_{\text{im}{\text{g}}_{\_3}}^{{\prime} }\right)\right)$$

(13)

where ${I}_{\text{style}}$ corresponds to the logits output by the model, and ${FC}\left(\cdot \right)$ represents the output of the fully connected layer.

We apply the multiclass cross-entropy loss ${L}_{\text{style}}$ to constrain the style prediction, as shown in the formula (14):

$${L}_{\text{style}}=-\frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}log \left(\frac{exp \left({I}_{\text{style}}^{n}\cdot {I}_{\text{true}}^{{t}_{n}}\right)}{{\sum }_{j=1}^{{C}^{{\prime} }}exp \left({I}_{\text{style}}^{n}\cdot {I}_{\text{true}}^{j}\right)}\right)$$

(14)

where the numerator represents the similarity score between the predicted style feature and the reference vector of its ground-truth category ${t}^{n}$, and the denominator is the sum of similarity scores between the predicted style feature and the style vectors of all style categories ${C}^{{\prime} }$.

The D_IMG is a hierarchical feature fusion network, and its key design is the multilevel feature fusion module (MFM), inspired by ref. ³⁸. MFM integrates both standard convolution and dilated convolution, effectively extracting multiscale features, as shown in Fig. 7. The D_IMG begins feature fusion from the deepest layers and progressively fuses toward the shallower layers. After each fusion level, the predicted results are resized to the original dimensions through a $3\times 3$ convolution and an interpolation operation. This design of multiscale supervision can significantly enhance the restoration performance. The decoding process is represented as follows:

$${\hat{x}}_{0}=\left\{{{Conv}}_{3\times 3}\left({{MFM}}_{l}\left({F}_{{\rm{bon}}}^{l},\left({F}_{\text{img\_}3}+{S}_{{\rm{emb}}}\right)\right)\right){|l}=\mathrm{3,2,1,0}\right\}$$

(13)

where ${\hat{x}}_{0}$ represents the final restoration result. ${S}_{\text{emb}}$ is obtained by transforming the dimensions of ${I}_{\text{emb}}$.

**Fig. 7: The image decoder (DIMG) of the inpainting branch.**

We employ a multiscale supervision strategy to optimize the restoration process, applying ${L}_{1}$ loss on the outputs of each intermediate layer. The final loss function ${L}_{\text{multi}}$ can be expressed as follows:

$${L}_{\text{multi}}=\mathop{\sum }\limits_{l=0}^{3}{w}_{l}{{\rm{||}}{\hat{x}}_{l}-x{\rm{||}}}_{1}$$

(16)

where ${\hat{x}}_{1}$, ${\hat{x}}_{2}$, and ${\hat{x}}_{3}$ are the outputs of the intermediate layers.

In summary, the overall loss ${L}_{\text{total}}$ for training CINet consists of three components, defined as follows:

$${L}_{\text{total}}={L}_{\text{tex}}+{L}_{\text{style}}+{L}_{\text{multi}}$$

(17)

Results

Datasets and experimental settings

Due to the lack of publicly available inscription image datasets, we constructed a real-damaged inscription dataset (DID) by collecting images from a public database_1. However, the types of damaged inscription images in this dataset are relatively limited and do not cover more severe degradation that may occur in practical scenarios. In addition, the dataset is relatively small, making it insufficient to evaluate the model’s generalization ability. To address this issue, we further constructed two synthetic datasets, namely the Damaged Printed Chinese Characters Dataset (DPCCD) and the Damaged Handwritten Chinese Characters Dataset (DHWD), to assist in analyzing the model’s performance under various conditions. Furthermore, we introduced a Printed Chinese Characters Dataset (PCCD) to train the CLIP model.

We gathered inscription images from different dynasties to construct the DID, resulting in a dataset of 3295 inscription images, as shown in Fig. 8a. We use 2608 images for training and 687 for testing.

**Fig. 8: Examples of real and synthetic datasets.**

To construct the DPCCD, we collected 10 unseen styles (not part of PCCD) to generate images for 3755 Chinese character categories, totaling 37,550 images. We introduced various damaged types and adjusted the damaged areas to create diverse damaged images, as shown in Fig. 8b. Of which, 31,920 images are used for training and 5630 for testing.

We collected 37,423 images of handwritten Chinese characters from 10 authors from HWDB1.1³⁹, forming the DHWD. We then generated damaged images by applying various damage types and adjusting the damage areas. A total of 31,841 images are used for training and 5582 for testing.

We collected 120 styles and 3755 Chinese character categories (GB2312 level-1 characters) from a public database₂ to construct the PCCD, resulting in 450,600 Chinese character images. The training set includes 110 styles, with 100 styles randomly selected to generate samples for 1126 Chinese character categories. Additionally, 2629 Chinese character categories are covered by all 110 styles. The total number of training samples amounts to 401,790 images. The test set contains 10 unseen styles, with 11,260 Chinese character images generated as test samples.

We use several evaluation metrics to assess model performance, where the PSNR and SSIM are used to evaluate the pixel-level difference and structural integrity, the LPIPS measures the perceptual difference, and the FID quantifies the difference in data distribution between the reconstructed and ground-truth image. To evaluate the model’s ability of restoring styles, we train a style classifier on the test set and use it to calculate style scores (StSc)⁴⁰ for the inpainting images.

The model is implemented using the PyTorch framework and trained on an i9-14900KF processor with an NVIDIA GeForce RTX 4090D (24 GB) GPU. The IB uses the Adam optimizer⁴¹ with a learning rate of $2\times 1{0}^{-4}$, while the CSB uses the Adadelta optimizer⁴² with a learning rate of 1. Specifically, the DPCCD and DHWD use $128\times 128$ with a batch size of 32, while DID uses $256\times 256$ with a batch size of 8. The coefficients ${w}_{l}$ in the loss function ${L}_{\text{multi}}$ are set to $\left[0.5\text{,}0.3\text{,}0.2\text{,}0.1\right]$. To provide a more detailed description of the training process, we present the implementation pseudo code in Algorithm 1.

Algorithm 1

Pseudo code of CINet method

Input: training dataset is ${D}_{\text{data}}={\left\{{x}_{i},{\text{y}}_{i}\right\}}_{i=1}^{N}$, batch size $N$ is 8 and the epoch is 200

Output: inpainting image $\hat{x}$

1.
Randomly initialize the model parameter $\theta$.
2.
For $i=1$ to epoch do
3.
$\left\{{x}_{N},{y}_{N}\right\}\leftarrow$Sample$\left({D}_{\text{data}},N\right)$.
4.
${F}_{\text{bon}}\leftarrow {E}_{\text{B}}\left({x}_{N}\right)$
5.
${F}_{\text{fus}},{F}_{\text{img\_}1},{F}_{\text{tex\_}1}\leftarrow {FIM}\left({F}_{\text{bon}}\right)$
6.
${F}_{\text{tex\_}2}\leftarrow C{A}_{\text{TEX}}\left({W}_{1}^{Q}{F}_{\text{tex\_}1}{,{W}_{1}^{K}F}_{\text{fus}},{W}_{1}^{V}{F}_{\text{fus}}\right)$, ${F}_{\text{img\_}2}\leftarrow C{A}_{\text{IMG}}\left({W}_{2}^{Q}{F}_{\text{img\_}1},{{W}_{2}^{K}F}_{\text{fus}},{W}_{2}^{V}{F}_{\text{fus}}\right)$
7.
${T}_{\text{out}}\leftarrow {D}_{\text{TEX}}\left({F}_{\text{tex\_}2}\right)$, ${E}_{\text{out}}\leftarrow {E}_{\text{TEX}}\left({y}_{N}\right)$
8.
${F}_{\text{img\_}3}\leftarrow C{A}_{\text{IMG}\_\text{TEX}}\left({{W}_{3}^{Q}F}_{\text{img\_}2},{{W}_{3}^{K}T}_{\text{out}}{,{W}_{3}^{V}T}_{\text{out}}\right)$
9.
$\hat{x}\leftarrow {D}_{\text{img}}\left({F}_{\text{bon}},{F}_{\text{img\_}3},{S}_{\text{emb}}\right)$, $\hat{y}\leftarrow {T}_{\text{out}}\times {E}_{\text{out}}$
10.
Update CSB
11.
Update IB
12.
end for

1 https://www.9610.com/index.htm2https://www.foundertype.com/index.php/FindFont/index

Comparison to state-of-the-art methods

In the inpainting of damaged Chinese characters, the absence of strokes and components significantly hinders the recovery of complete structural information when relying solely on the degraded image. To address this limitation, we utilize the known target character category during the restoration process and employ a trained structural feature extraction model to obtain a complete structural representation of the character. This representation is subsequently fused with image features to guide the inpainting of missing regions. Unlike conventional cross-style recognition methods, our approach focuses on extracting stable and precise structural features rather than depending on the model’s generalization ability to unseen fonts. To ensure the reliability of structural representations, training data should feature clear strokes, standardized structures, and high legibility. The PCCD dataset, composed of high-quality printed Chinese characters with complete and regular structures, provides ideal structural templates for the model. In contrast, handwritten or inscription-based datasets often exhibit considerable structural distortion and noise, making them less suitable for learning stable structural features and potentially degrading restoration performance.

Although the core of our method lies in leveraging stable structural features, we also evaluate the CLIP model across various font styles to gain a more comprehensive understanding of its performance. Specifically, we train the CLIP model using the PCCD, and the pretrained text encoder generates high-quality Chinese character glyph structure representations that are independent of character style. To evaluate the robustness of the CLIP model in recognizing Chinese characters across different font styles, we conduct tests on 1126 Chinese character categories and 10 font styles (both seen and unseen). Notably, there is no overlap between these test sets and the training set. The average recognition rate for the 10 seen font styles is 99.876%, while the rate for the 10 unseen font styles is 97.158% (as shown in Table 1), indicating only a slight drop in performance. Among the unseen fonts, one style achieves a recognition accuracy of approximately 82%, while all others exceed 90%. Overall, the model demonstrates strong robustness to unseen font styles. Even in extreme cases where generalization is limited, reliable structural information can still be retrieved during the inpainting process by leveraging the known character category.

Table 1 Recognition accuracy of 10 unseen styles (%)

Full size table

In damaged character scenarios, we rely on the known character category to obtain a complete structural representation from the pretrained model. Accordingly, the model must be capable of extracting consistent structural features across different font styles. To validate this capability, we select the same character rendered in multiple font styles, extract its structural features using the trained model, and project the resulting features into a two-dimensional space using t-SNE, as illustrated in Fig. 9. The t-SNE visualization shows that different font styles of the same character form tight clusters in the feature space, indicating that the structural representations extracted by the model remain stable and consistent across font variations. This property ensures that accurate structural information can still be retrieved from the model to support the inpainting process, even when the input image is damaged, provided that the character category is known.

**Fig. 9: t-SNE visualization of feature distributions for the same Chinese character.**

Because inscription image inpainting is a small-sample problem, we evaluate the performance of various methods under reduced data conditions from two perspectives: the insufficient number of glyph instance samples (G) and style samples (S). We analyze the applicability of different models in small-sample scenarios.

First, reducing G refers to decreasing the number of damaged images with diverse styles under the same IDS conditions. Table 2 summarizes the performance of different methods as G decreases by a step of 2, from 5 to 1 (i.e., 5 → 3 → 1), keeping S constant (in this case, S = 2629). As shown in Table 2, the evaluation metrics for all methods generally exhibit a downward trend as G decreases, indicating that glyph structural features are crucial to restoration performance. With only one instance, other models struggle to learn sufficient structural features, resulting in suboptimal restoration. In contrast, CINet performs the best in this scenario, demonstrating its ability to effectively compensate for the lack of information caused by a single glyph sample and showing better adaptability to scarce glyphs. Specifically, compared to the second-place methods, our model outperforms CENet by 0.4442 dB in PSNR; outperforms GSDM by 0.0046 and 0.0205 in SSIM and StSc, respectively; and reduces LPIPS and FID by 0.0051 and 0.6358, respectively.

Table 2 Impact of reducing glyph instances on performance of the DPCDD Dataset

Full size table

Second, the reduction of S refers to the decrease in degraded images with identical styles but different IDS. Table 3 reports the performance of various methods when G is 5, and the number of S decreases in an approximately halving pattern, from 1315 to 329 (i.e., 1315 → 657 → 329). As shown in Table 3, with the reduction of S, the evaluation metrics for all methods generally show a downward trend, indicating that style information also influences restoration performance. However, our model, CINet, still maintains optimal performance compared to all other methods. When S = 329, although our method is 0.0167 lower than the best-performing method in SSIM, it achieves a 0.0823 dB and 0.0202 increase in PSNR and StSc over the second-best, along with reductions of 0.0010 and 0.0700 in LPIPS and FID, respectively. This shows that even with insufficient style information, our method can maintain excellent performance. As shown in Fig. 10, CINet can effectively restores the image, while other models (e.g., DE-GAN, FD-Net, RubGAN, and CENet) produce artifacts. Moreover, compared to models such as RCRN, Tformer, and GSDM, CINet demonstrates greater precision in restoring glyph structures (as highlighted by the red boxes), while better preserving details and avoiding structural blurring or loss.

Table 3 Impact of reducing style samples on performance of the DPCDD Dataset.

Full size table

**Fig. 10: Restoration results of different methods on damaged DPCCD images.**

To more clearly and intuitively demonstrate the performance of CINet under limited G and S samples, we present Table 4 and Fig. 11. Table 4 shows the inpainting effects of CINet as G and S decrease. It is evident that even with a minimal number of glyph categories and style samples (G = 1, S = 329), our model still outperforms most other models tested at G = 5 and S = 2629 (as shown in Table 2). This proves the effectiveness and applicability of our method under small-scale conditions. Figure 11a illustrates the variation trend of S. Specifically, Table 4 divides S into three groups, with G remaining consistent within each group, highlighting the variations in metrics across these groups. Figure 11b illustrates the variation trend of G. Here, Table 4 divides G into four groups, with S remaining unchanged within each group, demonstrating how the metrics change across the groups. Figure 11 visually demonstrates that CINet can maintain satisfactory results even with insufficient S and G samples, showcasing stronger robustness, particularly in SSIM, LPIPS, and StSc. Specifically, when G changes (from 5 to 1), SSIM, LPIPS, and StSc vary smoothly. PSNR decreases slightly (with a maximum change of 1.1614 dB), while FID increases slightly (with a maximum change of 1.8070). When S changes (from 2629 to 329), SSIM, LPIPS, and StSc show better robustness (with maximum changes of 0.0349, 0.0233, and 0.0275, respectively). PSNR decreases moderately (with a maximum change of 2.6331 dB), and FID increases (with a maximum change of 3.2777).

Table 4 Impact of reducing glyph instances and style samples on CINet performance.

Full size table

**Fig. 11: Impact of G and S samples on CINet performance.**

Table 5 reports the influence of degradation levels on the effectiveness of various inpainting methods. For slight degradation (10–20%), the loss of image information is relatively small, allowing most methods to effectively leverage the remaining features, thus achieving better performance. However, under severe degradation (30–40%), significant loss of image information results in poor performance for methods that rely solely on image features. In comparison, our method consistently produces higher-quality results across all degradation levels. Notably, under conditions of severe degradation, CINet compensates for the loss of image information by utilizing glyph structure information, ensuring superior robustness. Specifically, under slight degradation conditions, most methods show good performance in terms of PSNR and SSIM; however, CINet achieves higher PSNR and SSIM values, with notable improvements in LPIPS and FID. For example, under 20% degradation on the DHWD dataset, CINet attains a PSNR of 28.5654 dB and an StSc of 0.8705, outperforming all other methods. CINet also achieves an FID score of 6.9545, which is 2.4368 lower than the second-best method, GSDM. Under high degradation conditions, the performance of most methods (e.g., RubGAN and FD-Net) significantly declines, while CINet maintains relatively stable performance. Moreover, CINet shows notable advantages in LPIPS and FID. For example, on the DPCCD dataset with 40% degradation, CINet achieves an FID of 6.3836, significantly outperforming the second-best method, Tformer (15.2614). Additionally, its LPIPS and StSc scores are 0.0820 and 0.9014, respectively, demonstrating clear superiority over other methods. To summarize, CINet delivers excellent performance across different degradation levels (10–40%) on the DPCCD and DHWD datasets, showcasing remarkable adaptability to complex scenes. Figure 12 shows the restoration results of the DPCCD and DHWD datasets. Our method restores the glyph structure more accurately, while other methods exhibit distortion and artifacts. Furthermore, we use Grad-CAM for visualization (as shown in Fig. 13). We overlay the heat maps generated by different methods onto the damaged images to highlight the areas the model focuses on. Red areas indicate high attention, while blue represents low attention. In CINet, the red-highlighted areas are concentrated on the glyph structure or missing parts, demonstrating that the model correctly focuses on the restoration areas, thereby achieving better results. In contrast, other models (e.g., Tformer) fail to adequately focus on the glyph structure, or their highlighted areas do not concentrate on glyph-related regions (e.g., CIDG, RCRN), leading to glyph distortion.

Table 5 Comparison of different methods under varying degradation ratios (10–40%) on damaged DPCCD and DHWD datasets.

Full size table

**Fig. 12: Restoration results of different methods on DPCCD/DHWD.**

**Fig. 13: Visualization of Grad-CAM results on DPCCD.**

In Table 6, we report the performance of various methods on the DID dataset. Our method outperforms others across most metrics, with the exception of FID, where it ranks second, demonstrating significant advantages in real-inscription image restoration. This makes our model more suitable for real-world scenarios. Specifically, although CINet’s FID is 16.2541, which is only 0.4457 higher than the best-performing Tformer (15.8084), it surpasses both GSDM and Tformer in PSNR, SSIM, LPIPS, and StSc. CINet achieves SSIM and StSc values of 0.9564 and 0.9534, respectively, outperforming Tformer by 0.0115 and 0.0305. It also outperforms the second-best method, GSDM, by 0.9711 dB in PSNR. Furthermore, CINet has the lowest LPIPS score of 0.0370, which is significantly better than other methods, such as CIDG (LPIPS = 0.0628) and Charformer (LPIPS = 0.2935). Figure 14 displays the restoration results, where CINet ensures consistent restoration of the glyph structure, while comparison models (e.g., Tformer) retain unnecessary strokes. GSDM performs well in handling slight degradations but is less effective at removing spurious strokes, often resulting in the retention of incorrect structural information. Figure 15 presents the Grad-CAM visualization results, showcasing how CINet effectively focuses on the glyph structure, distinguishing real from fake features. In contrast, other models (e.g., Tformer and CIDG) fail to concentrate on the glyph regions accurately and instead focus on irrelevant areas, leaving meaningless strokes in the restored images.

Table 6 Comparison of performance on real-inscription images.

Full size table

**Fig. 14: Inpainting results of different methods on DID.**

**Fig. 15: Grad-CAM visualization results on DID.**

Ablation study

To evaluate the impact of different components on restoration performance, we analyze the performance fluctuations after removing each module on the DID dataset, as shown in Table 7. The results indicate that the removal of any module negatively affects CINet’s performance, underscoring the importance of module cooperation in achieving high performance. When removing a single component, the performance degradation is relatively minor, with significant fluctuations occurring only in specific metrics. For example, removing the FSM module leads to more noticeable increases in LPIPS and FID, which rise by 0.0018 and 3.5012, respectively, while its effect on SSIM is negligible. In contrast, removing the S_emb has a more substantial impact on StSc, reducing it from 0.9534 to 0.9301. When multiple modules are removed together, the performance degradation is much more pronounced, highlighting strong synergistic effects among the modules. For example, in the CINet-FSM-CA_TEX-CA_IMG-S_emb-E_TEX or CINet-FSM-CA_TEX-CA_IMG-S_emb configurations, all metrics show a significant decline, and the magnitude of this degradation is far greater than when removing a single module. Specifically, when all modules are removed, PSNR decreases by 1.2043 dB, and FID increases by 21.2852. It is noteworthy that even when all components are removed, the CSB branch, relying solely on CA_{IMG_TEX} to transfer glyph information to the IB branch, still performs reasonably well, achieving a PSNR of 20.5643 dB and an FID of 37.5393. These results outperform most of the comparison methods (as shown in Table 6), highlighting the critical role of glyph information in inpainting inscription images.

Table 7 Ablation experiments of different components.

Full size table

To evaluate the network’s adaptability and restoration performance under different damage shapes, we simulate circular, rectangular, and square damage patterns at a 20% degradation level on the DHWD and DPCCD datasets, with results shown in Table 8. The results are visualized in Fig. 16. The findings indicate that CINet exhibits robust restoration capabilities, showing minimal sensitivity to different damage shapes. This reflects the model’s strong generalizability across various damage patterns. Specifically, on the DHWD dataset, our model demonstrates consistent performance with only minor fluctuations across all metrics. The maximum variations are 0.0027 for SSIM, 0.0057 for LPIPS, and 0.0052 for StSc. On the DPCCD dataset, CINet similarly shows excellent robustness to different damage shapes. Notably, the maximum fluctuations are 0.0044 for SSIM (ranging from 0.9501 to 0.9545), 0.0027 for LPIPS (from 0.0365 to 0.0392), 0.4527 for FID (from 2.4913 to 2.9440), and 0.0059 for StSc (from 0.9728 to 0.9787).

**Fig. 16: Restoration results of our method on DHWD/DPCCD for different damage shapes.**

More visualization results

To evaluate the generalization capability of the proposed method across different Chinese calligraphy styles, we perform restoration experiments during the testing stage on Yan Zhenqing’s Epitaph of Guo Xuji (Yan script), Ouyang Xun’s Jiuchenggong Liquan Inscription (Ou script), and Liu Gongquan’s Diamond Sutra (Liu script) (see Fig. 17). The DID training set contains only Yan script samples, with no Ou or Liu script samples. Moreover, for the Yan script, the training and testing sets are strictly non-overlapping. The visualization results demonstrate that CINet can accurately inpaint glyphs and maintain style consistency in both unseen fonts (Ou and Liu scripts) and seen fonts with different samples (Yan script), evidencing its effectiveness and robustness under varying font style scenarios.

**Fig. 17: Visualization of restoration results for Yan, Ou, and Liu scripts.**

To further validate the advantages and limitations of the proposed method, we select a 40% degradation ratio from the DPCCD dataset as the testing scenario, which provides a challenging setting and effectively evaluates the model’s inference ability under conditions of missing information. As shown in Table 5, under the 40% damage condition, CINet still achieves the best performance in all indicators. However, the visualization results (see Figs. 18, 19) show that some characters are still not fully restored. Specifically, samples with successful inpainting (see Fig. 18) generally retain most of their crucial structural elements, enabling the model to accurately infer missing parts based on existing structural information. In contrast, failed inpainting cases (see Fig. 19) often exhibit substantial loss of core strokes and numerous strokes or overlapping components, which significantly increase the restoration difficulty and lead to missing strokes, structural distortions, or incorrect completions, as highlighted by the green boxes. This indicates that while the proposed method performs well overall, it still faces challenges and has potential for improvement in inpainting Chinese characters with severe key structure loss and multiple overlapping components.

**Fig. 18: Visualization of successful restoration examples.**

**Fig. 19: Visualization of failed restoration examples.**

Table 8 Impact of different mask shapes on CINet performance

Full size table

Discussion

We propose CINet, a specialized image inpainting network built with an in-depth understanding of Chinese character structures. CINet adopts a dual-branch architecture. The first branch, the CSB, generates high-quality representations of Chinese characters. Specifically, we utilize the text encoder from the CLIP model, pretrained for Chinese character image recognition, to provide additional supervisory information to the CSB. The second branch, the IB, incorporates a cross-attention mechanism to inject key glyph information, guiding the inpainting task. This enables the model to be more sensitive to glyph features and compensates for the limitations of relying solely on degraded image feature extraction. The cross-modal design of CINet also effectively addresses the challenge of insufficient inscription image data. Additionally, to improve the network’s ability to capture and preserve style, we integrate a style embedding module. This enhances the model’s precision in maintaining style consistency during inpainting. We demonstrate the superiority of CINet across multiple datasets, especially in scenarios with limited data and complex degradation, showing greater robustness and adaptability.

Data availability

The DID, DHWD, and DPCCD datasets are available via the following link: https://github.com/liuyunjing0306/CINet.

Code availability

The code for this study is available via the following link. https://github.com/liuyunjing0306/CINet.

References

Goodfellow, I. et al. Generative adversarial networks. Commun. ACM 63, 139–144 (2020).
Article Google Scholar
Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE Int. Conf. Comput. Vision, 2242–2251 (IEEE, 2017).
Sun, G., Zheng, Z. & Zhang, M. End-to-end rubbing restoration using generative adversarial networks. Preprint at https://doi.org/10.48550/arXiv.2205.03743 (2022).
Zhang, J., Guo, M. & Fan, J. A novel generative adversarial net for calligraphic tablet images denoising. Multimedia Tools Appl. 79, 119–140 (2020).
Article Google Scholar
Souibgui, M. A. & Kessentini, Y. De-gan: a conditional generative adversarial network for document enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1180–1191 (2022).
Article PubMed Google Scholar
Shi, D. et al. Charformer: a glyph fusion based attentive framework for high-precision character image denoising. In Proc. 30th ACM International Conference on Multimedia. 1147–1155 (Association for Computing Machinery, 2022).
Shi, D. et al. Rcrn: real-world character image restoration network via skeleton extraction. In Proc. 30th ACM international conference on multimedia. 1177–1185 (Association for Computing Machinery, 2022).
Radford, A. et al. Learning transferable visual models from natural language supervision. Int. Conf. Mach. Learn. 139, 8748–8763 (2021).
Google Scholar
Yu, H., Wang, X., Li, B. & Xue, X. Chinese text recognition with a pre-trained clip-like model through image-ids aligning. In Proc. IEEE/CVF International Conference on Computer Vision 11909–11918 (IEEE, 2023).
Duan, C., Fu, P., Guo, S., Jiang, Q. & Wei, X. Odm: a text-image further alignment pre-training approach for scene text detection and spotting. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15587–15597 (IEEE, 2024).
Zhao, L., Guo, Q., Li, X. & Wang, S. Clii: Visual-text inpainting via cross-modal predictive interaction. Preprint at https://doi.org/10.48550/arXiv.2407.16204 (2024).
Li, J., Wang, N., Zhang, L., Du, B. & Tao, D. Recurrent feature reasoning for image inpainting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7757–7765 (IEEE, 2020).
Lugmayr, A. et al. Repaint: Inpainting using denoising diffusion probabilistic models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11451–11461 (IEEE, 2022).
Zhu, S. et al. Text image inpainting via global structure-guided diffusion models. AAAI Conf. Art. Intell. 38, 7775–7783 (2024).
Google Scholar
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5999–6009 (2017).
Li, W. et al. Mat: Mask-aware transformer for large hole image inpainting. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10748–10758 (IEEE, 2022).
Deng, Y., Hui, S., Zhou, S., Meng, D. & Wang, J. T-former: an efficient transformer for image inpainting. In Proc. 30th ACM International Conference on Multimedia. 6559-6568 (Association for Computing Machinery, 2022).
Chen, S., Atapour-Abarghouei, A., Zhang, H. & Shum, H. P. H. Mxt: Mamba x transformer for image inpainting. Preprint at https://doi.org/10.48550/arXiv.2407.16126 (2024).
Wang, Z. et al. Uformer: a general u-shaped transformer for image restoration. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17662–17672 (IEEE, 2022).
Huang, W. et al. Sparse self-attention transformer for image inpainting. Pattern Recognit. 145, 109897 (2024).
Chen, Y., Xia, R., Yang, K. & Zou, K. Gcam: lightweight image inpainting via group convolution and attention mechanism. Int. J. Mach. Learn. Cybern. 15, 1815–1825 (2024).
Article Google Scholar
Wang, H.-H., Tsai, F.-J., Lin, Y.-Y. & Lin, C.-W. Tanet: triplet attention network for all-in-one adverse weather image restoration. Asian Conf. Comput. Vision 15475 LNCS, 3–19 (2025).
Google Scholar
Chen, Y., Xia, R., Yang, K. & Zou, K. Dnnam: image inpainting algorithm via deep neural networks and attention mechanism. Appl. Soft Comput. 154, 111392 (2024).
Xing, C. & Ren, Z. Binary inscription character inpainting based on improved context encoders. IEEE Access 11, 55834–55843 (2023).
Article Google Scholar
Xiong, W., Yue, L., Zhou, L., Wei, L. & Li, M. Fd-net: a fully dilated convolutional network for historical document image binarization. Pattern Recognit. Comput. Vis. 13019 LNCS, 518–529 (2021).
Article Google Scholar
Li, H. et al. Generative character inpainting guided by structural information. Visual Comput. 37, 2895–2906 (2021).
Article Google Scholar
Song, G., Li, J. & Wang, Z. Occluded offline handwritten chinese character inpainting via generative adversarial network and self-attention mechanism. Neurocomputing 415, 146–156 (2020).
Article Google Scholar
Kingma, D. P. & Welling, M. Auto-encoding variational bayes. In Int. Conf. Learn. Represent. (International Conference on Learning Representations, 2014).
Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T. & Efros, A. A. Context encoders: feature learning by inpainting. In Proc IEEE Conference on Computer Vision and Pattern Recognition. 2536–2544 (IEEE, 2016).
Zhao, L., Yuan, Z. & Lou, Y. Cross auto-encoder for inscription character inpainting. In Int. Joint Conference on Neural Networks. 1–8 (IEEE, 2024).
Lugo-Torres, G., Peralta-Rodriguez, D. A., Valdez-Rodriguez, J. E. & Calvo, H. Enhancing document digitization: image denoising with a cycle generative adversarial network. In IEEE Symp. Ser. Comput. Intell. 1461–1466 (IEEE, 2023).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 770–778 (IEEE, 2016).
Zhang, K., Zuo, W., Chen, Y., Meng, D. & Zhang, L. Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 26, 3142–3155 (2017).
Article PubMed Google Scholar
Gurrola-Ramos, J., Dalmau, O. & Alarcon, T. E. A residual dense u-net neural network for image denoising. IEEE Access 9, 31742–31754 (2021).
Article Google Scholar
Dai, G. et al. Disentangling writer and character styles for handwriting generation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 5977–5986 (IEEE, 2023).
Zhang, Y., Zhang, Y. & Cai, W. Separating style and content for generalized style transfer. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 8447–8455 (IEEE, 2018).
Kim, H. & Sohn, K.-A. How positive are you: text style transfer using adaptive style embedding. In Int. Conf. Comput. Linguist. 2115–2125 (Association for Computational Linguistics, 2020).
Mehta, S., Rastegari, M., Shapiro, L. & Hajishirzi, H. Espnetv2: A light-weight, power-efficient, and general-purpose convolutional neural network. In IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit. 9182–9192 (IEEE, 2019).
Liu, C.-L., Yin, F., Wang, D.-H. & Wang, Q.-F. Online and offline handwritten chinese character recognition: Benchmarking on new databases. Pattern Recognit. 46, 155–162 (2013).
Article Google Scholar
Tang, S. & Lian, Z. Write like you: Synthesizing your cursive online chinese handwriting via metric-based meta learning. Comput. Graphics Forum 40, 141–151 (2021).
Article Google Scholar
Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. In Int. Conf. Learn. Represent. (International Conference on Learning Representations, 2015).
Zeiler, M. D. Adadelta: an adaptive learning rate method. Preprint at https://doi.org/10.48550/arXiv.1212.5701 (2012).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. 62573345 & 62273273).

Author information

Authors and Affiliations

School of Mechanical and Precision Instrument Engineering, Xi’an University of Technology, Xi’an, China
Yunjing Liu & Erhu Zhang
Department of Information Science, Xi’an University of Technology, Xi’an, China
Erhu Zhang & Guangfeng Lin
School of Computer Science and Engineering, Xi’an University of Technology, Xi’an, China
Jinghong Duan

Authors

Yunjing Liu
View author publications
Search author on:PubMed Google Scholar
Erhu Zhang
View author publications
Search author on:PubMed Google Scholar
Guangfeng Lin
View author publications
Search author on:PubMed Google Scholar
Jinghong Duan
View author publications
Search author on:PubMed Google Scholar

Contributions

Y. L.: Data curation, Methodology, Software, Validation, Writing—original draft. E. Z.: Supervision, Project administration, Methodology, Writing—review & editing. G.L.: Data curation. J. D.: Data curation.

Corresponding author

Correspondence to Erhu Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, Y., Zhang, E., Lin, G. et al. A structural information-guided cross-modal method for damaged inscription inpainting via vision-language models. npj Herit. Sci. 13, 485 (2025). https://doi.org/10.1038/s40494-025-02059-1

Download citation

Received: 08 June 2025
Accepted: 21 September 2025
Published: 02 October 2025
Version of record: 02 October 2025
DOI: https://doi.org/10.1038/s40494-025-02059-1

A structural information-guided cross-modal method for damaged inscription inpainting via vision-language models

Abstract

Similar content being viewed by others

Fine art image classification and design methods integrating lightweight deep learning

Material classification method of traditional Chinese painting image based on prototypical network

Chinese inscription restoration based on artificial intelligent models

Introduction