Introduction

Chinese calligraphy, recognized by UNESCO as an Intangible Cultural Heritage, embodies over three millennia of Chinese cultural essence. Spanning from oracle bone script to regular and running scripts, this art form conveys unique aesthetic values and profound cultural connotations through the rhythmic interplay of brush and ink1. However, the natural degradation of physical carriers like paper and the diminishing number of master calligraphers place many invaluable inscriptions at risk of irreversible damage or loss. For instance, celebrated works such as the “Preface to the Poems Composed at the Orchid Pavilion” (Lanting Xu) are estimated to have approximately 15% of their characters exhibiting missing strokes due to their extensive history. This underscores the critical need for the digitization of calligraphy, a task central to cultural heritage preservation. Such digitization must not only achieve high-fidelity reproduction of diverse artistic styles but also maintain the intricate structural integrity of Chinese characters.

Despite advancements in deep learning-based image generation2, existing calligraphy generation methods encounter persistent challenges, primarily in achieving robust style control and ensuring structural fidelity. Firstly, while models like CalliGAN3 have made strides by employing Long Short-Term Memory (LSTM) networks to encode component sequences, their sequential modeling approach exhibits inherent limitations. For characters with more than 4 components, the stroke breakage rate in generated characters increases by as much as 37%, highlighting the difficulty in capturing global structural dependencies. Secondly, prevalent methods often rely on one-hot encoding for style representation, which discretizes styles and struggles to capture the nuanced evolution within a single calligrapher’s repertoire across different periods4. For example, the stylistic divergence between Yan Zhenqing’s early work, “Multiple Treasure Pagoda Stele” and his later “Yan Qingli Stele” cannot be smoothly interpolated or adequately represented by existing models. These issues primarily stem from two fundamental limitations: (1) the inadequacy of traditional recurrent architectures in modeling the complex two-dimensional spatial relationships inherent in Chinese character structures, and (2) the lack of fine-grained control and continuous representation capabilities for calligraphic styles.

To address these bottlenecks, this paper proposes Calliformer, a novel disentangled style transfer framework for Chinese calligraphy generation, characterized by three key innovations. Firstly, we design a Structure-Aware Transformer encoder that introduces a structural attention bias mechanism. By decomposing Chinese characters into a graph-like structure where nodes represent components and edges denote their spatial relationships, our model employs a dynamically generated bias matrix to guide the self-attention mechanism. This focuses computational resources on critical structural areas, such as the interaction between left and right components in a “⿰” (left-right) structure, thereby enhancing structural coherence. Secondly, inspired by the success of perceptual feature extraction in style transfer, we replace discrete one-hot style labels with a content-aware style encoder. This module utilizes a pre-trained network to map reference calligraphy images into a continuous style space, enabling the generation of nuanced and stylistically faithful results. Finally, we introduce and will release CCTS-2025, a comprehensive calligraphy dataset annotated with explicit Chinese character structural relationships. This dataset comprises more than 47,700 character images from 35 renowned calligraphers, with each character providing its component sequence (based on CNS5, a structural decomposition code, and distinct style code labels.

The contributions of this study offer significant practical implications. It provides a high-fidelity generation scheme crucial for the digital restoration of damaged historical inscriptions. Furthermore, it supports the development of adaptive font generation systems for personalized calligraphy education and facilitates the creation of novel, stylistically diverse fonts for the digital cultural and creative industries. Experimental results on our CCTS-2025 dataset demonstrate the superiority of our approach: it achieves a Mean Squared Error (MSE) of 12.91, a marked improvement over benchmark models such as CalliGAN (19.49) and zi2zi (26.02), while incurring only an 18.7% increase in model parameters.

The remainder of this paper is organized as follows: Section II reviews related work. Section III details our proposed methodology. Section IV presents comprehensive experimental results and analyses. Section V concludes the paper and suggests future research directions.

Related work

Our work, framing calligraphy generation as a conditional image-to-image translation task, builds upon and diverges from several key lines of inquiry, primarily concerning Chinese calligraphy generation, style transfer, and the application of Transformer architectures in vision.

Chinese calligraphy generation

Early methods for font generation often relied on stroke-based approaches, which required extensive manual effort to define stroke libraries6. More recent deep learning methods have framed the problem as image-to-image translation. Models like zi2zi7 and Rewrite8 adopted Generative Adversarial Networks (GANs) to transfer styles between fonts.

Specifically for Chinese calligraphy, CalliGAN introduced a significant paradigm by encoding sequences of predefined character components using LSTMs, coupled with a U-Net generator for multi-style output. The core concept involves decomposing characters into components and leveraging sequence modeling to capture stroke-order features. However, this paradigm exhibits two significant limitations. Firstly, the chain-like structure of LSTMs imposes a strict sequential assumption, which struggles with the inherently two-dimensional topology of Chinese characters (e.g., the triangular “品” structure or the encompassing “国” structure), often leading to structural errors. Secondly, the component encoding relies on a fixed, predefined dictionary, restricting its ability to handle non-standard character forms or nuanced stroke variations found in diverse calligraphic styles. Another line of work, like Style-Aware GAN, focuses on style disentanglement but still relies on standard CNNs that may not adequately model complex structural rules9.

In contrast, our approach models Chinese characters with a structure-aware framework. A Structure-Aware Transformer encoder is guided by a dynamically generated bias matrix, which directs attention toward critical structural interactions. This method moves beyond simple sequential modeling to improve structural fidelity while preserving generative flexibility. Although Diffusion models have demonstrated remarkable performance in general image generation in recent years, GAN-based frameworks remain the most relevant technical baselines for tasks involving fine-grained structure and style control specific to Chinese calligraphy. Therefore, this paper primarily focuses on introducing improvements and conducting comparisons based on these approaches.

Fig. 1
figure 1

Architecture of the Calliformer model. The proposed Calliformer employs an encoder-decoder framework (U-Net Generator) for calligraphy image synthesis. It integrates dedicated branches: a Graph Transformer-based Structure Encoder (Es, c) to process character component and structural information derived from the CCTS-2025 dataset indexed by the Unicode, and a Content-aware Style Encoder (Estyle) to extract artistic style from a reference calligraphy image. An internal image encoder (Ei) within the U-Net processes a standard character image for content. These structural (vstruct), style (vstyle), and content (vi) features are fused to guide the generation process. A discriminator provides adversarial feedback and performs style analysis (regression and classification). Key training objectives indicated include component and structure losses, pixel-wise reconstruction loss, adversarial loss, and style regression loss.

Style transfer

In the domain of style transfer, early methods often focused on matching statistical properties of feature activations from pre-trained networks, such as in the work of Gatys et al. A significant advancement came with Adaptive Instance Normalization (AdaIN)10, which demonstrated that style could be effectively transferred by aligning the mean and variance of content features with those of style features in real-time. This line of work established the power of using deep features from networks like VGG or ResNet to represent artistic style.

However, many conditional generation models, particularly in the font and calligraphy domain like CalliGAN3, still resort to simpler representations such as one-hot encoding. This approach treats styles as discrete, isolated categories, failing to capture the subtle continuity and shared characteristics among different calligraphers’ works. It also limits the model’s ability to generalize to new, unseen styles without retraining.

To address this, our work adopts a feature-based style encoding approach. We employ a pre-trained network to extract a style vector directly from a reference image. This method provides a rich, continuous representation of style, moving beyond the limitations of discrete labels and enabling more flexible and accurate style control.

Transformers in vision generation

The Vision Transformer (ViT) demonstrated that a pure Transformer architecture, with minimal CNN-based priors, can achieve state-of-the-art results in image classification11. This has spurred a wave of research applying Transformers to various computer vision tasks, including generation12. For character generation, GlyphGAN13 leveraged self-attention to model the global structure of characters. However, the direct adoption of the standard Transformer architecture faces challenges when applied to the highly structured nature of Chinese characters. There exist explicit, hierarchical structural constraints between Chinese character components (e.g., “⿱” for top-bottom, “⿴” for surrounding structures), whereas the standard self-attention mechanism treats all positional relationships agnostically, without inherent bias towards these established rules. Recent advancements have demonstrated the power of Graph Neural Networks (GNNs) in explicitly modeling the underlying structure and relationships within visual data, offering a potent alternative to the implicit spatial hierarchies learned by traditional CNNs. This is particularly relevant for tasks requiring a deep understanding of complex compositions, such as calligraphy generation14,15.

Although these methods are developed for classification under domain shift, their core principle—using explicit graph structures to analyze and represent complex visual grammars—provides valuable insights for generative tasks. Inspired by this line of research, our paper proposes a Structure-Aware Attention mechanism, which extends the standard scaled dot-product attention by incorporating a learnable structural bias, \({B_{struct}}\). This core idea is formulated as (1).

$${\text{StructAttn}}(Q,K,V,{B_{struct}})={\text{Softmax}}\left( {\frac{{Q{K^T}}}{{\sqrt {{d_k}} }}+{B_{struct}}} \right)V$$
(1)

Here, \({B_{struct}}\) represents a dynamic bias matrix derived from structural embeddings that encode the predefined relationships between components. This allows the model to automatically strengthen attention within structurally related regions and suppress irrelevant interactions, leading to improved structural integrity in the generated characters. Experimental results on our CCTS-2025 dataset validate this design, demonstrating an improvement in generating results.

Method

This section elaborates on the proposed Calliformer model, detailing its architecture, the encoding mechanisms for structure and style, and the training objectives.

Architecture of Calliformer

As illustrated in Fig. 1, Calliformer generates Chinese calligraphy by factorizing character representation into three key components: content, structure, and style. The model comprises three main modules: a Graph Transformer-based Structure Encoder (Es, c), a Content-aware Style Encoder (Estyle), and a U-Net based Generator, which are trained adversarially with a Discriminator that also performs style analysis.

  1. (1)

    Inputs.

The model takes three types of data as input:

  1. 1.

    A Unicode for the target character, used to retrieve predefined components and structural information.

  2. 2.

    A reference calligraphy image for style extraction or a pre-extracted style code from a specific calligraphy image.

  3. 3.

    A standard font image (e.g., SimSun) serving as the content guide.

  1. (2)

    Structure-aware transformer encoder.

This module processes the character’s component sequence and structural code retrieved from the CCTS-2025 dataset. As detailed in Section III-B, it produces a structural encoding \({v_{struct}}\) that captures the topological layout of the character.

  1. (3)

    Content-aware style encoder.

As detailed in Section III-C, Estyle extracts a style vector vstyle from the reference image using a pre-trained ResNet backbone, capturing the essential stylistic attributes of the calligrapher.

  1. (4)

    U-net generator.

The core of the image synthesis process is a U-Net Generator. It consists of an image encoder and an image decoder. The Image Encoder (Ei) takes the standard character image as input. It performs a series of downsampling convolutions to extract content-related features vi. These features capture the basic shape and identity of the character.

At the bottleneck of the U-Net, the content features vi are concatenated with the structural encoding vstruct (from Es, c) and the style embedding vstyle (from Estyle). This fused representation combines information about what character to write (vi), how its components should be arranged (\({v_{struct}}\)), and in what artistic style (\({v_{style}}\)).

The Image Decoder then takes this combined feature vector and progressively upsamples it through transposed convolutions and convolutional layers to generate the final stylized calligraphy image. Skip connections, inherent to the U-Net architecture, link feature maps from Ei to corresponding layers in the decoder, facilitating the preservation of fine-grained details from the input standard character image.

The architecture of the image encoder and decoder is showed in Table 1. All 8 encoder layers use the same convolution kernel size 5 × 5, batch normalization layer, activation function LeakyReLU with a slope as 0.2, and stride size of 2. The decoder’s L1 to L7 layer use the same deconvolution kernel size 5 × 5, activation function ReLU, batch normalization layer. The decoder’s L8 layer uses the hyperbolic tangent activation function and has a drop out layer with a drop rate as 0.5.

Table 1 Architecture of the image encoder and decoder.

Discriminator and style analysis

A discriminator network is employed for adversarial training and style consistency. It takes either a real calligraphy image or a generated image as input. The Discriminator performs the real/fake classification for the adversarial loss.

This modular architecture allows Calliformer to disentangle content, structure, and style, providing a controllable framework for generating diverse and high-quality Chinese calligraphy. The interactions between these modules are guided by a comprehensive loss function, which will be detailed in section F.

Fig. 2
figure 2

Architecture of the graph transformer-based structure and component encoder.

Structure-aware transformer encoder

The Graph Transformer-based Structure and Component Encoder (\({E_{s,c}}\)) is engineered to precisely capture the topological relationships among a character’s constituent components and their individual identities. As depicted in Fig. 2, this encoder processes a target character, first by deriving its component sequence \({S_c}=\{ {c_1},{c_2}, \ldots ,{c_{{L_c}}}\}\) and a structural code \({S_s}\), where \({L_c}\) is the component sequence length. We use the Chinese Standard Interchange Code, which defines 517 components and 13 structures of most Chinese characters to build the CCTS-2025 dataset. Given a character’s Unicode h, we obtain its component sequence and structural code via a lookup in the CCTS-2025 dataset as Fig. 3. These codes are concatenated and then fed into a specialized structure-encoding Transformer module.

Fig. 3
figure 3

Examples of component sequences. The first and second characters share the same component code k1 as 46, and the second and third characters share the same k2 as 48 and k3 as 81. The first and second characters have the same structure (left-right), and the second and third characters have the same structure (top-bottom).

Overall architecture of the structure encoding module

The core structure encoding module operates through the following steps:

Step 1: Input Embedding.

Component Embedding. The component identifier sequence \({S_c}\) is transformed into a sequence of dense vectors \({H_c} \in {{\mathbb{R}}^{{L_c} \times {d_{model}}}}\) using a component embedding function \({E_{comp}}:{{\mathbb{Z}}^{{L_c}}} \to {{\mathbb{R}}^{{L_c} \times {d_{model}}}}\). Each \({c_i}\) is mapped from a vocabulary of size \({V_c}\) to a \({d_{model}}\)-dimensional vector.

Structure Embedding. Similarly, the structural code \({S_s}\) is embedded into \({H_s} \in {{\mathbb{R}}^{{L_s} \times {d_{model}}}}\) by a structure embedding function \({E_{struct}}:{{\mathbb{Z}}^{{L_s}}} \to {{\mathbb{R}}^{{L_s} \times {d_{model}}}}\).

Both embedding functions handle padding for variable-length inputs.

Step 2: Sequence Combination and Positional Encoding.

The embedded component sequence \({H_c}\) and structure sequence \({H_s}\) are concatenated along the sequence dimension to form a unified hidden sequence \(H=[{H_c} \oplus {H_s}] \in {{\mathbb{R}}^{({L_c}+{L_s}) \times {d_{model}}}}\), where \(\oplus\) denotes concatenation. Then learnable positional encodings \(PE \in {{\mathbb{R}}^{({L_c}+{L_s}) \times {d_{model}}}}\) are added to H to provide the model with information about the relative or absolute positions of elements, resulting in the initial input sequence for the Transformer layers: \({X_0}=H+PE\). Let \(L={L_c}+{L_s}\) be the total length of this combined sequence.

Step 3: Dynamic Structural Bias Generation.

A dedicated module, the Dynamic Structural Bias Generator (\({G_{bias}}\)), processes the original component sequence \({S_c}\) and structure code \({S_s}\) to dynamically compute a structural bias matrix \({B_{struct}} \in {{\mathbb{R}}^{{N_h} \times L \times L}}\), where \({N_h}\) is the number of attention heads. This matrix injects prior structural knowledge of Chinese characters into the attention mechanism, as detailed in the next part.

Step 4: Attention Masking.

A causal attention mask \({M_{attn}} \in {{\mathbb{R}}^{L \times L}}\) is typically generated (e.g., an upper triangular matrix where \({M_{attn}}[i,j]= - \infty\) if \(j>i\) and 0 otherwise). This mask prevents attention heads from attending to subsequent positions, enforcing a directional processing flow. This mask is broadcastable to the shape \([{N_h} \times L \times L]\) when applied.

Step 5: Structural Transformer Encoder Layers.

The sequence \({X_0}\) is processed by a stack of \({N_L}\) identical encoder layers. Each layer l takes the output \({X_{l - 1}}\) from the previous layer (with \({X_0}\) as the input to the first layer) and refines the representation using the structure-aware attention mechanism, incorporating \({B_{struct}}\) and \({M_{attn}}\).

Step 6: Output Representation.

After \({N_L}\) layers, the final output sequence is \({X_{{N_L}}} \in {{\mathbb{R}}^{L \times {d_{model}}}}\). The representation of the first token \({X_{{N_L}}}[0,:]\) is extracted as the global structural encoding \({v_{struct}} \in {{\mathbb{R}}^{{d_{model}}}}\). This vector is then typically reshaped (e.g., to \({{\mathbb{R}}^{{d_{model}} \times 1 \times 1}}\)) for integration with subsequent image generation networks.

Dynamic structural bias generation

The Dynamic Structural Bias Generator (DSBG) module is the key innovation that instills character-specific structural understanding into the Transformer. This module pioneers a sophisticated, two-stage process to inject explicit structural grammar into the self-attention mechanism, moving beyond generic attention patterns. The process is detailed below and illustrated conceptually in Fig. 4.

Stage 1: Component Weight Inference.

First, for each character, we infer a weight for every component based on the character’s primary structural type. This is not a static assignment but a dynamic process guided by learnable parameters.

  1. 1.

    Learnable Split Ratios The module maintains a learnable parameter, split_ratios, for each of the 13 fundamental structural types defined in our CCTS-2025 dataset (e.g., ⿰, ⿱, ⿴). These scalar ratios are initialized with empirically derived values (e.g., 0.5 for a left-right structure) but are fine-tuned during training.

  2. 2.

    Component Partitioning For a given character with M components and a specific structural type, the corresponding split_ratio is used to calculate a split_point. This dynamically partitions the ordered component sequence into two groups (e.g., the “left” group and “right” group for a ⿰ structure).

  3. 3.

    Heuristic Weight Assignment Based on the structural type, we assign pre-defined, heuristic weights to the components in each group. For instance, in a left-right structure (⿰), components before the split_point might receive a weight of 1.0, while those after receive 0.5. This creates a position-aware weight vector, pos_weights, for the character’s components.

Fig. 4
figure 4

Detailed workflow of the dynamic structural bias generation for the character “好” (hǎo). This two-stage process translates the high-level structural code “⿰” into a nuanced pos_weights vector, which then seeds the creation of multiple factor matrices. A learnable combination of these factors produces the final, detailed attention bias.

This stage allows the model to learn the optimal partitioning of components for different structural patterns, a crucial step for handling complex character layouts.

Stage 2: Multi-Factor Bias Matrix Construction.

Second, using the inferred pos_weights, the DSBG constructs the final bias matrix, Bstruct, through a learnable linear combination of multiple structural factors.

  1. 1.

    Structural Parameter Projection The character’s structural code is passed through a learnable embedding layer and a linear projection, yielding three distinct, head-specific control parameters for each attention head: horizon, vertical, and region. These parameters govern the strength of different structural biases.

  2. 2.

    Factor Matrix Generation Three factor matrices are generated: Regional attention matrix, Vertical suppression matrix, and Active region matrix.

  3. 3.

    Final Bias Computation The final component bias matrix for each attention head is a learned linear combination of these factors.

This composite bias is then integrated into the full attention matrix, providing a rich, multi-faceted structural signal that guides the model to generate structurally coherent characters. The entire process is formally described in Table 2.

Table 2 Dynamic structural bias generating algorithm.

Structural transformer encoder layer and attention

Each of the \({N_L}\) encoder layers implements a structure-aware version of the standard Transformer encoder layer.

The Structure-Aware Multi-Head Self-Attention module instantiates our proposed mechanism (1) across \({N_h}\) parallel heads. For each head h, the input X is projected to \({Q_h},{K_h},{V_h} \in {{\mathbb{R}}^{L \times {d_k}}}\). Building upon Eq. (1), the attention scores are computed by incorporating the structural bias Bstruct and a standard attention mask \({M^{\prime}_{attn}}\):

$${\text{Score}}{{\text{s}}_h}=\frac{{{Q_h}K_{h}^{T}}}{{\sqrt {{d_k}} }}+{B_{struct}}[h,:,:]+{M^{\prime}_{attn}}$$
(2)

Attention weights are obtained via softmax: \({A_h}={\text{softmax}}({\text{Score}}{{\text{s}}_h}).\) The addition of the per-head bias \({B_{struct}}[h,:,:]\) directly infuses structural priors into the score computation before the softmax normalization. The output for head h is \({O_h}={A_h}{V_h}\). Outputs from all heads are concatenated, \({O_{concat}}=[{O_1} \oplus \ldots \oplus {O_{{N_h}}}]\), and then passed through a final linear projection \({W_O}\):

$${\text{MultiHeadAttn}}(X,{B_{struct}},{M^{\prime}_{attn}})={O_{concat}}{W_O}.$$

By stacking these layers, the module progressively refines the component and structure representations, leveraging the injected structural bias at each attention step to learn a rich, structure-aware encoding \({v_{struct}}\) for the input character.

Content-aware style encoder

Our framework employs a Style Encoder (Estyle) module to extract a high-level, disentangled representation of calligraphic style from a given reference calligraphy image. Instead of relying on discrete labels, this module captures the continuous and nuanced visual characteristics of a given calligraphic style.

For this task, we adopt a standard ResNet-34 architecture16,17, pre-trained on the ImageNet dataset18. We made a deliberate design choice to freeze the weights of this network during our training process.

The rationale for this approach is grounded in foundational work on neural style transfer and perceptual feature extraction. Pioneering research demonstrated that deep convolutional neural networks (CNNs), trained for object recognition on large-scale natural image datasets, learn a powerful hierarchy of visual features19. The lower layers of such networks capture basic pictorial elements like edges and textures, while deeper layers abstract away specific object identities to represent more complex textural and stylistic patterns.

The process is straightforward: the reference image is passed through the convolutional layers of the ResNet-34. We extract the output of the final average pooling layer, which produces a single feature vector. This vector, vstyle, serves as the style embedding for the input image. It encapsulates high-level stylistic information, such as stroke texture, weight, and overall composition, learned by the network.

This approach offers several advantages over the one-hot encoding used in prior work like CalliGAN. First, it creates a continuous style space, allowing for smooth interpolation between different styles. Second, by using a pre-trained network, it benefits from rich feature hierarchies learned from a massive dataset, enabling it to capture subtle details that define a calligraphic style. Finally, it is highly efficient as the backbone network’s weights are frozen during training, adding minimal computational overhead.

Discriminator

To enforce the generation of realistic and high-fidelity character images, we employ a conditional discriminator within our Generative Adversarial Network (GAN) framework. Specifically, we utilize the PatchGAN architecture in their work on image-to-image translation20.

Unlike a traditional discriminator that outputs a single scalar value classifying the entire image as real or fake, the PatchGAN discriminator operates on a different principle. It processes the input image convolutionally and outputs an N × N matrix, where each element corresponds to a decision on the authenticity of a specific overlapping “patch” or local region of the image. The final loss is then computed by averaging the responses across all patches.

The choice of PatchGAN is particularly well-suited for the task of high-resolution character generation. The perceptual quality of a synthesized character is determined not by its global plausibility, but by the fine-grained, high-frequency details: the sharpness of stroke edges, the subtle texture of the ink, and the precise geometric forms of radicals. A standard discriminator might be fooled by an image with a correct overall shape but blurry or artifact-laden strokes. In contrast, the PatchGAN architecture acts as a “local texture and structure critic”. By penalizing unrealistic patches anywhere in the image, it compels the generator to produce convincing details across the entire canvas, effectively guiding the model to master the intricacies of calligraphic rendering. This local focus is instrumental in achieving the crisp, detailed results that our method produces. As shown in Table 3, the proposed discriminator has 5 layers. BN means a batch normalization layer.

Table 3 Architecture of the proposed discriminator.

Loss function design

This section details the comprehensive loss function system designed for Calliformer. The system aims to achieve a harmonious balance between precise structural generation, faithful style transfer, and robust component-wise accuracy. It builds upon a GAN framework, incorporating specialized encoders for structure, components and style (\({E_{style}}\)), a U-Net based Generator (G), and a Discriminator (D) with an adversarial output (\({D_{adv}}\)).

Notation:

x: input standard character image (providing content and base structure).

y: Real calligraphy image from the dataset.

\({v_{struct}}={E_{s,c}}(x)\): Structure and component code, a vector representing character components and their layout.

\({v_{style}}={E_{style}}(y)\): Style code, the output of the style encoder.

\(\hat {y}=G\left( {x,{v_{struct}},{v_{style}}} \right)\): Generated calligraphy image.

\({D_{adv}}\left( {x,y} \right)\): Output of the discriminator, predicting if an image is real or fake.

  1. 1.

    Adversarial Loss (\({\mathcal{L}_{adv}}\)). This loss aligns the distribution of generated calligraphy with that of real calligraphy, conditioned on both style and component information. The Generator aims to produce images that the Discriminator cannot distinguish from real samples, while the Discriminator learns to differentiate them21. For the Generator:

    $${\mathcal{L}_{ad{v_G}}}= - {{\mathbb{E}}_{x,{v_{struct}},{v_{style}}}}[\log {D_{adv}}(G(x,{v_{struct}},{v_{style}}))].$$

    For the Discriminator (adversarial head \({D_{adv}}\)):

    \(\begin{aligned} {\mathcal{L}_{ad{v_D}}}= & - {{\mathbb{E}}_{x,y}}[\log {D_{adv}}(x,y)] \\ \quad & - {{\mathbb{E}}_{x,{v_{struct}},{v_{style}}}}[\log (1 - {D_{adv}}(x,G\left( {x,{v_{struct}},{v_{style}}} \right))]. \\ \end{aligned}\)

  2. 2.

    Pixel-wise Reconstruction Loss (\({\mathcal{L}_{pixel}}\)). An L1 loss between the generated image and the ground truth to enforce content and structural preservation, which helps stabilize training:

    $${\mathcal{L}_{pixel}}={{\mathbb{E}}_{x,y,{v_{struct}},{v_{style}}}}\left[ {{{\left\| {G(x,{v_{struct}},{v_{style}}) - y} \right\|}_1}} \right].$$

    The L1 norm is preferred over L2 for better preservation of sharp stroke edges.

  3. 3.

    Style Regression Loss (\({\mathcal{L}_{style}}\)). This novel loss ensures that the generated image \(\hat {y}\) embeds the target style \({v_{style}}\) in a way that the style encoder can accurately retrieve it. It also trains \({D_{style}}\) and \({E_{style}}\) to be consistent. For the Generator:

    \({\mathcal{L}_{styl{e_G}}}={{\mathbb{E}}_{x,y}}\left[ {{{\left\| {{D_{style}}(x,y) - {E_{style}}(y)} \right\|}_1}} \right].\)

    This forces G to generate images whose style, as perceived by \({D_{style}}\), matches the input style code \({v_{style}}={E_{style}}(y)\). This trains \({D_{style}}\) to accurately extract style codes from real images, and simultaneously refines \({E_{style}}\) to produce codes that \({D_{style}}\) can robustly map to.

  4. 4.

    Component Consistency Loss (\({\mathcal{L}_{cyc}}\)). Maintains reversibility in the component encoding space. The component structure extracted from the generated image should match the original input component code.

    \({\mathcal{L}_{cyc}}={{\mathbb{E}}_{x,{y_s},{v_{struct}}}}\left[ {{{\left\| {{E_{s,c}}(G(x,{v_{struct}},{v_{style}})) - {v_{struct}}} \right\|}_1}} \right],\)  

    this trains G to respect the input component structure and \({E_{s,c}}(x)\) to be robust to G’s variations.

  5. 5.

    Total Loss Functions. The overall training involves optimizing the parameters of the Generator, the Discriminator, \({E_{s,c}}\), and \({E_{style}}\). The total losses are weighted sums of the individual terms. For the Generator:

    $${\mathcal{L}_G}={\lambda _{ad{v_G}}}{L_{ad{v_G}}}+{\lambda _{pixel}}{L_{pixel}}+{\lambda _{style}}{L_{style}}+{\lambda _{cyc}}{L_{cyc}}.$$

    For the Discriminator:

    $${\mathcal{L}_D}={\lambda _{ad{v_D}}}({L_{ad{v_{Dreal}}}}+{L_{ad{v_{Dfake}}}}).$$

    The hyperparameters \({\lambda _*}\) balance the contribution of each loss term. This comprehensive design systematically addresses the challenges of structure-style decoupling, fine-grained style control, and component accuracy in calligraphy generation.

Experiments

To validate the effectiveness of our proposed model, Calliformer, we conducted a series of comprehensive experiments. This section first details the dataset and our annotation process, followed by the experimental setup, training protocols, and evaluation metrics. We then present a quantitative and qualitative comparison against two classic baseline models, zi2zi and CalliGAN, to demonstrate our model’s better performance in generation quality and style fidelity. Finally, an ablation study, and a human subjective evaluation were conducted.

Dataset and preprocessing

Dataset source and annotation

Our experiments are conducted on the high-resolution Chinese calligraphy dataset first introduced in the CalliGAN paper. This dataset contains thousands of character images across 7 distinct calligraphy styles from the regular script (楷书) category, covering masterpieces from renowned calligraphers like Yan Zhenqing and Liu Gongquan.

While we use the same image set, a key contribution of our work lies in the extensive and meticulous annotation process we established to create the explicit structural priors required by our model. This enriched dataset, which we refer to as CCTS-2025, enables a disentangled representation of content, structure, and style. The annotations, stored in a structured JSON file for each image, include:

  1. 1.

    Component Sequence: We decomposed each Chinese character into a sequence of its fundamental components (e.g., radicals and other sub-structures). This provides the explicit structural information for our Graph Transformer encoder.

  2. 2.

    Style Label: A unique style label was assigned to each image, corresponding to its source calligrapher. This label guides the training of the Style Encoder

  3. 3.

    Positional Information: We also encoded the positional and relational information for each component, which is used to generate the dynamic structural attention bias in our model.

To ensure the accuracy and reliability of these structural annotations, we implemented a rigorous protocol. The annotation was performed by three graduate students with expertise in Chinese linguistics. They followed a strict set of guidelines, primarily adhering to the decomposition standards in the Table of General Standard Chinese Characters (《通用规范汉字表》). For rare characters, historical lexicographical sources like the Kangxi Dictionary were consulted to maintain consistency. Each character was annotated independently by all three annotators.

To validate the quality of this process, we calculated the Inter-Annotator Agreement (IAA) using Fleiss’ Kappa (κ), achieving a score of κ = 0.92. This high level of agreement indicates the robustness of our annotation guidelines and the consistency of the collected data. Any disagreements were resolved by a majority vote or, in rare cases, adjudicated by a senior researcher. This meticulous process ensures that the structural information fed into our model is both accurate and reliable.

Experimental setup

Implementation details

Our model, Calliformer, was implemented using PyTorch. All experiments were performed on a server equipped with four NVIDIA RTX 4090 GPUs. To accelerate training, we utilized Automatic Mixed Precision (AMP). The training process for each model ran for 40 epochs with a batch size of 64.

We used the Adam optimizer for both the generator and discriminator, with β₁=0.5 and β₂=0.999. The learning rate was initially set to 0.0002 for the first 20 epochs and then decayed by 50% for the remaining 20 epochs. This setup ensures stable convergence and is consistent with the training practices of the baseline models.

Evaluation metrics

To provide a thorough and objective assessment of our model’s performance, we employed two widely-used metrics for image generation quality:

  • Mean Squared Error (MSE, ↓): Measures the average squared difference between the pixels of the generated image and the ground truth image. A lower MSE indicates better pixel-level accuracy.

  • Structural Similarity Index (SSIM, ↑): Compares the structural information, luminance, and contrast between two images. A higher SSIM score (closer to 1) suggests that the generated image is structurally more similar to the ground truth, which aligns better with human perception22.

Quantitative comparison

We compared Calliformer against two influential baseline models: zi2zi, a pioneering GAN-based font generation model, and CalliGAN, a state-of-the-art method specifically designed for calligraphy generation. To ensure a fair comparison, we implemented baseline models using PyTorch and trained them with the same environment.

Table 4 Performance comparison with baseline models.

The results in Table 4 clearly demonstrate the superiority of our proposed model.

  • Calliformer significantly outperforms both baselines on both metrics. Compared to CalliGAN, our model reduces the MSE by 33.8% and improves the SSIM by 15.7%.

  • This improvement can be attributed to our model’s core design. Unlike CalliGAN, which uses a simple one-hot vector for style and a component encoder for structure, Calliformer employs a more powerful Transformer-based architecture. This allows the model to better capture long-range dependencies between strokes and more effectively integrate the detailed structural and component information from our JSON annotations. The attention mechanism is crucial for generating coherent and complex characters with high fidelity.

Qualitative comparison

In addition to numerical metrics, we provide a visual comparison of the results generated by Calliformer and the baseline models.

Fig. 5
figure 5

Qualitative comparison of generated characters. For each column, the images are (from top to down): Input Character, CalliGAN, and Calliformer (Ours). Red rectangles highlight the benefits brought by the proposed design, which generates calligraphic images more like the ground truth images.

As shown in Fig. 5, Calliformer generates characters with sharper details and more accurate stroke structures. For instance, in characters with complex intersections or fine-tipped strokes, zi2zi often produces blurry or broken results. While CalliGAN improves upon this, it can sometimes fail to capture subtle stylistic nuances. In contrast, Calliformer consistently produces images that are not only structurally correct but also stylistically closer to the ground truth. This is particularly evident in the rendering of hooks and corners, where our model’s attention to structural detail excels.

Ablation study

To systematically evaluate the technical contributions of individual modules, the research team designed seven ablation configurations.

To understand the impact of the key components in our Calliformer architecture, we conducted an ablation study. We started with a basic version of our model and progressively added our main contributions: the Component Encoder and the Style Encoder module inspired by Transformer architectures.

Table 5 Ablation study of Calliformer components.

As shown in Table 5, the results from the ablation study highlight the following:

  1. (1)

    Baseline (U-Net only): Our baseline, a simple U-Net structure similar to pix2pix, achieves a reasonable but limited performance.

  2. (2)

    Adding the Structure-Aware Transformer: Integrating our novel Transformer-based structure encoder significantly improves performance, reducing MSE and increasing SSIM. This confirms that explicitly modeling the spatial relationships between components is crucial for structural integrity.

  3. (3)

    Integrating the Content-aware Style Encoder (Full Model): The most significant performance gain comes from replacing a simple one-hot style input with our ResNet-based style encoder and combining it with the structure encoder. This addition leads to a major drop in MSE and an obvious increase in SSIM. This demonstrates that the combination of a rich, continuous style representation and a powerful structural prior enables the model to generate higher-quality and more faithful calligraphy. This synergistic effect confirms the value of our core architectural innovations.

Human subjective evaluation

To evaluate the model’s ability to not only generate structurally sound characters but also to faithfully capture the nuanced stylistic differences between various calligraphy masters, we conducted a comprehensive human subjective evaluation. This study was designed to assess whether outputs from our CalliFormer are perceptually preferred over those from the strong baseline, CalliGAN, when rendering distinct styles within the same script category.

Study design and procedure

We recruited 15 participants, comprising two groups: 5 certified calligraphy experts from the Chinese Calligraphers Association and 10 advanced calligraphy enthusiasts with extensive practice backgrounds.

The experiment was a double-blind, two-alternative forced-choice (2-AFC) preference test. The stimuli were generated from our dataset which contains 7 distinct styles from the Regular Script (楷书) category, including masterpieces from renowned calligraphers such as Yan Zhenqing and Liu Gongquan. For each of the 7 styles, we generated 20 common Chinese characters. In each trial, participants were shown three images on a calibrated display: a ground-truth character image establishing the target style, and two generated images of the same character—one from CalliFormer and one from CalliGAN. This protocol yielded a total of 2,100 independent preference data points (15 participants × 7 styles × 20 characters).

Results and analysis

The results, summarized in Table 6, demonstrate a clear and resounding preference for CalliFormer. Across all 2,100 trials, CalliFormer was preferred in 91.2% of the cases, a statistically significant margin that validates its superior ability to synthesize high-fidelity calligraphy.

Table 6 Human preference rate (%) of CalliFormer vs. CalliGAN across regular script styles.

Crucially, our model demonstrated exceptional performance on styles renowned for their rigorous structural logic. For the bold and majestic style of Yan Zhenqing and the bony, sinewy style of Liu Gongquan, CalliFormer achieved preference rates of 93.3% and a peak of 95.7%, respectively. This strongly suggests that our structure-aware Transformer architecture accurately captures the fundamental principles that define these canonical styles, a task where the baseline often falters by introducing subtle structural inconsistencies.

Limitations and future work

While our proposed CalliFormer framework demonstrates significant advancements in generating high-fidelity regular script calligraphy, we frankly acknowledge its current limitations and identify several promising directions for future research. These areas not only address the model’s current constraints but also align with the broader goal of achieving more versatile and creative calligraphy generation.

Limitations

Our current work is primarily constrained by three factors. First, the reliance on a predefined, component-based structural prior, while highly effective for Regular Script (楷书), fundamentally limits the model’s applicability to more fluid and interconnected scripts like Running (行书) and Cursive (草书), where character structures are dynamic rather than fixed. Second, while our style encoding captures the general essence of a calligrapher’s style, it does not yet achieve the precision required to replicate the subtlest nuances of brushwork, such as ink feathering or pressure variations. Finally, generation errors can still occur for characters with exceptionally complex structures or a high stroke count, indicating room for improvement in handling extreme cases.

Future work

To address these challenges, we are actively pursuing the following research fields, which directly build upon the findings of this paper:

Extending to Fluid Scripts via Dynamic and Sequential Priors: Our next immediate goal is to overcome the static nature of our current structural priors. We propose a two-staged approach. First, we will explore data-driven structural priors, using self-supervised methods on large-scale datasets to learn the dynamic relationships between strokes and components directly from calligraphic data. Second, we plan to transition from component-level modeling to stroke-level sequence modeling. By treating calligraphy generation as a sequential process, similar to how a human calligrapher writes, we can naturally model the continuous flow and stroke linkages characteristic of Running and Cursive scripts.

Enhancing Fidelity and Creativity with Diffusion Models: As the next major phase of our research, we plan to integrate Denoising Diffusion Probabilistic Models (DDPMs) into our framework. We hypothesize that the iterative refinement process inherent to diffusion models is exceptionally well-suited to capturing the fine-grained textures and delicate details of calligraphy, such as ink dispersion on Xuan paper and the subtle variations in brushstrokes. While GANs excel at generating globally coherent structures, DDPMs offer superior performance in textural and detail synthesis. Our future work will investigate hybrid GAN-diffusion architectures to leverage the strengths of both, aiming not only for photorealistic replication but also for enabling novel creative applications, such as interpolating between styles or generating entirely new, aesthetically compelling calligraphic forms.

Dataset and Style Representation Enrichment: Concurrently, we will continue to expand our CCTS-2025 dataset to include a wider array of master calligraphers and standard styles. This will provide a richer foundation for all our modeling efforts. Furthermore, we will investigate more flexible and disentangled style encoding methods to move beyond mere style replication towards a system capable of genuine artistic exploration.

Conclusion

This study introduced CalliFormer, a modular deep learning framework designed to enhance both the structural accuracy and stylistic fidelity of Chinese calligraphy generation. We demonstrated that our key innovation, the Dynamic Structural Bias Generator (DSBG), significantly improves the structural integrity of generated characters by dynamically injecting explicit structural priors into the Transformer architecture. When combined with our style encoding and adversarial training modules, the framework produces high-quality results that are both structurally sound and stylistically consistent.

A notable finding of our work is the significant synergistic effect among these modules, where their integrated performance surpasses the sum of their individual contributions. This insight offers valuable guidance for the design of other complex generative systems. In summary, the methodology proposed in this paper not only contributes a robust and effective framework to the field of computational art but also paves the way for a rich research agenda aimed at the digital preservation, creative inheritance, and artistic exploration of this profound cultural heritage.