A novel flexible identity-net with diffusion models for painting-style generation

Zhao, Yifei; Liang, Ziqi; Qiu, Yingrui; Wang, Xiaona

doi:10.1038/s41598-025-12434-4

Download PDF

Article
Open access
Published: 31 July 2025

A novel flexible identity-net with diffusion models for painting-style generation

Yifei Zhao¹,
Ziqi Liang¹,
Yingrui Qiu¹ &
…
Xiaona Wang¹

Scientific Reports volume 15, Article number: 27896 (2025) Cite this article

1480 Accesses
1 Citations
Metrics details

Subjects

Abstract

Art’s unique style and creativity are essential in defining a work’s identity, conveying emotions, and shaping audience perception. Recent advancements in diffusion models have revolutionized art design, animation, and gaming, particularly in generating original artwork and visual identities. However, traditional creative processes face challenges such as slow innovation, high costs, and limited scalability. Consequently, deep learning has emerged as a promising solution for enhancing painting-style creative design. In this paper, we present the Painting-Style Design Assistant Network (PDANet), a groundbreaking network architecture designed for advanced style transformation. Our work is supported by the Painting-42 dataset, a meticulously curated collection of 4055 artworks from 42 illustrious Chinese painters, capturing the aesthetic nuances of Chinese painting and offering invaluable design references. Additionally, we introduce a lightweight Identity-Net, designed to enhance large-scale text-to-image (T2I) models by aligning internal knowledge with external control signals. This innovative Identity-Net seamlessly integrates image prompts into the U-Net encoder, enabling the generation of diverse and consistent images. Through extensive quantitative and qualitative evaluations, our approach has demonstrated superior performance compared to existing methods, producing high-quality, versatile content with broad applicability across various creative domains. Our work not only advances the field of AI-driven art but also offers a new paradigm for the future of creative design. The code and data are available at https://github.com/aigc-hi/PDANet.

Material classification method of traditional Chinese painting image based on prototypical network

Article Open access 30 July 2025

Investigating the diversity and stylization of contemporary user generated visual arts in the complexity entropy plane

Article Open access 01 July 2025

Fine art image classification and design methods integrating lightweight deep learning

Article Open access 26 September 2025

Introduction

Chinese painting is a traditional ancient art with a long history of more than 3,000 years, rich in content and diverse subject matter^1,2,3,4,5. This ancient art form has diverse styles, divided into different dynasties and schools, reflecting the characteristics of traditional Chinese culture⁶. Compared to the labor-intensive and time-consuming nature of conventional hand-painting, computer-assisted image generation has led to a paradigm shift in efficiency and innovation⁷. Chinese painting embodies distinct aesthetic principles, particularly in its use of scattered perspective, freehand brushwork, ink diffusion, and the artistic concept of “white space”. These unique artistic features pose significant challenges for conventional image generation models. To address this, the algorithm incorporates a cross-scale attention mechanism specifically designed to model these characteristics. By leveraging cutting-edge technology, artists can quickly produce complex effects, saving time and costs while expanding the boundaries of artistic expression and unlocking new creative possibilities⁸. Generative Adversarial Networks (GANs) represent the state-of-the-art in this field, revolutionizing the field of computer-assisted image generation⁹. However, despite their impressive capabilities, GANs often have difficulty maintaining the stability of the quality of generated images, which can appear blurry or distorted due to imperfect control over content generation¹⁰. In addition, GANs are susceptible to mode collapse, resulting in limited diversity in generated images, which can severely limit their applicability in real-world scenarios¹¹.

Recent breakthroughs in the field of generative art have opened up new avenues for artistic expression, including the use of large-scale models such as stable diffusion (SD)¹², Midjourney¹³ and DALL-E 3¹⁴. Among them, SD has shown remarkable strength in generating Chinese landscape paintings and creating high-quality works with complex details and coherent structures. In particular, the artworks generated by SD faithfully capture the texture and layering characteristics of traditional ink paintings, demonstrating its extraordinary ability to imitate the nuances of this ancient art form¹⁵. The success of SD can be attributed to its innovative approach, which combines style reference images with descriptive clues to provide precise control over style and content¹⁶. This enables artists to create works that meet specific artistic expectations while also allowing them to explore new creative possibilities¹⁷. In addition, SD’s automatic generation process significantly shortens the creation time, allowing artists to focus on conceptualization and creative exploration¹⁸. SD uses digital technology to promote the seamless combination of traditional Chinese painting and modern technology, thus promoting innovation and protecting artistic heritage¹⁹.

This paper introduces the Painting-42 dataset, a comprehensive collection of images and text annotations designed specifically for Chinese painting styles. Using this dataset, we introduce a groundbreaking model, the painting-style design assistance network (PDANet), which is characterized by its lightweight training parameters and its ability to guide text inputs to generate design references that meet predefined criteria. The PDANet model aims to address the challenges of text-to-image (T2I) generation, which aims to generate images from text prompts using big data and powerful computing power. While T2I generation has achieved high synthesis quality, it relies heavily on carefully crafted prompts and lacks flexible user control, which often results in an inaccurate reflection of the user’s ideas. This limitation leads to inaccurate control and unstable results for non-expert users. To address this challenge, we propose Identity-Net, a lightweight model that excels at learning with relatively small datasets. Identity-Net aims to enhance the pre-trained T2I diffusion model by providing supplementary guidance. We hypothesize that a compact adapter model can effectively perform this role by mapping control information onto the internal knowledge of the T2I model instead of learning a completely new generation function. By doing so, Identity-Net enables more flexible and controllable T2I generation, allowing users to generate high-quality images that accurately reflect their ideas.

In summary, the primary contributions of this article are as follows.

We present the Painting-42 dataset, a comprehensive collection of 4,055 works by 42 renowned Chinese painters from various historical periods. This high-quality dataset is specifically tailored to capture the intricacies of Chinese painting styles.
We propose PDANet, a lightweight architecture for style transformation that pioneers the application of diffusion models in creative style design. PDANet excels in faithfully emulating the distinct styles of specific Chinese painters, enabling the rapid generation of painting-style designs directly from text inputs.
We introduce the Identity-Net, a straightforward and efficient method that effectively adjusts the internal knowledge of T2I models and external control signals at minimal expense. Through comprehensive artistic evaluations conducted during user research, we demonstrate improved user preferences and validate the effectiveness of our approach.

Related work

The advent of deep generative models has ushered in a new era of digital creation in the field of painting, with computer algorithms pioneering new approaches in the field of generative art²⁰. However, controllable painting style generation remains a relatively underexplored area, hampered by the scarcity of relevant data and the limited adaptability of existing design frameworks¹⁶. To address this challenge, we propose PDANet, a new method for generating paintings with a specific style using a latent diffusion model²¹. By leveraging the power of deep learning, PDANet can produce high-quality, stylistically unique paintings that are both beautiful and controllable.

Generative adversarial networks

The convergence of artificial intelligence and artistic expression has inaugurated a new era of innovative applications in painting and art. Recent breakthroughs in computer-aided design have concentrated on developing sophisticated rendering and texture synthesis algorithms, particularly in the domain of image stylization. The advent of neural network-based style transfer methods, which synthesize images that amalgamate the style of one image with the content of another²², has paved the way for novel creative possibilities. This paradigm has garnered considerable attention, sparking a further investigation into deep learning-based techniques, notably through the utilization of convolutional neural networks (CNNs)^23,24. The efficacy of these approaches has been extensively demonstrated in various applications, showcasing their potential to revolutionize the field of artistic expression. Recently, researchers have investigated the incorporation of GANs into image style transfer^25,26,27,28, thereby expanding the frontiers of this field. GANs, a groundbreaking deep learning paradigm primarily employed for unsupervised learning and creative data generation, comprise two essential components: the generator G and the discriminator D²⁹. Generator G learns to produce realistic novel data instances by mapping random noise to approximate the real-world data distribution, while discriminator D is trained to distinguish between authentic data and synthetic data generated by the generator. This interplay forms a dynamic game during training, driving both components to continually enhance their performance³⁰. Ultimately, the objective is for the generator to produce high-fidelity samples that are indistinguishable from truthful data. Beyond image style transfer, GANs have found diverse applications in image restoration, music composition, video synthesis, and numerous other fields, underscoring their pivotal role in advancing generative model technology and pushing the boundaries of artificial intelligence.

The proposal of the sketch and paint GAN (SAPGAN) is a pioneering contribution in this research field³⁰, which plays an important role in generating sketches and paintings. This groundbreaking work introduced new concepts to the field, especially in the field of Chinese painting generation. Notably, SAPGAN represents the first end-to-end model designed to unconditionally create Chinese paintings, thus bridging the gap between artificial intelligence and art. The framework contains two basic components: a sketching-GAN that generates outlines and a paint-GAN that applies colors based on these outlines. By integrating these components, SAPGAN provides a powerful theoretical framework and practical methods for computer-generated artistic content. This technological breakthrough marks an important milestone in interdisciplinary research and highlights the potential of artificial intelligence (AI) to revolutionize artistic creation and push the boundaries of creative expression.

Painting style transfer

Style transfer entails the process of migrating stylistic elements from one image to another while preserving the content of the source image. This technique involves learning the intricate details of the target image and adapting them to the source image³¹. Recently, the application of generative artificial intelligence for text-guided image generation has gained significant traction, particularly with the advancements in large-scale diffusion models^32,33,34,35. Notable examples include Glide³⁶, Cogview³⁷, Imagen³⁸, Make-a-scene³⁹, ediffi⁴⁰, and Raphael⁴¹. These approaches often employ self-distillation techniques in conjunction with adapters for guided generation, requiring minimal additional training while maintaining the fixed parameters of the original model. In the realm of image style transfer, GANs provide a robust framework. For instance, Pix2Pix was one of the earliest applications of GANs to image pair translation tasks, effectively achieving style transfer. Subsequent studies have further refined style transfer using GANs and their variants. CycleGAN, for example, handles unpaired data, while architectures such as Anycost GAN optimize resource utilization on diverse computing platforms. These advancements underscore the versatility and ongoing development of GAN-based techniques for optimizing style transfer effects.

Moreover, we explore an inversion-based style transfer method (InST) that harnesses the capabilities of diffusion models to conceptualize painting styles through a trainable textual representation. InST leverages inverse mapping techniques to accurately and efficiently extract and transfer artistic style characteristics from a source image. This approach enables the synthesis of novel stylized content without the need for elaborate textual guidance, operating effectively even with a single painting as a reference. By doing so, InST streamlines the style transfer process, allowing for greater flexibility and creative control.

Specifically, Gaussian Splatting excels in efficient 3D scene representation, enabling real-time rendering and stylistic propagation within 3D environments^42,43,44. In tasks such as Text-to-3D generation, Gaussian Splatting has demonstrated the ability to combine text conditioning with style transfer, allowing controlled generation of stylized 3D scenes. However, as our work centers on 2D Chinese painting generation, we did not include Gaussian Splatting in the main technical comparison due to its current focus on 3D representation. Nevertheless, we fully recognize its potential for future extensions of our framework toward 3D artistic scene generation and style transfer.

Diffusion models

Diffusion models⁴⁵ have made significant progress in image synthesis by generating images from Gaussian noise via iterative denoising methods, which are based on rigorous physical principles governing the diffusion process and its reversal^46,47. Recently, image diffusion models have attracted widespread attention in image generation^32,48. Latent diffusion models (LDM)¹² perform the diffusion step within the latent image space⁴⁹, significantly reducing computational overhead. In the field of T2I generation, diffusion models leverage pre-trained language models such as CLIP⁵⁰ to encode textual inputs into latent vectors, resulting in state-of-the-art image synthesis results. It is worth noting that SD is an upscaled implementation of LDM, while Imagen³⁸ adopts a pyramid structure to directly diffuse pixels without involving the latent image, thus providing a unique approach to image synthesis.

Several recent approaches^40,51,52 have been proposed to modify cross-attention maps in pre-trained T2I models, enabling the steering of the generation process without requiring additional training. A notable advantage of these methods is their seamless integration with existing models. For instance, AFA⁵³ dynamically adjusts the contributions of multiple diffusion models based on various states, leveraging their strengths while suppressing their weaknesses. This approach differs from static parameter merging methods, offering a more adaptive and effective way to combine the capabilities of different models. In parallel efforts, Zhang et al.⁵⁴ have explored the use of task-specific control networks to facilitate conditional generation using pre-trained T2I models. This line of research has shown promise in enabling more precise control over the generation process. Furthermore, diffusion models have been successfully applied to a wide range of scenarios, including pose-guided person image synthesis⁵⁵, talking faces⁵⁶, virtual dressing⁵⁷, and story generation⁵⁸. These applications demonstrate the versatility and potential of diffusion models in various domains. Therefore, it is essential to explore the potential of diffusion models in the synthesis of artistic style, a promising area of research that could benefit from the capabilities of these models.

Proposed method

Preliminary

The T2I diffusion model, specifically SD¹², plays a crucial role in modeling the dynamic behavior and stable state distribution of complex systems by drawing an analogy between image generation and ink diffusion in water. This model comprises two primary processes, leveraging an auto-encoder and a modified UNet denoiser⁵⁹. In the initial phase, SD employs an auto-encoder trained to encode images into a compact latent space, followed by reconstruction. This process enables the model to capture the essential features and patterns in the input data. The subsequent phase utilizes a modified UNet denoiser to directly refine this latent space, effectively removing noise and generating high-quality images. This streamlined approach can be formulated as follows:

$$\begin{aligned} {\mathscr {L}} = {\mathbb {E}}{{\mathbf{Z}}{t},{\mathbf{C}},{\mathbf{e}},t}\left( \left| {\mathbf{e}} - {\mathbf{e}}_{\theta }({\mathbf{Z}}_t,{\mathbf{C}})\right| _2^2 \right) , \end{aligned}$$

(1)

where ${\mathbf{Z}}{t}=\sqrt{\overline{\alpha {t}}}{\mathbf{Z}}0+\sqrt{1-\overline{\alpha {t}}}{\mathbf{e}}$ represents the noise-processed feature map in step t, with ${\mathbf{e}} \in {\mathscr {N}}(0,{\mathbf{I}})$ denoting the noise component. ${\mathbf{C}}$ represents the conditional information, and ${\mathbf{e}}_{\theta }$ is a function of the UNet denoiser, parameterized by $\theta$.

During the inference process, the initial latent map ${\mathbf{Z}}_T$ is generated from a random Gaussian distribution. Given ${\mathbf{Z}}T$, the denoiser ${\mathbf{e}}{\theta }$ predicts noise estimates at each step t, conditioned on the conditional information ${\mathbf{C}}$. By subtracting these noise features, the noise map becomes increasingly refined, allowing for the recovery of the underlying signal. After t iterations, the resulting purified latent feature $\hat{{\mathbf{Z}}}_0$ is inputted into the decoder for image generation. In the conditional setup, SD leverages a pre-trained CLIP⁵⁰ text encoder to embed textual input into a sequence represented by ${\mathbf{y}}$. This embedded sequence is then incorporated into the denoising process using a cross-attention model, enabling the model to effectively integrate textual information into the image generation process.

$$\begin{aligned} \begin{aligned} \left\{ \begin{array}{ll} {\mathbf{Q}} = {\mathbf{W}}_Q \phi ({\mathbf{Z}}_t),\ {\mathbf{K}} = {\mathbf{W}}_K \tau ({\mathbf{y}}),\ {\mathbf{V}} = {\mathbf{W}}_V \tau ({\mathbf{y}}),\\ \text {Attention}({\mathbf{Q}}, {\mathbf{K}}, {\mathbf{V}}) = \text {softmax}\left( \frac{{\mathbf{Q}}{\mathbf{K}}^T}{\sqrt{d}}\right) \cdot {\mathbf{V}}. \end{array} \right. \end{aligned} \end{aligned}$$

(2)

Here, $\phi (\cdot )$ and $\tau (\cdot )$ are two trainable embeddings. ${\mathbf{W}}_Q$, ${\mathbf{W}}_K$, and ${\mathbf{W}}_V$ are trainable projection matrices.

Identity-net design

We propose a multimodal diffusion model for the T2I architecture, in order to address the limitations of existing models in the generation of complex scenes. As illustrated in the upper part of Fig. 1, relying solely on individual text inputs in PDANet can lead to instability and hinder accurate structural guidance for image synthesis. This issue is not due to the model’s generation capabilities, but rather the challenge of providing precise generation guidance via textual inputs, which requires seamless alignment of internal knowledge from SD with external control. To overcome this limitation, we propose integrating multimodal inputs, which can be achieved at a low computational cost. This approach reduces the reliance on single conditions by enabling adapters to extract guidance features from various types of conditions. The pre-trained SD model keeps its parameters fixed, generating images based on input text features and additional guidance features. Using multimodal input, our model can effectively capture the complexities of the scene and provide more accurate structural guidance for image synthesis.

Our Identity-Net proposal, illustrated in the lower part of Fig. 1, is designed to be simple, lightweight, and efficient. It comprises four feature extraction blocks and three downsampling blocks to adjust the feature resolution. Initially, the conditional input has a resolution of $512\times 512$, which is downsampled to $64\times 64$ using the pixel unshuffle operation⁶⁰. At each scale, a convolutional layer and two residual blocks (RB) are employed to extract the conditional feature ${\mathbf{F}}{c}^k$. This process yields a multi-scale conditional feature ${\mathbf{F}}{c}={{\mathbf{F}}c^1, {\mathbf{F}}c^2, {\mathbf{F}}c^3, {\mathbf{F}}c^4}$, where the dimensions match those of the intermediate feature ${\mathbf{F}}{enc}={{\mathbf{F}}{enc}^1, {\mathbf{F}}{enc}^2, {\mathbf{F}}{enc}^3, {\mathbf{F}}_{enc}^4}$ in the encoder of the UNet denoiser. These features ${\mathbf{F}}c$ and ${\mathbf{F}}{enc}$ are subsequently combined at each scale using a combination operation. The process of conditional feature extraction and operations can be formulated as follows:

$$\begin{aligned}&{\mathbf{F}}c = {\mathscr {F}}{AD}({\mathbf{C}}), \end{aligned}$$

(3)

$$\begin{aligned}&\hat{{\mathbf{F}}}{enc}^{i} = {\mathbf{F}}{enc}^{i} + {\mathbf{F}}{c}^{i},\ i\in {1,2,3,4}, \end{aligned}$$

(4)

where ${\mathbf{C}}$ represents the conditional input, and ${\mathscr {F}}{AD}$ is the Identity-Net. Our proposed Identity-Net exhibits strong generalization capabilities and accommodates diverse structural controls, such as sketches, depth maps, semantic segmentation maps, and key poses. The condition maps corresponding to these modes are directly fed into task-specific adapters to extract condition features, denoted as ${\mathbf{F}}_c$.

In T2I tasks, CLIP plays a pivotal role in bridging the gap between textual descriptions and image representations. By leveraging both text encoders and image encoders, CLIP is trained using contrastive learning on a vast dataset of text-image pairs, thereby aligning patterns in the feature space. In the context of SD, employing a pre-trained CLIP text encoder to extract text embeddings from input text enables guidance of the denoising process. However, relying solely on text guidance may not sufficiently capture the intricate aesthetic characteristics inherent in Chinese painting. To mitigate this limitation, an additional image encoder is introduced to provide finer details, albeit necessitating parameter fine-tuning to adapt to painting-style design tasks. Our proposed PDANet method addresses these challenges by training adapters using frozen CLIP and SD, focusing on learning parameters specifically within linear layers, layer normalization, and cross-attention layers. This approach yields promising results without requiring extensive retraining, thereby streamlining the process.

Model optimization

During optimization, we maintain fixed parameters in SD while focusing on optimizing the Identity-Net. Each training sample consists of a triplet: the original image ${\mathbf{X}}_0$, the condition map ${\mathbf{C}}$, and the text prompt y. The optimization procedure mirrors that of SD. Specifically, starting with an image ${\mathbf{X}}_0$, we encode it into the latent space ${\mathbf{Z}}_0$ using the auto-encoder’s encoder. Subsequently, we randomly select a time step t from the interval [0, T] and introduce corresponding noise to ${\mathbf{Z}}_0$, yielding ${\mathbf{Z}}_t$. Mathematically, our Identity-Net is optimized as follows:

$$\begin{aligned} {\mathscr {L}}_{AD} = {\mathbb {E}}_{{\mathbf{Z}}_{0},t, {\mathbf {F}}_c, \epsilon \sim {\mathscr {N}}(0,1)}\left[ ||\epsilon -\epsilon _{\theta }({\mathbf{Z}}_{t},t,\tau ({\mathbf{y}}),{\mathbf {F}}_c)||_2^2\right] . \end{aligned}$$

(5)

Where ${\mathbb {E}}$ denotes the expectation over multiple samples. Here, ${\mathbf{Z}}_{0}$ represents the initial latent variable, $t$ is the time step or iteration count, ${\mathbf {F}}_c$ signifies conditional features or contextual information, and $\epsilon \sim {\mathscr {N}}(0,1)$ indicates that the noise $\epsilon$ is sampled from a standard normal distribution. The term $||\cdot ||_2^2$ denotes the squared Euclidean norm, used to measure the error between the predicted noise $\epsilon _{\theta }$ and the true noise $\epsilon$. The model output $\epsilon _{\theta }({\mathbf{Z}}_{t},t,\tau ({\mathbf{y}}),{\mathbf {F}}_c)$ depends on the parameters $\theta$, the latent variable ${\mathbf{Z}}_{t}$ at time $t$, the embedding $\tau ({\mathbf{y}})$ of the label ${\mathbf{y}}$, and the conditional features ${\mathbf {F}}_c$.

In the diffusion model, embedding time as a condition during sampling is crucial for effective guidance. Through our experiments, we have observed that incorporating time embedding into the adapter significantly enhances its guidance capabilities. However, this approach requires the adapter’s involvement in every iteration, which contradicts our goal of simplicity and compactness. To address this issue, we employ strategic training methods to optimize the adapter’s performance. Specifically, we segment the DDIM inference sampling process into three stages: early, middle, and late stages. We introduce guidance information at each of these stages to investigate its impact on the results. Interestingly, our findings reveal that adding guidance during the middle and late stages has a minimal impact on the results, suggesting that the primary content of the generated output is largely determined in the early sampling stage. Consequently, if t is sampled from the later stages, the guidance information tends to be ignored during training, leading to suboptimal performance. To bolster the adapter’s training, we adopt a non-uniform sampling strategy to increase the likelihood of t falling within the early sampling stage. We utilize a cubic function, $t=(1-(\frac{t}{T})^3)\times T,\ t\in U(0,T)$, where t is sampled from a uniform distribution. This distribution helps address the issue of weak guidance observed with uniform sampling, particularly in tasks such as color control. The cubic sampling strategy effectively mitigates these weaknesses, leading to improved overall performance of the diffusion model.

Inference stage

During training, we exclusively optimize Identity-Net while maintaining the parameters of the pre-trained diffusion model fixed. The Identity-Net is trained on the dataset containing image-text pairs, adhering to the same training objective as the original SD:

$$\begin{aligned} L_{\text {simple}}={\mathbb {E}}_{\varvec{x}_{0},\varvec{\epsilon }, \varvec{c}_{t}, \varvec{c}_{i}, t} \Vert \varvec{\epsilon }- \varvec{\epsilon }_\theta \big (\varvec{x}_t, \varvec{c}_{t}, \varvec{c}_{i}, t\big )\Vert ^2. \end{aligned}$$

(6)

In addition, we randomly exclude image conditions during the training stage to facilitate guidance in the inference stage without relying on classifiers:

$$\begin{aligned} \begin{aligned} \hat{\varvec{\epsilon }}_{\theta }(\varvec{x}_t, \varvec{c}_{t}, \varvec{c}_{i}, t) = w\varvec{\epsilon }_{\theta }(\varvec{x}_t, \varvec{c}_{t}, \varvec{c}_{i}, t)+(1-w)\varvec{\epsilon }_{\theta }(\varvec{x}_t, t). \end{aligned} \end{aligned}$$

(7)

Here, we nullify the CLIP image embedding simply by setting it to zero when the image condition is excluded.

Since the text cross-attention and image cross-attention are separate, we can also modify the weighting of the image condition during the inference stage:

$$\begin{aligned} {\mathbf{Z}}^{new}=\text {Attention}({\mathbf{Q}},{\mathbf{K}},{\mathbf{V}}) + \lambda \cdot \text {Attention}({\mathbf{Q}},{\mathbf{K}}',{\mathbf{V}}'). \end{aligned}$$

(8)

where $\lambda$ is weight factor, and the model becomes the original T2I diffusion model if $\lambda =0$.

Experiment and analysis

Experimental environment

All experiments were conducted on a single NVIDIA RTX 4090 GPU (24GB VRAM), paired with an Intel i9-14900K CPU (32 cores), 64GB DDR5 RAM, and a 1.8TB NVMe SSD. The software environment included Ubuntu 22.04.3 LTS, CUDA 11.7, cuDNN 8.5, and PyTorch 2.0.0. Under this setup, the peak GPU memory usage remained below 18.3GB during training, confirming the model’s efficiency on standard workstation hardware without requiring multi-GPU clusters.

In our experiments, it enabled full parameter fine-tuning with a peak memory usage of only 18.3GB, making it highly suitable for single-GPU deployment. Moreover, Stable Diffusion 1.5’s U-Net architecture has been widely validated in art-related tasks, including art style transfer. Textual Inversion learns pseudo-word embeddings from 3-5 images to map visual concepts into text space for zero-shot adaptation⁶¹, while language-diffusion synergy integrates pretrained language models with diffusion architectures to enhance semantic alignment and photorealism , demonstrating efficient customization and high-fidelity generation³⁸.

Painting-42

Collection. Despite the existence of numerous art datasets, such as VisualLink⁶², Art500k⁶³, and Artemis⁶⁴, that facilitate AI learning, there is a scarcity of datasets specifically focused on Chinese paintings from distinct historical eras with unique styles or techniques. To advance the development of Chinese painting within the AI learning field, it is essential to construct more accurate and comprehensive datasets tailored to these specific contexts. To address this gap, we meticulously curated a dataset of 4,055 Chinese paintings spanning various historical periods and diverse artistic styles, sourced from online platforms and artist albums. The resolution distribution for these paintings is detailed in Fig. 2. To ensure the stability and accuracy of the data, and to alleviate issues such as image blur, detail loss, distortion, or noise amplification, we standardized all paintings to a consistent resolution of $512 \times$512. This resolution is consistent with the training dataset specifications of the SD model, ensuring model stability and output quality. Deviations from this resolution could potentially compromise the fidelity of the generated outputs. Throughout the standardization process, we placed significant emphasis on preserving the distinct painting styles, compositions, and aesthetic features characteristic of each historical era. Larger paintings were carefully cropped and segmented to ensure that each segment captured the essence of the original artwork. By doing so, we aimed to create a high-quality dataset that would facilitate the development of AI models capable of generating authentic and diverse Chinese paintings.

Electronic replicas of Chinese paintings often fail to capture the intricate details and nuances found in the originals. To address this limitation, we meticulously adjusted the image parameters to enhance the painting features while minimizing noise. This process involved careful screening to remove low-resolution and redundant works, resulting in a curated collection of 4,055 high-quality Chinese painting images. These images cover various categories and styles of ancient and modern Chinese art, including blue-green landscapes, golden and blue-green landscapes, fine brush painting, flower and bird painting, etc. They also feature characteristic techniques such as plain drawing and texture strokes, showcasing the unique artistic traditions of Chinese painting. Each image was selected to ensure it authentically represents its respective era, enabling comparisons between painters and their unique styles or techniques. This curated dataset serves as valuable material for advancing machine learning research and facilitating the preservation and innovation of Chinese painting art in the digital age. Notably, the dataset includes works from 42 renowned Chinese artists that span seven different dynasties in Chinese history, providing a comprehensive representation of the evolution of Chinese painting styles. This initiative marks the creation of the first Chinese painting style dataset tailored specifically for T2I tasks. The distribution among dynasties within the dataset is visualized in Fig. 3, illustrating the breadth and depth of the collection. By making this dataset available, we hope to contribute to the development of more sophisticated machine-learning models that can appreciate and generate Chinese paintings with greater accuracy and nuance.

Copyright compliance was a fundamental consideration in the construction of the Painting-42 dataset. The dataset primarily consists of ancient Chinese paintings that are in the public domain, sourced from open-access museum collections and archives. According to the Copyright Law of the People’s Republic of China (2020 Amendment)⁶⁵, copyright protection for artworks expires 50 years after the death of the author. As the dataset exclusively includes works that have far exceeded this protection period, their use is fully compliant with legal requirements. No copyrighted modern artworks were included, ensuring that all images in the dataset are legally available for academic and research purposes.

Labels. To streamline the T2I generation process, we opted to extract elements and features of Chinese paintings from painting names and reannotate them using natural language, rather than annotating the paintings directly. The annotation workflow is illustrated in Fig. 4. Initially, we used BLIP2⁶⁶ for annotation and applied a filtering method (CapFilt) as an experimental approach. CapFilt leverages dataset guidance to filter out noisy and automatically generated image titles. However, after conducting a manual review and comparison, we identified areas where this method could be improved, particularly in accurately identifying free-hand or abstract expression techniques common in Chinese painting. To improve the quality of annotations, we consulted experts in Chinese art history and Chinese painting. We have improved annotations specifically for these types of images through detailed manual adjustments. Key enhancements included appending keywords such as “Chinese painting”, “era”, and “author” to each image description, highlighting core stylistic features of Chinese painting. Furthermore, we categorized distinct popular painting techniques associated with painters from different eras, ensuring the model could accurately distinguish and recognize unique expressive techniques characteristic of each era painter. Ultimately, we organized these annotated text-image pairs into a structured JSON file format for seamless integration into our T2I generation pipeline.

Quantitative evaluation

All stylized reference images used for demonstration and comparison (including those in qualitative comparisons with DALL$\cdot$E 3 and MidJourney) were strictly drawn from the held-out test set and were never involved in the model’s training process.

Table 1 Comparison of different generated images based on CLIP Score, LPIPS, FID, VTC, ASL, AP, and creativeness

Full size table

Our objective in assisted design is to ensure that the Chinese paintings generated by our model exhibit harmonious layouts, vibrant colors, delicate brushwork, and a cohesive composition that balances diversity with stability, all guided by the provided theme description. To achieve this goal, we have chosen metrics that primarily assess the coherence between text descriptions and the aesthetic quality of the generated images. To evaluate the alignment between text and images, we employ a suite of metrics, including CLIP Score⁷⁰, LPIPS⁷¹, FID⁷², Visual Text Consistency(VTC), Accurate Style Learning(ASL), Aesthetics Preference(AP), and Creativeness. CLIP Score offers a comprehensive evaluation that considers factors such as color fidelity, texture, contrast, and clarity, translating these evaluations into numerical scores to assess overall image quality. LPIPS utilizes feature extraction and similarity calculation to provide a nuanced evaluation of image quality and similarity, utilizing deep learning techniques to accurately assess visual fidelity. FID is used to evaluate the quality and diversity of generated images in generative models, especially in image generation tasks. It works by comparing the distribution of generated images with real images in a specific space. VTC is a crucial criterion for assessing model performance, as it ensures that the generated image accurately represents the details specified in the input text. The assessment of ASL focuses on evaluating the model’s ability to authentically replicate the style of a specific artist. The AP exam assesses the model’s capacity to generate visually appealing images while adhering to essential artistic principles. Creativeness emphasizes the evaluation of imaginative image generation that showcases a unique style.

We conducted a comprehensive comparative analysis of our PDANet method against state-of-the-art models, including DALL-E 3, Midjourney, Midjourney + reference, DreamWorks Diffusion and PuLID-FLUX, and presented empirical evidence to demonstrate our advances. To ensure a thorough evaluation across diverse categories and difficulty levels of text descriptions, we randomly selected eight prompts from PartiPrompts⁷³, a dataset comprising over 1,200 English prompts. For each prompt, we generated 50 painting-style images, and the final scores were averaged across all datasets to provide a robust assessment. The results, summarized in Table 1, showcase PDANet’s impressive performance metrics. Notably, PDANet attained a CLIP Score of 0.8147, the highest among all evaluated methods, demonstrating its exceptional ability to generate images that align with the input text descriptions. In terms of LPIPS, PDANet scored 0.5519, substantially surpassing other models and underscoring its strength in producing visually coherent images. In terms of FID, PDANet scores 2037, the best score among all models, demonstrating its excellent performance in terms of generated image quality. In the VTC, ASL, AP, and Creativeness evaluations, 60.6%, 68.2%, 53.0%, and 56.1% of users selected images generated by the PDANet method, significantly surpassing the proportion of users choosing images from other models.

These findings collectively underscore PDANet’s remarkable capability to align generated images with input text descriptions, confirming its state-of-the-art status in producing painting-style images. The empirical evidence presented in this study validates the effectiveness of our proposed approach and highlights its potential for applications in T2I synthesis.

Qualitative analysis

In this study, we primarily aim to extract seven key prompt words from the PartiPrompts corpus to generate images in various Chinese painting styles. We will then apply these prompts to multiple models, including DALL-E 3, Midjourney, Midjourney + reference, DreamWorks Diffusion, PuLID-FLUX, and PDANet (Ours) model, to evaluate their performance.

Through systematic experiments and analyses, as illustrated in Fig. 5, we evaluated various methods for generating images in the style of Chinese paintings. Although the DALL-E 3 method displays certain features of this style, the artistic quality of its generated images is somewhat constrained. In contrast, the Midjourney and Midjourney + reference methods tend to produce more realistic images rather than true ink art renderings. The DreamWorks Diffusion and PuLID-FLUX methods generate images rich in detail, yet they struggle to replicate the stylistic traits of traditional artists accurately. Our proposed model framework utilizes a multimodal approach, integrating multiple modalities to extract and simulate essential stylistic elements of Chinese paintings, such as brushstroke features, visual rhythm, and color palette. Compared to these other methods, our approach offers enhanced learning capabilities and greater generation stability, enabling it to capture the complex and subtle characteristics of the Chinese painting style, resulting in more accurate and aesthetically pleasing images.

As illustrated in Fig. 6, we conducted comparative experiments against several well-established approaches, including LoRA (Low-Rank Adaptation), ControlNet, and IP-Adapter, to comprehensively evaluate the efficiency, controllability, and generation quality of our proposed method. The results, summarized in Fig. 1 of the revised manuscript, reveal the following key observations: LoRA significantly improves training efficiency due to its parameter-efficient design. However, it exhibits clear limitations in maintaining style consistency and fine-grained detail fidelity, especially in challenging generation tasks involving complex human poses. IP-Adapter demonstrates strong controllability but often suffers from reduced diversity and noticeable detail loss in generated images, limiting its applicability in scenarios requiring high-fidelity generation. ControlNet offers precise conditioning but imposes higher computational overhead and resource consumption during training and inference. In contrast, our proposed framework achieves a better balance between image quality, style consistency, and efficiency. It consistently generates photorealistic and style-consistent results while maintaining manageable computational costs, making it more practical for real-world deployment.

This study conclusively demonstrates the superiority of PDANet in comprehending and interpreting deep-level cue words, thereby establishing a novel theoretical and practical foundation for stylization generation in the T2I field. By leveraging its advanced capabilities, PDANet sets a new benchmark for T2I synthesis, enabling the generation of highly stylized and contextually relevant images that accurately capture the essence of the input text.

Our framework is style-agnostic and applicable to diverse art forms. While this study focuses on Chinese paintings due to dataset accessibility and their complex textures, the model design does not embed any bias toward Chinese art. Modules like the style attention layer and multi-scale fusion are based on generic visual feature modeling, enabling adaptation to various textures, compositions, and brushstroke patterns. By replacing or expanding the dataset, the framework can be fine-tuned for other styles such as Western oil paintings or modern digital art without architectural changes.

Ablation study

In this paper, we aim to utilize several low-cost and sample adapters to dig out more controllable ability from the SD model while not affecting their original network topology and generation ability. Therefore, in this part, we focus on studying the manner of injecting multiple conditions and the complexity of our T2I-Adapter. The results are summarized in Table 2, where PDANet (Ours) demonstrates excellent performance indicators. Among them, PDANet’s CLIP Score is 0.8844, the highest among all evaluation methods. In terms of LPIPS, PDANet scored 0.4763, far exceeding the images generated by other methods. In terms of FID, PDANet scored 774, the best among all methods, demonstrating its outstanding performance in generating image quality. In the VTC, ASL, AP, and Creativeness evaluations, 62.1%, 65.2%, 63.6%, and 59.1% of users selected images generated by the PDANet method, far exceeding the proportion of users who selected images from other methods.

Table 2 An ablation study comparing various conditions injected into the PDANet (Ours) model

Full size table

As shown in Fig. 7, we conducted a comprehensive evaluation of model performance using style images for ablation studies, and evaluated multiple methods for generating image styles in response to different cue words. The results demonstrate that PDANet exhibits exceptional proficiency in interpreting diverse prompt instructions, particularly excelling with complex and lengthy descriptions. It consistently generates Chinese painting-style images that faithfully match the given instructions, maintaining high fidelity in visual style while achieving notable strides in visual text consistency, accurate style learning, aesthetics preference, and Creativeness. Notably, PDANet’s performance is characterized by its ability to deliver consistently satisfactory results, showcasing its robustness and reliability in generating high-quality images that meet the desired standards. The model’s capacity to accurately interpret and respond to varied prompt instructions underscores its potential for applications in T2I synthesis, where the ability to generate diverse and contextually relevant images is paramount.

We first performed the ablation study to verify the effectiveness of each core module and determine the optimal combination of components and hyperparameters. After finalizing the model configuration based on the ablation results, we conducted the quantitative evaluation (Sec. 4.3) to benchmark the optimized model against baselines.

User study

In this user research evaluation, we conducted a comprehensive analysis of the images generated by the model across four key dimensions: visual text consistency, accurate style learning, aesthetics preference, and creativeness. To assess the model’s performance, users were asked to select the work that best aligned with these evaluation criteria from images generated by six different models. Visual text consistency was a crucial aspect of the evaluation, as it assesses the model’s ability to maintain thematic coherence in design creation, similar to evaluating a designer’s skill in this regard. This dimension forms a fundamental criterion for evaluating model performance, as it ensures that the generated images accurately reflect the content specified in the input text. By prioritizing visual text consistency, we can guarantee that the model produces images that are not only aesthetically pleasing but also contextually relevant and faithful to the original text.

Secondly, the assessment of accurate style learning aims at evaluating the model’s capability to faithfully replicate the style of a particular artist. This is achieved by measuring the similarity between the images generated by the model and the original works of the artist. In terms of aesthetics preference, a comprehensive quality assessment of the generated Chinese painting style images is conducted based on artistic principles, including composition, arrangement of elements, and visual expression. This dimension examines whether the model can produce visually appealing images while adhering to fundamental artistic principles.

Finally, in evaluating the creativeness, we focus on striking a balance between creativity and stability to generate imaginative images with unique styles. Our experimental results reveal that incorporating reference images as control conditions tends to constrain the model’s creativity, often compromising the delicate balance between creativity and stability in the generated images. To comprehensively assess the performance of PDANet in terms of style creativity and stability, we conducted a comparative analysis with state-of-the-art models, including DALL-E 3, Midjourney, Midjourney + reference, DreamWorks Diffusion, and PuLID-FLUX. This evaluation enables us to rigorously examine the strengths and weaknesses of PDANet in achieving a harmonious balance between creativity and stability.

Our research sample includes 52 survey questionnaires from 12 cities in China, among which 35 participants have art education backgrounds and a profound appreciation for Chinese painting art. This ensured the evaluation’s professionalism and reliability. As illustrated in Fig. 8, we labeled the generated images from six models as follows: (A) DALL-E 3, (B) Midjourney, (C) Midjourney + reference, (D) DreamWorks Diffusion, (E) PuLID-FLUX and (F) PDANet (Ours).

In terms of visual text consistency, the image generated by PDANet more accurately captured the essence of the keywords provided, while Models (A), (B), (C), (D) and (E) exhibited varying degrees of deviation in replicating the art style. This disparity highlights the superiority of PDANet in terms of style fidelity. In terms of accurate style learning, we assessed each model’s capacity to learn and reproduce the style inspired by the works of the renowned Ming Dynasty painter Tang Yin. Characterized by exquisite elegance and delicate, serene brushwork, Tang Yin’s paintings are a hallmark of Chinese art. Notably, PDANet demonstrated a high level of consistency with Tang Yin’s style in terms of visual aesthetics and brushstroke quality, underscoring its ability to capture the nuances of artistic expression. In evaluating aesthetics preference, option F consistently exhibits a cohesive and visually striking expression throughout the creative process, showcasing a deep interpretation of the prompt “a painting of a bird, the paint by tang yin.” This option aligns closely with the respondents’ aesthetic expectations and demonstrates superior artistic merit compared to the alternatives. Ultimately, we conducted a thorough analysis of the stylistic effects of each model in generating various images, emphasizing content and style creativity. Our findings indicated that the first five models struggled to accurately capture the unique characteristics of Chinese paintings from specific periods, exhibiting notable limitations in creativeness, particularly concerning brushstrokes. In contrast, PDANet consistently delivered creative and high-quality outcomes, effectively embodying the aesthetic nuances of painting styles, composition, and imagery from diverse eras.

As illustrated in Fig. 9, the statistical data reveal that PDANet has garnered significant user preference and widespread recognition for its graphical consistency, stylistic precision, and aesthetic appeal. The results indicate that PDANet’s output is considered more visually appealing and style-consistent than other options, emphasizing its ability to meet the aesthetic expectations of respondents. The user preference results reveal that, across the four dimensions of VTC, ASL, AP, and Creativity, 60.6%, 68.2%, 53.0%, and 56.1% of users favored option F, respectively. This underscores PDANet’s advantage in maintaining stylistic stability in images. Such a strong preference magnitude highlights PDANet’s superiority in generating stable and visually appealing images that align with user aesthetic expectations.

Generated showcase

Figure 10 shows the Chinese paintings generated by the proposed method. These paintings have the painting-styles of Shen Zhou, Tang Yin, and Wang Hui. These generated works strictly follow the aesthetic principles of their respective styles, while achieving extremely high richness and refinement in detail expression. Specifically, the outline lines, the use of modeling techniques and the precise coloring of the colors in the picture are all displayed clearly and vividly, fully demonstrating the unique artistic style of each style and accurately capturing the core charm of traditional Chinese painting art.

It is worth noting that although these three Chinese paintings were created based on the use of the same depth image, the final results generated are each unique, demonstrating excellent stylistic painting capabilities. The shape of the trees and the distribution of leaves in the picture are accurately depicted, forming a good visual hierarchy and sense of space. The successful application of this method marks its feasibility in practice. It not only reduces the technical difficulty in the creative process but also significantly improves the efficiency of creation, allowing designers to focus more on creativity and inspiration.

Discussion

Specifically, we plan to incorporate Reinforcement Learning Human Feedback (RLHF)⁷⁴ by collaborating with art experts in the field of Chinese painting. These experts can not only evaluate the generated works but also provide structured feedback based on artistic principles and visual features, which will guide the model toward better alignment with authentic artistic styles. Additionally, this process can help us identify potential gaps in stylistic imitation, such as inaccuracies in ink techniques or compositional elements. We believe this human-in-the-loop evaluation will significantly enhance the interpretability, controllability, and credibility of our system in real-world artistic applications. Thank you again for highlighting this important perspective.

Conclusion

In conclusion, this paper underscores the crucial role of painting style and creativity in defining the essence and uniqueness of art, as these elements reflect the artist’s personality and emotional tone, shaping audience perception, and enhancing the meaning and value of the work. The growing popularity of diffusion models in art design, animation, and gaming highlights their potential in original painting creation, poster design, and VI design. However, traditional creative processes are hindered by challenges such as slow innovation, high reliance on manual efforts, high costs, and limitations in large-scale replication. To address these challenges, we propose PDANet, a novel approach for style transformation that leverages the meticulously curated Painting-42 dataset, comprising 4,055 works by fifty-nine renowned Chinese painters from various periods. Using this data set, PDANet grasps the aesthetic intricacies of Chinese painting, providing users with rich design references. Furthermore, we introduce a lightweight Identity-Net for large-scale T2I models, which aligns internal knowledge with external control signals, enhancing the T2I model’s capabilities. The trainable Identity-Net inputs image prompts into the U-Net encoder to generate new, diverse, and stable images. Our extensive quantitative experiments and qualitative analyses demonstrate that our approach surpasses current state-of-the-art methods, delivering high-quality generated content with broad applicability. This generative solution represents a significant advancement over traditional computer-aided design practices, offering a more efficient and innovative approach to creative design through deep learning. Using the power of AI, our approach has the potential to revolutionize the creative industry, enabling artists, designers, and developers to produce high-quality content with unprecedented ease and efficiency.

Data availability

The code and data are available at https://github.com/aigc-hi/PDANet.

References

Cross-Zamirski, J. O. et al. Label-free prediction of cell painting from Brightfield images. Sci. Rep. 12, 10001 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Xu, X. A fuzzy control algorithm based on artificial intelligence for the fusion of traditional Chinese painting and AI painting. Sci. Rep. 14, 17846 (2024).
Article CAS PubMed PubMed Central Google Scholar
Tirandaz, Z., Foster, D. H., Romero, J. & Nieves, J. L. Efficient quantization of painting images by relevant colors. Sci. Rep. 13, 3034 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Nakauchi, S. et al. Universality and superiority in preference for chromatic composition of art paintings. Sci. Rep. 12, 4294 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Lambert, F. E. et al. Layer separation mapping and consolidation evaluation of a fifteenth century panel painting using terahertz time-domain imaging. Sci. Rep. 12, 21038 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, N. & Li, H. Innovative application of Chinese landscape culture painting elements in AI art generation. Applied Mathematics and Nonlinear Sciences (2024).
Elasri, M., Elharrouss, O., Al-Maadeed, S. & Tairi, H. Image generation: A review. Neural Process. Lett. 54, 4609–4646 (2022).
Article Google Scholar
Zeng, W., Zhu, H.-l., Lin, C. & Xiao, Z.-y. A survey of generative adversarial networks and their application in text-to-image synthesis. Electronic Research Archive (2023).
Yang, Q., Bai, Y., Liu, F. & Zhang, W. Integrated visual transformer and flash attention for lip-to-speech generation GAN. Sci. Rep. 14, 4525 (2024).
Article ADS PubMed PubMed Central Google Scholar
He, Y., Li, W., Li, Z. & Tang, Y. Gluegan: Gluing two images as a panorama with adversarial learning. In 2022 International Conference on Human-Machine Systems and Cybernetics (IHMSC) (2022).
Li, M., Lin, L., Luo, G. & Huang, H. Monet style oil painting generation based on cyclic generation confrontation network. J. Electron. Imaging 33, 120 (2024).
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695 (2022).
Borji, A. Generated faces in the wild: Quantitative comparison of stable diffusion, midjourney and dall-e 2. arXiv preprint arXiv:2210.00586 (2022).
Ramesh, A. et al. Zero-shot text-to-image generation. In International conference on machine learning, pp. 8821–8831 (PMLR, 2021).
Wu, Y., Zhou, Y. & Xu, K. A scale-arbitrary network for Chinese painting super-resolution. In 2023 ACM Symposium on Applied Computing (2023).
Wang, W., Huang, Y. & Miao, H. Research on artistic style transfer of chinese painting based on generative adversarial network. In 2023 International Conference on Artificial Intelligence and Information Technology (ACAIT) (2023).
Cheng, Y., Huang, M. & Sun, W. VR-based line drawing methods in Chinese painting. In 2023 International Conference on Virtual Reality (ICVR) (2023).
Xu, H., Chen, S. & Zhang, Y. Magical brush: A symbol-based modern Chinese painting system for novices. In 2023 ACM Symposium on Applied Computing (2023).
Yang, G. & Zhou, H. Teaching Chinese painting color based on intelligent image processing technology. Applied Mathematics and Nonlinear Sciences (2023).
Goodfellow, I. et al. Generative adversarial networks. In Advances in Neural Information Processing Systems, pp. 2672–2680 (2014).
Ramesh, A. et al. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
Gatys, L. A., Ecker, A. S. & Bethge, M. Image style transfer using convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423 (2016).
Nikulin, Y. & Novak, R. Exploring the neural algorithm of artistic style. arXiv preprint arXiv:1602.07188 (2016).
Novak, R. & Nikulin, Y. Improving the neural algorithm of artistic style. arXiv preprint arXiv:1605.04603 (2016).
Li, C. & Wand, M. Precomputed real-time texture synthesis with Markovian generative adversarial networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, 702–716 (Springer, 2016).
Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232 (2017).
Liu, Y., Qin, Z., Wan, T. & Luo, Z. Auto-painter: Cartoon image generation from sketch by using conditional Wasserstein generative adversarial networks. Neurocomputing 311, 78–87 (2018).
Article Google Scholar
Zhao, H., Li, H. & Cheng, L. Synthesizing filamentary structured images with GANs. arXiv preprint arXiv:1706.02185 (2017).
Goodfellow, I. et al. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
Xue, A. End-to-end Chinese landscape painting creation using generative adversarial networks. In Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pp. 3863–3871 (2021).
Gatys, L. A., Ecker, A. S. & Bethge, M. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015).
Dhariwal, P. & Nichol, A. Diffusion models beat GANs on image synthesis. Adv. Neural Inf. Process. Syst. 34, 8780–8794 (2021).
Google Scholar
Shen, F. et al. Imaggarment-1: Fine-grained garment generation for controllable fashion design. arXiv preprint arXiv:2504.13176 (2025).
Shen, F. & Tang, J. Imagpose: A unified conditional framework for pose-guided person generation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024).
Shen, F. et al. Long-term talkingface generation via motion-prior conditional diffusion model. arXiv preprint arXiv:2502.09533 (2025).
Nichol, A. et al. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021).
Ding, M. et al. Cogview: Mastering text-to-image generation via transformers. Adv. Neural Inf. Process. Syst. 34, 19822–19835 (2021).
Google Scholar
Saharia, C. et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 35, 36479–36494 (2022).
Google Scholar
Gafni, O. et al. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, 89–106 (Springer, 2022).
Balaji, Y. et al. EDIFF-I: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022).
Xue, Z. et al. Raphael: Text-to-image generation via large mixture of diffusion paths. Adv. Neural Inf. Process. Syst. 36, 41693 (2024).
Google Scholar
Kerbl, B., Kopanas, G., Leimkühler, T. & Drettakis, G. 3D gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42, 139–140 (2023).
Article Google Scholar
Wu, G. et al. 4D Gaussian splatting for real-time dynamic scene rendering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 20310–20320 (2024).
Chen, Z., Wang, F., Wang, Y. & Liu, H. Text-to-3d using Gaussian splatting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 21401–21412 (2024).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Google Scholar
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256–2265 (PMLR, 2015).
Song, Y. & Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems 32 (2019).
Kingma, D., Salimans, T., Poole, B. & Ho, J. Variational diffusion models. Adv. Neural Inf. Process. Syst. 34, 21696–21707 (2021).
Google Scholar
Esser, P., Rombach, R. & Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12873–12883 (2021).
Radford, A. et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763 (PMLR, 2021).
Feng, W. et al. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032 (2022).
Hertz, A. et al. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).
Wang, C. et al. Ensembling diffusion models via adaptive feature aggregation. arXiv preprint arXiv:2405.17082 (2024).
Zhang, L., Rao, A. & Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023).
Shen, F. et al. Advancing pose-guided image synthesis with progressive conditional diffusion models. In The Twelfth International Conference on Learning Representations (2023).
Wang, C. et al. V-express: Conditional dropout for progressive training of portrait video generation. arXiv preprint arXiv:2406.02511 (2024).
Shen, F. et al. Imagdressing-v1: Customizable virtual dressing. arXiv preprint arXiv:2407.12705 (2024).
Shen, F. et al. Boosting consistency in story visualization with rich-contextual conditional diffusion models. arXiv preprint arXiv:2407.02482 (2024).
Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234–241 (Springer, 2015).
Shi, W. et al. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883 (2016).
Gal, R. et al. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
Seguin, B., Striolo, C., diLenardo, I. & Kaplan, F. Visual link retrieval in a database of paintings. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part I 14, pp. 753–767 (Springer, 2016).
Mao, H., Cheung, M. & She, J. Deepart: Learning joint representations of visual arts. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1183–1191 (2017).
Achlioptas, P., Ovsjanikov, M., Haydarov, K., Elhoseiny, M. & Guibas, L. J. Artemis: Affective language for visual art. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11569–11579 (2021).
Standing Committee of the National People’s Congress. Copyright law of the people’s republic of china (2020 amendment). http://www.npc.gov.cn (2020). Accessed: 2025-03-06.
Li, J., Li, D., Xiong, C. & Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pp. 12888–12900 (PMLR, 2022).
Betker, J. et al. Improving image generation with better captions. Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf2, 8 (2023).
Jaruga-Rozdolska, A. Artificial intelligence as part of future practices in the architect s work: Midjourney generative tool as part of a process of creating an architectural form. Architectus pp, 95–104 (2022).
Guo, Z., Wu, Y., Chen, Z., Chen, L. & He, Q. Pulid: Pure and lightning id customization via contrastive alignment. arXiv preprint arXiv:2404.16022 (2024).
Hessel, J., Holtzman, A., Forbes, M., Bras, R. L. & Choi, Y. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021).
Ghazanfari, S., Garg, S., Krishnamurthy, P., Khorrami, F. & Araujo, A. R-LPIPS: An adversarially robust perceptual similarity metric. arXiv preprint arXiv:2307.15157 (2023).
Chong, M. J. & Forsyth, D. Effectively unbiased fid and inception score and where to find them. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6070–6079 (2020).
Yu, J. et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.107892, 5 (2022).
Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).
Google Scholar

Download references

Author information

Authors and Affiliations

School of New Media, Beijing Institute of Graphic Communication, Beijing, 102600, China
Yifei Zhao, Ziqi Liang, Yingrui Qiu & Xiaona Wang

Authors

Yifei Zhao
View author publications
Search author on:PubMed Google Scholar
Ziqi Liang
View author publications
Search author on:PubMed Google Scholar
Yingrui Qiu
View author publications
Search author on:PubMed Google Scholar
Xiaona Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.F.Z.: writing-original draft and editing; Z.Q.L.: writing-review; Y.R.Q.: writing-original draft; X.N.W.: writing-original draft. All authors gave final approval for publication.

Corresponding author

Correspondence to Yifei Zhao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Please note: Abbreviations should be introduced at the first mention in the main text - no abbreviations lists. Suggested structure of main text (not enforced) is provided below.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, Y., Liang, Z., Qiu, Y. et al. A novel flexible identity-net with diffusion models for painting-style generation. Sci Rep 15, 27896 (2025). https://doi.org/10.1038/s41598-025-12434-4

Download citation

Received: 08 October 2024
Accepted: 17 July 2025
Published: 31 July 2025
DOI: https://doi.org/10.1038/s41598-025-12434-4