Introduction

Chinese traditional landscape painting stands as one of the foundational pillars of traditional visual culture, blending unique artistic techniques, aesthetic philosophies, and cultural symbolism. Originating during the Wei and Jin dynasties, flourishing in the Sui and Tang, and maturing in the Song dynasty, landscape painting evolved beyond the depiction of natural scenery into a medium for expressing cosmological views, reflections on life, and philosophical thought. Within the tradition of literati painting, it became an important vehicle for intellectuals to convey ideals, express emotions, and engage in meditative contemplation. The aesthetic principle of “valuing spirit over form” reflects a core tenet of Chinese culture: harmony between humans and nature.

Unlike Western painting traditions that emphasize linear perspective and realistic coloration, Chinese landscape painting employs ink and wash techniques to construct multi-layered, ethereal visual spaces. Through variations in ink tone, brushwork density, and compositional balance between void and substance, artists evoke intricate interactions among mountains, rivers, and clouds, ultimately forming a spiritual realm beyond visual realism. The spatial construction eschews strict geometric perspective, instead relying on techniques such as texture strokes, ink washes, and blank-leaving to establish depth and hierarchy. Scenes often feature mist-shrouded peaks and cascading waterfalls, unified in structure while imbued with the artist’s inner emotional and philosophical expression.

At the micro level, Chinese landscape painting emphasizes the integration of detailed brushwork with holistic composition. Elements such as moss dots, water ripples, and tree branches are rendered through nuanced ink modulation and brush control to convey the spirit of nature. Techniques like feibai(flying white), pima cun, and mi dian cun not only reflect the artist’s understanding of natural textures but also capture the flow of internal energy and rhythm1. This distinct visual language constructs a semi-abstract system that simultaneously captures realism and symbolism, making landscape painting a projection of the artist’s spiritual world. Due to its complex composition, abstract aesthetics, and multi-layered semantics, Chinese landscape painting not only poses unique and challenging tasks for computational image generation but also provides a research context for understanding AI’s role in cultural creation.

In recent years, text-to-image generation has rapidly progressed, enabling computer-aided artistic creation. Leveraging large-scale paired text-image datasets, T2I models have achieved remarkable success in generating semantically aligned, high-fidelity natural images. Across disciplines, T2I has become a critical intersection of human-computer interaction (HCI), visual computing, and digital humanities2,3,4. Among various approaches, diffusion models have emerged as a dominant paradigm. Notable examples include DALL-E25, Imagen6, and Stable Diffusion 7, which embed language into the image synthesis pipeline to enable end-to-end generation from textual descriptions. Compared to early GAN (generative adversarial network) methods, diffusion models generate images via iterative denoising, offering more stable training and higher-resolution details, which enhances their potential for simulating complex artistic styles and layered brushwork. DALL-E2 integrates CLIP and diffusion to enhance cross-modal alignment; Imagen uses a large-scale language model (T5) for semantic encoding; and stable diffusion introduces latent space diffusion to reduce computation while maintaining quality. However, despite strong performance in general tasks, these models still face limitations in complex artistic scenarios, abstract concepts, and traditional style consistency.

To improve structure preservation and semantic controllability, several controllable generation methods have been proposed. T2I Adapter8 integrates external adapters for fusing edge maps, depth maps, and other conditional inputs; ControlNet9 explicitly injects structural priors into the U-Net backbone to enhance layout fidelity. Additionally, fine-grained editing methods have emerged, including Prompt-to-Prompt10 for constraint-based semantic editing via attention reweighting, InstructPix2Pix11 for instruction-based image modification, and SDEdit12 for refining images through noise injection and denoising. While these approaches improve controllability and interaction, they still face challenges in modeling artistic styles and recovering layered brushstroke textures.

In the Chinese context, models like Tongyi Wanxiang13 by Alibaba Cloud, RAPHAEL14 large model from SenseTime, and WenXin 4.5 Turbo15 by Baidu have improved alignment between Chinese semantics and visual output, showing potential for generating Chinese-style imagery. However, these models are primarily trained on modern image styles and lack fine-grained structural and stylistic modeling for abstract traditional forms like shanshui. Research specifically focused on Chinese landscape painting synthesis remains scarce, partly due to the modeling difficulty and the lack of effective structure-aware generation frameworks. Such technical limitations imply that artists may experience divergent outcomes when applying AI to their creative practices, depending on their resources and skills. Against this backdrop, the AI divide reflects disparities among artists in accessing and utilizing AI tools, whereas the traditional digital divide focuses on inequalities in information technology and internet access. Understanding the distinction between these concepts helps evaluate the impact of technological change on the distribution of artistic cultural capital.

Several efforts have been made toward the generative modeling of traditional Chinese painting. Existing research spans style transfer16,17,18,19, cross-modal generation guided by poetry20, and text-to-image synthesis. For example, Sun et al.21 proposed a method based on VQGAN, using classical Chinese poetry as textual input to enhance the cultural ambiance of generated paintings. However, due to limitations inherent in GANs, such as mode collapse and poor texture resolution, their method struggles to reproduce the nuanced brushwork and ink texture characteristic of landscape paintings. To address these limitations, Peng et al.22 introduced the Fine-grained Hierarchical Semantic Adapter (FHS-adapter), which guides the diffusion process to better preserve artistic structure and semantics. While this approach improves image fidelity and style consistency, it requires high-quality paired datasets and precise text-image alignment and remains sensitive to training instability. In sum, research on the generation of Chinese landscape paintings remains scarce. This limitation arises not only from the intrinsic abstraction, compositional complexity, and stylistic subtlety of landscape painting but also from the unequal distribution of access to AI resources and skills among artists. To conceptualize such disparities, we adopt the notion of AI capital, which may be understood as an extension of Bourdieu’s framework of cultural capital. Specifically, AI capital refers to the capacity of artists to acquire, comprehend, and effectively utilize AI tools, and it operates in conjunction with economic, social, and symbolic forms of capital.

To address these challenges, we propose LFMDiff (LoRA Flow MMD Diffusion), a novel generation framework tailored for the synthesis of traditional Chinese landscape paintings. Our model integrates Hierarchical Local LoRA 23, which performs block-wise low-rank adaptation across feature dimensions, enabling fine-grained control and improved generalization in low-data regimes while mitigating overfitting. We further incorporate dynamic rank and dynamic alpha mechanisms24, using a learnable gating strategy25 to prune inactive channels and adaptively adjust model capacity, thereby enhancing sparsity, interpretability, and efficiency. Additionally, we introduce a Conditional Normalizing Flow 26 module in conjunction with a kernel-based multi-scale maximum mean discrepancy (MMD) 27 objective. The adversarial component aligns generated outputs with prior stylistic distributions, while the multi-scale MMD captures style consistency at different levels of detail, thereby faithfully reproducing the layered brush aesthetics and spatial hierarchies unique to Chinese landscape art.

The main contributions of this paper are:

  • We propose LFMDiff, a novel diffusion-based model tailored for Chinese landscape painting generation, capable of fine-grained detail control and the creation of culturally rich, emotionally resonant artworks.

  • We construct CTLPD, a high-quality dataset containing 1449 images for text-conditioned Chinese landscape painting generation, providing a valuable benchmark for future research.

  • We conduct comprehensive qualitative and quantitative evaluations of LFMDiff, demonstrating its superior performance over several state-of-the-art baselines.

Methods

Overview

Our proposed LFMdiff framework is designed with four key components during the training phase, as illustrated in Fig. 1a. Text Controller Module: this module leverages Hierarchical Local LoRA to perform low-rank adaptation in subspaces of the text encoder. It incorporates dynamic rank and dynamic alpha mechanisms, where a learnable gating function controls each channel adaptively. (b) Diffusion Generation Module: This module encodes image features and generates images conditioned on text prompts, following a standard diffusion-based generative process. (c) Distribution alignment module: we introduce a conditional adversarial flow to enhance latent space modeling and integrate multi-scale MMD to measure distributional discrepancies across different scales. Additionally, a discriminator provides adversarial feedback to encourage the realism and naturalness of generated images. (d) Loss fusion module: this component combines three types of losses from different branches (MSE Loss, multi-scale MMD Loss, and GAN Loss) for joint backpropagation, effectively guiding the overall optimization of the system.

Fig. 1: Schematic diagram of the LFMdiff network architecture.
Fig. 1: Schematic diagram of the LFMdiff network architecture.
Full size image

The framework is divided into two stages: the training phase and the inference phase, with the primary innovations introduced during training. The training process consists of four modules: a Text controller module. b Diffusion generation module. c Distribution alignment module, and d Loss fusion module.

During inference, we utilize the trained Text Encoder and denoising UNet to perform conditional sampling. Specifically, given an input text prompt, the Text Encoder (enhanced with LoRA fine-tuning) generates a text embedding to guide the diffusion model in progressively denoising from initial Gaussian noise to a target latent representation. This latent is then decoded into the final image using the VAE decoder. Notably, the inference process does not rely on the discriminator or loss functions, retaining only the forward generation path, which ensures both efficiency and high image quality.

Text controller module

To enhance the fine-grained semantic guidance of text in Chinese landscape painting generation, we propose a text control module that integrates hierarchical local LoRA with a dynamic rank and scaling factor mechanism, as illustrated in Fig. 2. Building upon the original text encoder, this module further refines the modulation between text embeddings and the diffusion module, enabling both local detail awareness and global style alignment.

Fig. 2: Illustration of the text control module.
Fig. 2: Illustration of the text control module.
Full size image

This figure illustrates the text control module used in the Transformer encoder, integrating Hierarchical Subspace Low-Rank Adaptation (LoRA) with dynamic rank selection and gating. The module is inserted into the query, key, and value (Q/K/V) projections of multi-head attention and the first linear layer of the multilayer perceptron (MLP). a Each LoRA branch contains two learnable low-rank matrices. The rank of each branch is dynamically determined during training. A binary gating vector controls channel-wise pruning by selecting active dimensions, represented through a diagonal matrix. A scaling factor adjusts the contribution strength of each branch. b Redundant LoRA channels are pruned by setting their corresponding gating values to zero. This enables structural sparsity by deactivating uninformative parameters. c For branches that remain active, additional rank capacity is allocated to improve expressiveness and representation ability.

We utilize pretrained models such as CLIP or T5 to convert the input text T into semantic embedding representations:

$${\rm{e}}=TextEncoder(T)\in {R}^{L\times d}$$
(1)

where RL×d denotes the dimensions of the embedding matrix W, which is of size L × d, where each row corresponds to the embedding vector of a token.

The original LoRA (Low-Rank Adaptation) method adds a pair of low-rank matrices U and V, to the original weight matrix \({\rm{W}}\in {R}^{{{\rm{d}}}_{out}\times {{\rm{d}}}_{in}}\), satisfying the following condition:

$$\begin{array}{r}{W}_{eff}=W+\alpha \cdot UV\end{array}$$
(2)

where α denotes the scaling factor, and rank(U) = rank(V) = r, where r represents the rank, i.e., the dimensionality of the low-rank approximation.

The traditional LoRA method primarily performs low-rank approximations in the attention layers or linear layers, thereby reducing the number of parameters that need to be updated. However, Chinese landscape painting, in contrast to typical Western oil paintings or photographs, often exhibits highly hierarchical features both in local brushstrokes and overall structure. For instance, the layering of mountains and clouds, the extent of ink blending, and the treatment of empty space all present multi-layered characteristics. To better capture these hierarchical and spatial relationships, we propose the introduction of a local block-based low-rank structure. The input features are divided into several sub-blocks, each corresponding to an independent LoRA factor, thereby forming local representations:

$${W}_{eff}(x)=Wx+\alpha \cdot \mathop{\sum }\limits_{i=1}^{N}({U}_{i}{V}_{i}\cdot {M}_{i})x$$
(3)

where \(x\in {R}^{{d}_{in}}\) denotes the input feature vector, W is the weight matrix defined in Equation (2), \({{\rm{U}}}_{i}\in {R}^{{d}_{out}\times {\rm{r}}}\) and \({{\rm{V}}}_{i}\in {R}^{r\times {d}_{in}}\) represent the low-rank matrices of the i-th local sub-block, \({M}_{i}\in {R}^{{d}_{in}\times {d}_{in}}\) indicates the mask corresponding to the i-th sub-block, which is used to extract the relevant region from the input, and N refers to the total number of sub-blocks.

In traditional LoRA, the rank r of the low-rank matrices is typically manually set to a fixed constant, and the scaling factor α is also defined as a constant (e.g., 1 or 0.1). However, in complex scenarios, the difficulty of feature expression and the distribution of details exhibit significant variability, such as the varying complexities of regions like ridges, ink wash, clouds, and blank space in Chinese landscape paintings. To address this, we propose the introduction of a learnable rank control mechanism and scaling mechanism, enabling the model to allocate appropriate capacity between easily learned and more challenging features.

We replace the original fixed scaling factor α with a separate, learnable parameter vector α = [α1, α2, . . . , αN] for each sub-module:

$${W}_{eff}(x)=Wx+\mathop{\sum }\limits_{i=1}^{N}{\alpha }_{i}({U}_{i}{V}_{i}\cdot {M}_{i})x$$
(4)

Each αiR is updated through gradients during training, learning the importance of the current sub-block, and a regularization term L1(α) can be introduced to enforce sparsity.

For the adaptive control of rank, either hard gating or soft gating can be employed, where a subset of LoRA channels is gradually “activated” or “frozen” during training, thereby dynamically adjusting the effective rank. The following simplified representation can be referenced:

$$\begin{array}{r}{{\rm{Z}}}_{i}={U}_{i}{V}_{i}\in {R}^{d\times d},{G}_{i}\in {R}^{d\times d}\end{array}$$
(5)

where Zi denotes the low-rank matrix product of the i-th branch, representing the transformation in the local subspace, and Gi is the corresponding gating matrix used to dynamically control the activation state of this branch.

By incorporating this into Equation (4), we obtain the final text control module, a hierarchically-aware, dynamically adjustable multi-branch LoRA module, formulated as:

$$\begin{array}{r}{W}_{eff}(x)=Wx+\mathop{\sum }\limits_{i=1}^{N}{\alpha }_{i}(x)\cdot (({Z}_{i}\odot {G}_{i}(x)\cdot {M}_{i})x)\end{array}$$
(6)

where denotes element-wise multiplication, and Gi(x) represents input-dependent gating, either at the channel level or the element level.

Diffusion generation module

The primary function of the diffusion generation module is to encode image features and facilitate text-guided image synthesis. Specifically, the input image I is first encoded into a latent variable z0 through a Variational Autoencoder (VAE):

$${{\rm{z}}}_{0}=VA{E}_{encoder}(I)$$
(7)

During the training phase, noise is then added to the latent variable z0 according to a predefined noise schedule (forward process), resulting in a noised latent variable zt at a specific time step t:

$${{\rm{z}}}_{{\rm{t}}}=\sqrt{{\bar{\alpha }}_{t}}{z}_{0}+\sqrt{1-{\bar{\alpha }}_{t}}\cdot \epsilon,\epsilon \in {\rm{N}}(0,I)$$
(8)

where \({\bar{\alpha }}_{t}\) denotes the noise attenuation coefficient (cumulative product) in the diffusion process, which controls how much of the original information is retained at the current time step. The ϵ represents random noise sampled from a standard normal distribution.

Subsequently, the U-Net learns a reverse diffusion process by receiving the noised latent variable zt, the time step t, and the text embedding e, and predicting the original noise \(\hat{{\epsilon }_{\theta }}\), which represents the model’s learnable parameters:

$$\bar{{\epsilon }_{\theta }}=UN{\rm{et}}({z}_{t},t,e)$$
(9)

The training objective is to minimize the mean squared error (MSE) between the ground-truth noise ϵ and the predicted noise \(\hat{{\epsilon }_{\theta }}\):

$${L}_{{\rm{diffusion}}}={E}_{{z}_{0},\epsilon,t}[| | \epsilon -\hat{{\epsilon }_{\theta }}({z}_{t},t,e)| {| }^{2}]$$
(10)

Distribution alignment module

A single KL divergence (or similar conventional distributional distance) may be insufficient for accurately assessing the discrepancy between the model-generated distribution and the target style distribution. It often fails to capture the intricate characteristics of high-dimensional distributions, leading to overly “averaged” generative results. This deficiency makes it difficult to reproduce the higher-order structures and localized details intrinsic to Chinese landscape painting, such as the expressive brushwork and intentional use of negative space. To address this limitation, a Conditional Adaptive Flow (for example, a lightweight Conditional Normalizing Flow) can be integrated with a multi-scale MMD based on kernel methods. On one hand, the adversarial training objective helps align the generated distribution with the prior style distribution in a more targeted and stylistically faithful manner. On the other hand, the multi-scale MMD enables the model to capture the multi-level, fine-grained characteristics of brush techniques and compositional style across varying scales within Chinese landscape painting.

The lightweight adversarial adaptive flow defines a conditionally invertible transformation fθ, which maps the noise-perturbed latent variable \(\tilde{{z}_{t}}\) and the text encoding e jointly to an auxiliary space ω, and subsequently performs an inverse transformation back to the original space:

$$\omega ={{\rm{f}}}_{\theta }(\tilde{{z}_{t}}| e),\tilde{{z}_{t}}={f}_{\theta }^{-1}(\omega | e)$$
(11)

Adversarial learning is employed to align the model-generated distribution ω in the auxiliary space with the target prior distribution ω* as closely as possible. A lightweight discriminator Dφ is introduced to distinguish whether a given ω originates from the true reference distribution ω*. The adversarial loss is defined as:

$${L}_{{\rm{GAN}}}={E}_{{\omega }^{* }}[\log {D}_{\varphi }({\omega }^{* })]+{E}_{\omega }[\log {D}_{\varphi }(\omega )]$$
(12)

To align both local textures and global structural patterns, we introduce a family of multi-scale kernel functions \({\{{k}_{s}\}}_{s = 1}^{S}\) to impose distributional constraints between the generated and real distributions at multiple levels. The resulting multi-scale MMD loss is defined as:

$${L}_{{\rm{MMD}}}=\mathop{\sum }\limits_{s=1}^{S}({E}_{\omega,{\omega }^{{\prime} }}[{k}_{s}(\omega,{\omega }^{{\prime} })]-2{E}_{\omega,{\omega }^{* }}[{k}_{s}(\omega,{\omega }^{* })]+{E}_{{\omega }^{* },{\omega }^{* {\prime} }}[{k}_{s}({\omega }^{* },{\omega }^{* {\prime} })])$$
(13)

where the summation symbol \(\mathop{\sum }\nolimits_{{\rm{s}} = 1}^{S}\) denotes multi-scale integration, where the MMD is computed independently at each scale and then combined through a weighted aggregation.

Loss fusion module

During training, the original diffusion reconstruction objective Ldiffusion can be integrated with the adversarial loss LGAN and the multi-scale MMD constraint LMMD to form the final composite loss:

$$\begin{array}{r}L={L}_{{\rm{diffusion}}}+{\lambda }_{1}{L}_{{\rm{GAN}}}+{\lambda }_{2}{L}_{{\rm{MMD}}}\end{array}$$
(14)

where λ1 and λ2 are weighting coefficients used to balance the adversarial objective, the MMD objective, and the core diffusion training objective.

Ethics statement

This study involving human participants was conducted in accordance with the principles of the Declaration of Helsinki. Ethical approval was obtained from the Ethics Committee of Xi’an University of Architecture and Technology. As the study only involved the online evaluation of computer-generated images and did not include the collection of sensitive personal data, the committee determined that a specific approval number was not required. Written informed consent was obtained from all participants prior to their involvement in the study.

Results

Dataset

A publicly available image-text paired dataset that explicitly describes both style and content remains scarce, particularly for traditional Chinese landscape painting, which is often characterized by visual ambiguity and indistinct boundaries. To address this gap, we construct a new dataset named CTLPD (Chinese Traditional Landscape Painting Dataset), consisting of 1449 paintings with diverse stylistic representations and their corresponding textual descriptions, aiming to advance controllable generation in this domain. All crawled images were sourced from public-domain collections or used with explicit permissions to ensure compliance with copyright and ethical standards.

For image pre-processing, we incorporate 2192 images from the publicly released dataset by Alice Xue28 and collect an additional 5000 Chinese landscape paintings from online sources via web crawling, all of which were sourced from public-domain collections or used with explicit permissions. To ensure data quality, we manually filter out images that are non-landscape, presented on fans, have decorative borders, or exhibit low resolution or poor clarity. Each painting is first resized so that its shorter side is scaled to 512 pixels while preserving the original aspect ratio. For paintings with an aspect ratio less than 1.5, we apply center cropping to obtain 512 × 512 pixel images. For paintings exceeding this aspect ratio, we adopt a sliding window approach with a stride of 256 pixels to segment them into multiple 512 × 512 patches.

For text pre-processing, each image is input into caption generation models such as ChatGPT and Doubao. These models are prompted to describe specific visual elements and stylistic attributes, generating textual captions of no more than 77 characters, thus forming image-text pairs for Chinese landscape paintings.

To further improve the quality of training data, we performed text-image pair pre-processing. To further improve the quality of training data, we conduct a systematic cleaning and filtering process. First, we apply perceptual hashing (e.g., pHash) for images and text-based hashing techniques to identify and remove duplicate entries, reducing redundancy that may hinder model learning. Second, we eliminate anomalous samples, such as corrupted image files, empty captions, or clearly mismatched image-text pairs, to ensure dataset validity. Finally, we randomly sample approximately 1% of the dataset for manual inspection, verifying the correctness of the pairings and addressing any inconsistencies missed by automated procedures. This multi-stage process ensures the highest possible accuracy and consistency of the training data.

Implementation details

Our experiments are based on the pretrained Stable Diffusion v2.1 model, utilizing the CLIP ViT-L model as the image and text encoder. We build our model using the HuggingFace Diffusers library and apply LoRA (Low-Rank Adaptation) to enable lightweight fine-tuning of the diffusion model, ensuring both parameter efficiency and scalability. The total number of trainable parameters in our method is approximately 0.93M, significantly reducing storage and training overhead.

During training, the images are first resized proportionally and center-cropped to a fixed size of 512 × 512, with a random horizontal flipping augmentation strategy applied. The images are then normalized to a mean of 0.5 and a standard deviation (STD) of 0.5. We employ the AdamW optimizer with an initial learning rate of 1e-6, a cosine learning rate schedule, and a 100-step warm-up phase. Training is conducted over 10 epochs with a batch size of 16, and a gradient clipping threshold of 1.0 is applied to maintain training stability. The weight decay is set to 0.01.

To enhance the expressiveness of the latent space, we incorporate a lightweight Conditional Adversarial Flow module to model stylistic guidance over the noise latent variables. In addition, a multi-scale MMD constraint is applied to encourage consistency in both local brushstroke patterns and global composition across the generated latent space. The adversarial loss and MMD loss are integrated into the overall training objective with weights of 0.1 and 0.05, respectively.

We further enable Hierarchical Local LoRA, where the input features are divided into eight local sub-blocks, each responsible for learning distinct style representations. A dynamic rank and dynamic α-gating mechanism is also introduced, with the maximum adjustable rank set to 32. We use Gumbel-Softmax to enable dynamic activation of sub-blocks during training.

Training is conducted on a single NVIDIA RTX 4090 GPU, and the entire process is integrated with Weights & Biases (WandB) for real-time monitoring of training metrics. During inference, we adopt DDIM sampling with 50 steps and a guidance scale of 7.5 to generate images for both subjective and objective evaluation.

Evaluation metrics

In this study, we adopt three evaluation metrics: CLIP-T Score29, Learned Perceptual Image Patch Similarity (LPIPS)30, and Fréchet inception distance (FID)31, to assess the perceptual similarity and overall quality of the generated landscape paintings.

The CLIP-T Score is designed to evaluate the relevance between the generated image and the input text by computing the cosine similarity between image and text feature vectors extracted using CLIP. This metric reflects how well the generated image aligns with the given textual prompt. A higher CLIP-T Score indicates stronger semantic alignment and generally higher image quality. In this study, we normalize the CLIP-T Score within the range [−1, 1].

The LPIPS metric aims to measure the perceptual similarity between the generated and reference images. It leverages a pretrained convolutional neural network (e.g., AlexNet or VGG) to extract multi-layer visual features and calculates the distance between these feature representations. As LPIPS is more closely aligned with human visual perception, it effectively captures differences in texture and structure. Lower LPIPS scores indicate a smaller perceptual discrepancy between generated and reference images, implying higher visual fidelity. In our evaluation, LPIPS scores are also normalized within the range [−1, 1].

In addition, the FID is employed to assess the overall distributional quality and diversity of generated images. FID compares the feature distributions of generated and real images in the embedding space of an Inception network, computing their Fréchet distance in high-dimensional space. Lower FID scores reflect a closer match in distribution, suggesting higher realism and diversity in generated samples. These metrics collectively provide a comprehensive evaluation of model performance.

Ablation study

We designed and conducted a series of ablation studies to systematically evaluate the individual contributions and synergistic effects of key components in the proposed method. Quantitatively, six comparative configurations were implemented on the CTLPD dataset, involving different combinations of modules including Hierarchical Local LoRA, Dynamic Rank with Adaptive Alpha Mechanism, and Conditional Adversarial Flow with Multi-Scale MMD. Their performances were assessed using CLIP-T, LPIPS, and FID, as summarized in Table 1.

Table 1 Quantitative comparison of ablation studies to investigate the contribution of each individual component

Qualitatively, we further illustrate the visual differences in generated images under various configurations in Fig. 3, analyzing dimensions such as structural fidelity, color coherence, and style preservation. Together, these experiments provide comprehensive quantitative and qualitative validation of the contribution of each module to the model’s overall performance.

Fig. 3: Visual ablation comparison across different model components.
Fig. 3: Visual ablation comparison across different model components.
Full size image

Each column corresponds to a different model; each row represents a distinct prompt. No post-processing or filtering was applied. All models were conditioned on the same prompt text.

The structure of the ablation settings is as follows:

  • Case 1 uses only Hierarchical Local LoRA, which partitions the input feature dimensions and applies low-rank adaptation.

  • Case 2 includes only the dynamic rank and adaptive alpha mechanism.

  • Case 3 incorporates only the conditional adversarial flow and Kernel-based multi-scale MMD.

  • Case 4 builds upon Case 1 by adding the dynamic rank and adaptive alpha mechanism.

  • Case 5 builds upon Case 1 by incorporating the conditional adversarial flow and multi-scale MMD.

  • Case 6 represents the full version of our proposed model, LFMdiff.

For clarity, A denotes Hierarchical Local LoRA, B denotes Dynamic Rank with Adaptive Alpha, and C denotes Conditional Adversarial Flow with Kernel-based multi-scale MMD.

Comparison with existing methods

We conducted both quantitative and qualitative comparisons of several state-of-the-art text-to-image generation models. For the quantitative analysis, we systematically evaluated multiple models, including SDXL32, DALL-E2, GLIDE33, Taiyi34, ControlNet, P+35, T2I-Adapter, CCLAP36 (mainstream diffusion models); Tongyi Wanxiang, RAPHAEL, WenXin 4.5 Turbo (Chinese text-to-image models); LlamaGen-XL37, OpenMAGVIT2 (Dayma et al., 2024, arXiv:2409.04410), DALL. E Mini(Luo et al., 2021, GitHub repository: https://github.com/borisdayma/dalle-mini)(autoregressive models); and our proposed approach, using three widely adopted metrics: CLIP-T, LPIPS, and FID. The results are summarized in Table 2. For the qualitative analysis, we showcase the image outputs generated by these models based on the same textual input in the context of Chinese landscape painting. The comparison focuses on visual style, structural fidelity, and detail rendering, as illustrated in Figs. 4 and 5. (Note: due to the lack of open-source implementation for the FHS-Adapter model, we were unable to reproduce its results under consistent experimental settings. Therefore, it is not included in this comparative study.) This evaluation provides a comprehensive view of each model’s practical performance in terms of semantic alignment, image quality, and style fidelity.

Table 2 Comparison with recent state-of-the-art text-to-image generation models
Fig. 4: Visual comparison between LFMdiff and state-of-the-art diffusion models.
Fig. 4: Visual comparison between LFMdiff and state-of-the-art diffusion models.
Full size image

This figure shows a visual comparison of landscape images generated by different diffusion-based models, arranged by prompt (rows 1–8) and model (columns ah). Each model receives the same textual prompt describing a traditional Chinese landscape scene. No post-processing or filtering was applied to any of the outputs. a SDXL: a general-purpose text-to-image diffusion model capable of producing diverse styles. b DALL-E3: OpenAI's third-generation image generation model, trained on broad multilingual data. c GLIDE: a text-guided diffusion model focused on naturalistic image synthesis. d Taiyi: a Chinese-trained diffusion model with text-image alignment optimized for traditional aesthetics. e ControlNet: a structure-aware model that enables guided generation through edge or layout conditioning. f CCLAP: a layout-preserving model designed specifically for Chinese cultural content generation. g Ours: images generated by our proposed LFMdiff model, which introduces low-frequency modulation and hierarchical LoRA-based control to enhance brushstroke fidelity, spatial composition, and ink style consistency. h GT: The final column displays ground truth reference paintings, sourced from historical Chinese artworks in the public domain.LFMdiff stands for low-frequency modulated diffusion. The eight columns from ah correspond to different generation methods and the reference. Each row visualizes the outputs for a distinct textual prompt.

Fig. 5: Visual comparison between LFMdiff, Chinese text-to-image generation models, and autoregressive models.
Fig. 5: Visual comparison between LFMdiff, Chinese text-to-image generation models, and autoregressive models.
Full size image

This figure shows a qualitative comparison of generated Chinese landscape paintings from multiple diffusion-based models. Each column (1–8) represents a unique textual prompt describing a traditional Chinese landscape composition. Each row corresponds to a model output or reference image. No post-processing was applied, and all models were conditioned on the same textual input. a Tongyi Wanxiang: a proprietary image generation model developed by Alibaba. b RAPHAEL: a diffusion model from Baidu with improved visual fidelity and layout structure. c WenXin 4.5 Turbo: an optimized variant of Baidu’s WenXin model for fast generation. d LlamaGen-XL: a large-scale autoregressive model for high-resolution image generation. e OpenMAGVIT2: an open-source autoregressive model with outputs resembling natural images. f DALL. E Mini: a lightweight autoregressive model with recognizable semantics but limited style control. g Ours: generation results from the proposed LFMdiff model, which uses low-frequency modulation and hierarchical attention guidance to enhance brushwork style, compositional balance, and ink expressiveness. h GT: ground truth reference images, obtained from digitized historical Chinese landscape paintings in the public domain. These serve as artistic benchmarks for comparison.

Two-tier human evaluation

To comprehensively evaluate the performance of different models in the task of generating traditional Chinese landscape paintings, we designed a two-tiered human subjective evaluation framework that assesses both the perceptual experience of general users and the artistic standards of professional painters. Based on the relatively low LPIPS values in Table 2 and the qualitative quality of generated images shown in Figs. 4 and 5, we selected four representative models—SDXL, DALL-E3, RAPHAEL, and our proposed model—for comparative assessment. Each model contributed 8 generated images, resulting in a total of 32 samples, with an emphasis on evaluating image quality and style fidelity.

For user evaluation, we recruited N = 57 participants, none of whom had a background in Chinese painting or fine arts. Detailed demographic information, including age, gender, education level, and nationality, is provided in Supplementary Table 1. The evaluation was conducted via an online questionnaire system. Participants rated each image across the following three dimensions:

  • Perceived Visual Quality: Clarity and aesthetics of the image, composition coherence, and presence of any structural errors or visual artifacts.

  • Semantic Consistency: Whether the image accurately conveys the semantic content described in the input prompt, including elements such as mountains, water, pines, and pavilions.

  • Style Fidelity: Whether the image reflects the stylistic features of traditional Chinese ink painting, such as ink variation, use of blank space, and compositional layout—distinct from photographs or Western oil paintings, and embodying the essence of “ink and brush” aesthetics.

Participants rated each image using a five-point Likert scale, defined as follows: 5 (Excellent), 4 (Good), 3 (Fair), 2 (Poor), 1 (Very Poor; severely inconsistent).

To ensure methodological transparency, we adopted a qualitative thematic analysis approach. Specifically, the open-ended comments from participants were analyzed following a three-stage coding procedure (open coding→axial coding→selective coding). A coding scheme was developed collaboratively by two independent coders, and intercoder reliability was calculated (Cohen’s κ = 0.82), indicating strong agreement. NVivo software was employed for data management and coding. Representative verbatim excerpts from participants’ comments are reported in Supplementary Table 2 to illustrate typical perceptions and provide empirical grounding for our interpretation.

To systematically analyze the subjective performance of the models, we implemented a structured data processing pipeline. First, we constructed a unified data table where each row corresponds to a single evaluation record, containing the image ID, associated model, user ID, and scores across the three dimensions. Based on this, we computed the mean opinion score (MOS) and STD for each image and each dimension to assess both the average subjective quality and the consistency of user ratings. We then aggregated the MOS scores across the 8 images of each model to obtain model-level average subjective scores for each evaluation dimension. The aggregated results, represented in the form of mean ± std, are reported in Table 3.To test for statistically significant differences among models in each subjective dimension, we conducted Friedman tests (a non-parametric alternative to ANOVA) on the MOS values. If the Friedman test indicated significance (p < 0.05), we further applied Dunn’s post-hoc multiple comparisons to identify specific differences between model pairs. Finally, to visually interpret the dispersion and central tendency of user ratings, we employed boxplots, as shown in Fig. 6, which intuitively depict the comparative performance of the models across evaluation criteria.

Table 3 Model-level mean opinion scores (MOS) and rating variability across perceptual dimensions
Fig. 6: Subjective MOS score distribution by model.
Fig. 6: Subjective MOS score distribution by model.
Full size image

This figure illustrates the distribution of mean opinion score (MOS) for four models (SDXL, DALL-E3, PAPAHEL, and ours) across three evaluation metrics: perceived visual quality, semantic consistency, and style fidelity. SDXL: presented by blue boxes.DALL. E3: represented by orange boxes. PAPAHEL: represented by green boxes. Ours: represented by red boxes.

While the user study provided meaningful insights, we acknowledge that the relatively modest sample size and reliance on an online questionnaire may limit generalizability. Thus, conclusions should be interpreted as indicative rather than definitive.

For expert evaluation, we invited three experts with extensive professional backgrounds in the field of traditional Chinese landscape painting to conduct independent evaluations of the generated images. The expert panel consisted of a professor from a fine arts academy, a senior-level traditional ink painting artist, and a curator with over a decade of experience in exhibition planning and artistic evaluation. All three possess >10 years of experience in both theoretical research and practical creation within the domain of Chinese ink painting. The evaluation was structured around three professional dimensions:

  • Brushwork expressiveness: assessing whether the image demonstrates layered brushstrokes and traditional techniques such as texture strokes, including axe-cut, hemp-fiber, variations in ink tones, and water stain rendering.

  • Composition and spatial perspective: evaluating whether the composition adheres to the “three distances” (san yuan fa) principle unique to Chinese landscape painting, with clear focal hierarchies and a balance of void and solid spaces.

  • Ink aesthetic and artistic conception: examining whether the image embodies the traditional aesthetics of “vitality and spirit” and “expressive intentionality” (xieyi), including natural use of negative space and an overall sense of ethereal, serene landscape imagery.

Each image was rated by the three experts on a 5-point Likert scale, where: 5 = Excellent, 4 = Good, 3 = Fair, 2 = poor, and 1 = very poor (severely inconsistent with expectations). To enhance transparency, experts were also encouraged to provide short qualitative remarks, which were subsequently coded using the same three-stage procedure as the user evaluation. Representative quotations are reported in Supplementary Table 3.

To ensure robustness and representativeness, the median score across the three experts was taken as the final rating for each image on each dimension. Subsequently, for each model, the median scores of its 8 images under each dimension were averaged to obtain the model-level subjective performance score, as presented in Table 4.To assess whether significant differences exist among models, we conducted Friedman tests on the image-level scores across the four models in each evaluation dimension. For cases where the p-value indicated statistical significance (p < 0.05), we further applied Dunn’s post-hoc tests for pairwise comparisons. In addition, we visualized the distribution and central tendency of expert ratings across models using boxplots, which intuitively highlight the differences in subjective performance across evaluation dimensions, as shown in Fig. 7.

Table 4 Average median scores per model in professional expert evaluation
Fig. 7: Expert evaluation of models across artistic dimensions.
Fig. 7: Expert evaluation of models across artistic dimensions.
Full size image

This figure presents expert evaluation scores for four models (SDXL, DALL-E3, PAPAHEL, and ours) across three artistic dimensions: brushwork, composition and spatial perspective, and ink aesthetic. Brushwork: represented by blue bars. Composition and spatial perspective: represented by orange bars. Ink aesthetic: represented by green bars.

Discussion

As shown in Table 1, Case 1 employs only the Hierarchical Local LoRA module (denoted as A), achieving strong performance in both FID (76.951) and LPIPS (0.604), alongside a relatively high CLIP-T score (0.320). These results indicate that this module positively contributes to both perceptual image quality and semantic alignment. In contrast, Case 2 incorporates only the dynamic rank and dynamic alpha mechanisms (denoted as B). Despite their adaptive capabilities, this configuration yields the lowest performance across all three metrics (CLIP-T: 0.297, LPIPS: 0.689, FID: 85.945), suggesting limited standalone effectiveness. Case 3 integrates the Conditional Adversarial Flow and multi-scale MMD components (denoted as C), which slightly improves CLIP-T (0.305) and FID (82.809) compared to Case 2. This indicates that these modules facilitate better distribution alignment and structural fidelity. Case 4 builds upon Case 1 by adding the dynamic mechanisms (A + B), resulting in a marked improvement in LPIPS (0.633) and a reduction in FID (77.443), demonstrating that the dynamic mechanisms enhance visual quality when combined with LoRA. Similarly, Case 5, which combines A and C, shows consistently strong results in LPIPS (0.651) and CLIP-T (0.310), outperforming the standalone C configuration. Finally, Case 6 integrates all three components (A + B + C), achieving the best performance across all metrics: CLIP-T improves to 0.334, LPIPS drops to 0.438, and FID significantly decreases to 61.544. These findings validate the complementary nature and synergistic benefits of the proposed modules.

As illustrated in Fig. 3, the first column demonstrates that Case 1 exhibits strong capabilities in capturing architectural edges, contours, and structural hierarchies, especially in black-and-white line drawing styles (e.g., rows 3 and 10), where the reconstruction is relatively clear. This suggests that the local LoRA mechanism contributes to improved structural reconstruction, particularly in capturing local details and linear layering within the generated images. However, this configuration reveals notable shortcomings in natural color transitions, stylistic consistency, and overall structural coherence. Color images often appear overly synthetic and lack realism (e.g., rows 2 and 4). In the second column, improvements in color saturation and style consistency are observed, particularly in the rendering of landscapes and lighting effects (rows 5 and 6). Nonetheless, this setup shows a decline in structural accuracy, with certain architectural or scene details becoming blurred or distorted, suggesting a deficiency in maintaining stable spatial structures. The third column yields more natural color palettes and stylized effects, especially in terms of texture and brushstroke realism, achieving a closer resemblance to ground truth images (e.g., rows 4, 7, and 9). Yet, this configuration demonstrates weaker structural control, as some samples exhibit spatial misalignment or perspective distortion, indicating that this module alone is insufficient for capturing complete spatial structural features. When the Hierarchical Local LoRA is combined with the dynamic rank and alpha mechanisms, both structural clarity and stylistic coherence are improved. Images in the fourth column display enhanced architectural stability and color harmony (e.g., rows 3, 6, and 10), although some local regions, including landscape edges or fine textures, still lack sharpness. With the integration of Conditional Adversarial Flow, the fifth column shows further enhancement in image layering and texture detail (e.g., rows 4, 5, and 7), although its performance in color control and global consistency remains slightly inferior to that of the complete model. When all three modules (A, B, and C) are integrated, the generated results (sixth column) achieve the best overall performance in terms of structural fidelity, color accuracy, and stylistic consistency.

As shown in Table 2, the proposed LFMDiff (Ours) model demonstrates strong overall performance across all three evaluation metrics, with notable advantages in both CLIP-T and FID, and competitive results in LPIPS, which measures perceptual similarity. The semantic consistency score (CLIP-T) of our method reaches 0.334, significantly outperforming all baseline models, including WenXin 4.5 Turbo (0.247), CCLAP (0.234), and ControlNet (0.226). This indicates that the images generated by LFMDiff are more semantically aligned with the input text, demonstrating superior text-image correspondence. In terms of perceptual similarity, our method achieves an LPIPS score of 0.438. Although this is not the best among all models and does not rank in the top five, LFMDiff still shows strong capability in preserving ink wash details and capturing stylistic features characteristic of traditional Chinese landscape painting. It is worth noting that models such as SDXL (0.011) and ControlNet (0.026), despite achieving the lowest LPIPS scores (indicating better perceptual similarity in pixel space), tend to generate overly smooth images, lacking the diversity and expressive brushwork of traditional ink painting. In contrast, models like Taiyi (0.831), GLIDE (0.969), and T2I-Adapter (0.981) perform poorly on this metric, suggesting weaker perceptual quality. Regarding overall image quality, LFMDiff achieves the lowest FID score of 61.544, indicating that the distribution of its generated images is closest to that of real Chinese landscape paintings in terms of structural fidelity and distribution realism. In summary, although LFMDiff does not achieve the top performance on the LPIPS metric, it exhibits clear advantages on both CLIP-T and FID, reflecting a well-balanced trade-off among semantic consistency, stylistic fidelity, and image diversity.

As shown in Fig. 4, images generated by SDXL and DALL-E 3 demonstrate certain advantages in resolution and overall visual appeal. However, their stylistic tendencies lean toward Western oil painting or digital illustration, lacking the abstract expression and “artistic conception” characteristic of traditional Chinese ink painting. Their brushstrokes appear overly realistic or digitally rendered, and the handling of ink tones is often unnatural. GLIDE yields generally poor results, with outputs frequently exhibiting loose composition and monotonous color schemes. Some samples resemble children’s drawings or are overly abstract, failing to capture the expressive style and spatial depth fundamental to Chinese painting. Taiyi, a Chinese-pretrained model, aligns more closely with Chinese aesthetics in terms of color and some visual elements. Nevertheless, its compositions tend to be overly simplistic, lacking structural complexity, and the textures are coarse. For instance, the representation of trees and rocks is overly generalized, with insufficient variation in brush and ink techniques. ControlNet exhibits strong structural control and can produce well-defined contours of mountains and architectural elements. However, the resulting images resemble contour maps or computational renderings, lacking the fluidity and expressive spontaneity of ink painting, and often display a mechanical or patterned appearance.CCLAP is capable of generating images with rich color palettes and complex visual elements. Yet, its outputs frequently diverge from the subtle and elegant aesthetic of traditional landscape painting, instead showing a strong decorative tendency or exaggerated coloration, more reminiscent of modern collage or illustration styles. In contrast, our method (Ours) achieves the most convincing results in terms of ink painting style, hierarchical composition, and spatial arrangement. The generated images exhibit hallmark features of Chinese painting techniques, including feibai, jimo (ink accumulation), xuanran (gradual wash), and liubai (intentional blank space), which are applied in the rendering of elements like rocks, forests, clouds, water, and architecture. The overall composition conveys a dynamic sense of rhythm and vitality, achieving a balance between form and spirit, and most closely resembles authentic artworks.

As illustrated in Fig. 5, although mainstream large-scale models such as Tongyi Wanxiang and RAPHAEL are capable of generating high-resolution images with rich details, their visual styles tend to resemble Western oil paintings or digital illustrations. These models typically emphasize realistic lighting and physical structure in the depiction of mountains, trees, and water bodies. However, their brushstrokes are highly representational, lacking the abstract expression and ink aesthetics characteristic of traditional Chinese landscape painting. For instance, in columns 1 and 4, the images generated by RAPHAEL exhibit a strong sense of volume and contrast, but the overall color saturation is relatively high, which does not align with the Chinese aesthetic preference for subtlety, elegance, and serenity. WenXin 4.5 Turbo, a model fine-tuned for Chinese-language inputs, demonstrates a better understanding of landscape-related semantics, with more traditional elements such as pine trees, pavilions, terraces, and mist appearing in the generated images. However, it still suffers from loose spatial composition and ambiguous layering. In columns 3 and 6, for example, the shapes of rocks appear rigid, and the handling of mist and distant views lacks the dynamic interplay between solidity and emptiness that defines classical Chinese landscape techniques. In comparison, autoregressive models such as LlamaGen-XL, OpenMAGVIT2, and DALL. E Mini exhibits distinct stylistic tendencies. LlamaGen-XL leans toward photorealism, often lacking the artistic abstraction of ink painting, and in some cases (e.g., Columns 2 and 4) introduces modern elements inconsistent with the intended cultural context. OpenMAGVIT2 and DALL. E Mini can partially imitate the grayscale tonalities of ink wash, but their compositions remain simplistic, with homogeneous mountain and tree forms, insufficient layering, and a lack of brush-and-ink expressiveness. In contrast, our method (Ours) exhibits superior performance across multiple dimensions, particularly in generation quality and artistic style fidelity. The generated images follow traditional compositional principles, with a natural transition from foreground to middle ground to background, adhering to the “Three Distances” perspective approach commonly employed in Chinese landscape painting. In terms of detailed depiction, as seen in columns 5 and 8, our model produces well-articulated brushwork in trees and rocks, with effective use of blank space for water bodies, and natural applications of ink wash techniques such as feibai,jimo, andpomo (splashed ink). The overall imagery conveys a strong sense of vitality and rhythm, achieving a harmonious integration of form and meaning, and delivering a rich expression of artistic conception.

As shown in Table 3 and Fig. 6, the evaluated models exhibit distinct differences in perceptual quality from the perspective of general users. PAPHAEL demonstrates the best performance in Perceived Visual Quality, with a mean score of 4.07. The corresponding boxplot indicates a high median and concentrated interquartile range, suggesting strong and consistent visual detail rendering. SDXL achieves the highest score in Semantic Consistency (mean = 4.11), with a tightly grouped distribution, reflecting its strong alignment between text and image content. Although DALL-E3 attains comparable mean scores across dimensions, its boxplots reveal more dispersed ratings with the presence of low outliers, indicating less stable performance. In contrast, the Ours model excels in Style Fidelity, achieving the highest mean score of 3.98, with a tight and elevated boxplot distribution, highlighting its superior capability to reproduce the stylistic features of traditional Chinese landscape painting. Furthermore, Ours achieves a Semantic Consistency score of 4.09, closely approaching the best-performing SDXL, demonstrating robust semantic understanding. Although its Visual Quality score is slightly lower than PAPHAEL and SDXL, the difference is marginal, and Ours maintains high stability overall. These observations are further supported by statistical significance testing. For the Visual Quality dimension, the Friedman test yields a statistic of 2.130 with a p value of 0.5458, indicating no significant differences among models in this dimension—expert evaluations were generally consistent across models. In Semantic Consistency, the Friedman statistic is 10.714 (p = 0.0134), suggesting a significant difference among models. However, subsequent Dunn’s post hoc tests reveal that none of the pairwise comparisons reach statistical significance (all p > 0.05), implying that the overall difference may result from the joint contribution of multiple models. In Style Fidelity, the Friedman statistic is 8.217 with a p value of 0.0417, indicating significant differences. Dunn’s post-hoc analysis further shows that Ours significantly outperforms DALL-E3 (p = 0.037969) in this dimension, while no significant differences are observed with other models. In summary, the Ours model exhibits a statistically significant advantage in Style Fidelity, underscoring its strong ability to faithfully reproduce traditional Chinese landscape painting styles. It also performs stably in Semantic Consistency and comparably in Visual Quality, demonstrating balanced and competitive overall performance in the task of traditional Chinese landscape image generation.

As shown in Table 4 and Fig. 7, the expert evaluation was conducted across three core dimensions of traditional Chinese landscape painting—brushwork, composition & perspective, and ink aesthetic—to assess each model’s capability in artistic fidelity. Overall, the Ours model demonstrated strong performance across multiple dimensions, highlighting its excellent adaptability and effectiveness in replicating traditional artistic features. In the Composition & Perspective dimension, Ours achieved the highest mean score of 4.25, with the boxplot indicating a concentrated distribution in the upper quartile. Statistical analysis revealed that Ours significantly outperformed DALL-E3 (p = 0.0288) and SDXL (p = 0.0037), and showed no significant difference from PAPHAEL (p = 1.0). This suggests that Ours aligns well with expert principles such as the three perspectives and exhibits clear structural hierarchy and spatial layering. For Brushwork, PAPHAEL led with a mean score of 4.5, significantly higher than DALL-E3 (p = 0.0098). Ours followed closely with a mean score of 4.0, and although the difference was not statistically significant compared to the other models, its boxplot showed low variance, indicating consistent rendering of traditional techniques such as texturing, strokes and ink-wash gradients. In terms of Ink Aesthetic, PAPHAEL again achieved the highest mean score (4.125), followed by Ours (3.75). Although the Friedman test did not indicate statistical significance across models (p = 0.0954), Ours demonstrated a stable performance, with a compact boxplot suggesting reliable delivery of artistic mood and atmospheric subtlety. Taken together, Ours stands out as the top-performing model in composition, while also delivering solid and consistent results in brush technique and aesthetic expression. With all three dimensions rated as “Good”or “Excellent”, Ours exhibits outstanding overall suitability for the task of traditional Chinese landscape painting generation, balancing structural composition and artistic authenticity with notable effectiveness.

Despite the significant progress achieved in Chinese landscape painting generation, several limitations and challenges remain. Although the CTLPD dataset encompasses a relatively wide range of styles and themes, it still falls short in capturing the full diversity of human artistic creation. Rare styles such as ruled-line painting and blue-and-green landscape are underrepresented, which may impair the model’s generalization capability in these specific genres. Additionally, the current evaluation metrics, FID, LPIPS, and CLIP-T, are primarily designed for natural image perception and fail to adequately reflect the aesthetic criteria intrinsic to traditional East Asian art, which include qiyun shengdong (vitality and spirit resonance) and xushi xiangsheng (the interplay of void and substance). Future research should explore the integration of evaluation indicators grounded in art cognition and aesthetic theory.

Chinese landscape painting is not merely a visual art form but also a vessel of profound philosophical thought, cultural symbolism, and historical value. The introduction of LFMDiff represents an innovative advancement that has begun to influence generative modeling, while also providing potential contributions to the digital preservation and dissemination of cultural heritage. In terms of cultural inheritance, LFMDiff can support the recreation of ancient painting styles and the restoration of damaged artworks, providing intelligent tools for institutions such as the Palace Museum and Dunhuang Academy. In art education, LFMDiff may serve as a pedagogical aid to help beginners grasp compositional logic and brush-ink techniques, thereby facilitating digital instruction and creative reinterpretation of traditional practices. It is important to note, however, that such educational impact is not solely determined by technical capability; factors such as learners’ prior art background, age, and level of digital literacy also play a role. Furthermore, the controllable generation mechanisms introduced by our model could offer a technical foundation for the development of interactive cultural products, suggesting possible pathways for expanding the expressive boundaries of traditional art in contemporary contexts.

Although user evaluations highlighted the model’s performance in style fidelity and expert assessments confirmed its reliability in structural and compositional aspects, future research could benefit from triangulation by integrating qualitative findings with industry reports, usage data, or broader survey results. Such an approach would help assess the generalizability of these observations beyond individual subjective perceptions, thereby enhancing the robustness and applicability of the conclusions.