Abstract
Under the context of educational informatization and cultural heritage, painting instruction in ethnic universities faces challenges such as difficulty in technique transmission, cross-linguistic barriers, and insufficient personalization. This study proposes a painting art rendering system based on deep learning and machine translation, establishing an integrated framework of “technique transmission – style rendering – cultural interpretation – personalized guidance.” The system employs an improved generative adversarial network to achieve automatic rendering of eight ethnic painting styles and introduces a visual-context Transformer to accomplish semantic mapping of painting terminology across different ethnic languages. Validation was conducted on a multimodal dataset comprising 12,000 artworks and 5,000 terminology entries. Results showed that the style rendering module achieved an F1 score of 92.3%, representing an 8.7% improvement over traditional models. Meanwhile, the terminology mapping module reached a semantic matching rate of 89.6%, an increase of 6.2%. Ablation experiments indicated that the collaborative operation of the two modules enhanced overall performance by 11.5%. Teaching experiments showed that students using the system improved by 18.4%, 25.4%, and 17.6% in technique mastery, cultural understanding, and creative innovation, respectively, significantly outperforming the traditional approach. The study makes a contribution by proposing a collaborative teaching framework, introducing innovative modules for intelligent rendering and cross-linguistic interpretation, and empirically validating their educational value. This study provides a practical approach for the digital preservation of ethnic painting techniques and for facilitating cross-cultural communication.
Similar content being viewed by others
Introduction
Ethnic painting arts carry the historical memory, values, and aesthetic characteristics of China’s diverse ethnic groups, such as the mineral pigment techniques of Tibetan Thangka, the symmetrical aesthetics of Miao silver ornaments, and the color symbolism systems of Yi lacquerware1. However, painting instruction in ethnic universities faces three major challenges: the transmission of techniques relies heavily on oral teaching and limited original works; symbols and terminology depend on ethnic language backgrounds, making systematic comprehension difficult for students; and standardized teaching models constrain personalization and creative development.
Deep learning techniques for image generation and style transfer can automate the rendering process from sketches to finished works, enhancing the visualization of technical instruction2. Transformer-based translation models can overcome cross-linguistic barriers, enabling semantic mapping of terminology and cultural symbols. Educational psychology research indicates that “contextualized learning” and “personalized feedback” are crucial in art education, and immersive rendering with deep learning combined with cross-linguistic cultural interpretation aligns well with these pedagogical needs. Existing studies largely focus on single technologies: although Convolutional Neural Network (CNN)-based style transfer can achieve image transformation, it falls short in restoring detailed textures in ethnic paintings; similarly, machine translation research for ethnic languages is often limited to everyday communication and lacks support for specialized terminology and cultural symbols.
This study introduces three main innovations: (1) Painting Rendering Module Based on Improved Generative Adversarial Network (GAN): Enables precise recognition and automated rendering of eight ethnic painting styles, providing visual feedback on technical details. (2) Cross-Linguistic Painting Terminology Mapping Module Based on Transformer: Integrates ethnic languages and a professional terminology database to achieve cross-linguistic interpretation of cultural symbols, addressing language gaps in teaching. (3) Dynamic Weight Bidirectional Coupling Mechanism: Unlike static fusion in conventional cross-modal models, this mechanism adjusts the fusion coefficient \(\:\alpha\:\) dynamically, ranging from 0.5 to 0.9. It controls the balance between visual injection (R→T) and terminology feedback (T→R) using a Style Complexity Perceptor. This dynamic adjustment improves feature compatibility in cross-ethnic style fusion scenarios.
Literature review
The core objective of painting instruction in ethnic universities is to unify “technique transmission” and “cultural identity.” Hui (2022) noted that traditional “lecture–copying” approaches had left students’ understanding of patterns at a superficial level, making it difficult to grasp underlying cultural logic3. Huang and Fang (2024) identified gaps in ethnic language and terminology as the primary constraints on cultural comprehension4. In terms of resource development, Mohamed (2024) proposed digital museums that preserved high-resolution details but were not integrated into teaching5. Kan (2025) experimented with Virtual Reality (VR) to enhance immersion, but high costs and the lack of cross-linguistic interpretation limited its classroom applicability6.
In the field of deep learning–based image generation and style transfer, recent research has made significant progress, providing a technological foundation for the intelligent rendering of ethnic paintings. Huang and Jafari (2023) proposed GANs that effectively improved the realism of generated images7. Hindarto and Handayani (2024) developed CycleGANs to overcome unpaired data limitations, enabling ordinary photos to be transformed into Thangka-style images8. He and Tao (2025) introduced a CNN-based style transfer approach that separated content and style features, but handling complex patterns often led to texture discontinuities and detail distortion9. Luo and Ren (2025) attempted to optimize ethnic painting rendering with improved GANs but covered only two ethnic styles and did not integrate teaching applications, which limited classroom utility10. Meanwhile, several high-impact studies in recent years have further advanced both the theory and applications of style transfer. For example, Hu and Zhang (2025) proposed StyleGAN3, which achieves stable, high-fidelity style generation through phase consistency improvements, offering new approaches for reproducing brushstrokes and texture details in ethnic paintings11. Jiang et al. (2025) introduced a contrastive learning mechanism within the Contrastive Unpaired Translation (CUT) framework, significantly enhancing style consistency under unpaired data conditions12. Zhao et al. (2025) combined Diffusion Models with Transformers to achieve painting style synthesis with stronger semantic control, producing results that outperform traditional GAN models in both texture granularity and color expression. These studies provide both theoretical inspiration and a technical foundation for the improved IGAN model proposed in this research13.
In the field of cross-linguistic and cross-cultural interpretation, machine translation technologies have provided essential support for the cross-lingual mapping of ethnic painting terminology. Li (2025) developed a Mandarin–Tibetan Thangka terminology dictionary; however, it was limited in scope and overlooked cultural contextual differences14. Recent research explored the collaborative application of deep learning and machine translation, primarily in cross-modal content generation. Banar (2023) combined style transfer and translation to generate multilingual descriptions of artworks, yet technical skill support was lacking15. In educational settings, Liu and Zhu (2025) built an intelligent teaching system integrating deep learning and translation models to provide bilingual feedback on student assignments, but its design targeted STEM subjects and did not accommodate the visual and creative characteristics of painting arts16. Moreover, recent studies highlight the importance of jointly modeling visual and textual semantics. For example, Zhang (2024) proposed the Vision-Language Pretraining (VLP) framework17, and Gao (2024) developed multimodal Transformer models18. These studies provide key references for designing the visual context fusion layer in this research.
In summary, while prior research has advanced image generation, style transfer, and cross-linguistic translation, a collaborative framework that integrates deep learning–based rendering with machine translation–based cross-lingual interpretation is still lacking. This gap is especially evident in ethnic painting education, where challenges persist in technique inheritance, language barriers, and shallow cultural understanding. To address this, the present study develops an intelligent teaching system that combines visual style generation with semantic mapping. This system aims to support the digital preservation and cross-cultural dissemination of ethnic painting.
Research model
Deep learning-based painting rendering module
The core technology of deep learning is the neural network, whose design is inspired by the structure and information transmission patterns of the human brain19,20. The human brain functions as a complex system for information interaction, and deep learning emulates this by treating each neuron in the network as an independent weight parameter. Numerous interconnected neurons together form a large-scale network of parameters21,22,23. A critical step in deep learning is training these networks to optimize the weight parameters of all neurons.
The simplest form of a neural network typically consists of an input layer, one or more hidden layers, and an output layer24. The input layer receives the data to be processed, which can include diverse types such as images, audio, or other preprocessed information. The hidden layers, connected to the input layer, process and transform the incoming signals. These layers often consist of multiple levels and represent the most structurally complex part of the network, containing the largest number of neurons. The output layer receives the processed information from the hidden layers and produces the final result of the trained network.
The structure of a neural network is illustrated in Fig. 1. The hidden layers can comprise multiple levels, and all neurons between successive layers are fully connected, a pattern known as a fully connected network.
The style feature extraction submodule is responsible for capturing the multi-level stylistic features of ethnic paintings25. Its network architecture is based on ResNet-50 combined with a spatial attention mechanism. ResNet-50 extracts features hierarchically through five convolutional stages (conv1 to conv5): lower layers capture fine details such as edges and textures, while higher layers identify style semantics. To emphasize critical features and suppress redundancy, a spatial attention mechanism is introduced. This mechanism uses pixel-wise importance weighting to enhance responses to key painting regions while reducing background noise. The attention weights are computed as follows:
where s denotes the feature map output from ResNet-50, W1 and W2 are weight matrices, b1 and b2 are bias terms, σ is the sigmoid activation function, and A(s) represents the final attention weight map.
To ensure cultural adaptation, a technical rule database is incorporated, storing specific standards for each ethnic painting style. These rules are encoded as feature constraint vectors R and fused with the ResNet-extracted image features to obtain the final style feature vector F:
where FResNet is the feature vector extracted by ResNet-50, α is the fusion coefficient (ranging from 0.6 to 0.8, optimized experimentally), and R is the technical rule constraint vector.
The intelligent rendering module is based on a GAN framework, consisting of a generator and a discriminator, enabling the transformation of student sketches into images with ethnic painting styles26. The generator builds upon the traditional U-Net encoder–decoder framework by incorporating residual blocks and a self-attention mechanism, enabling both fine-grained texture preservation and global style consistency.
-
1.
Sketch Feature Encoding: The sketch is converted into a content feature vector C, preserving the main outlines and spatial layout.
-
2.
Style Feature Injection: The style feature vector F is fused with the content vector C to form a hybrid feature H:
-
3.
Image Generation: The hybrid feature H is input into the decoder, which gradually restores the resolution through residual blocks to produce the final image.
The discriminator’s core function is to distinguish between “real ethnic paintings” and generator-rendered images. Its loss function is a multi-objective formulation combining adversarial loss, style loss, and content loss. The discriminator adopts a 70 × 70 PatchGAN architecture, performing image authenticity assessment at the patch level to enhance both detail perception and discriminative accuracy.
where \(\:{\mathcal{L}}_{adv}\) is the adversarial loss, implemented using WGAN-GP to prevent mode collapse:
D(x) is the discriminator output for a real image x, G(z) is the generator output for noise zzz, \(\:\widehat{x}\) is an interpolated sample between real and generated images, γ is the gradient penalty coefficient, used to constrain the discriminator’s gradient norm and maintain training stability.
-
\(\:{\mathcal{L}}_{style}\) enforces consistency between the generated image and real works in the style feature space.
-
\(\:{\mathcal{L}}_{content}\) ensures correspondence between the generated image and the input sketch in the content feature space.
-
λ1 and λ2 are weighting coefficients that balance the contributions of different loss terms.
The style vector and rule constraint vector are injected at different stages of the generator: the style vector into the mid-level features of the encoder, and the rule vector into the style reconstruction layer of the decoder. The style vector is linearly mapped to a 256-dimensional space and normalized layer by layer to modulate the local color and brushstroke distribution of the generated image. The rule vector is processed through a fully connected layer and added residually to the skip connection outputs, ensuring that the painting’s structure and compositional rules are preserved. The generator and discriminator topologies are illustrated in Fig. 2, and the detailed layer configurations are provided in Table 1.
During training, the generator and discriminator are jointly updated via adversarial optimization. The generator aims to minimize a composite loss function, which includes adversarial loss, style loss, content loss, and regularization terms. The discriminator, in contrast, maximizes classification accuracy to enhance convergence stability. This architecture ensures a clear hierarchical design and transparent parameter configuration, providing a standardized technical foundation for performance replication and cross-study validation.
Cross-language interpretation module via machine translation
The cross-language interpretation module is based on an improved Transformer architecture, designed to enable mapping among ethnic languages, Mandarin, and painting-specific terminology while incorporating cultural context27,28. Its core innovation lies in the integration of a visual context fusion layer, which leverages the style feature vectors from the rendering module to assist translation29. This design effectively addresses the ambiguity of technical terms that often arises when traditional models rely solely on textual input. The improved Transformer structure is defined as follows:
Encoder: The input consists of ethnic language term vectors ET. Simultaneously, visual vectors EV, derived from style feature vectors, are introduced. These two types of vectors are weighted and fused to form a hybrid vector ETV, as shown in Eq. (6):
where WV is the weight matrix for the visual vector, and bV is the bias term. The fused vector is then fed into multi-head attention layers and feed-forward networks to extract deep semantic representations.
The loss function for this module combines cross-entropy loss with semantic similarity loss:
where \(\:{\mathcal{L}}_{CE}\) is the cross-entropy loss, measuring the discrepancy between the model-generated translation and the reference translation; \(\:{\mathcal{L}}_{sim}\) is the semantic similarity loss, calculated using cosine similarity between the generated translation and the visual feature vector to ensure semantic alignment with the painting context; λ3 is a weighting coefficient, set to 0.5 in experiments.
Style transfer network
The style transfer network, originally proposed by Gatys et al., employs VGGNet to achieve fusion and transfer of content and style features. The network defines two loss functions: the content loss and the style loss, which are combined in a weighted manner to form the final loss function30,31. By iteratively minimizing this loss function during training, the network produces images with the desired style transformation. The architecture of the Visual Context Transformer translation module is shown in Fig. 3.
To ensure correct loss computation, the training network structure must be consistent with that of VGGNet32,33,34,35,36. The detailed training process of the rendering network is illustrated in Fig. 4.
Bidirectional coupling mechanism
To achieve deep collaboration between the painting style rendering module and the cross-lingual translation module, this study designs a bidirectional coupling mechanism. It aligns cross-modal semantics and complements features through visual feature injection and terminology rule feedback.
Visual context injection
The style feature vector from the rendering module, \(\:{F}_{s}\in\:{\mathbb{R}}^{B\times\:{d}_{s}}\) (where \(\:B\) is the batch size and \(\:{d}_{s}=512\) is the style dimension), is linearly projected into the latent space of the translation encoder. It is then fused with the input ethnic terminology embedding \(\:{E}_{T}\in\:{\mathbb{R}}^{B\times\:L\times\:{d}_{t}}\). Ethnic terminology refers to specialized technique names, pattern symbols, or material terms used in the traditional paintings of different ethnic groups. These terms reflect artistic techniques and carry rich cultural information. For example, “gold-line painting” in Tibetan Thangka, “paired brocade patterns” in Miao silver ornaments, and “color-layered floral painting” in Yi lacquerware are all typical ethnic terms. The fusion process is defined as:
\(\:\text{Repeat}({F}_{s},L)\) duplicates the style vector along the sequence length to match the terminology sequence length \(\:L\), and \(\:{W}_{v}\in\:{\mathbb{R}}^{{d}_{t}\times\:{d}_{t}}\) is a learnable projection matrix. This operation ensures that the translation model processes ethnic terminology under visual context constraints, effectively reducing semantic ambiguity.
Terminology rule feedback
During the output stage of the translation module, cultural-technical rule vectors \(\:{R}_{t}\in\:{\mathbb{R}}^{B\times\:{d}_{r}}\) (where dr=256) are extracted and fed back into the decoding stage of the generator in the rendering module. Here, technical rules refer to the creative conventions of specific ethnic paintings, including principles of composition, color schemes, stroke order, and symbol usage. For instance, Tibetan Thangka layouts follow a “central-axis symmetry” principle for figure arrangement, while Miao silver ornaments follow “symmetrical repetition” in pattern arrangement. Encoding these rules as feature constraint vectors allows the model to maintain the ethnic style and cultural characteristics of the artwork during generation. The fusion process is defined as follows:
where \(\:{F}_{\text{res}}\) is the visual style feature extracted by ResNet, \(\:{W}_{r}\in\:{\mathbb{R}}^{{d}_{r}\times\:{d}_{s}}\) is the mapping matrix, and \(\:\alpha\:\) is the fusion coefficient (set to 0.7). This mechanism constrains the style generation process using semantic rules, improving the fidelity of cultural symbols and technical features.
In summary, the bidirectional coupling mechanism forms a closed-loop visual–language collaborative system within the model architecture. Visual context enhances language comprehension, while semantic rule feedback optimizes style generation, enabling dynamic coupling of technical features and cultural semantics.
Training procedure
To ensure reproducibility and stability in training the improved IGAN and the visual-context Transformer for multi-ethnic painting rendering and cross-lingual terminology mapping, this study provides a detailed training procedure and parameter configuration.
In the rendering module, WGAN-GP (Wasserstein GAN with Gradient Penalty) is adopted to prevent mode collapse and maintain training stability. For each training iteration, the discriminator is updated 5 times while the generator is updated once. The gradient penalty coefficient is set to \(\:\gamma\:=10\)to constrain gradient norms and balance adversarial training. The optimizer is Adam with hyperparameters \(\:{\beta\:}_{1}=0.5\), \(\:{\beta\:}_{2}=0.999\). The initial learning rate is \(\:2\times\:{10}^{-4}\), which is linearly decayed to zero after 100 iterations. Batch size is 16, with a total of 200 training iterations.
To improve temporal stability and model robustness, exponential moving average (EMA) is applied to the generator parameters during training, with a smoothing factor of 0.999. All experiments are conducted with a fixed random seed (\(\:seed=42\)) to ensure reproducibility. The model checkpoint achieving the peak F1 score and style similarity on the validation set is saved as the final model for testing and teaching experiments.
Loss function weights configuration:
-
Adversarial loss \(\:{L}_{\text{adv}}\): \(\:{\lambda\:}_{\text{adv}}=1.0\).
-
Style loss \(\:{L}_{\text{style}}\): \(\:{\lambda\:}_{\text{style}}=2.0\).
-
Content loss \(\:{L}_{\text{content}}\): \(\:{\lambda\:}_{\text{content}}=1.5\).
-
Identity preservation loss \(\:{L}_{\text{id}}\): \(\:{\lambda\:}_{\text{id}}=0.5\).
-
Total variation regularization \(\:{L}_{\text{TV}}\): \(\:{\lambda\:}_{\text{TV}}=0.1\).
-
Technique rule constraint loss \(\:{L}_{\text{rule}}\): \(\:{\lambda\:}_{\text{rule}}=0.8\).
In the Transformer translation module, AdamW is used with \(\:{\beta\:}_{1}=0.9\), \(\:{\beta\:}_{2}=0.999\), a learning rate of \(\:1\times\:{10}^{-4}\), and a batch size of 32 for 100 training iterations. The loss function combines cross-entropy loss \(\:{L}_{\text{CE}}\) and semantic similarity loss \(\:{L}_{\text{sim}}\) with a fixed weight ratio of 1:0.5.
Experimental design and performance evaluation
Datasets collection
To achieve deep rendering of multi-ethnic painting styles and cross-lingual understanding, this study constructed a multi-ethnic painting dataset containing both image and text modalities. The dataset includes images, terminology, and cultural annotation texts, covering eight ethnic painting styles and multilingual technical terms.
Data categories and sample distribution
The image data primarily consist of three parts: (i) original ethnic paintings (8 categories), (ii) student course sketches, and (iii) cross-lingual painting terminology corpora. Table 2 presents the category details and sample distribution.
To prevent training bias caused by sample size imbalance, class resampling and weighted balancing strategies were applied. During training, underrepresented ethnic categories (e.g., Dai and Uyghur) underwent data augmentation, including rotation, flipping, and brightness perturbation. Additionally, class-balanced weights \(\:{\omega\:}_{c}\in\:[0.8,\:1.2]\) were incorporated into the loss function to ensure balanced learning across multiple categories.
Annotation procedure and consistency assessment
The annotation process was conducted in two stages. In the first stage, five experts in ethnic art and linguistics jointly established annotation guidelines and performed preliminary labeling of style categories, technical features, and terminology semantics. In the second stage, a separate expert group conducted verification and cross-checking, with 10% of samples randomly selected for consistency evaluation.
Annotation consistency was measured using Cohen’s κ coefficient (Kappa) and Krippendorff’s α coefficient (Alpha):
where \(\:{p}_{o}\) is the observed agreement rate, \(\:{p}_{e}\) is the expected agreement rate by chance, and \(\:{D}_{o}\) and \(\:{D}_{e}\) denote observed and expected differences, respectively. Statistical results indicated high annotation consistency, with Cohen’s κ = 0.87 and Krippendorff’s α = 0.89, both exceeding the 0.80 threshold. Samples with consistency below 0.75 were re-evaluated until consensus was reached.
Data splitting strategy
Following the principle of data balance, a stratified sampling strategy was used to divide the dataset, maintaining the proportion of each ethnic category in the training, validation, and test sets. The final split was 70%:20%:10%, resulting in:
-
Training set: 16,800 images, 42,000 text entries.
-
Validation set: 4,800 images, 12,000 text entries.
-
Test set: 2,400 images, 6,000 text entries.
The random seed was fixed at 42 to ensure experimental reproducibility.
Ethics and usage guidelines
This study strictly adhered to research ethics and data compliance requirements. All images and text samples were sourced with authorization or publicly licensed access. Specifically:
-
Ethnic art samples were obtained from legally licensed educational and museum digital resources, with explicit permission from original authors or copyright holders.
-
Student sketches were collected with informed consent, with all personal identifiers and distinctive signatures removed.
-
Ethnic cultural terminology and symbol explanations were reviewed by academic experts to prevent inappropriate use of sensitive religious or cultural symbols.
Data were used exclusively for academic research and teaching, with no commercial redistribution permitted. Additionally, the study followed a “cultural sensitivity” principle: religious symbols, traditional taboos, and spiritual imagery were blurred or masked during data processing and visualization to respect the original cultural context.
Experimental environment
To ensure efficiency, stability, and accuracy in handling multimodal data for the deep learning–based painting rendering and machine translation system in multi-ethnic high school contexts, a high-performance computing environment was established. Hardware and software configurations are summarized in Table 3.
The system is built on the Ubuntu 22.04 LTS operating system, with deep learning frameworks PyTorch 2.0.1 and TensorFlow 2.12.0, leveraging CUDA 12.0 and cuDNN 8.9.2 for GPU acceleration. All experiments used a fixed random seed of 42 to ensure training stability and reproducibility.
Considering copyright restrictions on some ethnic artworks, the full dataset is not publicly available. However, an anonymized and culturally reviewed subset, MEPD-mini, is provided, containing 1,000 images and 2,000 terminology entries, for model reproduction and teaching validation. Additionally, the following resources are made available: pre-trained weights for the improved IGAN and the Visual Context Transformer models (.pth format), data preprocessing and inference scripts, and training/testing logs (including loss curves and performance statistics for 50 key epochs).
All resources will be uploaded to the project homepage (tentatively GitHub and OpenDataLab) after the study is accepted, along with model usage instructions and a data license (CC BY-NC 4.0).
For classroom deployment in ethnic higher education contexts, the system’s inference performance has been optimized. On an NVIDIA A100 80GB GPU, the average inference speed is approximately 12.4 FPS at 512 × 512 resolution, with a memory footprint of roughly 9.8 GB per batch (Batch Size = 8). The system also runs stably on an NVIDIA RTX 4090 (24GB), achieving an average rendering latency below 0.35 s per image.
A lightweight version, optimized with FP16 mixed-precision inference and model distillation, supports single-sample inference on standard GPUs or high-performance laptops, suitable for real-time classroom demonstrations and personalized feedback. Recommended classroom configurations are as follows: batch size: 4–8; inference threads: 8; GPU memory: ≥12 GB; rendering resolution: 512 × 512 for teaching demonstrations, 768 × 768 for research presentations.
Parameters setting
To ensure efficient training and accurate output of the deep learning rendering and machine translation modules, key parameters were optimized via repeated tuning on the validation set using a controlled variable approach, considering the characteristics of the MEPD dataset and model architectures (e.g., adversarial training in the improved GAN and multi-head attention in the enhanced Transformer). The final parameter configurations are listed in Table 4.
Evaluation metrics design
Precision: Defined as the proportion of pixel regions in the rendered image that conform to the target ethnic painting style relative to the total rendered area. Its calculation formula is as follows:
TParea denotes the number of pixels in correctly rendered style feature regions, while FParea represents the number of pixels incorrectly rendered outside the style feature regions.
Recall: Defined as the proportion of target ethnic painting style feature types successfully covered in the rendered image relative to the total number of annotated style feature types. The calculation formula is as follows:
where TPtype is the number of correctly covered style feature types, and FNtype is the number of style feature types not covered.
F1 Score: The harmonic mean of precision and recall, balancing redundancy avoidance and completeness:
Fréchet Inception Distance (FID): Quantifies the difference between the feature distributions of generated and real images. Features are extracted from the 2,048-dimensional fully connected layer of a pre-trained Inception-V3 network, and the Fréchet distance between the two distributions is computed. Lower values indicate closer distributions. The computation follows the standard procedure described by Heusel et al.
LPIPS (Learned Perceptual Image Patch Similarity): Evaluates perceptual-level style consistency. Local patch features are extracted from layers conv1–conv5 of a pre-trained AlexNet, and the Euclidean distance between generated and real images is calculated with layer-wise weighting. Lower LPIPS values indicate higher perceptual similarity.
Style Similarity: Calculated via multi-dimensional weighted fusion as follows: StyleSimilarity = 0.4 × (1-LPIPS) + 0.3×SSIM + 0.3×StyleFeatureCosSim. Here, \(\:1-\text{LPIPS}\) maps LPIPS to the [0,1] range; SSIM denotes the structural similarity index; StyleFeatureCosSim represents the cosine similarity of style feature vectors extracted from the conv4 layer of a ResNet-50 network, normalized from [−1,1] to [0,1]. The final score is scaled to 0–100%, with a threshold of 0.8 indicating successful style matching.
SSIM (Structural Similarity Index): Measures structural consistency between generated images and input sketches. Using a Gaussian-weighted window (\(\:\sigma\:=\text{1.5,11}\times\:11\)), luminance (l), contrast (c), and structure (s) similarities are computed as: SSIM=(2µₓµγ+C₁)(2σₓγ+C₂)/[(µₓ²+µγ²+C₁)(σₓ²+σγ²+C₂)]. where \(\:\mu\:\) denotes the pixel mean, \(\:\sigma\:\)the standard deviation, \(\:{\sigma\:}_{xy}\:\)the covariance, \(\:{C}_{1}=(0.01\times\:255{)}^{2}\), and \(\:{C}_{2}=(0.03\times\:255{)}^{2}\). Values range from 0 to 1; higher values indicate greater structural consistency.
PSNR (Peak Signal-to-Noise Ratio): Evaluates pixel-level fidelity based on mean squared error (MSE): PSNR = 10×log₁₀((2⁸−1)²/MSE) for 8-bit images. Higher PSNR values indicate higher pixel-level accuracy.
Semantic Matching: Combines cosine similarity and BERT-Score metrics as follows: \(\:\text{SemanticMatch}=0.5\times\:\text{CosSim}+0.5\times\:{\text{BERT-Score}}_{F1}\). Here, CosSim is computed from 768-dimensional sentence embeddings extracted using a pre-trained multilingual BERT model (bert-base-multilingual-cased, fine-tuned on ethnic painting terminology), scaled to 0–100%. BERT-ScoreF1 measures token-level matching: 2×(precision × recall)/(precision + recall) with IDF weighting applied. A threshold of 0.5 is used to determine effective semantic matching.
Terminology Accuracy (TA): The proportion of correctly translated painting-specific terms among all terms:
where Correctterm is the number of correctly translated terms, and Totalterm is the total number of terms.
Cultural Interpretation Completeness (CI): The proportion of translations that include all three key elements—“technical rules,” “symbolic meanings,” and “usage scenarios”:
where Completeinfo is the total number of key information items included, and Totalexplanation is the total number of cultural explanations.
Comparative experimental design
To comprehensively validate system performance, four comparative experiments were designed, all conducted on the MEPD test set, with controlled variables and test schemes as follows:
-
1.
Baseline Traditional Deep Learning Rendering Models: CNN-Style, VGG-19, and basic GAN were used to evaluate the performance of conventional rendering models.
-
2.
Single Machine Translation Models: BERT-Transformer, mBERT, and Google Translate were employed to assess translation accuracy and cultural explanation of ethnic painting terminology.
-
3.
Non-Collaborative Fusion Models: Ablation models were constructed by removing either the “visual context fusion layer” or the “technical constraint vector” to examine the contribution of the collaborative mechanism.
-
4.
Proposed System: Tested for adaptability on single ethnic styles and cross-ethnic fused styles.
Performance evaluation
To further verify the generation quality and detail fidelity of the improved IGAN model in ethnic style decorative pattern rendering tasks, this study conducts a comprehensive performance evaluation by combining quantitative indicators and qualitative visual analysis (Fig. 5). Figure 5 takes traditional ethnic decorative patterns as the core test object, and selects typical representative scroll patterns and geometric patterns as input prototypes. It presents the original line draft, improved IGAN model generation results (including two classic ethnic color schemes of red gold and gold red), optimal performance baseline model output results, and corresponding real ethnic decorative pattern reference images. All test samples use completely consistent input parameters and experimental environment to ensure the reproducibility of the results.
The comparison in Fig. 5 suggests that the improved IGAN model is significantly better than the traditional baseline method in terms of pattern edge clarity, structural integrity, and style color matching. It can accurately restore the curling rhythm of curly grass patterns and the symmetrical beauty of geometric patterns. The generated decorative patterns have smooth and continuous lines, and have no common defects such as edge blurring and structural fracture in baseline models. Through color mapping mechanism, it achieves harmonious color matching that conforms to national aesthetic characteristics. To comprehensively evaluate the robustness of the model, this study also tested two typical boundary cases: the multi-layer nested structure restoration of complex curled grass patterns; color overflow control under high saturation color matching. Diagnostic analysis shows that the limitations of the model in extreme scenarios mainly stem from two aspects: (1) Insufficient learning of hierarchical features for ultra-complex patterns; (2) The color constraint mechanism in high saturation color matching scenes needs to be strengthened. Subsequent research will focus on optimizing the extraction of pattern level features and improving color adaptive adjustment algorithms, further enhancing the stability and adaptability of the model in complex ethnic decoration pattern generation tasks.
Comparative results of traditional rendering models
This experiment was conducted on 2,000 student sketches representing eight ethnic groups in the MEPD test set. The performance of the improved IGAN was compared with traditional rendering models (CNN-Style, VGG-19, basic GAN), considering metrics including rendering accuracy, efficiency, hardware adaptability, and teaching applicability. The results are presented in Fig. 6.
The experimental results demonstrate that the improved IGAN outperforms traditional models across all evaluated metrics. Compared with the basic GAN, IGAN achieves approximately a 5% improvement in both precision and recall, and an F1 score increase of 6.1%, reflecting higher rendering accuracy and coverage. Its style similarity reaches 91.0%, significantly higher than CNN-Style (80.0%), VGG-19 (82.0%), and basic GAN (85.0%), approaching expert-level quality and effectively reproducing the complex details and stylistic features of ethnic paintings.
To comprehensively evaluate the performance advantages of the proposed improved IGAN in ethnic painting style rendering, this study selected traditional style transfer models (CNN-Style, VGG-19), a baseline generative model (vanilla GAN), and current state-of-the-art models (StyleGAN3, CUT, diffusion models) as baselines. The comparison was conducted across four dimensions: distribution similarity between generated and real images (FID), perceptual similarity (LPIPS), Style Similarity, and generation efficiency (speed). The results are summarized in Table 5.
As shown in Table 5, the improved IGAN outperforms all baseline models on key metrics (FID, LPIPS): FID decreased by 1.9 (lower values indicate closer alignment between generated and real image distributions), and LPIPS decreased by 0.04 (lower values indicate higher perceptual similarity). Additionally, it maintains generation efficiency, with speed only slightly slower than CNN-Style and VGG-19, but much faster than StyleGAN3 and diffusion models, making it suitable for real-time rendering in ethnic college classrooms.
Comparative results of single translation models
This experiment validates the performance of the improved Transformer translation module using a test dataset comprising 5,000 painting-specific terms from three languages: Tibetan, Miao, and Yi. The terms span three categories: techniques, symbols, and materials. Performance was compared with BERT-Transformer, mBERT, and Google Translate, focusing on semantic matching, TA, and cultural interpretation completeness. The results are presented in Fig. 7.
As shown in Fig. 7, the improved model clearly outperforms the baseline models across all metrics. For semantic matching, BERT-Transformer, mBERT, and Google Translate achieved 83.4%, 85.2%, and 85.1%, respectively, while the improved model reached 89.6%, an increase of approximately 4% points, demonstrating more precise understanding in complex contexts. In terms of TA, mBERT slightly outperforms BERT-Transformer, whereas Google Translate only achieves 82.5%. The improved model reaches 90.3%, effectively handling specialized terms such as “duijin” and “butterfly patterns.” Regarding cultural interpretation completeness, Google Translate scores lowest, BERT-Transformer and mBERT reach 76.5% and 78.3%, respectively, while the improved model attains 88.7%, significantly improving the transmission of technical, symbolic, and material cultural information, highlighting its advantages in ethnic painting translation.
To systematically verify the effectiveness of the improved Transformer in cross-lingual mapping of ethnic painting terminology, additional baselines were included beyond BERT-Transformer, mBERT, and Google Translate. These included visual–language pre-trained models (VLP), multimodal Transformers, and ethnic language-specific models (TibetanBERT). The comparison was conducted across three key dimensions: semantic matching rate, terminology accuracy (TA), and cultural interpretation completeness (CI). The results are summarized in Table 6.
As shown in Table 6, the improved Transformer achieves clear superiority in semantic matching rate, TA, and CI. Compared with the best baseline (multimodal Transformer), semantic matching rate improved by 2.5% points, TA by 5.1 points, and CI by 6.2 points. Its core advantage lies in integrating style feature vectors from the rendering module, enabling a “term–visual–cultural” triadic cross-lingual mapping. For example, when translating “paired brocade patterns,” the model preserves the semantic accuracy of “symmetrical pattern” while supplementing the arrangement rules (“left-right symmetry, head-to-tail alignment”) and the cultural attributes of core Miao silver ornament patterns.
Ablation study on module collaboration mechanism
An ablation experiment was conducted on the task of “Miao silver ornament sketch rendering + Miao language terminology translation” to evaluate the performance differences between the full system and models with disrupted collaboration mechanisms. This experiment aimed to quantify the contribution of bidirectional information flow between modules to rendering quality and terminology translation accuracy. Results are presented in Fig. 8.
The ablation results indicate that the bidirectional collaboration mechanism—comprising the visual context fusion layer (rendering → translation) and the technical constraint vector transfer (translation → rendering)—is critical for end-to-end performance. With full collaboration, rendering (F1 92.3%) and translation (semantic matching 89.6%, TA 90.3%, cultural interpretation completeness 88.7%) achieve optimal results. Removing the rendering → translation mechanism leads to decreased translation performance, demonstrating that visual context supports semantic interpretation. Omitting the translation → rendering mechanism degrades style and feature restoration in rendering. When both mechanisms are removed, rendering F1 drops to 85.6% and translation semantic matching falls to 82.8%, yielding the poorest overall performance. This confirms the essential role of bidirectional information interaction in enhancing image generation and terminology translation performance.
To quantitatively assess the effect of the bidirectional coupling mechanism, an extended ablation study was conducted on the “Miao Silver Pattern Sketch Rendering + Miao Language Terminology Translation” task, incorporating additional metrics—FID, LPIPS, SSIM, PSNR—and significance testing. Results are summarized in Table 7. The full model achieves the best performance: rendering F1 reaches 92.3 ± 0.8%, FID is 28.2 ± 1.1 (significantly lower than other variants, p < 0.01), LPIPS is 0.12 ± 0.02 (p < 0.01), SSIM is 0.89 ± 0.03, PSNR is 28.5 ± 0.6 dB, and semantic matching score is 89.6 ± 0.7% (p < 0.01).
Removing the visual injection pathway (R→T) causes the semantic matching score to drop to 84.2 ± 0.9% (p < 0.01) and the rendering F1 to 89.1 ± 1.0% (p < 0.05), indicating that visual context is critical for terminology understanding. Removing the terminology feedback pathway (T→R) increases FID to 37.5 ± 1.6 (p < 0.01) and LPIPS to 0.21 ± 0.03 (p < 0.01), highlighting the key role of semantic rule constraints in reproducing style details. When both pathways are removed, system performance drops to its lowest (F1 = 85.6 ± 1.4%, FID = 40.2 ± 1.8), with highly significant differences from the full model (p < 0.01), confirming the essential contribution of bidirectional interaction.
Ethnic style adaptability experiment
This experiment evaluated the system’s adaptability to ethnic cultural styles on the MEPD dataset, including both single-ethnic painting styles and one set of cross-ethnic fused styles. The impact of sample size differences and style complexity on rendering performance was analyzed to provide guidance for applying the system to diverse ethnic painting courses in minority colleges. Results are shown in Fig. 9.
As shown in Fig. 9, the proposed system demonstrates stable performance on single-ethnic painting styles, with F1 scores consistently above 90% and style similarity ranging from 88% to 93%. Performance is higher for larger datasets (e.g., Tibetan Thangka and Miao silver ornaments, F1 ≈ 93–94%) and slightly lower for smaller datasets or more complex patterns (e.g., Dai paper-cut, Uyghur wood carving, Mongolian rock painting, F1 ≈ 88–90%). For cross-ethnic fusion styles (Thangka + Uyghur wood carving), F1 drops to 87.3% and style similarity to 86%, indicating that multi-style fusion increases the difficulty of rendering and feature preservation, highlighting the need for optimized cross-style feature extraction and fusion mechanisms.
Practical validation in teaching scenarios
The system was tested in a painting course at a minority university with two classes of 40 students each (2023 cohort). One served as the experimental group (system-assisted teaching) and the other as the control group (traditional teaching) over a 16-week period. Course content included Tibetan Thangka, Miao silver ornaments, and Yi lacquerware. Evaluation metrics included technique tests, cultural interpretation, artwork grading, and student satisfaction surveys. Results are shown in Fig. 10.
Figure 10 shows that the proposed system significantly improved learning outcomes in ethnic painting instruction. The experimental group achieved an average technique test score of 85.6 (vs. 72.3 in the control group) and a technique mastery rate of 80.2% (25.2% points higher), indicating that the rendering function enhanced understanding and acquisition of painting techniques. Cultural interpretation scores reached 82.4 (vs. 65.7), demonstrating effective cross-language transmission of technical and cultural terminology. Artwork grading averaged 88.3 (vs. 75.1), with a cross-ethnic style creation rate of 35% (vs. 10%), highlighting the system’s support for technique application and creative innovation.
Discussion
The proposed improved IGAN combined with a visually enhanced Transformer exhibits outstanding performance in ethnic painting rendering and cross-language interpretation. Its advantages are threefold: (1) Accurate Feature Capture: The spatial attention mechanism and multi-objective loss function enable precise extraction of key features, balancing style reproduction and content integrity, effectively addressing detail blurring in traditional models. (2) Cross-Modal Semantic Association: The visual context fusion layer and cultural context library link images and terminology, compensating for the neglect of cultural information in generic translation tools. (3) Bidirectional Collaboration Mechanism: The closed-loop interaction of style feature transmission and terminology constraint feedback enhances overall technical and cultural adaptability. System limitations include lower F1 scores for small-sample styles, reduced performance in cross-ethnic fusion due to conflicting colors and brushstroke characteristics, and the 14.2 GB GPU memory requirement, which may not be feasible for all institutions. In teaching applications, the system combines accurate rendering and cross-language interpretation to support technique inheritance, cultural education, and cross-ethnic creative production.
Conclusion
Research contribution
This study makes significant contributions in theory, technology, and practice: (1) Theoretical Contribution: Proposes a “deep learning rendering – machine translation collaboration” framework for ethnic painting instruction, establishing links between visual style and cross-language semantics, and incorporating situational learning and personalized feedback concepts, offering new perspectives for ethnic cultural education. (2) Technical Contribution: Designs an improved GAN rendering model and a visual-context Transformer translation model, enhancing style reproduction and terminology cultural interpretation accuracy, with significant synergistic effects on overall system performance. (3) Practical Contribution: Constructs a multimodal dataset of eight ethnic painting styles and multilingual terminology, and validates system advantages in technique mastery, cultural understanding, and creative innovation through teaching experiments, providing solid support for digital preservation and educational innovation of ethnic painting.
Future works and research limitations
Despite the achievements, limitations remain. Future research should focus on three directions: (1) Research Scope: Expand coverage of ethnic languages and painting styles, collect more artworks and corpora, and optimize cross-ethnic fusion style rendering algorithms to meet diverse creative needs. (2) Technical Aspect: Apply model compression, distillation, and lightweight optimization to reduce hardware requirements, improve rendering efficiency in complex scenarios, and enhance adaptability on standard GPUs or edge devices. (3) Teaching Application: Leverage educational big data to analyze student learning behavior, build personalized learning profiles, and enable precise instructional feedback and collaborative teacher-student creation. Future studies should continue to advance data expansion, model optimization, and refined cultural representation to enhance system generalizability and pedagogical value.
Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author Safrizal Shahir on reasonable request via e-mail safrizal@usm.my.
References
Xing, L., Razak, H. A. & Noh, L. The aesthetic development of traditional Chinese landscape painting in contemporary landscape painting. Art Perform. Lett. 4 (7), 53–63 (2023).
Karadağ, İ. Transforming sketches into realistic images: leveraging machine learning and image processing for enhanced architectural visualization. Sakarya Univ. J. Sci. 27 (6), 1209–1216 (2023).
Hui, J. et al. Research on Art teaching practice supported by virtual reality (VR) technology in the primary schools. Sustainability 14 (3), 1246 (2022).
Huang, Y. & Fang, F. I feel a sense of solidarity when speaking teochew’: unpacking family Language planning and sustainable development of Teochew from a multilingual perspective. J. Multiling. Multicultural Dev. 45 (5), 1375–1391 (2024).
Mohamed, Ů. Integrating digital techniques/technologies in developing Egyptian museums (case study: Alexandria library museums-alexandria city). Sohag Eng. J. 4 (1), 34–47 (2024).
Kan, Y. Research on the integration of multimodal large Language models (MLLM) and augmented reality (AR) for smart navigation with Real-Time Cross-Language interaction and cognitive load balancing strategies. Int. J. High Speed Electron. Syst. 2540752 (2025).
Huang, G. & Jafari, A. H. Enhanced balancing GAN: Minority-class image generation. Neural Comput. Appl. 35 (7), 5145–5154 (2023).
Hindarto, D. & Handayani, E. T. E. Revolution in image data collection: cyclegan as a dataset generator. Sinkron: Jurnal dan. Penelitian Teknik Informatika. 8 (1), 444–454 (2024).
He, J. & Tao, H. Applied research on innovation and development of blue Calico of Chinese intangible cultural heritage based on artificial intelligence. Sci. Rep. 15 (1), 12829 (2025).
Luo, X. & Ren, S. Research on the application of computer visual technology to enhance cultural inheritance and National identity in ethnic pattern design courses. J. COMBIN MATH. COMBIN COMPUT. 127, 3109–3126 (2025).
Hu, W. & Zhang, Y. Research on artistic pattern generation for clothing design based on style transfer[J]. J. COMBIN MATH. COMBIN COMPUT. 127, 4539–4550 (2025).
Jiang, W. et al. CycleH-CUT: an unsupervised medical image translation method based on cycle consistency and hybrid contrastive learning[J]. Phys. Med. Biol. 70 (5), 055014 (2025).
Zhao, Y. et al. A novel flexible identity-net with diffusion models for painting-style generation[J]. Sci. Rep. 15 (1), 27896 (2025).
Li, Y. Tracking on the edge of Chinese discourses: drawing the trajectories of Pema tseden’s Tibetan filmmaking in the PRC. Q. Rev. Film Video. 42 (3), 837–871 (2025).
Banar, N., Daelemans, W. & Kestemont, M. Transfer learning for the visual arts: the multi-modal retrieval of iconclass codes. ACM J. Comput. Cult. Herit. 16 (2), 1–16 (2023).
Liu, Y. & Zhu, C. The use of deep learning and Artificial intelligence-based digital technologies in Art education. Sci. Rep. 15 (1), 15859 (2025).
Zhang, J. et al. Vision-language models for vision tasks: A survey[J]. IEEE Trans. Pattern Anal. Mach. Intell. 46 (8), 5625–5644 (2024).
Gao, X. et al. Multimodal visual-semantic representations learning for scene text recognition[J]. ACM Trans. Multimedia Comput. Commun. Appl. 20 (7), 1–18 (2024).
Kufel, J. et al. What is machine learning, artificial neural networks and deep learning?—Examples of practical applications in medicine. Diagnostics 13 (15), 2582 (2023).
Kaveh, M. & Mesgari, M. S. Application of meta-heuristic algorithms for training neural networks and deep learning architectures: A comprehensive review. Neural Process. Lett. 55 (4), 4519–4622 (2023).
Iqbal, S., Qureshi, N. & Li, A. On the analyses of medical images using traditional machine learning techniques and convolutional neural networks. Arch. Comput. Methods Eng. 30 (5), 3173–3233 (2023).
Mehmood, F., Ahmad, S. & Whangbo, T. K. An efficient optimization technique for training deep neural networks. Mathematics 11 (6), 1360 (2023).
Tao, S. et al. Deep-learning-based amplitude variation with angle inversion with multi-input neural networks. Processes 12 (10), 2259 (2024).
Wang, Y. et al. Temperature prediction of lithium-ion battery based on artificial neural network model. Appl. Therm. Eng. 228, 120482 (2023).
Zhao, Q. & Zhang, R. Classification of painting styles based on the difference component. Expert Syst. Appl. 259, 125287 (2025).
Chang, Y. H. et al. Color face image generation with improved generative adversarial networks. Electronics 13 (7), 1205 (2024).
Shubham, M. Breaking Language barriers: advancements in machine translation for enhanced cross-lingual information retrieval. J. Electr. Syst. 20 (9s), 2860–2875 (2024).
Bai, Y. & Lei, S. Cross-language dissemination of Chinese classical literature using multimodal deep learning and artificial intelligence. Sci. Rep. 15 (1), 21648 (2025).
Zhang, J. et al. Knowledge translator: cross-lingual course video text style transform via imposed sequential attention networks. Electronics 14 (6), 1213 (2025).
Chen, Y. et al. Upst-nerf: universal photorealistic style transfer of neural radiance fields for 3d scene. IEEE Trans. Vis. Comput. Graph. 31 (4), 2045–2057 (2024).
Kong, X. et al. Exploring the Temporal consistency of arbitrary style transfer: A Channelwise perspective. IEEE Trans. Neural Networks Learn. Syst. 35 (6), 8482–8496 (2023).
Shah, S. R. et al. Comparing inception V3, VGG 16, VGG 19, CNN, and ResNet 50: a case study on early detection of a rice disease. Agronomy 13 (6), 1633 (2023).
Rastogi, D., Johri, P. & Tiwari, V. Augmentation based detection model for brain tumor using VGG 19. Int. J. Comput. Digit. Syst. 13 (1), 1–1 (2023).
Kokkula, A., Sekhar, P. C. & Naidu, T. M. P. Modified VGG-19 deep learning strategies for parkinson’s disease diagnosis-a comprehensive review and novel approach. Integr. Biomedical Res. 8 (3), 1–11 (2024).
Qawasmeh, B., Oh, J. S. & Kwigizile, V. Comparative analysis of AlexNet, ResNet-50, and VGG-19 performance for automated feature recognition in pedestrian crash diagrams. Appl. Sci. 15 (6), 2928 (2025).
Sahoo, P. K. et al. An improved VGG-19 network induced enhanced feature pooling for precise moving object detection in complex video scenes. Ieee Access. 12, 45847–45864 (2024).
Author information
Authors and Affiliations
Contributions
Suyimeng Wang: Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparationSafrizal Shahir: writing—review and editing, visualization, supervision, project administration, funding acquisitionMuhammad Uzair Ismai: Conceptualization, methodology, software, validation.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, S., Shahir, S. & Ismail, M.U. A painting art rendering system by deep learning framework and machine translation. Sci Rep 16, 3967 (2026). https://doi.org/10.1038/s41598-025-34058-4
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-34058-4












