Introduction

Ethnic painting arts carry the historical memory, values, and aesthetic characteristics of China’s diverse ethnic groups, such as the mineral pigment techniques of Tibetan Thangka, the symmetrical aesthetics of Miao silver ornaments, and the color symbolism systems of Yi lacquerware1. However, painting instruction in ethnic universities faces three major challenges: the transmission of techniques relies heavily on oral teaching and limited original works; symbols and terminology depend on ethnic language backgrounds, making systematic comprehension difficult for students; and standardized teaching models constrain personalization and creative development.

Deep learning techniques for image generation and style transfer can automate the rendering process from sketches to finished works, enhancing the visualization of technical instruction2. Transformer-based translation models can overcome cross-linguistic barriers, enabling semantic mapping of terminology and cultural symbols. Educational psychology research indicates that “contextualized learning” and “personalized feedback” are crucial in art education, and immersive rendering with deep learning combined with cross-linguistic cultural interpretation aligns well with these pedagogical needs. Existing studies largely focus on single technologies: although Convolutional Neural Network (CNN)-based style transfer can achieve image transformation, it falls short in restoring detailed textures in ethnic paintings; similarly, machine translation research for ethnic languages is often limited to everyday communication and lacks support for specialized terminology and cultural symbols.

This study introduces three main innovations: (1) Painting Rendering Module Based on Improved Generative Adversarial Network (GAN): Enables precise recognition and automated rendering of eight ethnic painting styles, providing visual feedback on technical details. (2) Cross-Linguistic Painting Terminology Mapping Module Based on Transformer: Integrates ethnic languages and a professional terminology database to achieve cross-linguistic interpretation of cultural symbols, addressing language gaps in teaching. (3) Dynamic Weight Bidirectional Coupling Mechanism: Unlike static fusion in conventional cross-modal models, this mechanism adjusts the fusion coefficient \(\:\alpha\:\) dynamically, ranging from 0.5 to 0.9. It controls the balance between visual injection (R→T) and terminology feedback (T→R) using a Style Complexity Perceptor. This dynamic adjustment improves feature compatibility in cross-ethnic style fusion scenarios.

Literature review

The core objective of painting instruction in ethnic universities is to unify “technique transmission” and “cultural identity.” Hui (2022) noted that traditional “lecture–copying” approaches had left students’ understanding of patterns at a superficial level, making it difficult to grasp underlying cultural logic3. Huang and Fang (2024) identified gaps in ethnic language and terminology as the primary constraints on cultural comprehension4. In terms of resource development, Mohamed (2024) proposed digital museums that preserved high-resolution details but were not integrated into teaching5. Kan (2025) experimented with Virtual Reality (VR) to enhance immersion, but high costs and the lack of cross-linguistic interpretation limited its classroom applicability6.

In the field of deep learning–based image generation and style transfer, recent research has made significant progress, providing a technological foundation for the intelligent rendering of ethnic paintings. Huang and Jafari (2023) proposed GANs that effectively improved the realism of generated images7. Hindarto and Handayani (2024) developed CycleGANs to overcome unpaired data limitations, enabling ordinary photos to be transformed into Thangka-style images8. He and Tao (2025) introduced a CNN-based style transfer approach that separated content and style features, but handling complex patterns often led to texture discontinuities and detail distortion9. Luo and Ren (2025) attempted to optimize ethnic painting rendering with improved GANs but covered only two ethnic styles and did not integrate teaching applications, which limited classroom utility10. Meanwhile, several high-impact studies in recent years have further advanced both the theory and applications of style transfer. For example, Hu and Zhang (2025) proposed StyleGAN3, which achieves stable, high-fidelity style generation through phase consistency improvements, offering new approaches for reproducing brushstrokes and texture details in ethnic paintings11. Jiang et al. (2025) introduced a contrastive learning mechanism within the Contrastive Unpaired Translation (CUT) framework, significantly enhancing style consistency under unpaired data conditions12. Zhao et al. (2025) combined Diffusion Models with Transformers to achieve painting style synthesis with stronger semantic control, producing results that outperform traditional GAN models in both texture granularity and color expression. These studies provide both theoretical inspiration and a technical foundation for the improved IGAN model proposed in this research13.

In the field of cross-linguistic and cross-cultural interpretation, machine translation technologies have provided essential support for the cross-lingual mapping of ethnic painting terminology. Li (2025) developed a Mandarin–Tibetan Thangka terminology dictionary; however, it was limited in scope and overlooked cultural contextual differences14. Recent research explored the collaborative application of deep learning and machine translation, primarily in cross-modal content generation. Banar (2023) combined style transfer and translation to generate multilingual descriptions of artworks, yet technical skill support was lacking15. In educational settings, Liu and Zhu (2025) built an intelligent teaching system integrating deep learning and translation models to provide bilingual feedback on student assignments, but its design targeted STEM subjects and did not accommodate the visual and creative characteristics of painting arts16. Moreover, recent studies highlight the importance of jointly modeling visual and textual semantics. For example, Zhang (2024) proposed the Vision-Language Pretraining (VLP) framework17, and Gao (2024) developed multimodal Transformer models18. These studies provide key references for designing the visual context fusion layer in this research.

In summary, while prior research has advanced image generation, style transfer, and cross-linguistic translation, a collaborative framework that integrates deep learning–based rendering with machine translation–based cross-lingual interpretation is still lacking. This gap is especially evident in ethnic painting education, where challenges persist in technique inheritance, language barriers, and shallow cultural understanding. To address this, the present study develops an intelligent teaching system that combines visual style generation with semantic mapping. This system aims to support the digital preservation and cross-cultural dissemination of ethnic painting.

Research model

Deep learning-based painting rendering module

The core technology of deep learning is the neural network, whose design is inspired by the structure and information transmission patterns of the human brain19,20. The human brain functions as a complex system for information interaction, and deep learning emulates this by treating each neuron in the network as an independent weight parameter. Numerous interconnected neurons together form a large-scale network of parameters21,22,23. A critical step in deep learning is training these networks to optimize the weight parameters of all neurons.

The simplest form of a neural network typically consists of an input layer, one or more hidden layers, and an output layer24. The input layer receives the data to be processed, which can include diverse types such as images, audio, or other preprocessed information. The hidden layers, connected to the input layer, process and transform the incoming signals. These layers often consist of multiple levels and represent the most structurally complex part of the network, containing the largest number of neurons. The output layer receives the processed information from the hidden layers and produces the final result of the trained network.

The structure of a neural network is illustrated in Fig. 1. The hidden layers can comprise multiple levels, and all neurons between successive layers are fully connected, a pattern known as a fully connected network.

Fig. 1
figure 1

Neural network architecture.

The style feature extraction submodule is responsible for capturing the multi-level stylistic features of ethnic paintings25. Its network architecture is based on ResNet-50 combined with a spatial attention mechanism. ResNet-50 extracts features hierarchically through five convolutional stages (conv1 to conv5): lower layers capture fine details such as edges and textures, while higher layers identify style semantics. To emphasize critical features and suppress redundancy, a spatial attention mechanism is introduced. This mechanism uses pixel-wise importance weighting to enhance responses to key painting regions while reducing background noise. The attention weights are computed as follows:

$$\:A\left(s\right)=\sigma\:\left({W}_{2}\cdot\:ReLU\left({W}_{1}\cdot\:s+{b}_{1}\right)+{b}_{2}\right)$$
(1)

where s denotes the feature map output from ResNet-50, W1 and W2 are weight matrices, b1 and b2 are bias terms, σ is the sigmoid activation function, and A(s) represents the final attention weight map.

To ensure cultural adaptation, a technical rule database is incorporated, storing specific standards for each ethnic painting style. These rules are encoded as feature constraint vectors R and fused with the ResNet-extracted image features to obtain the final style feature vector F:

$$\:F=\alpha\:\cdot\:{F}_{ResNet}+\left(1-\alpha\:\right)\cdot\:R$$
(2)

where FResNet is the feature vector extracted by ResNet-50, α is the fusion coefficient (ranging from 0.6 to 0.8, optimized experimentally), and R is the technical rule constraint vector.

The intelligent rendering module is based on a GAN framework, consisting of a generator and a discriminator, enabling the transformation of student sketches into images with ethnic painting styles26. The generator builds upon the traditional U-Net encoder–decoder framework by incorporating residual blocks and a self-attention mechanism, enabling both fine-grained texture preservation and global style consistency.

  1. 1.

    Sketch Feature Encoding: The sketch is converted into a content feature vector C, preserving the main outlines and spatial layout.

  2. 2.

    Style Feature Injection: The style feature vector F is fused with the content vector C to form a hybrid feature H:

$$\:H=C\odot{F+C+F}$$
(3)
  1. 3.

    Image Generation: The hybrid feature H is input into the decoder, which gradually restores the resolution through residual blocks to produce the final image.

The discriminator’s core function is to distinguish between “real ethnic paintings” and generator-rendered images. Its loss function is a multi-objective formulation combining adversarial loss, style loss, and content loss. The discriminator adopts a 70 × 70 PatchGAN architecture, performing image authenticity assessment at the patch level to enhance both detail perception and discriminative accuracy.

$$\:{\mathcal{L}}_{total}={\mathcal{L}}_{adv}+{\lambda\:}_{1}\cdot\:{\mathcal{L}}_{style}+{\lambda\:}_{2}\cdot\:{\mathcal{L}}_{content}$$
(4)

where \(\:{\mathcal{L}}_{adv}\) is the adversarial loss, implemented using WGAN-GP to prevent mode collapse:

$$\:{\mathcal{L}}_{adv}={E}_{x\sim\:{p}_{data}}\left[D\left(x\right)\right]-{E}_{z\sim\:{p}_{z}}\left[D\left(G\right(z\left)\right)\right]+\gamma\:\cdot\:{E}_{\widehat{x}\sim\:{p}_{\widehat{x}}}{\left[{\parallel{\nabla\:}_{\widehat{x}}D\left(\widehat{x}\right)\parallel}_{2}-1\right]}^{2}$$
(5)

D(x) is the discriminator output for a real image x, G(z) is the generator output for noise zzz, \(\:\widehat{x}\) is an interpolated sample between real and generated images, γ is the gradient penalty coefficient, used to constrain the discriminator’s gradient norm and maintain training stability.

  • \(\:{\mathcal{L}}_{style}\) enforces consistency between the generated image and real works in the style feature space.

  • \(\:{\mathcal{L}}_{content}\) ensures correspondence between the generated image and the input sketch in the content feature space.

  • λ1 and λ2 are weighting coefficients that balance the contributions of different loss terms.

The style vector and rule constraint vector are injected at different stages of the generator: the style vector into the mid-level features of the encoder, and the rule vector into the style reconstruction layer of the decoder. The style vector is linearly mapped to a 256-dimensional space and normalized layer by layer to modulate the local color and brushstroke distribution of the generated image. The rule vector is processed through a fully connected layer and added residually to the skip connection outputs, ensuring that the painting’s structure and compositional rules are preserved. The generator and discriminator topologies are illustrated in Fig. 2, and the detailed layer configurations are provided in Table 1.

Fig. 2
figure 2

Detailed architecture of the improved IGAN rendering module.

Table 1 GAN model layer configuration details.

During training, the generator and discriminator are jointly updated via adversarial optimization. The generator aims to minimize a composite loss function, which includes adversarial loss, style loss, content loss, and regularization terms. The discriminator, in contrast, maximizes classification accuracy to enhance convergence stability. This architecture ensures a clear hierarchical design and transparent parameter configuration, providing a standardized technical foundation for performance replication and cross-study validation.

Cross-language interpretation module via machine translation

The cross-language interpretation module is based on an improved Transformer architecture, designed to enable mapping among ethnic languages, Mandarin, and painting-specific terminology while incorporating cultural context27,28. Its core innovation lies in the integration of a visual context fusion layer, which leverages the style feature vectors from the rendering module to assist translation29. This design effectively addresses the ambiguity of technical terms that often arises when traditional models rely solely on textual input. The improved Transformer structure is defined as follows:

Encoder: The input consists of ethnic language term vectors ET. Simultaneously, visual vectors EV, derived from style feature vectors, are introduced. These two types of vectors are weighted and fused to form a hybrid vector ETV, as shown in Eq. (6):

$$\:{E}_{TV}={E}_{T}+{W}_{V}\cdot\:{E}_{V}+{b}_{V}$$
(6)

where WV is the weight matrix for the visual vector, and bV is the bias term. The fused vector is then fed into multi-head attention layers and feed-forward networks to extract deep semantic representations.

The loss function for this module combines cross-entropy loss with semantic similarity loss:

$$\:{\mathcal{L}}_{trans}={\mathcal{L}}_{CE}+{\lambda\:}_{3}\cdot\:{\mathcal{L}}_{sim}$$
(7)

where \(\:{\mathcal{L}}_{CE}\) is the cross-entropy loss, measuring the discrepancy between the model-generated translation and the reference translation; \(\:{\mathcal{L}}_{sim}\) is the semantic similarity loss, calculated using cosine similarity between the generated translation and the visual feature vector to ensure semantic alignment with the painting context; λ3 is a weighting coefficient, set to 0.5 in experiments.

Style transfer network

The style transfer network, originally proposed by Gatys et al., employs VGGNet to achieve fusion and transfer of content and style features. The network defines two loss functions: the content loss and the style loss, which are combined in a weighted manner to form the final loss function30,31. By iteratively minimizing this loss function during training, the network produces images with the desired style transformation. The architecture of the Visual Context Transformer translation module is shown in Fig. 3.

Fig. 3
figure 3

Architecture of the visual context transformer translation module.

To ensure correct loss computation, the training network structure must be consistent with that of VGGNet32,33,34,35,36. The detailed training process of the rendering network is illustrated in Fig. 4.

Fig. 4
figure 4

Training process of the rendering network.

Bidirectional coupling mechanism

To achieve deep collaboration between the painting style rendering module and the cross-lingual translation module, this study designs a bidirectional coupling mechanism. It aligns cross-modal semantics and complements features through visual feature injection and terminology rule feedback.

Visual context injection

The style feature vector from the rendering module, \(\:{F}_{s}\in\:{\mathbb{R}}^{B\times\:{d}_{s}}\) (where \(\:B\) is the batch size and \(\:{d}_{s}=512\) is the style dimension), is linearly projected into the latent space of the translation encoder. It is then fused with the input ethnic terminology embedding \(\:{E}_{T}\in\:{\mathbb{R}}^{B\times\:L\times\:{d}_{t}}\). Ethnic terminology refers to specialized technique names, pattern symbols, or material terms used in the traditional paintings of different ethnic groups. These terms reflect artistic techniques and carry rich cultural information. For example, “gold-line painting” in Tibetan Thangka, “paired brocade patterns” in Miao silver ornaments, and “color-layered floral painting” in Yi lacquerware are all typical ethnic terms. The fusion process is defined as:

$$\:{E}_{T}^{{\prime\:}}=\text{LayerNorm}({E}_{T}+\text{MLP}(\text{Repeat}({F}_{s},L))\cdot\:{W}_{v})$$

\(\:\text{Repeat}({F}_{s},L)\) duplicates the style vector along the sequence length to match the terminology sequence length \(\:L\), and \(\:{W}_{v}\in\:{\mathbb{R}}^{{d}_{t}\times\:{d}_{t}}\) is a learnable projection matrix. This operation ensures that the translation model processes ethnic terminology under visual context constraints, effectively reducing semantic ambiguity.

Terminology rule feedback

During the output stage of the translation module, cultural-technical rule vectors \(\:{R}_{t}\in\:{\mathbb{R}}^{B\times\:{d}_{r}}\) (where dr=256) are extracted and fed back into the decoding stage of the generator in the rendering module. Here, technical rules refer to the creative conventions of specific ethnic paintings, including principles of composition, color schemes, stroke order, and symbol usage. For instance, Tibetan Thangka layouts follow a “central-axis symmetry” principle for figure arrangement, while Miao silver ornaments follow “symmetrical repetition” in pattern arrangement. Encoding these rules as feature constraint vectors allows the model to maintain the ethnic style and cultural characteristics of the artwork during generation. The fusion process is defined as follows:

$$\:{F}_{f}=\alpha\:\cdot\:{F}_{res}+(1-\alpha\:)\cdot\:{W}_{r}{R}_{t}$$

where \(\:{F}_{\text{res}}\) is the visual style feature extracted by ResNet, \(\:{W}_{r}\in\:{\mathbb{R}}^{{d}_{r}\times\:{d}_{s}}\) is the mapping matrix, and \(\:\alpha\:\) is the fusion coefficient (set to 0.7). This mechanism constrains the style generation process using semantic rules, improving the fidelity of cultural symbols and technical features.

In summary, the bidirectional coupling mechanism forms a closed-loop visual–language collaborative system within the model architecture. Visual context enhances language comprehension, while semantic rule feedback optimizes style generation, enabling dynamic coupling of technical features and cultural semantics.

Training procedure

To ensure reproducibility and stability in training the improved IGAN and the visual-context Transformer for multi-ethnic painting rendering and cross-lingual terminology mapping, this study provides a detailed training procedure and parameter configuration.

In the rendering module, WGAN-GP (Wasserstein GAN with Gradient Penalty) is adopted to prevent mode collapse and maintain training stability. For each training iteration, the discriminator is updated 5 times while the generator is updated once. The gradient penalty coefficient is set to \(\:\gamma\:=10\)to constrain gradient norms and balance adversarial training. The optimizer is Adam with hyperparameters \(\:{\beta\:}_{1}=0.5\), \(\:{\beta\:}_{2}=0.999\). The initial learning rate is \(\:2\times\:{10}^{-4}\), which is linearly decayed to zero after 100 iterations. Batch size is 16, with a total of 200 training iterations.

To improve temporal stability and model robustness, exponential moving average (EMA) is applied to the generator parameters during training, with a smoothing factor of 0.999. All experiments are conducted with a fixed random seed (\(\:seed=42\)) to ensure reproducibility. The model checkpoint achieving the peak F1 score and style similarity on the validation set is saved as the final model for testing and teaching experiments.

Loss function weights configuration:

  • Adversarial loss \(\:{L}_{\text{adv}}\): \(\:{\lambda\:}_{\text{adv}}=1.0\).

  • Style loss \(\:{L}_{\text{style}}\): \(\:{\lambda\:}_{\text{style}}=2.0\).

  • Content loss \(\:{L}_{\text{content}}\): \(\:{\lambda\:}_{\text{content}}=1.5\).

  • Identity preservation loss \(\:{L}_{\text{id}}\): \(\:{\lambda\:}_{\text{id}}=0.5\).

  • Total variation regularization \(\:{L}_{\text{TV}}\): \(\:{\lambda\:}_{\text{TV}}=0.1\).

  • Technique rule constraint loss \(\:{L}_{\text{rule}}\): \(\:{\lambda\:}_{\text{rule}}=0.8\).

In the Transformer translation module, AdamW is used with \(\:{\beta\:}_{1}=0.9\), \(\:{\beta\:}_{2}=0.999\), a learning rate of \(\:1\times\:{10}^{-4}\), and a batch size of 32 for 100 training iterations. The loss function combines cross-entropy loss \(\:{L}_{\text{CE}}\) and semantic similarity loss \(\:{L}_{\text{sim}}\) with a fixed weight ratio of 1:0.5.

Experimental design and performance evaluation

Datasets collection

To achieve deep rendering of multi-ethnic painting styles and cross-lingual understanding, this study constructed a multi-ethnic painting dataset containing both image and text modalities. The dataset includes images, terminology, and cultural annotation texts, covering eight ethnic painting styles and multilingual technical terms.

Data categories and sample distribution

The image data primarily consist of three parts: (i) original ethnic paintings (8 categories), (ii) student course sketches, and (iii) cross-lingual painting terminology corpora. Table 2 presents the category details and sample distribution.

Table 2 Ethnic painting categories and sample distribution.

To prevent training bias caused by sample size imbalance, class resampling and weighted balancing strategies were applied. During training, underrepresented ethnic categories (e.g., Dai and Uyghur) underwent data augmentation, including rotation, flipping, and brightness perturbation. Additionally, class-balanced weights \(\:{\omega\:}_{c}\in\:[0.8,\:1.2]\) were incorporated into the loss function to ensure balanced learning across multiple categories.

Annotation procedure and consistency assessment

The annotation process was conducted in two stages. In the first stage, five experts in ethnic art and linguistics jointly established annotation guidelines and performed preliminary labeling of style categories, technical features, and terminology semantics. In the second stage, a separate expert group conducted verification and cross-checking, with 10% of samples randomly selected for consistency evaluation.

Annotation consistency was measured using Cohen’s κ coefficient (Kappa) and Krippendorff’s α coefficient (Alpha):

$$\:\kappa\:=\frac{{p}_{o}-{p}_{e}}{1-{p}_{e}},\alpha\:=1-\frac{{D}_{o}}{{D}_{e}}$$

where \(\:{p}_{o}\) is the observed agreement rate, \(\:{p}_{e}\) is the expected agreement rate by chance, and \(\:{D}_{o}\) and \(\:{D}_{e}\) denote observed and expected differences, respectively. Statistical results indicated high annotation consistency, with Cohen’s κ = 0.87 and Krippendorff’s α = 0.89, both exceeding the 0.80 threshold. Samples with consistency below 0.75 were re-evaluated until consensus was reached.

Data splitting strategy

Following the principle of data balance, a stratified sampling strategy was used to divide the dataset, maintaining the proportion of each ethnic category in the training, validation, and test sets. The final split was 70%:20%:10%, resulting in:

  • Training set: 16,800 images, 42,000 text entries.

  • Validation set: 4,800 images, 12,000 text entries.

  • Test set: 2,400 images, 6,000 text entries.

The random seed was fixed at 42 to ensure experimental reproducibility.

Ethics and usage guidelines

This study strictly adhered to research ethics and data compliance requirements. All images and text samples were sourced with authorization or publicly licensed access. Specifically:

  • Ethnic art samples were obtained from legally licensed educational and museum digital resources, with explicit permission from original authors or copyright holders.

  • Student sketches were collected with informed consent, with all personal identifiers and distinctive signatures removed.

  • Ethnic cultural terminology and symbol explanations were reviewed by academic experts to prevent inappropriate use of sensitive religious or cultural symbols.

Data were used exclusively for academic research and teaching, with no commercial redistribution permitted. Additionally, the study followed a “cultural sensitivity” principle: religious symbols, traditional taboos, and spiritual imagery were blurred or masked during data processing and visualization to respect the original cultural context.

Experimental environment

To ensure efficiency, stability, and accuracy in handling multimodal data for the deep learning–based painting rendering and machine translation system in multi-ethnic high school contexts, a high-performance computing environment was established. Hardware and software configurations are summarized in Table 3.

Table 3 Experimental environment configuration.

The system is built on the Ubuntu 22.04 LTS operating system, with deep learning frameworks PyTorch 2.0.1 and TensorFlow 2.12.0, leveraging CUDA 12.0 and cuDNN 8.9.2 for GPU acceleration. All experiments used a fixed random seed of 42 to ensure training stability and reproducibility.

Considering copyright restrictions on some ethnic artworks, the full dataset is not publicly available. However, an anonymized and culturally reviewed subset, MEPD-mini, is provided, containing 1,000 images and 2,000 terminology entries, for model reproduction and teaching validation. Additionally, the following resources are made available: pre-trained weights for the improved IGAN and the Visual Context Transformer models (.pth format), data preprocessing and inference scripts, and training/testing logs (including loss curves and performance statistics for 50 key epochs).

All resources will be uploaded to the project homepage (tentatively GitHub and OpenDataLab) after the study is accepted, along with model usage instructions and a data license (CC BY-NC 4.0).

For classroom deployment in ethnic higher education contexts, the system’s inference performance has been optimized. On an NVIDIA A100 80GB GPU, the average inference speed is approximately 12.4 FPS at 512 × 512 resolution, with a memory footprint of roughly 9.8 GB per batch (Batch Size = 8). The system also runs stably on an NVIDIA RTX 4090 (24GB), achieving an average rendering latency below 0.35 s per image.

A lightweight version, optimized with FP16 mixed-precision inference and model distillation, supports single-sample inference on standard GPUs or high-performance laptops, suitable for real-time classroom demonstrations and personalized feedback. Recommended classroom configurations are as follows: batch size: 4–8; inference threads: 8; GPU memory: ≥12 GB; rendering resolution: 512 × 512 for teaching demonstrations, 768 × 768 for research presentations.

Parameters setting

To ensure efficient training and accurate output of the deep learning rendering and machine translation modules, key parameters were optimized via repeated tuning on the validation set using a controlled variable approach, considering the characteristics of the MEPD dataset and model architectures (e.g., adversarial training in the improved GAN and multi-head attention in the enhanced Transformer). The final parameter configurations are listed in Table 4.

Table 4 Key model hyperparameters.

Evaluation metrics design

Precision: Defined as the proportion of pixel regions in the rendered image that conform to the target ethnic painting style relative to the total rendered area. Its calculation formula is as follows:

$$\:{P}_{recision}=\frac{{TP}_{area}}{{TP}_{area}+{FP}_{area}}\times\:100\text{\%}$$
(8)

TParea denotes the number of pixels in correctly rendered style feature regions, while FParea represents the number of pixels incorrectly rendered outside the style feature regions.

Recall: Defined as the proportion of target ethnic painting style feature types successfully covered in the rendered image relative to the total number of annotated style feature types. The calculation formula is as follows:

$$\:Recall=\frac{{TP}_{type}}{{TP}_{type}+{FN}_{type}}\times\:100\text{\%}$$
(9)

where TPtype is the number of correctly covered style feature types, and FNtype is the number of style feature types not covered.

F1 Score: The harmonic mean of precision and recall, balancing redundancy avoidance and completeness:

$$\:F1=\frac{2\times\:{P}_{recision}\times\:Recall}{{P}_{recision}+Recall}\times\:100\text{\%}$$
(10)

Fréchet Inception Distance (FID): Quantifies the difference between the feature distributions of generated and real images. Features are extracted from the 2,048-dimensional fully connected layer of a pre-trained Inception-V3 network, and the Fréchet distance between the two distributions is computed. Lower values indicate closer distributions. The computation follows the standard procedure described by Heusel et al.

LPIPS (Learned Perceptual Image Patch Similarity): Evaluates perceptual-level style consistency. Local patch features are extracted from layers conv1–conv5 of a pre-trained AlexNet, and the Euclidean distance between generated and real images is calculated with layer-wise weighting. Lower LPIPS values indicate higher perceptual similarity.

Style Similarity: Calculated via multi-dimensional weighted fusion as follows: StyleSimilarity = 0.4 × (1-LPIPS) + 0.3×SSIM + 0.3×StyleFeatureCosSim. Here, \(\:1-\text{LPIPS}\) maps LPIPS to the [0,1] range; SSIM denotes the structural similarity index; StyleFeatureCosSim represents the cosine similarity of style feature vectors extracted from the conv4 layer of a ResNet-50 network, normalized from [−1,1] to [0,1]. The final score is scaled to 0–100%, with a threshold of 0.8 indicating successful style matching.

SSIM (Structural Similarity Index): Measures structural consistency between generated images and input sketches. Using a Gaussian-weighted window (\(\:\sigma\:=\text{1.5,11}\times\:11\)), luminance (l), contrast (c), and structure (s) similarities are computed as: SSIM=(2µₓµγ+C₁)(2σₓγ+C₂)/[(µₓ²+µγ²+C₁)(σₓ²+σγ²+C₂)]. where \(\:\mu\:\) denotes the pixel mean, \(\:\sigma\:\)the standard deviation, \(\:{\sigma\:}_{xy}\:\)the covariance, \(\:{C}_{1}=(0.01\times\:255{)}^{2}\), and \(\:{C}_{2}=(0.03\times\:255{)}^{2}\). Values range from 0 to 1; higher values indicate greater structural consistency.

PSNR (Peak Signal-to-Noise Ratio): Evaluates pixel-level fidelity based on mean squared error (MSE): PSNR = 10×log₁₀((2⁸−1)²/MSE) for 8-bit images. Higher PSNR values indicate higher pixel-level accuracy.

Semantic Matching: Combines cosine similarity and BERT-Score metrics as follows: \(\:\text{SemanticMatch}=0.5\times\:\text{CosSim}+0.5\times\:{\text{BERT-Score}}_{F1}\). Here, CosSim is computed from 768-dimensional sentence embeddings extracted using a pre-trained multilingual BERT model (bert-base-multilingual-cased, fine-tuned on ethnic painting terminology), scaled to 0–100%. BERT-ScoreF1 measures token-level matching: 2×(precision × recall)/(precision + recall) with IDF weighting applied. A threshold of 0.5 is used to determine effective semantic matching.

Terminology Accuracy (TA): The proportion of correctly translated painting-specific terms among all terms:

$$\:TA=\frac{{Correct}_{term}}{{Total}_{term}}\times\:100\text{\%}$$
(11)

where Correctterm is the number of correctly translated terms, and Totalterm is the total number of terms.

Cultural Interpretation Completeness (CI): The proportion of translations that include all three key elements—“technical rules,” “symbolic meanings,” and “usage scenarios”:

$$\:CI=\frac{{Complete}_{info}}{3\times\:{Total}_{explanation}}\times\:100\text{\%}$$
(12)

where Completeinfo is the total number of key information items included, and Totalexplanation is the total number of cultural explanations.

Comparative experimental design

To comprehensively validate system performance, four comparative experiments were designed, all conducted on the MEPD test set, with controlled variables and test schemes as follows:

  1. 1.

    Baseline Traditional Deep Learning Rendering Models: CNN-Style, VGG-19, and basic GAN were used to evaluate the performance of conventional rendering models.

  2. 2.

    Single Machine Translation Models: BERT-Transformer, mBERT, and Google Translate were employed to assess translation accuracy and cultural explanation of ethnic painting terminology.

  3. 3.

    Non-Collaborative Fusion Models: Ablation models were constructed by removing either the “visual context fusion layer” or the “technical constraint vector” to examine the contribution of the collaborative mechanism.

  4. 4.

    Proposed System: Tested for adaptability on single ethnic styles and cross-ethnic fused styles.

Performance evaluation

To further verify the generation quality and detail fidelity of the improved IGAN model in ethnic style decorative pattern rendering tasks, this study conducts a comprehensive performance evaluation by combining quantitative indicators and qualitative visual analysis (Fig. 5). Figure 5 takes traditional ethnic decorative patterns as the core test object, and selects typical representative scroll patterns and geometric patterns as input prototypes. It presents the original line draft, improved IGAN model generation results (including two classic ethnic color schemes of red gold and gold red), optimal performance baseline model output results, and corresponding real ethnic decorative pattern reference images. All test samples use completely consistent input parameters and experimental environment to ensure the reproducibility of the results.

Fig. 5
figure 5

Comparison of the generation results of ethnic decorative patterns, demonstrating the differences between the improved IGAN model and the baseline method.

The comparison in Fig. 5 suggests that the improved IGAN model is significantly better than the traditional baseline method in terms of pattern edge clarity, structural integrity, and style color matching. It can accurately restore the curling rhythm of curly grass patterns and the symmetrical beauty of geometric patterns. The generated decorative patterns have smooth and continuous lines, and have no common defects such as edge blurring and structural fracture in baseline models. Through color mapping mechanism, it achieves harmonious color matching that conforms to national aesthetic characteristics. To comprehensively evaluate the robustness of the model, this study also tested two typical boundary cases: the multi-layer nested structure restoration of complex curled grass patterns; color overflow control under high saturation color matching. Diagnostic analysis shows that the limitations of the model in extreme scenarios mainly stem from two aspects: (1) Insufficient learning of hierarchical features for ultra-complex patterns; (2) The color constraint mechanism in high saturation color matching scenes needs to be strengthened. Subsequent research will focus on optimizing the extraction of pattern level features and improving color adaptive adjustment algorithms, further enhancing the stability and adaptability of the model in complex ethnic decoration pattern generation tasks.

Comparative results of traditional rendering models

This experiment was conducted on 2,000 student sketches representing eight ethnic groups in the MEPD test set. The performance of the improved IGAN was compared with traditional rendering models (CNN-Style, VGG-19, basic GAN), considering metrics including rendering accuracy, efficiency, hardware adaptability, and teaching applicability. The results are presented in Fig. 6.

Fig. 6
figure 6

Comparison of style rendering performance between traditional rendering models and the proposed system.

The experimental results demonstrate that the improved IGAN outperforms traditional models across all evaluated metrics. Compared with the basic GAN, IGAN achieves approximately a 5% improvement in both precision and recall, and an F1 score increase of 6.1%, reflecting higher rendering accuracy and coverage. Its style similarity reaches 91.0%, significantly higher than CNN-Style (80.0%), VGG-19 (82.0%), and basic GAN (85.0%), approaching expert-level quality and effectively reproducing the complex details and stylistic features of ethnic paintings.

To comprehensively evaluate the performance advantages of the proposed improved IGAN in ethnic painting style rendering, this study selected traditional style transfer models (CNN-Style, VGG-19), a baseline generative model (vanilla GAN), and current state-of-the-art models (StyleGAN3, CUT, diffusion models) as baselines. The comparison was conducted across four dimensions: distribution similarity between generated and real images (FID), perceptual similarity (LPIPS), Style Similarity, and generation efficiency (speed). The results are summarized in Table 5.

Table 5 Performance comparison of different rendering models in ethnic painting style transfer tasks.

As shown in Table 5, the improved IGAN outperforms all baseline models on key metrics (FID, LPIPS): FID decreased by 1.9 (lower values indicate closer alignment between generated and real image distributions), and LPIPS decreased by 0.04 (lower values indicate higher perceptual similarity). Additionally, it maintains generation efficiency, with speed only slightly slower than CNN-Style and VGG-19, but much faster than StyleGAN3 and diffusion models, making it suitable for real-time rendering in ethnic college classrooms.

Comparative results of single translation models

This experiment validates the performance of the improved Transformer translation module using a test dataset comprising 5,000 painting-specific terms from three languages: Tibetan, Miao, and Yi. The terms span three categories: techniques, symbols, and materials. Performance was compared with BERT-Transformer, mBERT, and Google Translate, focusing on semantic matching, TA, and cultural interpretation completeness. The results are presented in Fig. 7.

Fig. 7
figure 7

Cross-language interpretation performance: single translation models vs. proposed system.

As shown in Fig. 7, the improved model clearly outperforms the baseline models across all metrics. For semantic matching, BERT-Transformer, mBERT, and Google Translate achieved 83.4%, 85.2%, and 85.1%, respectively, while the improved model reached 89.6%, an increase of approximately 4% points, demonstrating more precise understanding in complex contexts. In terms of TA, mBERT slightly outperforms BERT-Transformer, whereas Google Translate only achieves 82.5%. The improved model reaches 90.3%, effectively handling specialized terms such as “duijin” and “butterfly patterns.” Regarding cultural interpretation completeness, Google Translate scores lowest, BERT-Transformer and mBERT reach 76.5% and 78.3%, respectively, while the improved model attains 88.7%, significantly improving the transmission of technical, symbolic, and material cultural information, highlighting its advantages in ethnic painting translation.

To systematically verify the effectiveness of the improved Transformer in cross-lingual mapping of ethnic painting terminology, additional baselines were included beyond BERT-Transformer, mBERT, and Google Translate. These included visual–language pre-trained models (VLP), multimodal Transformers, and ethnic language-specific models (TibetanBERT). The comparison was conducted across three key dimensions: semantic matching rate, terminology accuracy (TA), and cultural interpretation completeness (CI). The results are summarized in Table 6.

Table 6 Performance comparison of different translation models for cross-language mapping of ethnic painting terminology.

As shown in Table 6, the improved Transformer achieves clear superiority in semantic matching rate, TA, and CI. Compared with the best baseline (multimodal Transformer), semantic matching rate improved by 2.5% points, TA by 5.1 points, and CI by 6.2 points. Its core advantage lies in integrating style feature vectors from the rendering module, enabling a “term–visual–cultural” triadic cross-lingual mapping. For example, when translating “paired brocade patterns,” the model preserves the semantic accuracy of “symmetrical pattern” while supplementing the arrangement rules (“left-right symmetry, head-to-tail alignment”) and the cultural attributes of core Miao silver ornament patterns.

Ablation study on module collaboration mechanism

An ablation experiment was conducted on the task of “Miao silver ornament sketch rendering + Miao language terminology translation” to evaluate the performance differences between the full system and models with disrupted collaboration mechanisms. This experiment aimed to quantify the contribution of bidirectional information flow between modules to rendering quality and terminology translation accuracy. Results are presented in Fig. 8.

Fig. 8
figure 8

Ablation study results of module collaboration mechanism.

The ablation results indicate that the bidirectional collaboration mechanism—comprising the visual context fusion layer (rendering → translation) and the technical constraint vector transfer (translation → rendering)—is critical for end-to-end performance. With full collaboration, rendering (F1 92.3%) and translation (semantic matching 89.6%, TA 90.3%, cultural interpretation completeness 88.7%) achieve optimal results. Removing the rendering → translation mechanism leads to decreased translation performance, demonstrating that visual context supports semantic interpretation. Omitting the translation → rendering mechanism degrades style and feature restoration in rendering. When both mechanisms are removed, rendering F1 drops to 85.6% and translation semantic matching falls to 82.8%, yielding the poorest overall performance. This confirms the essential role of bidirectional information interaction in enhancing image generation and terminology translation performance.

To quantitatively assess the effect of the bidirectional coupling mechanism, an extended ablation study was conducted on the “Miao Silver Pattern Sketch Rendering + Miao Language Terminology Translation” task, incorporating additional metrics—FID, LPIPS, SSIM, PSNR—and significance testing. Results are summarized in Table 7. The full model achieves the best performance: rendering F1 reaches 92.3 ± 0.8%, FID is 28.2 ± 1.1 (significantly lower than other variants, p < 0.01), LPIPS is 0.12 ± 0.02 (p < 0.01), SSIM is 0.89 ± 0.03, PSNR is 28.5 ± 0.6 dB, and semantic matching score is 89.6 ± 0.7% (p < 0.01).

Removing the visual injection pathway (R→T) causes the semantic matching score to drop to 84.2 ± 0.9% (p < 0.01) and the rendering F1 to 89.1 ± 1.0% (p < 0.05), indicating that visual context is critical for terminology understanding. Removing the terminology feedback pathway (T→R) increases FID to 37.5 ± 1.6 (p < 0.01) and LPIPS to 0.21 ± 0.03 (p < 0.01), highlighting the key role of semantic rule constraints in reproducing style details. When both pathways are removed, system performance drops to its lowest (F1 = 85.6 ± 1.4%, FID = 40.2 ± 1.8), with highly significant differences from the full model (p < 0.01), confirming the essential contribution of bidirectional interaction.

Table 7 Ablation study results for the bidirectional coupling mechanism.

Ethnic style adaptability experiment

This experiment evaluated the system’s adaptability to ethnic cultural styles on the MEPD dataset, including both single-ethnic painting styles and one set of cross-ethnic fused styles. The impact of sample size differences and style complexity on rendering performance was analyzed to provide guidance for applying the system to diverse ethnic painting courses in minority colleges. Results are shown in Fig. 9.

Fig. 9
figure 9

Rendering performance of the proposed system across different ethnic painting styles.

As shown in Fig. 9, the proposed system demonstrates stable performance on single-ethnic painting styles, with F1 scores consistently above 90% and style similarity ranging from 88% to 93%. Performance is higher for larger datasets (e.g., Tibetan Thangka and Miao silver ornaments, F1 ≈ 93–94%) and slightly lower for smaller datasets or more complex patterns (e.g., Dai paper-cut, Uyghur wood carving, Mongolian rock painting, F1 ≈ 88–90%). For cross-ethnic fusion styles (Thangka + Uyghur wood carving), F1 drops to 87.3% and style similarity to 86%, indicating that multi-style fusion increases the difficulty of rendering and feature preservation, highlighting the need for optimized cross-style feature extraction and fusion mechanisms.

Practical validation in teaching scenarios

The system was tested in a painting course at a minority university with two classes of 40 students each (2023 cohort). One served as the experimental group (system-assisted teaching) and the other as the control group (traditional teaching) over a 16-week period. Course content included Tibetan Thangka, Miao silver ornaments, and Yi lacquerware. Evaluation metrics included technique tests, cultural interpretation, artwork grading, and student satisfaction surveys. Results are shown in Fig. 10.

Fig. 10
figure 10

Practical teaching validation results.

Figure 10 shows that the proposed system significantly improved learning outcomes in ethnic painting instruction. The experimental group achieved an average technique test score of 85.6 (vs. 72.3 in the control group) and a technique mastery rate of 80.2% (25.2% points higher), indicating that the rendering function enhanced understanding and acquisition of painting techniques. Cultural interpretation scores reached 82.4 (vs. 65.7), demonstrating effective cross-language transmission of technical and cultural terminology. Artwork grading averaged 88.3 (vs. 75.1), with a cross-ethnic style creation rate of 35% (vs. 10%), highlighting the system’s support for technique application and creative innovation.

Discussion

The proposed improved IGAN combined with a visually enhanced Transformer exhibits outstanding performance in ethnic painting rendering and cross-language interpretation. Its advantages are threefold: (1) Accurate Feature Capture: The spatial attention mechanism and multi-objective loss function enable precise extraction of key features, balancing style reproduction and content integrity, effectively addressing detail blurring in traditional models. (2) Cross-Modal Semantic Association: The visual context fusion layer and cultural context library link images and terminology, compensating for the neglect of cultural information in generic translation tools. (3) Bidirectional Collaboration Mechanism: The closed-loop interaction of style feature transmission and terminology constraint feedback enhances overall technical and cultural adaptability. System limitations include lower F1 scores for small-sample styles, reduced performance in cross-ethnic fusion due to conflicting colors and brushstroke characteristics, and the 14.2 GB GPU memory requirement, which may not be feasible for all institutions. In teaching applications, the system combines accurate rendering and cross-language interpretation to support technique inheritance, cultural education, and cross-ethnic creative production.

Conclusion

Research contribution

This study makes significant contributions in theory, technology, and practice: (1) Theoretical Contribution: Proposes a “deep learning rendering – machine translation collaboration” framework for ethnic painting instruction, establishing links between visual style and cross-language semantics, and incorporating situational learning and personalized feedback concepts, offering new perspectives for ethnic cultural education. (2) Technical Contribution: Designs an improved GAN rendering model and a visual-context Transformer translation model, enhancing style reproduction and terminology cultural interpretation accuracy, with significant synergistic effects on overall system performance. (3) Practical Contribution: Constructs a multimodal dataset of eight ethnic painting styles and multilingual terminology, and validates system advantages in technique mastery, cultural understanding, and creative innovation through teaching experiments, providing solid support for digital preservation and educational innovation of ethnic painting.

Future works and research limitations

Despite the achievements, limitations remain. Future research should focus on three directions: (1) Research Scope: Expand coverage of ethnic languages and painting styles, collect more artworks and corpora, and optimize cross-ethnic fusion style rendering algorithms to meet diverse creative needs. (2) Technical Aspect: Apply model compression, distillation, and lightweight optimization to reduce hardware requirements, improve rendering efficiency in complex scenarios, and enhance adaptability on standard GPUs or edge devices. (3) Teaching Application: Leverage educational big data to analyze student learning behavior, build personalized learning profiles, and enable precise instructional feedback and collaborative teacher-student creation. Future studies should continue to advance data expansion, model optimization, and refined cultural representation to enhance system generalizability and pedagogical value.