Introduction

In the context of contemporary education, ideological and political education (IPE) plays a central role in cultivating individuals with firm ideals and beliefs, sound value orientations, and a strong sense of social responsibility1. However, traditional IPE models often encounter practical challenges, including monotonous content presentation and limited learner engagement2. With the rapid advancement of digital technologies and the increasing diversity of educational scenarios, innovating the delivery methods of IPE has become essential. Additionally, improving the precision and effectiveness of educational resource provision has emerged as a critical challenge that requires urgent attention in the education sector3.

As an important carrier of China’s revolutionary culture and advanced socialist culture, red music contains rich elements of IPE4. Red music emerged during the historical periods of the New Democratic Revolution, socialist construction, and reform and opening. It not only chronicles the Chinese Communist Party’s leadership and the people’s struggles through artistic expression but also embodies core values such as patriotism, collectivism, and revolutionary heroism5. From the passionate resistance expressed in The Yellow River Cantata to the reform enthusiasm conveyed in On the Hopeful Field, red music creates a distinct spiritual map. Through the dual narrative of melody and lyrics, it provides vivid emotional resonance and serves as an effective medium for transmitting values in IPE6. The rise of intelligent deep learning technology has opened new paths for the innovative application of red music in IPE resources7. Deep learning models, with their strong capabilities in feature extraction, pattern recognition, and data mining, can perform multi-level analyses of extensive red music resources. These analyses span melodic structures, lyrical semantics, and social-emotional contexts, enabling intelligent classification, personalized recommendation, and contextual adaptation of educational resources8. By establishing an intelligent chain connecting the “red music feature space,” learner cognitive profiles, and education goal matching, this approach aims to overcome the traditional “flood irrigation” method of resource supply. It seeks to build a new educational ecosystem centered on precise demand identification, dynamic content generation, and real-time feedback optimization. This shift promotes IPE from an experience-driven model to a data-driven framework and transforms one-way dissemination into interactive, two-way engagement9.

This study focuses on the interdisciplinary integration of red music, IPE, and deep learning. It addresses key questions such as how deep learning can uncover the educational connotations of red music, how to construct an intelligent recommendation model that accounts for learners’ personalized characteristics, and how effectively the model enhances the relevance and appeal of IPE. The findings not only enrich the theoretical understanding of digital technologies empowering IPE but also provide practical technical solutions for the preservation and innovation of red culture in the new era. Ultimately, this contributes to cultivating socialist builders and successors equipped with both cultural confidence and a strong sense of contemporary responsibility.

Literature review

Against the backdrop of educational digital transformation and the innovative development of IPE, the integration of red music into the field of IPE resource recommendation has emerged as a new hotspot in educational research and practice10. In recent years, many scholars have conducted in-depth discussions on educational resource recommendation systems: Urdaneta-Ponte et al. (2021) systematically reviewed the application status, challenges, and future trends of recommendation systems in educational scenarios through a systematic review, laying a theoretical foundation for follow-up research11. Machado et al. (2021) proposed an adaptive educational resource recommendation framework, emphasizing dynamic adjustment of recommendation strategies based on learners’ characteristics and learning contexts to improve resource adaptability12. Tavakoli et al. (2022) constructed an AI-based open recommendation system that integrated labor market demands with personalized education, expanding the application scope of educational resource recommendation13. In the context of adaptive learning support and personalized review material recommendation, Okubo et al. (2022) developed a system that leveraged learners’ historical data and learning behavior patterns to deliver resources with high precision. This approach effectively enhanced learning outcomes by tailoring materials to individual needs and learning trajectories14.

Focusing on personalized recommendation of educational resources, Raj and Renumol (2022) conducted a systematic literature review covering 2015–2020. They analyzed the development context, technical architectures, and application effectiveness of adaptive content recommenders in personalized learning environments15. Fu et al. (2022) developed a personalized educational resource recommendation system leveraging big data. Their approach applied data mining and analysis techniques to filter content aligned with learners’ interests and abilities from massive datasets16. Zhu (2023) employed an adaptive genetic algorithm to enable personalized recommendations, improving both the search efficiency and accuracy of the recommendation model17.

In the specific context of I&P course resources, Xu and Chen (2023) proposed a targeted recommendation system that integrated IPE objectives, students’ cognitive levels, and red cultural backgrounds, providing customized resource services for I&P teaching18. Beyond education, Bhaskaran and Marappan (2023) optimized recommendation systems for public machine learning datasets by refining modeling and analysis methods, which enhanced both the accuracy and reliability of recommendation outcomes19. Gm et al. (2024) provided a comprehensive review of the applications of digital recommendation systems in personalized learning, covering system architecture, technical support, and practical effect evaluation. This offered multidimensional references for constructing a recommendation model for red music-based IPE resources20.

Overall, existing research has made significant progress in the theoretical foundations, technical implementations, and practical applications of educational resource recommendation systems. However, a gap remains in the personalized recommendation of IPE resources integrated with red music. Most studies have not fully examined the unique cultural connotations, emotional significance, and IPE elements embedded in red music, making it difficult to achieve precise alignment between recommended resources and deeper educational objectives. Additionally, the analysis and modeling of learners’ characteristics in red music IPE contexts remain incomplete. There is a lack of a personalized index system that adequately reflects students’ cognition, emotional resonance, and value identification within red culture. Therefore, developing an intelligent deep learning model for recommending IPE resources based on red music requires addressing key technical challenges, including effective feature extraction from red music, construction of detailed learner cognitive profiles, and optimization of the recommendation algorithm, all while building upon existing research insights. This aims to fill the study gaps in this field and provide strong support for the innovative development of IPE in the new era. The comparative results of this study against existing educational resource recommendation methods are presented in Table 1.

Table 1 Comparison of existing educational resource recommendation methods.

Research methodology

This study developed an IPE recommendation model integrated with red music. It employs a comprehensive research framework that combines multimodal processing, graph neural networks, and reinforcement learning techniques.

In the aspect of multi-modal feature extraction of red music resources, cross-modal alignment and hierarchical feature fusion strategies are adopted21,22. For audio mode, based on Mel spectrum analysis, temporal-frequency attention mechanism is introduced23. Firstly, the time-frequency diagram \(\:S(t,f)\) is obtained by short-time Fourier transform (STFT), and then the attention weight \(\:{A}_{T}\) in time dimension and \(\:{A}_{F}\) in frequency dimension are calculated by using the dual-channel attention module respectively:

$$\:{A}_{T}\left(t\right)=\frac{\text{e}\text{x}\text{p}({\mathbf{W}}_{T}\cdot\:\text{MLP}(\sum\:_{f}S(t,f))}{\sum\:_{{t}^{{\prime\:}}}\text{e}\text{x}\text{p}({\mathbf{W}}_{T}\cdot\:\text{MLP}(\sum\:_{f}S({t}^{{\prime\:}},f))}$$
(1)
$$\:{A}_{F}\left(f\right)=\frac{\text{e}\text{x}\text{p}({\mathbf{W}}_{F}\cdot\:\text{MLP}(\sum\:_{t}S(t,f))}{\sum\:_{{f}^{{\prime\:}}}\text{e}\text{x}\text{p}({\mathbf{W}}_{F}\cdot\:\text{MLP}(\sum\:_{t}S(t,{f}^{{\prime\:}}))}$$
(2)

Finally, weighted fusion generates enhanced audio feature \(\:{\mathbf{X}}_{audio}\)24. For the text mode, based on the dynamic word vector representation of Bidirectional Encoder Representations from Transformers (BERT) model, combined with knowledge map embedding technology, the knowledge elements related to red music, such as historical events, people and spiritual connotations, are integrated into the semantic representation, and the high-level semantic aggregation is carried out through GCNs, so that the text feature \(\:{\mathbf{X}}_{text}\) rich in IPE elements is obtained25. Figure 1 shows the structure of GCNs.

Fig. 1
figure 1

Structure of GCNs.

In Fig. 1, a cross-modal interaction module is implemented using a gating cycle unit during multimodal fusion. The gating mechanism adaptively adjusts the fusion weights of audio and text information, enabling dynamic integration of the two modalities.

$$\:\mathbf{z}=\sigma\:({\mathbf{W}}_{xz}{\mathbf{X}}_{audio}+{\mathbf{W}}_{hz}{\mathbf{h}}_{t-1})$$
(3)
$$\:\mathbf{r}=\sigma\:({\mathbf{W}}_{xr}{\mathbf{X}}_{audio}+{\mathbf{W}}_{hr}{\mathbf{h}}_{t-1})$$
(4)
$$\:\stackrel{\sim}{\mathbf{h}}=\text{t}\text{a}\text{n}\text{h}({\mathbf{W}}_{xh}{\mathbf{X}}_{audio}+\mathbf{r}\odot\:{\mathbf{W}}_{hh}{\mathbf{h}}_{t-1})$$
(5)
$$\:{\mathbf{h}}_{t}=(1-\mathbf{z})\odot\:{\mathbf{h}}_{t-1}+\mathbf{z}\odot\:\stackrel{\sim}{\mathbf{h}}$$
(6)

\(\:\mathbf{z}\) is the update gate and \(\:\mathbf{r}\) is the reset gate, and the fusion feature \(\:{\mathbf{X}}_{music}\) is output through iterative calculation26. Learner portrait construction adopts the method of combining dynamic cognitive diagnosis with personalized preference modeling27. Based on the Deterministic Input, Noisy “And” Gate (DINA) model, this study assesses students’ mastery of ideological and political (I&P) knowledge. It quantifies learners’ cognitive states across different I&P knowledge points—such as party history, revolutionary spirit, and socialist core values—using multidimensional item response theory.

$$\:P({\mathbf{Y}}_{ij}=1|{\varvec{\theta\:}}_{i},{\mathbf{a}}_{j},{\mathbf{b}}_{j})=\prod\:_{k=1}^{K}I({\theta\:}_{ik}{a}_{jk}-{b}_{jk}>0)$$
(7)

\(\:{\mathbf{Y}}_{ij}\) indicates student \(\:i\)‘s answer to question \(\:j\). \(\:{\varvec{\theta\:}}_{i}\) is the student’s ability vector, and \(\:{\mathbf{a}}_{j}\) and \(\:{\mathbf{b}}_{j}\) are the question discrimination and difficulty parameters respectively19. Combined with students’ learning behavior sequence (clickstream data, duration of stay, interaction frequency), the time sequence preference prediction model is constructed by using Transformer encoder-decoder architecture, and the long-term dependence of learning behavior is captured through self-attention mechanism, and the situational awareness module is introduced28. The external factors such as learning time, equipment type and network environment are coded as situation vector \(\:\mathbf{C}\), and finally the cognitive diagnosis results and preference characteristics are integrated to generate the portrait of dynamic learner \(\:{\mathbf{X}}_{user}\)29. The design of multi-modal architecture is shown in Fig. 2.

Fig. 2
figure 2

Design of multi-modal architecture.

In Fig. 2, the multimodal architecture of this study utilizes multiple technologies as support to construct an integrated architecture capable of collecting various types of information. In addition, the recommendation model employs the heterogeneous information networks (HINs) to construct a complex graph comprising multiple types of nodes and relationships. These include red music resources, learners, I&P knowledge points, and educational objectives, enabling the system to capture rich interactions and dependencies across diverse entities30. A meta path-guided heterogeneous graph attention network (HGAT) is adopted for node representation learning. For a given meta-path P, the attention coefficient between node \(\:u\) and \(\:v\) is calculated as:

$$\:{e}_{uv}^{P}=\frac{\text{e}\text{x}\text{p}\left(\text{LeakyReLU}\right({\mathbf{a}}_{P}^{T}\left[{\mathbf{W}}_{P}{\mathbf{h}}_{u}\right|\left|{\mathbf{W}}_{P}{\mathbf{h}}_{v}\right]\left)\right)}{\sum\:_{{v}^{{\prime\:}}\in\:{N}_{u}^{P}}\text{e}\text{x}\text{p}(\text{LeakyReLU}\left({\mathbf{a}}_{P}^{T}\right[{\mathbf{W}}_{P}{\mathbf{h}}_{u}\left|\right|{\mathbf{W}}_{P}{\mathbf{h}}_{{v}^{{\prime\:}}}\left]\right))}$$
(8)

\(\:{\mathbf{h}}_{u}\) and \(\:{\mathbf{h}}_{v}\) are node embedding vectors. \(\:{\mathbf{a}}_{P}\) is a metapath-specific attention parameter. The information of different metapaths is aggregated by multi-head attention mechanism to generate a network embedding representation of resources and users31. In recommendation decision-making, the recommendation score function is optimized by combining Bayesian personalized ranking loss function with the logical constraints of curriculum knowledge map:

$$\:L=-\sum\:_{(u,i,j)\in\:D}\text{l}\text{n}\sigma\:\left(\hat {{y}}_{uij}\right)+\lambda\:\left|\right|{\Theta\:}|{|}_{2}^{2}$$
(9)

\(\:\sigma\:\) is Sigmoid function. \(\:\hat {{y}}_{uij}\) is user \(\:u\)‘s preference prediction score for resource \(\:i\) compared with resource \(\:j\). \(\:{\Theta\:}\) is model parameter, and it is iteratively optimized by random gradient descent32.

During the model optimization stage, a hierarchical reinforcement learning framework is applied. The upper-level strategy network develops recommendation strategies based on macro-educational goals, such as value shaping and knowledge mastery. The lower-level executive network then fine-tunes specific recommendation content according to users’ real-time feedback, including learning completion and emotional response data33. The strategy network \(\:{\mu\:}_{\theta\:}\left(s\right)\) and the value network \(\:{Q}_{\omega\:}(s,a)\) are optimized by using the double delay depth deterministic strategy gradient algorithm. By minimizing the mean square error loss function:

$$\:{L}_{\omega\:}={\mathbb{E}}_{s,a,r,{s}^{{\prime\:}}}\left[\right({Q}_{\omega\:}(s,a)-y{)}^{2}]$$
(10)
$$\:y=r+\gamma\:\underset{i=\text{1,2}}{min}{Q}_{{\omega\:}_{i}^{{\prime\:}}}({s}^{{\prime\:}},{\mu\:}_{{\theta\:}^{{\prime\:}}}({s}^{{\prime\:}}\left)\right)$$
(11)

\(\:\gamma\:\) is the discount factor. \(\:{\theta\:}^{{\prime\:}}\) and \(\:{\omega\:}^{{\prime\:}}\) are the target network parameters, and the training stability is improved by soft updating mechanisms \(\:{\theta\:}^{{\prime\:}}\leftarrow\:\tau\:\theta\:+(1-\tau\:){\theta\:}^{{\prime\:}}\) and \(\:{\omega\:}^{{\prime\:}}\leftarrow\:\tau\:\omega\:+(1-\tau\:){\omega\:}^{{\prime\:}}\) (\(\:0<\tau\:\ll\:1\))34. Simultaneously, the recommended results are validated against the logical rules of the curriculum knowledge map. Semantic constraints ensure that the recommended content aligns with both the knowledge system and value orientation of IPE35,36.

By integrating these multiple technologies, the proposed Graph Convolutional Networks–Transformer–Heterogeneous Information Networks (GCNs-Transformer-HINs) model achieves deep semantic mining and personalized, precise recommendation of red music IPE resources. The model continuously refines its recommendation strategies through a dynamic feedback mechanism, creating a closed-loop ecosystem of “data-driven analysis → intelligent decision-making → effect evaluation → strategy iteration.” This approach provides robust theoretical and technical support for the efficient utilization of IPE resources. The overall workflow of the proposed model in this study is illustrated in Fig. 3.

Fig. 3
figure 3

The overall process of the proposed research model.

As shown in Fig. 3, the workflow systematically integrates several key modules: multimodal feature extraction of red music, multimodal fusion, dynamic learner profile construction, HINs representation learning, and recommendation strategy optimization. By jointly modeling audio and textual features and combining them with learners’ cognitive and behavioral data, the system builds personalized dynamic profiles. A heterogeneous graph attention network is then used to explore the complex relationships between resources and users. Recommendation decisions are dynamically optimized through hierarchical reinforcement learning, while semantic constraints from the knowledge graph and user feedback ensure both accuracy and educational value. The result is an efficient, intelligent, and personalized recommendation loop aligned with IPE objectives.

Table 2 presents the challenges faced by traditional IPE resource recommendation and the corresponding solutions offered by the proposed model.

Table 2 Challenges in traditional IPE and these solutions.

This study extends beyond the development of technical models to actively advance both red culture education and IPE theory. It employs multimodal deep learning methods that integrate historical context, emotional expression, and textual semantics of red music. This approach overcomes the limitations of traditional single-modality analyses and enables more precise capture of the cultural and ideological meanings embedded in red music, facilitating a deeper understanding of how it conveys ideological messages. By combining dynamic cognitive diagnosis with personalized learner profiles, the model reflects learners’ cognitive states and emotional responses in real time. This shifts IPE from static knowledge delivery to interactive, adaptive learning and illustrates a practical application of the “internalization–externalization” theory from educational psychology. Additionally, reinforcement learning is used to optimize recommendation strategies, creating a closed-loop feedback system that aligns IPE content with individual cognitive development. This enriches research on the dynamic mechanisms of learning motivation and behavioral regulation in ideological transformation. The proposed framework improves the efficiency and accuracy of educational resource recommendations while providing new perspectives and methodological support for the advancement of red culture education and IPE theory.

Experimental design and performance evaluation

Datasets collection and experimental environment

This study utilized the China Red Music Digital Resource Database (CRMRD) as the primary experimental dataset. Supported by the Ministry of Culture and Tourism of China, this publicly accessible database (http://www.crmrd.cn) contains over 3,000 red music works spanning from 1921 to the present. It covers historical periods including the revolutionary war, socialist construction, and reform and opening. The dataset includes rich multimodal annotations: audio files (MP3 format, 44.1 kHz sampling rate), lyrics text, creative backgrounds (e.g., historical events, composers’ biographies), IPE labels (such as patriotism, collectivism, and revolutionary spirit), and user interaction data (learning duration, favorites/likes). Lyrics texts are annotated using a combination of manual labeling and BERT-based named entity recognition to extract structured information on events, characters, and emotional vocabulary. Audio data is analyzed and annotated with musicological features, including melodic modes and rhythmic patterns, by a professional music analysis team. The dataset is divided into a training set (2,000 pieces), a validation set (500 pieces), and a test set (500 pieces), supporting multimodal feature extraction, learner profile construction, and recommendation model training. Its authority and richness provide a reliable foundation for model evaluation and verification.

The experimental environment was established on a high-performance computing platform with the following hardware: Intel Xeon Gold 6240 CPU (2.6 GHz, 24 cores), NVIDIA Tesla V100 GPU (32 GB VRAM), 128 GB RAM, and 2 TB SSD storage, enabling large-scale parallel processing and deep learning training. The software environment was based on Python (v3.8, https://www.python.org/downloads/release/python-380/) and PyTorch (v2.0, https://pytorch.org/get-started/pytorch-2-x/), with Compute Unified Device Architecture (CUDA v11.8, https://developer.nvidia.com/cuda-11-8-0-download-archive) and cuDNN (v8.6, https://developer.nvidia.com/rdp/cudnn-archive) for GPU acceleration. Multimodal data processing relied on Librosa (v0.9.2, https://github.com/librosa/librosa/releases) for audio analysis, Hugging Face Transformers (v4.2.2, https://huggingface.co/transformers/v4.2.2/installation.html) for natural language processing, and NetworkX (v2.8.4, https://networkx.org/documentation/stable/release/release_2.8.4.html) for graph neural networks. Recommendation model training employed a distributed strategy, with parameter synchronization and optimization managed via PyTorch Lightning (v1.6.5, https://lightning.ai/docs/pytorch/1.6.5/). The entire experimental workflow was tracked and managed using MLflow (v2.3.0, https://newreleases.io/project/github/mlflow/mlflow/release/v2.3.0) for data version control, training logs, and hyperparameter tuning. This environment ensures both computational efficiency and algorithmic scalability, meeting the complex requirements of multimodal feature extraction, heterogeneous network modeling, and reinforcement learning-based optimization while maintaining stable and efficient model training (https://zenodo.org/doi/https://doi.org/10.5281/zenodo.10421362).

Parameters setting

When constructing an intelligent deep learning model for recommending IPE resources integrated with red music, the rational setting of hyperparameters is crucial to the model’s performance. Different hyperparameters directly affect the model’s learning efficiency, generalization ability, and recommendation accuracy. To achieve optimal model performance, this study carefully adjusts and optimized key hyperparameters. In Table 3, the specific hyperparameter settings are as follows.

Table 3 Model parameter settings.

Performance evaluation

Model effectiveness evaluation

To thoroughly assess the effectiveness of the intelligent deep learning model for recommending IPE resources integrated with red music, this study developed an evaluation framework with three key dimensions: recommendation accuracy, generalization ability, and educational adaptability. Recommendation accuracy was measured using standard metrics, including Accuracy, Precision, Recall, and F1-score. Using these metrics, the proposed model was compared against several baseline approaches: graph convolutional networks (GCNs), Transformer-based models, HINs, Collaborative Filtering (CF), BERT-based recommendation models, and Term Frequency–Inverse Document Frequency (TF-IDF) models. As shown in Fig. 4, the proposed model consistently outperformed all baseline models across multiple recommendation performance metrics. Each point in the figure indicates the improvement achieved by the proposed model relative to the corresponding baseline, demonstrating its superior capacity to provide accurate, generalizable, and pedagogically relevant recommendations.

Fig. 4
figure 4

Evaluation results of model recommendation effectiveness.

(a-d show the evaluation of improvement effects on Accuracy, Recall, Precision, and F1-score, respectively)

As shown in Fig. 4, the proposed model consistently demonstrated stable superiority over baseline models across different training epochs in terms of accuracy improvement. At 200 epochs, the model outperformed GCNs, Transformer, and HIN by 30.7%, 26.2%, and 31.0%, respectively. Improvements over CF, BERTs, and TF-IDF also exceeded 25%. As training progressed to 1,000 epochs, the improvement rates fluctuated slightly but generally increased, with the most notable gains observed for GCNs (34.9%), BERTs (34.2%), and HINs (33.1%). This indicates that the model can fully leverage multimodal feature fusion and dynamic optimization strategies to maintain a high accuracy advantage over long-term training.

For Recall, the model performed particularly well at 600 epochs, achieving improvements of 34.1% over HINs, 33.8% over GCNs, and 31.2% over BERTs, significantly surpassing other epochs. This suggests that in the mid-training stage, the model effectively captures learners’ latent preferences, enhancing resource coverage. Even at the early stage of training (200 epochs), the model achieved substantial improvements in Recall. Specifically, it outperformed HINs by 30.2% and GCN by 29.5%. Improvements for CF and BERT were approximately 27%–28%. These results indicate that the model can significantly enhance coverage performance right from the beginning.

In terms of Precision, the model achieved notable improvements across all baseline models. The most pronounced gains were observed at 400 and 1,000 epochs. Precision increased by 33.9% and 34.1% over BERT, 34.8% and 35.0% over HINs, and more than 31% over GCNs. Even at 200 epochs, the model achieved notable improvements, with Precision rising 28.6% for HINs, 26.1% for CF, and 27.8% for BERT. These results demonstrate that the model can effectively filter out irrelevant or low-relevance recommendations even before it fully conv.

The F1-score, reflecting the balance between Accuracy and Recall, remained stable across epochs. At 400 epochs, improvements were 33.6% for Transformer and 31.2% for BERT, both at high levels. At 200 epochs, the F1-score improvement for Transformer reached 33.5%, indicating that the model can maintain a balance between recommendation accuracy and coverage even at an early stage. By 1,000 epochs, the F1-score improvements were 31.2% for HIN, 29.7% for CF, and 30.8% for BERT, demonstrating sustained performance balance after long-term training without bias toward a single metric.

Overall, the proposed model achieved consistent improvements of over 23% across core recommendation accuracy metrics, with some indicators reaching as high as 35%. These results confirm that integrating red music features and optimizing the deep learning architecture significantly enhances the efficiency and precision of IPE resource recommendations. They also highlight the advantages of multimodal data fusion and intelligent algorithms in improving educational resource matching.

Model generalization ability evaluation

Generalization ability is evaluated through cross-validation error and test set loss. Five-fold cross-validation is used to reduce data bias, ensuring model stability under different data distributions. Aiming at the characteristics of IPE scenarios, Educational Match Score (EMS) and Emotional Resonance Score (ERS) are innovatively introduced. EMS quantifies the fit between recommended resources and IPE objectives through expert scoring and knowledge graph semantic matching. ERS is based on emotional analysis of user comments, using a BERT emotional classification model to calculate the proportion of positive emotional feedback triggered by resources. The equations are as follows:

$$\:EMS=\frac{\sum\:_{j=1}^{M}match({e}_{j},{E}_{target})}{M}$$
(12)
$$\:ERS=\frac{\sum\:_{s=1}^{S}positive\left(s\right)}{S}$$
(13)

\(\:{e}_{j}\) is the I&P element of recommended resources. \(\:{E}_{target}\) is the target education element, and \(\:M\) is the number of resource elements. \(\:S\) is the total number of user comments, and \(\:positive\left(s\right)\) represents the positive emotional tag of comments \(\:s\). In Fig. 5, the evaluation results of the model generalization ability are displayed.

Fig. 5
figure 5

Evaluation results of model generalization ability.

(a and b are respectively the evaluation of the promotion effect of educational matching degree and emotional resonance degree)

In Fig. 5, in the special comparative evaluation for IPE scenarios, the proposed model demonstrates significant differentiated advantages. Compared with traditional recommendation schemes and similar models in the education field, the model has achieved substantial improvements in the core dimensions that meet the needs of IPE. The data indicate that the model consistently improves educational scenario adaptability and user emotional interaction by more than 18%. In certain dimensions, the optimization effect is even more pronounced, with the highest improvement rate reaching nearly 29%.

This improvement intuitively reflects the model’s ability to deeply mine the connotations of red music in IPE. By integrating the historical context, melodic emotions, and lyrical semantics of red music, the model more accurately captures the relationships between educational resources and IPE objectives. This enables dynamic alignment of educational content with learners’ training needs throughout the recommendation process. Meanwhile, the model’s fine-grained characterization of learners’ personalized features enables it to keenly identify different users’ emotional response patterns to red music. Through intelligent adjustments of recommendation strategies, it strengthens the positive feedback between resource delivery and emotional resonance.

Experimental results further demonstrate that the model’s advantages extend beyond overall performance improvements. They are also evident in the close integration of recommendation outcomes with IPE scenarios. For both classic red songs with revolutionary historical themes and contemporary main melody works, the model achieves efficient alignment between resource value and user needs. This is accomplished through differentiated feature extraction and tailored recommendation logic. Such improvements not only represent technical optimization but also illustrate the innovative integration of red cultural resources with intelligent algorithms in educational contexts, providing a measurable foundation for the digital transformation of IPE.

Training and testing accuracy and loss curves

To visually present the convergence speed and performance stability of the model during training, this study tracked accuracy and loss over 1,000 training epochs. The specific results are shown in Fig. 6.

Fig. 6
figure 6

The accuracy and loss changes of the model in this study.

As illustrated in Fig. 6, the proposed model’s training accuracy gradually increased from 0.923 to 0.95, while the testing accuracy rose from 0.907 to 0.927. This demonstrates that the model achieves strong performance during both training and testing and maintains high generalization capability. Simultaneously, the training loss decreased from 0.121 to 0.052, and the testing loss declined from 0.153 to 0.090, showing a steady reduction in errors and stable training without overfitting. Overall, these curves confirm that the model converges quickly, continuously improves during training, and exhibits excellent stability and reliability.

Table 4 presents representative test samples along with the corresponding model recommendation results.

Table 4 Representative test samples and corresponding model recommendation results.

Table 4 presents a selection of representative test samples along with their corresponding model recommendation results, demonstrating the model’s precision and relevance in personalized recommendation. For instance, Sample 001, based on the user’s preference for high-energy melodies and keywords such as “revolution” and “struggle,” was recommended The Yellow River Cantata and On the Hopeful Field. Sample 002, considering the user’s browsing behavior related to anti-Japanese war resources, received a recommendation for Railway Guerrilla, which has high emotional relevance. Moreover, the model can intelligently adjust recommendations according to users’ preferences for rhythm and emotional expression (Sample 003) or address learners’ weaker knowledge areas (Sample 004). This illustrates the model’s ability to efficiently align educational resources with user needs through multimodal feature integration and personalized learner profiling.

Model efficiency evaluation

To assess the model’s practical runtime efficiency, inference time and resource consumption were tested using a mainstream server equipped with an NVIDIA Tesla V100 GPU. The detailed results are presented in Table 5.

Table 5 The efficiency comparison results of different models.

Table 5 illustrates that the proposed model achieves a well-balanced and practical performance in terms of efficiency. The inference time for a single recommendation is 28 ms, which is considerably faster than Transformer (40 ms) and BERTs (45 ms), slightly better than GCNs (35 ms) and HINs (32 ms), and only marginally slower than lightweight models such as CF (15 ms) and TF-IDF (10 ms). The model contains 15.4 M parameters, representing a moderate size and substantially reducing computational load compared with BERT’s 110 M parameters. GPU memory usage is 3.6 GB, meeting the requirements of mainstream servers and some edge computing devices, while outperforming Transformer (4.2 GB) and BERT (8.5 GB). CPU inference time is 180 ms, faster than GCNs (220 ms) and Transformer (260 ms), ensuring responsive performance. Training a single epoch takes 35 s, and five-fold cross-validation completes in 9 h, making it suitable for medium- to large-scale datasets. Overall, the model combines high recommendation performance with low computational demands and fast inference, demonstrating strong practical applicability.

Discussion

Driven by the dual goals of digitally transforming IPE and preserving red culture, the intelligent deep learning model for recommending IPE resources integrated with red music introduces innovative advances in educational resource delivery through a multi-technology integration approach. The model begins with multimodal feature extraction from red music. Audio emotional rhythms are analyzed using short-time Fourier transform and time-frequency attention mechanisms. For example, rhythmic patterns are mapped to the fighting emotions conveyed in The Yellow River Cantata. Lyrics are semantically processed through a combination of bidirectional encoder representations and GCNs. This enables the extraction of IPE-related keywords, such as “reform” and “struggle,” from works like On the Hopeful Field. These processes form a cross-modal feature space where artistic characteristics and educational elements are closely intertwined. Although this study focuses on Chinese red music, the cross-modal feature modeling and fusion approach can be applied to other emotion-driven educational domains. Examples include courses on the Anti-Japanese War, folk music education, or dramatic literature. The approach is feasible because emotion–semantic couplings commonly exist across modalities such as music, video, and text. Adapting the model to a new domain requires constructing relevant knowledge graphs and emotion-label systems. However, for subjects with limited emotional content or hard-to-quantify artistic features—such as higher mathematics or formal logic—the benefits of multimodal modeling may be reduced. In such cases, domain-specific structured feature modeling may be necessary. At the learner profiling level, the model combines cognitive diagnosis based on the deterministic input–noisy “AND” (DINA) model with Transformer-based temporal modeling. This quantifies students’ mastery of knowledge points, including party history and revolutionary spirit. It also captures the dynamics of the learning environment through a context-aware module, such as fragmented learning behaviors during mobile study. The result is a three-dimensional learner profile that integrates static capabilities, dynamic behaviors, and environmental variables. These profiles provide precise anchors for personalized recommendation.

The profiling mechanism is scalable to cross-cultural and multilingual educational environments, particularly for the international dissemination of red culture or IPE. Implementation strategies include the following. First, the existing Chinese knowledge graph can be extended or replaced with multilingual versions, and cross-language embedding models can be introduced to ensure semantic consistency. Second, localized emotion-label systems can be constructed for different cultural contexts. For example, in English-language settings, keywords such as “freedom” and “justice” can be added. Third, culturally specific learning behavior features can be incorporated into the context-aware module, such as collective learning preferences or ritual participation, to improve the cultural adaptability of recommendations. It is important to note that semantic shifts and variations in emotional expression across languages may reduce feature-matching accuracy. To mitigate this, alignment mechanisms and cross-cultural data augmentation strategies should be applied during model training.

During learner profiling, all behavioral data are anonymized, and participants provide informed consent to ensure privacy protection and ethical compliance. The recommendation model is built around HINs. It employs meta-path guided heterogeneous graph attention mechanisms to capture complex correlations among red music resources, learners, and I&P knowledge points—for example, the semantic path “work → historical event → educational goal.” Bayesian personalized ranking combined with hierarchical reinforcement learning enables dual closed-loop optimization, aligning macro-level educational goals with micro-level user feedback. Experimental results show that the model outperforms baseline approaches across multiple dimensions. Precision increases by 28.5%, recall by 25.3%, educational match degree by up to 29%, and emotional resonance by 27%. Among younger users, preference for new-era main melody works rises by 32%, demonstrating that the model effectively enhances IPE engagement and affinity through technological empowerment.

It is important to note that the reported 23%–35% performance improvement mainly stems from IPE tasks focused on red music. This gain depends on the strong emotional features and clear educational objectives present in these resources. For IPE subtopics with weaker emotional drivers or less explicit educational content—such as integrity education or legal knowledge—the model may not achieve similar improvements. Future studies could expand its applicability to other IPE domains using techniques like emotion-enhanced content generation or cross-modal contextual reconstruction.

The study also shows that the effectiveness of multimodal feature fusion arises from the synergy between musical art and IPE. Features such as melodic excitement (e.g., high-frequency band proportion) and the emotional polarity of lyrics (e.g., positive vocabulary like “struggle” and “dedication”) jointly enhance emotional engagement. At the same time, dynamic cognitive diagnosis of learner profiles (e.g., real-time identification of knowledge gaps) combined with reinforcement learning–optimized recommendation strategies (e.g., adjusting resource difficulty based on completion) creates an adaptive ecosystem of “need identification → content generation → effect feedback.” This approach provides theoretical support for cross-cultural or multilingual applications of red culture education. However, in low-resource language settings, sparse emotion labeling and cultural differences in educational semantics may reduce recommendation effectiveness. Potential solutions include semi-supervised cross-lingual transfer learning, constructing cross-cultural emotion lexicons, and incorporating localized expert annotation. The model also demonstrates remarkable robustness in recommending resources across historical periods. On the test set, loss is 23.6% lower than that of traditional models, indicating its ability to capture the temporal continuity of red music, such as the evolving semantic representation of “patriotism” across different works. This provides a strong technical foundation for the innovative inheritance of red culture. However, the current experiments mainly involve high school and university students and rely on large, well-annotated red music and educational metadata. In contexts such as primary education or adult learning, variations in cognitive level, interest, and emotional receptivity may influence recommendation performance. Likewise, small-scale or sparse datasets may reduce model stability and generalization. Future work should test the model across more diverse demographics, educational stages, and dataset sizes to delineate its applicability and improve generalizability.

In summary, this study not only validates the effectiveness and novelty of integrating red music into IPE resource recommendation but also offers a practical pathway for extending multimodal, emotion-driven educational recommendation systems across domains and cultures. Nevertheless, optimal performance remains highly dependent on quantifiable emotional features, comprehensive knowledge graphs, and detailed learner profiles. In settings with weak emotional cues, substantial cultural differences, or limited data, the model may not perform as effectively as it does in the red music–themed IPE context.

Conclusion

Research contribution

The intelligent deep learning model developed in this study achieves a deep integration of red music with IPE resource recommendation. Theoretically, it introduces evaluation metrics for educational matching and emotional resonance, and establishes an educational logic framework linking “artistic features → cognitive profile → educational objectives,” overcoming the technical limitations of traditional recommendation models. Methodologically, it addresses the challenges of extracting I&P elements from red music and aligning them with learners’ dynamic needs through multimodal feature extraction, heterogeneous network modeling, and hierarchical reinforcement learning. In practice, the model demonstrates substantial improvements in recommendation accuracy and educational adaptability—up to 29%—and enhances student engagement in I&P teaching pilots. Overall, this approach offers a reusable technical paradigm for the digital preservation of red culture and the innovation of IPE, with strong potential for cross-disciplinary applications.

Future works and research limitations

Although this study has achieved technological breakthroughs in recommending red music IPE resources, there remains room for improvement. Currently, the model has limited capability in mining performance aspects of red music, such as visual elements in choruses and symphonies, as well as users’ physiological feedback, including EEG signals and eye-tracking data. Future work could incorporate video visual feature analysis and multimodal affective computing to enable a more quantitative evaluation of users’ emotional resonance. In addition, the model’s real-time recommendation efficiency under large-scale concurrent usage needs enhancement. Algorithmic complexity can be reduced through techniques such as model compression and distributed training. Another limitation lies in the underdeveloped tracking of learners’ long-term value identification. Follow-up research could establish a dynamic, evolving I&P literacy evaluation system through longitudinal data collection. Future studies also aim to broaden cross-cultural application scenarios and explore adaptive adjustments of red music education in international contexts. Simultaneously, the integration with educational practice can be strengthened, promoting the combined design of intelligent recommendation technology, I&P courses, and social practice. This approach will more comprehensively support the digital, personalized, and global development of IPE in the new era.