Abstract
This study addresses a fundamental question in music psychology: which specific, dynamic acoustic features predict human listeners’ emotional responses along the dimensions of valence and arousal. Our primary objective was to develop and validate an interpretable computational model that can serve as a tool for testing and advancing theories of music cognition. Using the publicly available DEAM dataset, containing 1,802 music excerpts with continuous valence-arousal ratings, we developed a novel, theory-guided neural network. This proposed model integrates a convolutional pathway for local spectral analysis with a Transformer pathway for capturing long-range temporal dependencies. Critically, its learning process is constrained by established principles from music psychology to enhance its plausibility. A core finding from an analysis of the model’s attention mechanisms was that distinct acoustic patterns drive the two emotional dimensions: rhythmic regularity and spectral flux emerged as strong predictors of arousal, whereas harmonic complexity and musical mode were key predictors of valence. To validate our analytical tool, we confirmed that the model significantly outperformed standard baselines in predictive accuracy, achieving a Concordance Correlation Coefficient (CCC) of 0.67 for valence and 0.73 for arousal. Furthermore, an ablation study demonstrated that the theory-guided constraints were essential for this superior performance. Together, these findings provide robust computational evidence for the distinct roles of temporal and spectral features in shaping emotional perception. This work demonstrates the utility of interpretable machine learning as a powerful methodology for testing and refining psychological theories of music and emotion.
Similar content being viewed by others
Data availability
The data that support the findings of this study are openly available in the MediaEval Database for Emotional Analysis in Music (DEAM) at https://cvml.unige.ch/databases/DEAM/.
References
Juslin, P. N. & Sloboda, J. Handbook of Music and Emotion: Theory, research, Applications (Oxford University Press, 2011).
Koelsch, S. Brain correlates of music-evoked emotions. Nat. Rev. Neurosci. 15 (3), 170–180 (2014).
Hou, J. et al. Review on neural correlates of emotion regulation and music: implications for emotion dysregulation. Front. Psychol. 8, 501 (2017).
Chong, H. J., Kim, H. J. & Kim, B. Scoping review on the use of music for emotion regulation. Behav. Sci. 14 (9), 793 (2024).
Juslin, P. N. Musical Emotions Explained: Unlocking the Secrets of Musical Affect (Oxford University Press, 2019).
Russell, J. A. A circumplex model of affect. J. Personal. Soc. Psychol. 39 (6), 1161 (1980).
Gabrielsson, A. Emotion perceived and emotion felt: same or different? Musicae Sci. 5 (1_suppl), 123–147 (2001).
Aljanaki, A., Yang, Y. H. & Soleymani, M. Developing a benchmark for emotional analysis of music. PloS One. 12 (3), e0173392. (2017).
Kim, Y. E. et al. Music emotion recognition: A state of the art review. In Proc. ismir (Vol. 86, pp. 937–952). (2010), August.
Han, D., Kong, Y., Han, J. & Wang, G. A survey of music emotion recognition. Front. Comput. Sci. 16 (6), 166335 (2022).
Eerola, T. & Vuoskoski, J. K. A comparison of the discrete and dimensional models of emotion in music. Psychol. Music. 39 (1), 18–49 (2011).
Juslin, P. N. & Laukka, P. Communication of emotions in vocal expression and music performance: different channels, same code? Psychol. Bull. 129 (5), 770 (2003).
Ilie, G. & Thompson, W. F. A comparison of acoustic cues in music and speech for three dimensions of affect. Music Percept. 23 (4), 319–330 (2006).
Schubert, E. Modeling perceived emotion with continuous musical features. Music Percept. 21 (4), 561–585 (2004).
Guest, O. & Martin, A. E. How computational modeling can force theory Building in psychological science. Perspect. Psychol. Sci. 16 (4), 789–802 (2021).
Barrett, L. F. The theory of constructed emotion: an active inference account of interoception and categorization. Soc. Cognit. Affect. Neurosci. 12 (1), 1–23 (2017).
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1 (5), 206–215 (2019).
Adadi, A. & Berrada, M. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE access. 6, 52138–52160 (2018).
Samek, W., Montavon, G., Lapuschkin, S., Anders, C. J. & Müller, K. R. Explaining deep neural networks and beyond: A review of methods and applications. Proceedings of the IEEE, 109(3), 247–278. (2021).
Mahmoodi, J., Leckelt, M., van Zalk, M. W., Geukes, K. & Back, M. D. Big data approaches in social and behavioral science: four key trade-offs and a call for integration. Curr. Opin. Behav. Sci. 18, 57–62 (2017).
Buhrmester, M., Kwang, T. & Gosling, S. D. Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality data? (2016).
Soleymani, M., Aljanaki, A. & Yang, Y. DEAM: Mediaeval Database for Emotional Analysis in Music (Geneva, 2016).
Brown, J. C. Calculation of a constant Q spectral transform. J. Acoust. Soc. Am. 89 (1), 425–434 (1991).
Schörkhuber, C. & Klapuri, A. Constant-Q transform toolbox for music processing. In 7th sound and music computing conference, Barcelona, Spain (pp. 3–64). SMC. (2010), July.
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE. 86 (11), 2278–2324 (2002).
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems vol. 30. 5998–6008 (2017).
Karpatne, A. et al. Theory-guided data science: A new paradigm for scientific discovery from data. IEEE Trans. Knowl. Data Eng. 29 (10), 2318–2331 (2017).
Husain, G., Thompson, W. F. & Schellenberg, E. G. Effects of musical tempo and mode on arousal, mood, and Spatial abilities. Music Percept. 20 (2), 151–171 (2002).
Parncutt, R. The emotional connotations of major versus minor tonality: one or more origins? Musicae Sci. 18 (3), 324–353 (2014).
Lawrence, I. & Lin, K. A Concordance Correlation Coefficient To Evaluate Reproducibility 255–268 (Biometrics, 1989).
Eyben, F., Wöllmer, M. & Schuller, B. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia (pp. 1459–1462). (2010), October.
Cummins, N. et al. An image-based deep spectrum feature representation for the recognition of emotional speech. In Proceedings of the 25th ACM international conference on Multimedia (pp. 478–484). (2017), October.
Huang, C. Z. A. et al. (2018). Music transformer. arXiv preprint arXiv:1809.04281.
Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (pp. 618–626). (2017).
McFee, B. et al. Librosa: audio and music signal analysis in python. SciPy 2015, 18–24 (2015).
Hickok, G. & Poeppel, D. The cortical organization of speech processing. Nat. Rev. Neurosci. 8 (5), 393–402 (2007).
Koelsch, S. Toward a neural basis of music perception–a review and updated model. Frontier Psychol. 2, 110 (2011).
Gregory, R. L. The intelligent eye. (1970).
Bar, M. Visual objects in context. Nat. Rev. Neurosci. 5 (8), 617–629 (2004).
Meyer, L. B. Emotion and Meaning in Music (University of Chicago Press, 2008).
Huron, D. Sweet Anticipation: Music and the Psychology of Expectation (MIT Press, 2008).
Clark, A. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behav. Brain Sci. 36 (3), 181–204 (2013).
Koelsch, S., Vuust, P. & Friston, K. Predictive processes and the peculiar case of music. Trends Cogn. Sci. 23 (1), 63–77 (2019).
Montague, P. R., Dolan, R. J., Friston, K. J. & Dayan, P. Computational psychiatry. Trends Cogn. Sci. 16 (1), 72–80 (2012).
Pereira, C. S. et al. Music and emotions in the brain: familiarity matters. PloS One. 6 (11), e27241. (2011).
Gabrielsson, A. & Lindström, E. The role of structure in the musical expression of emotions. Handbook of music and emotion: Theory, research, applications, 367400, 367 – 44. (2010).
Wingstedt, J. Narrative music: towards an understanding of musical narrative functions in multimedia (Doctoral dissertation, Luleå tekniska universitet). (2005).
Balkwill, L. L. & Thompson, W. F. A cross-cultural investigation of the perception of emotion in music: psychophysical and cultural cues. Music Percept. 17 (1), 43–64 (1999).
Swaminathan, S. & Schellenberg, E. G. Current emotion research in music psychology. Emot. Rev. 7 (2), 189–197 (2015).
Nickerson, R. S. Confirmation bias: A ubiquitous phenomenon in many guises. Rev. Gen. Psychol. 2 (2), 175–220 (1998).
Author information
Authors and Affiliations
Contributions
Y.G. and C.S. contributed equally to this work. Y.G. conceptualized the study, developed the methodology, implemented the software, and wrote the original draft. C.S. acquired the funding, performed validation of the experimental results, and contributed significantly to the writing, review, and editing of the manuscript. Y.F. assisted with data curation and visualization. J.L. provided supervision, project administration, and critically reviewed the manuscript. All authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Gu, Y., Shao, C., Li, J. et al. Interpretable deep learning reveals distinct spectral and temporal drivers of perceived musical emotion. Sci Rep (2026). https://doi.org/10.1038/s41598-025-34238-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-34238-2


