Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Multimodal generative adversarial networks for piano fingering correction and performance expressiveness modeling through audio-visual feature fusion
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 26 March 2026

Multimodal generative adversarial networks for piano fingering correction and performance expressiveness modeling through audio-visual feature fusion

  • Junyu Li1 

Scientific Reports , Article number:  (2026) Cite this article

  • 411 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Engineering
  • Mathematics and computing
  • Neuroscience

Abstract

Piano performance analysis demands sophisticated understanding of both acoustic outputs and physical gestures, yet existing computational approaches typically treat these modalities in isolation. This study presents a novel framework that leverages generative adversarial networks to jointly model piano fingering correction and performance expressiveness through integrated audio-visual analysis. We develop a hierarchical attention-based fusion mechanism that learns adaptive correspondences between acoustic events and hand movements, combined with a dual-stream GAN architecture that concurrently handles fingering classification and expressiveness assessment. Experiments on a multimodal dataset comprising 3,847 performances demonstrate that our approach achieves 89.7% frame-level fingering accuracy and maintains correlation coefficients exceeding 0.85 for dynamic expressiveness features, substantially outperforming single-modality baselines. The system provides near real-time feedback with 180ms processing latency per second of audio, enabling practical deployment in interactive music education environments. These results validate the efficacy of cross-modal deep learning for capturing the coupled relationship between biomechanical execution and artistic interpretation in skilled musical performance.

Data availability

The complete multimodal dataset comprising 3,847 performances (215 h of synchronized audio-visual recordings) is available through a structured access protocol designed to balance reproducibility with privacy protection. Supplementary Material provides all resources necessary for independent replication and is organized as follows: (a) complete feature extraction source code with inline comments, specifying software dependencies (Python 3.9, PyTorch 2.0.1, MediaPipe 0.10.3) and random seeds (seed=42 for all experiments), enabling researchers to replicate our entire preprocessing and training pipeline from raw recordings to final evaluation; (b) configuration files in JSON format listing all hyperparameters for each model component, including learning rate schedules, loss weights (\lambda_{recon}, \lambda_{adv}, \lambda_{expr}), augmentation parameters, and discriminator update frequency rules; (c) evaluation scripts that reproduce all quantitative results reported in Tables 5 and 6, and 7, with expected output values provided for verification; (d) a summary table documenting the detailed architecture specification of every layer in both generator and discriminator networks, including initialization schemes, normalization choices, and dropout placement; (e) sample input-output pairs from 20 representative test performances, showing raw audio-visual features alongside model predictions and ground-truth annotations, to facilitate qualitative inspection without requiring the full dataset. The full dataset access requires completion of a data use agreement acknowledging participant privacy protections and restrictions on redistribution, in accordance with the ethics approval granted by Nanjing Xiaozhuang University (IRB-NJXZU-MUS-2024-018). Researchers may request full dataset access by contacting the corresponding author at lijunyu1891299@163.com with a brief description of intended use and institutional affiliation. We commit to responding to access requests within 10 business days. This tiered access approach follows recommendations for responsible sharing of human performance data while maximizing research utility.

Abbreviations

GAN:

Generative Adversarial Network

MFCC:

Mel-Frequency Cepstral Coefficients

LSTM:

Long Short-Term Memory

CNN:

Convolutional Neural Network

BiLSTM:

Bidirectional Long Short-Term Memory

TCN:

Temporal Convolutional Network

HMM:

Hidden Markov Model

cGAN:

Conditional Generative Adversarial Network

STFT:

Short-Time Fourier Transform

MSE:

Mean Squared Error

IOI:

Inter-Onset Interval

RGB:

Red-Green-Blue

GPU:

Graphics Processing Unit

References

  1. Gómez-Martín, F., Arcos, J. L. & González-Abril, L. AI in music education: A systematic review. Expert Syst. Appl. 181, 115158 (2021).

    Google Scholar 

  2. Müller, M., Konz, V., Bogler, W. & Arifi-Müller, V. Sync Toolbox: A Python package for efficient, robust, and accurate music synchronization. J. Open. Source Softw. 6 (64), 3434 (2021).

    Google Scholar 

  3. Simon, T., Joo, H., Matthews, I. & Sheikh, Y. Hand keypoint detection in single images using multiview bootstrapping. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1145–1153. (2017).

  4. Nakamura, E., Yoshii, K. & Katayose, H. Performance error detection for piano performance with score information. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2696–2706 (2020).

    Google Scholar 

  5. Cancino-Chacón, C., Grachten, M., Goebl, W. & Widmer, G. Computational models of expressive music performance: A comprehensive and critical review. Front. Digit. Humanit. 5, 25 (2018).

    Google Scholar 

  6. Cancino-Chacón, C. E., Lattner, S., Grachten, M. & Widmer, G. Studying piano performance with a MIDI grand piano: An experimental setup. Front. Psychol. 11, 1725 (2020).

    Google Scholar 

  7. Huang, C. Z. A. et al. Music transformer: Generating music with long-term structure. Proceedings of the International Conference on Learning Representations. (2019).

  8. Nistal, J., Lattner, S. & Richard, G. DrumGAN: Synthesis of drum sounds with timbral feature conditioning using generative adversarial networks. Proc. Int. Soc. Music Inform. Retr. Conf. 590, 597 (2021).

    Google Scholar 

  9. Yang, L. C., Chou, S. Y. & Yang, Y. H. MidiNet: A convolutional generative adversarial network for symbolic-domain music generation. Proceedings of the International Society for Music Information Retrieval Conference, 324–331. (2017).

  10. Baltrusaitis, T., Ahuja, C. & Morency, L. P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41 (2), 423–443 (2019).

    Google Scholar 

  11. Ramachandram, D. & Taylor, G. W. Deep multimodal learning: A survey on recent advances and trends. IEEE. Signal. Process. Mag. 34 (6), 96–108 (2017).

    Google Scholar 

  12. Friberg, A., Bresin, R. & Sundberg, J. Overview of the KTH rule system for musical performance. Adv. Cogn. Psychol. 2 (2–3), 145–161 (2006).

    Google Scholar 

  13. Järveläinen, H., Verma, T. & Välimäki, V. Perception and adjustment of pitch in inharmonic string instrument tones. J. New. Music Res. 30 (4), 311–329 (2001).

    Google Scholar 

  14. Logan, B. Mel frequency cepstral coefficients for music modeling. Proceedings of the International Symposium on Music Information Retrieval. (2000).

  15. Müller, M. (ed, S.) Chroma toolbox: MATLAB implementations for extracting variants of chroma-based audio features. Proc. Int. Soc. Music Inform. Retr. Conf. 215 220 (2011).

    Google Scholar 

  16. Peeters, G., Giordano, B. L., Susini, P., Misdariis, N. & McAdams, S. The Timbre Toolbox: Extracting audio descriptors from musical signals. J. Acoust. Soc. Am. 130 (5), 2902–2916 (2011).

    Google Scholar 

  17. Dieleman, S. & Schrauwen, B. End-to-end learning for music audio. IEEE International Conference on Acoustics, Speech and Signal Processing, 6964–6968. (2014).

  18. Choi, K., Fazekas, G., Sandler, M. & Cho, K. Convolutional recurrent neural networks for music classification. IEEE International Conference on Acoustics, Speech and Signal Processing, 2392–2396. (2017).

  19. Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 779–788. (2016).

  20. Cao, Z., Hidalgo, G., Simon, T., Wei, S. E. & Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43 (1), 172–186 (2019).

    Google Scholar 

  21. Koepke, A., Rink, J. & Dixon, S. Score following as a multi-modal reinforcement learning problem. Trans. Int. Soc. Music Inform. Retr. 3 (1), 67–81 (2020).

    Google Scholar 

  22. Yan, S., Xiong, Y. & Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). (2018).

  23. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9 (8), 1735–1780 (1997).

    Google Scholar 

  24. Goodfellow, I. et al. Generative adversarial nets. Adv. Neural. Inf. Process. Syst., 27 2672–2680 (2014).

  25. Mirza, M. & Osindero, S. Conditional generative adversarial nets. arXiv preprint arXiv :1411.1784. (2014).

  26. Salimans, T. et al. Improved techniques for training GANs. Adv. Neural. Inf. Process. Syst., 29 2226–2234 (2016).

  27. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. C. Improved training of Wasserstein GANs. Adv. Neural. Inf. Process. Syst., 30 5767–5777 (2017).

  28. Karras, T., Aila, T., Laine, S. & Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. Proceedings of the International Conference on Learning Representations. (2018).

  29. Dixon, S. An on-line time warping algorithm for tracking musical performances. Proceedings of the International Joint Conference on Artificial Intelligence, 1727–1728. (2005).

  30. Vaswani, A. et al. Attention is all you need. Adv. Neural. Inf. Process. Syst., 30 5998–6008 (2017).

  31. Dauphin, Y. N., Fan, A., Auli, M. & Grangier, D. Language modeling with gated convolutional networks. Proceedings of the International Conference on Machine Learning, 933–941. (2017).

  32. Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. Proceedings of the International Conference on Machine Learning, 1597–1607. (2020).

  33. Engel, J. et al. Neural audio synthesis of musical notes with WaveNet autoencoders. Proceedings of the International Conference on Machine Learning, 1068–1077. (2017).

  34. Schuster, M. & Paliwal, K. K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45 (11), 2673–2681 (1997).

    Google Scholar 

  35. Bengio, S., Vinyals, O., Jaitly, N. & Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. Adv. Neural. Inf. Process. Syst., 28 1171–1179(2015).

  36. Wang, T. C. et al. High-resolution image synthesis and semantic manipulation with conditional GANs. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8798–8807. (2018).

  37. Widmer, G. & Goebl, W. Computational models of expressive music performance: The state of the art. J. New. Music Res. 33 (3), 203–216 (2004).

    Google Scholar 

  38. De Poli, G. Methodologies for expressiveness modelling of and for music performance. J. New. Music Res. 33 (3), 189–202 (2004).

    Google Scholar 

  39. Palmer, C. Music performance. Ann. Rev. Psychol. 48 (1), 115–138 (1997).

    Google Scholar 

  40. Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein generative adversarial networks. Proceedings of the International Conference on Machine Learning, 214–223. (2017).

  41. Zölzer, U. DAFX: Digital audio effects (Wiley, 2011).

  42. Parncutt, R., Sloboda, J. A., Clarke, E. F., Raekallio, M. & Desain, P. An ergonomic model of keyboard fingering for melodic fragments. Music Percept. 14 (4), 341–382 (1997).

    Google Scholar 

  43. Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data. 6 (1), 1–48 (2019).

    Google Scholar 

  44. Nakamura, E., Yoshii, K. & Sagayama, S. Rhythm transcription of polyphonic piano music based on merged-output HMM for multiple voices. IEEE/ACM Trans. Audio Speech Lang. Process. 25 (4), 794–806 (2017).

    Google Scholar 

  45. Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. Proceedings of the International Conference on Learning Representations. (2015).

  46. Levenshtein, V. I. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady. 10 (8), 707–710 (1966).

    Google Scholar 

  47. Juslin, P. N. & Sloboda, J. A. (eds) Handbook of music and emotion: Theory, research, applications (Oxford University Press, 2010).

  48. Hawthorne, C. et al. Enabling factorized piano music modeling and generation with the MAESTRO dataset. Proceedings of the International Conference on Learning Representations. (2019).

  49. Hawthorne, C., Simon, I., Swavely, R., Manilow, E. & Engel, J. Multi-instrument music transcription using transformers and self-attention. Proceedings of the International Society for Music Information Retrieval Conference, 245–252. (2023).

  50. Takegawa, Y., Terada, T. & Tsukamoto, M. Real-time piano fingering feedback system using depth sensors and convolutional neural networks. Proc. Int. Comput. Music Conf. 156, 163 (2023).

    Google Scholar 

  51. Zhang, H., Liu, Y. & Wang, J. Skeleton-based gesture recognition for musical performance analysis using graph neural networks. Pattern Recognit. Lett. 178, 45–52 (2024).

    Google Scholar 

  52. Lugaresi, C. et al. MediaPipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172. (2019).

  53. Lea, C., Flynn, M. D., Vidal, R., Reiter, A. & Hager, G. D. Temporal convolutional networks for action segmentation and detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 156–165. (2017).

  54. Chen, Y., Li, M. & Zhang, W. Cross-modal attention networks for multimodal sequence learning with incomplete observations. Neural Netw. 167, 234–248 (2023).

    Google Scholar 

  55. Radford, A. et al. Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, 8748–8763. (2021).

Download references

Acknowledgements

The author acknowledges the contributions of the 45 pianists who participated in the recording sessions, and the three professional piano instructors who provided expert annotations for the dataset. The author also thanks the technical staff at Nanjing Xiaozhuang University for their assistance with the recording infrastructure setup.

Funding

Not Applicable.

Author information

Authors and Affiliations

  1. School of Music, Nanjing Xiaozhuang University, Nanjing, 211171, Jiangsu, China

    Junyu Li

Authors
  1. Junyu Li
    View author publications

    Search author on:PubMed Google Scholar

Contributions

Junyu Li conceived and designed the study, developed the methodology, implemented the experimental framework, conducted data collection and analysis, and wrote the manuscript. The author takes full responsibility for the integrity and accuracy of the work presented.

Corresponding author

Correspondence to Junyu Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics approval and consent to participate

This study was approved by the Research Ethics Committee of Nanjing Xiaozhuang University, School of Music (Reference Number: IRB-NJXZU-MUS-2024-018). All participants provided written informed consent prior to enrollment in the recording sessions. The study was conducted in accordance with the Declaration of Helsinki and relevant national regulations of the People’s Republic of China. Participants were informed of their right to withdraw from the study at any time without consequence, and all data were anonymized to protect participant privacy.

Consent for publication

The author has reviewed the manuscript and consents to its publication. No identifiable information regarding participants has been included in this manuscript. All participants provided consent for their anonymized performance data to be used for research purposes and potential publication.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download ZIP )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J. Multimodal generative adversarial networks for piano fingering correction and performance expressiveness modeling through audio-visual feature fusion. Sci Rep (2026). https://doi.org/10.1038/s41598-026-44473-w

Download citation

  • Received: 22 December 2025

  • Accepted: 11 March 2026

  • Published: 26 March 2026

  • DOI: https://doi.org/10.1038/s41598-026-44473-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Piano performance analysis
  • Multimodal fusion
  • Generative adversarial networks
  • Fingering correction
  • Performance expressiveness
  • Audio-visual learning
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics