Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Exploring vision transformers for deep feature extraction and classification in video genre recognition for digital media
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 22 March 2026

Exploring vision transformers for deep feature extraction and classification in video genre recognition for digital media

  • Fawaz Khaled Alarfaj1 &
  • Anam Naz2 

Scientific Reports , Article number:  (2026) Cite this article

  • 945 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Computational science
  • Computer science

Abstract

The generation of video content for television production using automation and artificial intelligence-based techniques is quite common these days. The use of computer vision techniques plays a significant role in the classification and analysis of large volumes of multimedia content. This study aims to develop an intelligent framework for TV genre classification using deep learning and advanced transformer-based models. Traditional machine learning depends on traditional features and lacks the ability to capture complex spatio-temporal and acoustic relationships in modern media. To address these limitations, the study explores state-of-the-art vision transformers for deeper analysis on two standard datasets in the relevant domain. Firstly, a static image dataset is analyzed using the Pyramid Vision Transformer (PvT), which effectively captures multi-scale spatial and contextual information across diverse TV scenes. Secondly, a multimodal audio–video dataset is used by applying the Multimodal Attention and Invariant Vision–Audio Representation Transformer (MAiVAR-T). The applied model captures temporal dependencies and integrates acoustic features, including mel-spectrogram, chroma, waveform, and energy patterns. Empirical analysis demonstrates that the proposed PvT and MAiVAR-T models achieve the highest accuracies of 97% and 98%, respectively, outperforming the baseline deep learning models. This study presents the role of multimodal transformers in improving automated genre classification in television and digital media production.

Data availability

Dataset freely available at: (1) Dataset 1: [https://universe.roboflow.com/tv-production/](https:/universe.roboflow.com/tv-production). (2) Dataset 2: [https://github.com/jwehrmann/lmtd](https:/github.com/jwehrmann/lmtd)Code AvailabilityCode with sample dataset: [https://zenodo.org/records/18950832](https:/zenodo.org/records/18950832).

Code availability

Code with sample dataset: https://zenodo.org/records/18950832.

References

  1. Malitesta, D. et al. Formalizing multimedia recommendation through multimodal deep learning. ACM Trans. Recomm. Syst. https://doi.org/10.1145/3662738 (2025).

    Google Scholar 

  2. Bansal, G., Nawal, A., Chamola, V. & Herencsar, N. Revolutionizing visuals: The role of generative AI in modern image generation. ACM Trans. Multimedia Comput. Commun. Appl. https://doi.org/10.1145/3689641 (2024).

    Google Scholar 

  3. Wang, T. et al. Text-assisted spatial and temporal attention network for video question answering. Adv. Intell. Syst. 5 (4), 2200131. https://doi.org/10.1002/aisy.202200131 (2023).

    Google Scholar 

  4. Muqadas, A. et al. Deep learning and sentence embeddings for detection of clickbait news from online content. Sci. Rep. https://doi.org/10.1038/s41598-025-97576-1 (2025).

    Google Scholar 

  5. Chalaby, J. K. The streaming industry and the platform economy: An analysis. Media Cult. Soc. 46(3), 552–571. https://doi.org/10.1177/01634437231210439 (2024).

    Google Scholar 

  6. Garganas, O. Digital video advertising: Breakthrough or extension of TV advertising in the new digital media landscape?. Journalism and Media 5(2), 749–765. https://doi.org/10.3390/JOURNALMEDIA5020049 (2024).

    Google Scholar 

  7. Ro, D., Kwon, B., Lee, E. & Baek, H. What factors determine international television flows on over-the-top platforms? A fuzzy set qualitative comparative analysis approach. Telematics and Informat. 102, 102322. https://doi.org/10.1016/J.TELE.2025.102322 (2025).

    Google Scholar 

  8. Rix, J. & Gläser, M. Global entertainment & media outlook 2021–2025. MedienWirtschaft 18(3), 42–45. https://doi.org/10.15358/1613-0669-2021-3-42 (2021).

    Google Scholar 

  9. Christian, A. J. Expanding production value: The culture and scale of television and new media. Crit. Stud. Telev. 14 (2), 255–267. https://doi.org/10.1177/1749602019838882 (2019).

    Google Scholar 

  10. Liu, X. & Pan, H. The path of film and television animation creation using virtual reality technology under the artificial intelligence. Sci. Program. 2022(1), 1712929. https://doi.org/10.1155/2022/1712929 (2022).

    Google Scholar 

  11. Ahmed, M. et al. Real-time violent action recognition using key frames extraction and deep learning. Computers Mater. Continua. 69 (2), 2217–2230. https://doi.org/10.32604/cmc.2021.018103 (2021).

    Google Scholar 

  12. Zhang, H., Sun, Y., Zhao, M., Chow, T. W. S. & Wu, Q. M. J. Bridging user interest to item content for recommender systems: An optimization model. IEEE Trans. Cybern. 50(10), 4268–4280. https://doi.org/10.1109/TCYB.2019.2900159 (2020).

    Google Scholar 

  13. Tran, K. H., Vuong Do, P., Ly, N. Q. & Le, N. Unifying global and local scene entities modelling for precise action spotting, In Proceedings of the International Joint Conference on Neural Networks, https://doi.org/10.1109/IJCNN60899.2024.10650009 (2024).

  14. Khan, U., Khan, H. U., Iqbal, S. & Munir, H. Four decades of image processing: A bibliometric analysis. Libr. Hi Tech. 42(1), 180–202. https://doi.org/10.1108/LHT-10-2021-0351 (2024).

    Google Scholar 

  15. Yang, B. et al. Vision transformer-based visual language understanding of the construction process. Alexandria Eng. J. 99, 242–256. https://doi.org/10.1016/j.aej.2024.05.015 (2024).

    Google Scholar 

  16. Nazir, S., Cagali, T., Sadrzadeh, M. & Newell, C. Audiovisual, Genre, Neural and Topical Textual Embeddings for TV Programme Content Representation, In IEEE International Symposium on Multimedia (ISM), 197–200 https://doi.org/10.1109/ISM.2020.00041 (2020).

  17. Tang, P., Zhao, H., Meng, W. & Wang, Y. One-shot motion talking head generation with audio-driven model. Expert Syst. Appl. 297, 129344. https://doi.org/10.1016/J.ESWA.2025.129344 (2026).

    Google Scholar 

  18. Cammarano, M. E., Guarino, A., Malandrino, D. & Zaccagnino, R. TV shows popularity prediction of genre-independent TV series through machine learning-based approaches. Multimed Tools Appl. 83 (31), 75757–75780. https://doi.org/10.1007/s11042-024-18518-z (2024).

    Google Scholar 

  19. Li, R. et al. UE-Extractor: A Grid-to-Point Ground Extraction Framework for Unstructured Environments Using Adaptive Grid Projection. IEEE Robot Autom. Lett. 10 (6), 5991–5998. https://doi.org/10.1109/LRA.2025.3563127 (2025).

    Google Scholar 

  20. Ghosh, S., Sarkar, S., Ghosh, S., Zalkow, F. & Jana, N. D. Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions. Appl. Intell. 54(6), 4507–4524. https://doi.org/10.1007/S10489-024-05380-7 (2024).

    Google Scholar 

  21. Ghosh, D., Ghosh, S., Jana, N. D., Biswas, S. & Mallipeddi, R. Designing optimal vision transformer architecture using differential evolution for tomato leaf disease classification. Comput. Electron. Agric. 238, 110824. https://doi.org/10.1016/J.COMPAG.2025.110824 (2025).

    Google Scholar 

  22. Ghosh, S. et al. Melanoma skin cancer detection using ensemble of machine learning models considering deep feature embeddings. Procedia Comput. Sci. 235, 3007–3015. https://doi.org/10.1016/J.PROCS.2024.04.284 (2024).

    Google Scholar 

  23. Yang, X. K., Qu, D., Zhang, W. L. & Zhang, W. Q. An adapted data selection for deep learning-based audio segmentation in multi-genre broadcast channel. Digit. Signal. Process. 81, 8–15. https://doi.org/10.1016/j.dsp.2018.03.004 (2018).

    Google Scholar 

  24. Candela, F., Giordano, A., Zagaria, C. F. & Morabito, F. C. Effectiveness of deep learning techniques in TV programs classification: A comparative analysis. Integr. Comput. Aided. Eng. 31(4), 439–453. https://doi.org/10.3233/ICA-240740 (2024).

    Google Scholar 

  25. Sulun, S., Viana, P. & Davies, M. E. P. Movie trailer genre classification using multimodal pretrained features. Expert Syst. Appl. 258, 125209. https://doi.org/10.1016/J.ESWA.2024.125209 (2024).

    Google Scholar 

  26. Shao, Y. & Guo, N. Recognizing online video genres using ensemble deep convolutional learning for digital media service management. J. Cloud Comput. 13 (1), 102. https://doi.org/10.1186/s13677-024-00664-2 (2024).

    Google Scholar 

  27. Lin, F., Yuan, J., Chen, Z. & Abiri, M. Enhancing multimedia management: Cloud-based movie type recognition with hybrid deep learning architecture. J. Cloud Comput. 13(1), 1–16. https://doi.org/10.1186/S13677-024-00668-Y (2024).

    Google Scholar 

  28. Zhang, Z. et al. Movie Genre Classification by Language Augmentation and Shot Sampling, IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 7260–7270 https://doi.org/10.1109/WACV57701.2024.00711 (2024).

  29. Shaukat, F. et al. Multi-label movie genre classification with attention mechanism on movie plots. Comput. Mater. Contin. 83(3), 5595–5622. https://doi.org/10.32604/CMC.2025.061702 (2025).

    Google Scholar 

  30. Khan, H. U., Naz, A., Alarfaj, F. K. & Almusallam, N. A transformer-based architecture for collaborative filtering modeling in personalized recommender systems. Sci. Rep. 15(1), 1–22. https://doi.org/10.1038/s41598-025-08931-1 (2025).

    Google Scholar 

  31. Moreno-Galván, D. A. et al. Automatic movie genre classification & emotion recognition via a BiProjection Multimodal Transformer. Info. Fusion 113, 102641. https://doi.org/10.1016/J.INFFUS.2024.102641 (2025).

    Google Scholar 

  32. Unal, F. Z., Guzel, M. S., Bostanci, E., Acici, K. & Asuroglu, T. Multilabel genre prediction using deep-learning frameworks. Appl. Sci. https://doi.org/10.3390/app13158665 (2023).

    Google Scholar 

  33. Pham, L. et al. An Audio-Based Deep Learning Framework For BBC Television Programme Classification, In 29th European Signal Processing Conference (EUSIPCO). 56–60 https://doi.org/10.23919/EUSIPCO54536.2021.9616310 (2021).

  34. Ben-Ahmed, O. & Huet, B. Deep multimodal features for movie genre and interestingness prediction, Proceedings - International Workshop on Content-Based Multimedia Indexing. https://doi.org/10.1109/CBMI.2018.8516504 (2018).

  35. Braz, L., Teixeira, V., Pedrini, H. & Dias, Z. Image-Text integration using a multimodal fusion network module for movie genre classification,In IET Conference Proceedings. 1 157–162 https://doi.org/10.1049/ICP.2021.1456 (2021).

  36. Chu, W. T. & Guo, H. J. Movie genre classification based on poster images with deep neural networks, MUSA2 - Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes, co-located with MM. 39–45 https://doi.org/10.1145/3132515.3132516 (2017).

  37. Montalvo-Lezama, R., Montalvo-Lezama, B. & Fuentes-Pineda, G. Improving transfer learning for movie trailer genre classification using a dual image and video transformer. Inf. Process. Manag. 60(3), 103343. https://doi.org/10.1016/J.IPM.2023.103343 (2023).

    Google Scholar 

  38. Yadav, A. & Vishwakarma, D. K. A unified framework of deep networks for genre classification using movie trailer. Appl. Soft Comput. https://doi.org/10.1016/j.asoc.2020.106624 (2020).

    Google Scholar 

  39. Naz, A. et al. Machine and deep learning for personality traits detection: A comprehensive survey and open research challenges. Artif. Intell. Rev. 58(8), 239–239. https://doi.org/10.1007/s10462-025-11245-3 (2025).

    Google Scholar 

  40. Pandey, A. & Vishwakarma, D. K. VyAnG-Net: A novel multi-modal sarcasm recognition model by uncovering visual, acoustic and glossary features. Intell. Data Anal. https://doi.org/10.1177/1088467X251315637 (2024).

    Google Scholar 

  41. Aggarwal, S., Pandey, A. & Vishwakarma, D. K. Extracting cross-modal semantic incongruity with attention for multimodal sarcasm detection. Appl. Intell. 55(12), 1–22. https://doi.org/10.1007/S10489-025-06717-6 (2025).

    Google Scholar 

  42. Pandey, A. & Vishwakarma, D. K. Aspect-based multimodal sentiment analysis via employing visual-to-emotional-caption translation network using visual-caption pairs. Lang. Resour. Eval. 59(3), 2945–2972. https://doi.org/10.1007/S10579-025-09824-5 (2025).

    Google Scholar 

  43. Pandey, A. & Vishwakarma, D. K. Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs. https://arxiv.org/pdf/2408.02571 (2025).

  44. Wang, J., Li, S. & Sung, Y. Deformer: Denoising Transformer for Improved Audio Music Genre Classification. Appl. Sci. 13(23), 12673. https://doi.org/10.3390/APP132312673 (2023).

    Google Scholar 

  45. Pandey, A. & Vishwakarma, D. K. Progress, achievements, and challenges in multimodal sentiment analysis using deep learning: A survey. Appl. Soft Comput. 152, 111206. https://doi.org/10.1016/J.ASOC.2023.111206 (2024).

    Google Scholar 

  46. Zhao, K. et al. Neutron-gamma discrimination method based on voiceprint identification. Radiat. Meas. 187, 107483. https://doi.org/10.1016/j.radmeas.2025.107483 (2025).

    Google Scholar 

  47. Yin, L. et al. AFBNet: A lightweight adaptive feature fusion module for super-resolution algorithms. Comput. Model. Eng. Sci. 140(3), 2315–2347. https://doi.org/10.32604/CMES.2024.050853 (2024).

    Google Scholar 

  48. Song, W. et al. AttriDiffuser: Adversarially enhanced diffusion model for text-to-facial attribute image synthesis. Pattern Recognit. 163, 111447. https://doi.org/10.1016/J.PATCOG.2025.111447 (2025).

    Google Scholar 

  49. Chen, J., Zhang, S. & Xu, W. Scalable prediction of heterogeneous traffic flow with enhanced non-periodic feature modeling. Expert Syst. Appl. 294, 128847. https://doi.org/10.1016/J.ESWA.2025.128847 (2025).

    Google Scholar 

  50. Shaukat, F. et al. An interpretable multi-transformer ensemble for text-based movie genre classification. PeerJ Comput. Sci. 11, e2945. https://doi.org/10.7717/PEERJ-CS.2945 (2025).

    Google Scholar 

  51. Zhao, Y., Wang, X., Cao, S. & Huang, Z. Zero-shot automatic modulation recognition using a large vision-language model. IEEE Trans. Commun. https://doi.org/10.1109/TCOMM.2025.3611710 (2025).

    Google Scholar 

  52. Wang, Q., Chen, J., Song, Y., Li, X. & Xu, W. Fusing Visual Quantified Features for Heterogeneous Traffic Flow Prediction. Promet - Traffic and Transportation 36(6), 1068–1077. https://doi.org/10.7307/PTT.V36I6.667 (2024).

    Google Scholar 

  53. Ji, J. et al. DPA-MVSNet: Dynamic Context Perception Multi-view Stereo with transformers and data augmentation. Knowl. Based Syst. 325, 113852. https://doi.org/10.1016/J.KNOSYS.2025.113852 (2025).

    Google Scholar 

  54. Wang, S. et al. Interactive Siamese Network-based roadside perception for multi-vehicle tracking. IEEE Trans. Intell. Transp. Syst. 26(12), 22482–22496. https://doi.org/10.1109/TITS.2025.3611287 (2025).

    Google Scholar 

  55. Zhang, R. et al. Online adaptive keypoint extraction for visual odometry across different scenes. IEEE Robot. Autom. Lett. 10(7), 7539–7546. https://doi.org/10.1109/LRA.2025.3575644 (2025).

    Google Scholar 

  56. Xue, X., Hu, H. M., He, Z. & Zheng, H. Towards multi-source illumination color constancy through physics-based rendering and spectral power distribution embedding. IEEE Trans. Comput. Imaging. 11, 1349–1360. https://doi.org/10.1109/TCI.2025.3598440 (2025).

    Google Scholar 

  57. Zhang, G. et al. A Novel Spatial-Temporal Learning Method for Enhancing Generalization in Adaptive Video Streaming. IEEE Trans. Mob. Comput. 24(12), 12852–12866. https://doi.org/10.1109/TMC.2025.3588135 (2025).

    Google Scholar 

  58. Zhang, Y. et al. S2DBFT: Spectral–Spatial Dual-Branch Fusion Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2025.3608444 (2025).

    Google Scholar 

  59. Wang, T. et al. ResLNet: Deep residual LSTM network with longer input for action recognition. Front. Comput. Sci. 16(6), 166334–166334. https://doi.org/10.1007/s11704-021-0236-9 (2022).

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. KFU261384].

Funding

This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. KFU261384].

Author information

Authors and Affiliations

  1. Department of Management Information Systems School of Business, King Faisal University, Al Ahsa, Saudi Arabia

    Fawaz Khaled Alarfaj

  2. Department of Information Technology, University of Sargodha, Punjab, Pakistan

    Anam Naz

Authors
  1. Fawaz Khaled Alarfaj
    View author publications

    Search author on:PubMed Google Scholar

  2. Anam Naz
    View author publications

    Search author on:PubMed Google Scholar

Contributions

F.K.A. conceived the study and designed the research. F.K.A. secured the funding, supervised the work, and led the manuscript writing. A.N. implemented the models, prepared the data, and conducted the experiments. A.N. produced the results and figures. Both authors analyzed and interpreted the findings, revised the manuscript, and approved the final version.

Corresponding author

Correspondence to Fawaz Khaled Alarfaj.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alarfaj, F.K., Naz, A. Exploring vision transformers for deep feature extraction and classification in video genre recognition for digital media. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45087-y

Download citation

  • Received: 23 June 2025

  • Accepted: 17 March 2026

  • Published: 22 March 2026

  • DOI: https://doi.org/10.1038/s41598-026-45087-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Deep learning
  • Computer vision
  • Multimedia analysis
  • Feature extraction
  • Vision transformers
  • Waveform
  • Acoustics analysis
  • Management information system
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics