Exploring vision transformers for deep feature extraction and classification in video genre recognition for digital media

Alarfaj, Fawaz Khaled; Naz, Anam

doi:10.1038/s41598-026-45087-y

Download PDF

Article
Open access
Published: 22 March 2026

Exploring vision transformers for deep feature extraction and classification in video genre recognition for digital media

Fawaz Khaled Alarfaj¹ &
Anam Naz²

Scientific Reports , Article number: (2026) Cite this article

945 Accesses
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

The generation of video content for television production using automation and artificial intelligence-based techniques is quite common these days. The use of computer vision techniques plays a significant role in the classification and analysis of large volumes of multimedia content. This study aims to develop an intelligent framework for TV genre classification using deep learning and advanced transformer-based models. Traditional machine learning depends on traditional features and lacks the ability to capture complex spatio-temporal and acoustic relationships in modern media. To address these limitations, the study explores state-of-the-art vision transformers for deeper analysis on two standard datasets in the relevant domain. Firstly, a static image dataset is analyzed using the Pyramid Vision Transformer (PvT), which effectively captures multi-scale spatial and contextual information across diverse TV scenes. Secondly, a multimodal audio–video dataset is used by applying the Multimodal Attention and Invariant Vision–Audio Representation Transformer (MAiVAR-T). The applied model captures temporal dependencies and integrates acoustic features, including mel-spectrogram, chroma, waveform, and energy patterns. Empirical analysis demonstrates that the proposed PvT and MAiVAR-T models achieve the highest accuracies of 97% and 98%, respectively, outperforming the baseline deep learning models. This study presents the role of multimodal transformers in improving automated genre classification in television and digital media production.

Data availability

Dataset freely available at: (1) Dataset 1: [https://universe.roboflow.com/tv-production/](https:/universe.roboflow.com/tv-production). (2) Dataset 2: [https://github.com/jwehrmann/lmtd](https:/github.com/jwehrmann/lmtd)Code AvailabilityCode with sample dataset: [https://zenodo.org/records/18950832](https:/zenodo.org/records/18950832).

Code availability

Code with sample dataset: https://zenodo.org/records/18950832.

References

Malitesta, D. et al. Formalizing multimedia recommendation through multimodal deep learning. ACM Trans. Recomm. Syst. https://doi.org/10.1145/3662738 (2025).
Google Scholar
Bansal, G., Nawal, A., Chamola, V. & Herencsar, N. Revolutionizing visuals: The role of generative AI in modern image generation. ACM Trans. Multimedia Comput. Commun. Appl. https://doi.org/10.1145/3689641 (2024).
Google Scholar
Wang, T. et al. Text-assisted spatial and temporal attention network for video question answering. Adv. Intell. Syst. 5 (4), 2200131. https://doi.org/10.1002/aisy.202200131 (2023).
Google Scholar
Muqadas, A. et al. Deep learning and sentence embeddings for detection of clickbait news from online content. Sci. Rep. https://doi.org/10.1038/s41598-025-97576-1 (2025).
Google Scholar
Chalaby, J. K. The streaming industry and the platform economy: An analysis. Media Cult. Soc. 46(3), 552–571. https://doi.org/10.1177/01634437231210439 (2024).
Google Scholar
Garganas, O. Digital video advertising: Breakthrough or extension of TV advertising in the new digital media landscape?. Journalism and Media 5(2), 749–765. https://doi.org/10.3390/JOURNALMEDIA5020049 (2024).
Google Scholar
Ro, D., Kwon, B., Lee, E. & Baek, H. What factors determine international television flows on over-the-top platforms? A fuzzy set qualitative comparative analysis approach. Telematics and Informat. 102, 102322. https://doi.org/10.1016/J.TELE.2025.102322 (2025).
Google Scholar
Rix, J. & Gläser, M. Global entertainment & media outlook 2021–2025. MedienWirtschaft 18(3), 42–45. https://doi.org/10.15358/1613-0669-2021-3-42 (2021).
Google Scholar
Christian, A. J. Expanding production value: The culture and scale of television and new media. Crit. Stud. Telev. 14 (2), 255–267. https://doi.org/10.1177/1749602019838882 (2019).
Google Scholar
Liu, X. & Pan, H. The path of film and television animation creation using virtual reality technology under the artificial intelligence. Sci. Program. 2022(1), 1712929. https://doi.org/10.1155/2022/1712929 (2022).
Google Scholar
Ahmed, M. et al. Real-time violent action recognition using key frames extraction and deep learning. Computers Mater. Continua. 69 (2), 2217–2230. https://doi.org/10.32604/cmc.2021.018103 (2021).
Google Scholar
Zhang, H., Sun, Y., Zhao, M., Chow, T. W. S. & Wu, Q. M. J. Bridging user interest to item content for recommender systems: An optimization model. IEEE Trans. Cybern. 50(10), 4268–4280. https://doi.org/10.1109/TCYB.2019.2900159 (2020).
Google Scholar
Tran, K. H., Vuong Do, P., Ly, N. Q. & Le, N. Unifying global and local scene entities modelling for precise action spotting, In Proceedings of the International Joint Conference on Neural Networks, https://doi.org/10.1109/IJCNN60899.2024.10650009 (2024).
Khan, U., Khan, H. U., Iqbal, S. & Munir, H. Four decades of image processing: A bibliometric analysis. Libr. Hi Tech. 42(1), 180–202. https://doi.org/10.1108/LHT-10-2021-0351 (2024).
Google Scholar
Yang, B. et al. Vision transformer-based visual language understanding of the construction process. Alexandria Eng. J. 99, 242–256. https://doi.org/10.1016/j.aej.2024.05.015 (2024).
Google Scholar
Nazir, S., Cagali, T., Sadrzadeh, M. & Newell, C. Audiovisual, Genre, Neural and Topical Textual Embeddings for TV Programme Content Representation, In IEEE International Symposium on Multimedia (ISM), 197–200 https://doi.org/10.1109/ISM.2020.00041 (2020).
Tang, P., Zhao, H., Meng, W. & Wang, Y. One-shot motion talking head generation with audio-driven model. Expert Syst. Appl. 297, 129344. https://doi.org/10.1016/J.ESWA.2025.129344 (2026).
Google Scholar
Cammarano, M. E., Guarino, A., Malandrino, D. & Zaccagnino, R. TV shows popularity prediction of genre-independent TV series through machine learning-based approaches. Multimed Tools Appl. 83 (31), 75757–75780. https://doi.org/10.1007/s11042-024-18518-z (2024).
Google Scholar
Li, R. et al. UE-Extractor: A Grid-to-Point Ground Extraction Framework for Unstructured Environments Using Adaptive Grid Projection. IEEE Robot Autom. Lett. 10 (6), 5991–5998. https://doi.org/10.1109/LRA.2025.3563127 (2025).
Google Scholar
Ghosh, S., Sarkar, S., Ghosh, S., Zalkow, F. & Jana, N. D. Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions. Appl. Intell. 54(6), 4507–4524. https://doi.org/10.1007/S10489-024-05380-7 (2024).
Google Scholar
Ghosh, D., Ghosh, S., Jana, N. D., Biswas, S. & Mallipeddi, R. Designing optimal vision transformer architecture using differential evolution for tomato leaf disease classification. Comput. Electron. Agric. 238, 110824. https://doi.org/10.1016/J.COMPAG.2025.110824 (2025).
Google Scholar
Ghosh, S. et al. Melanoma skin cancer detection using ensemble of machine learning models considering deep feature embeddings. Procedia Comput. Sci. 235, 3007–3015. https://doi.org/10.1016/J.PROCS.2024.04.284 (2024).
Google Scholar
Yang, X. K., Qu, D., Zhang, W. L. & Zhang, W. Q. An adapted data selection for deep learning-based audio segmentation in multi-genre broadcast channel. Digit. Signal. Process. 81, 8–15. https://doi.org/10.1016/j.dsp.2018.03.004 (2018).
Google Scholar
Candela, F., Giordano, A., Zagaria, C. F. & Morabito, F. C. Effectiveness of deep learning techniques in TV programs classification: A comparative analysis. Integr. Comput. Aided. Eng. 31(4), 439–453. https://doi.org/10.3233/ICA-240740 (2024).
Google Scholar
Sulun, S., Viana, P. & Davies, M. E. P. Movie trailer genre classification using multimodal pretrained features. Expert Syst. Appl. 258, 125209. https://doi.org/10.1016/J.ESWA.2024.125209 (2024).
Google Scholar
Shao, Y. & Guo, N. Recognizing online video genres using ensemble deep convolutional learning for digital media service management. J. Cloud Comput. 13 (1), 102. https://doi.org/10.1186/s13677-024-00664-2 (2024).
Google Scholar
Lin, F., Yuan, J., Chen, Z. & Abiri, M. Enhancing multimedia management: Cloud-based movie type recognition with hybrid deep learning architecture. J. Cloud Comput. 13(1), 1–16. https://doi.org/10.1186/S13677-024-00668-Y (2024).
Google Scholar
Zhang, Z. et al. Movie Genre Classification by Language Augmentation and Shot Sampling, IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 7260–7270 https://doi.org/10.1109/WACV57701.2024.00711 (2024).
Shaukat, F. et al. Multi-label movie genre classification with attention mechanism on movie plots. Comput. Mater. Contin. 83(3), 5595–5622. https://doi.org/10.32604/CMC.2025.061702 (2025).
Google Scholar
Khan, H. U., Naz, A., Alarfaj, F. K. & Almusallam, N. A transformer-based architecture for collaborative filtering modeling in personalized recommender systems. Sci. Rep. 15(1), 1–22. https://doi.org/10.1038/s41598-025-08931-1 (2025).
Google Scholar
Moreno-Galván, D. A. et al. Automatic movie genre classification & emotion recognition via a BiProjection Multimodal Transformer. Info. Fusion 113, 102641. https://doi.org/10.1016/J.INFFUS.2024.102641 (2025).
Google Scholar
Unal, F. Z., Guzel, M. S., Bostanci, E., Acici, K. & Asuroglu, T. Multilabel genre prediction using deep-learning frameworks. Appl. Sci. https://doi.org/10.3390/app13158665 (2023).
Google Scholar
Pham, L. et al. An Audio-Based Deep Learning Framework For BBC Television Programme Classification, In 29th European Signal Processing Conference (EUSIPCO). 56–60 https://doi.org/10.23919/EUSIPCO54536.2021.9616310 (2021).
Ben-Ahmed, O. & Huet, B. Deep multimodal features for movie genre and interestingness prediction, Proceedings - International Workshop on Content-Based Multimedia Indexing. https://doi.org/10.1109/CBMI.2018.8516504 (2018).
Braz, L., Teixeira, V., Pedrini, H. & Dias, Z. Image-Text integration using a multimodal fusion network module for movie genre classification,In IET Conference Proceedings. 1 157–162 https://doi.org/10.1049/ICP.2021.1456 (2021).
Chu, W. T. & Guo, H. J. Movie genre classification based on poster images with deep neural networks, MUSA2 - Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes, co-located with MM. 39–45 https://doi.org/10.1145/3132515.3132516 (2017).
Montalvo-Lezama, R., Montalvo-Lezama, B. & Fuentes-Pineda, G. Improving transfer learning for movie trailer genre classification using a dual image and video transformer. Inf. Process. Manag. 60(3), 103343. https://doi.org/10.1016/J.IPM.2023.103343 (2023).
Google Scholar
Yadav, A. & Vishwakarma, D. K. A unified framework of deep networks for genre classification using movie trailer. Appl. Soft Comput. https://doi.org/10.1016/j.asoc.2020.106624 (2020).
Google Scholar
Naz, A. et al. Machine and deep learning for personality traits detection: A comprehensive survey and open research challenges. Artif. Intell. Rev. 58(8), 239–239. https://doi.org/10.1007/s10462-025-11245-3 (2025).
Google Scholar
Pandey, A. & Vishwakarma, D. K. VyAnG-Net: A novel multi-modal sarcasm recognition model by uncovering visual, acoustic and glossary features. Intell. Data Anal. https://doi.org/10.1177/1088467X251315637 (2024).
Google Scholar
Aggarwal, S., Pandey, A. & Vishwakarma, D. K. Extracting cross-modal semantic incongruity with attention for multimodal sarcasm detection. Appl. Intell. 55(12), 1–22. https://doi.org/10.1007/S10489-025-06717-6 (2025).
Google Scholar
Pandey, A. & Vishwakarma, D. K. Aspect-based multimodal sentiment analysis via employing visual-to-emotional-caption translation network using visual-caption pairs. Lang. Resour. Eval. 59(3), 2945–2972. https://doi.org/10.1007/S10579-025-09824-5 (2025).
Google Scholar
Pandey, A. & Vishwakarma, D. K. Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs. https://arxiv.org/pdf/2408.02571 (2025).
Wang, J., Li, S. & Sung, Y. Deformer: Denoising Transformer for Improved Audio Music Genre Classification. Appl. Sci. 13(23), 12673. https://doi.org/10.3390/APP132312673 (2023).
Google Scholar
Pandey, A. & Vishwakarma, D. K. Progress, achievements, and challenges in multimodal sentiment analysis using deep learning: A survey. Appl. Soft Comput. 152, 111206. https://doi.org/10.1016/J.ASOC.2023.111206 (2024).
Google Scholar
Zhao, K. et al. Neutron-gamma discrimination method based on voiceprint identification. Radiat. Meas. 187, 107483. https://doi.org/10.1016/j.radmeas.2025.107483 (2025).
Google Scholar
Yin, L. et al. AFBNet: A lightweight adaptive feature fusion module for super-resolution algorithms. Comput. Model. Eng. Sci. 140(3), 2315–2347. https://doi.org/10.32604/CMES.2024.050853 (2024).
Google Scholar
Song, W. et al. AttriDiffuser: Adversarially enhanced diffusion model for text-to-facial attribute image synthesis. Pattern Recognit. 163, 111447. https://doi.org/10.1016/J.PATCOG.2025.111447 (2025).
Google Scholar
Chen, J., Zhang, S. & Xu, W. Scalable prediction of heterogeneous traffic flow with enhanced non-periodic feature modeling. Expert Syst. Appl. 294, 128847. https://doi.org/10.1016/J.ESWA.2025.128847 (2025).
Google Scholar
Shaukat, F. et al. An interpretable multi-transformer ensemble for text-based movie genre classification. PeerJ Comput. Sci. 11, e2945. https://doi.org/10.7717/PEERJ-CS.2945 (2025).
Google Scholar
Zhao, Y., Wang, X., Cao, S. & Huang, Z. Zero-shot automatic modulation recognition using a large vision-language model. IEEE Trans. Commun. https://doi.org/10.1109/TCOMM.2025.3611710 (2025).
Google Scholar
Wang, Q., Chen, J., Song, Y., Li, X. & Xu, W. Fusing Visual Quantified Features for Heterogeneous Traffic Flow Prediction. Promet - Traffic and Transportation 36(6), 1068–1077. https://doi.org/10.7307/PTT.V36I6.667 (2024).
Google Scholar
Ji, J. et al. DPA-MVSNet: Dynamic Context Perception Multi-view Stereo with transformers and data augmentation. Knowl. Based Syst. 325, 113852. https://doi.org/10.1016/J.KNOSYS.2025.113852 (2025).
Google Scholar
Wang, S. et al. Interactive Siamese Network-based roadside perception for multi-vehicle tracking. IEEE Trans. Intell. Transp. Syst. 26(12), 22482–22496. https://doi.org/10.1109/TITS.2025.3611287 (2025).
Google Scholar
Zhang, R. et al. Online adaptive keypoint extraction for visual odometry across different scenes. IEEE Robot. Autom. Lett. 10(7), 7539–7546. https://doi.org/10.1109/LRA.2025.3575644 (2025).
Google Scholar
Xue, X., Hu, H. M., He, Z. & Zheng, H. Towards multi-source illumination color constancy through physics-based rendering and spectral power distribution embedding. IEEE Trans. Comput. Imaging. 11, 1349–1360. https://doi.org/10.1109/TCI.2025.3598440 (2025).
Google Scholar
Zhang, G. et al. A Novel Spatial-Temporal Learning Method for Enhancing Generalization in Adaptive Video Streaming. IEEE Trans. Mob. Comput. 24(12), 12852–12866. https://doi.org/10.1109/TMC.2025.3588135 (2025).
Google Scholar
Zhang, Y. et al. S2DBFT: Spectral–Spatial Dual-Branch Fusion Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2025.3608444 (2025).
Google Scholar
Wang, T. et al. ResLNet: Deep residual LSTM network with longer input for action recognition. Front. Comput. Sci. 16(6), 166334–166334. https://doi.org/10.1007/s11704-021-0236-9 (2022).
Google Scholar

Download references

Acknowledgements

This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. KFU261384].

Funding

This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. KFU261384].

Author information

Authors and Affiliations

Department of Management Information Systems School of Business, King Faisal University, Al Ahsa, Saudi Arabia
Fawaz Khaled Alarfaj
Department of Information Technology, University of Sargodha, Punjab, Pakistan
Anam Naz

Authors

Fawaz Khaled Alarfaj
View author publications
Search author on:PubMed Google Scholar
Anam Naz
View author publications
Search author on:PubMed Google Scholar

Contributions

F.K.A. conceived the study and designed the research. F.K.A. secured the funding, supervised the work, and led the manuscript writing. A.N. implemented the models, prepared the data, and conducted the experiments. A.N. produced the results and figures. Both authors analyzed and interpreted the findings, revised the manuscript, and approved the final version.

Corresponding author

Correspondence to Fawaz Khaled Alarfaj.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Alarfaj, F.K., Naz, A. Exploring vision transformers for deep feature extraction and classification in video genre recognition for digital media. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45087-y

Download citation

Received: 23 June 2025
Accepted: 17 March 2026
Published: 22 March 2026
DOI: https://doi.org/10.1038/s41598-026-45087-y

Exploring vision transformers for deep feature extraction and classification in video genre recognition for digital media

Subjects

Abstract

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Search

Quick links

Subjects

Abstract

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links