Abstract
The generation of video content for television production using automation and artificial intelligence-based techniques is quite common these days. The use of computer vision techniques plays a significant role in the classification and analysis of large volumes of multimedia content. This study aims to develop an intelligent framework for TV genre classification using deep learning and advanced transformer-based models. Traditional machine learning depends on traditional features and lacks the ability to capture complex spatio-temporal and acoustic relationships in modern media. To address these limitations, the study explores state-of-the-art vision transformers for deeper analysis on two standard datasets in the relevant domain. Firstly, a static image dataset is analyzed using the Pyramid Vision Transformer (PvT), which effectively captures multi-scale spatial and contextual information across diverse TV scenes. Secondly, a multimodal audio–video dataset is used by applying the Multimodal Attention and Invariant Vision–Audio Representation Transformer (MAiVAR-T). The applied model captures temporal dependencies and integrates acoustic features, including mel-spectrogram, chroma, waveform, and energy patterns. Empirical analysis demonstrates that the proposed PvT and MAiVAR-T models achieve the highest accuracies of 97% and 98%, respectively, outperforming the baseline deep learning models. This study presents the role of multimodal transformers in improving automated genre classification in television and digital media production.
Data availability
Dataset freely available at: (1) Dataset 1: [https://universe.roboflow.com/tv-production/](https:/universe.roboflow.com/tv-production). (2) Dataset 2: [https://github.com/jwehrmann/lmtd](https:/github.com/jwehrmann/lmtd)Code AvailabilityCode with sample dataset: [https://zenodo.org/records/18950832](https:/zenodo.org/records/18950832).
Code availability
Code with sample dataset: https://zenodo.org/records/18950832.
References
Malitesta, D. et al. Formalizing multimedia recommendation through multimodal deep learning. ACM Trans. Recomm. Syst. https://doi.org/10.1145/3662738 (2025).
Bansal, G., Nawal, A., Chamola, V. & Herencsar, N. Revolutionizing visuals: The role of generative AI in modern image generation. ACM Trans. Multimedia Comput. Commun. Appl. https://doi.org/10.1145/3689641 (2024).
Wang, T. et al. Text-assisted spatial and temporal attention network for video question answering. Adv. Intell. Syst. 5 (4), 2200131. https://doi.org/10.1002/aisy.202200131 (2023).
Muqadas, A. et al. Deep learning and sentence embeddings for detection of clickbait news from online content. Sci. Rep. https://doi.org/10.1038/s41598-025-97576-1 (2025).
Chalaby, J. K. The streaming industry and the platform economy: An analysis. Media Cult. Soc. 46(3), 552–571. https://doi.org/10.1177/01634437231210439 (2024).
Garganas, O. Digital video advertising: Breakthrough or extension of TV advertising in the new digital media landscape?. Journalism and Media 5(2), 749–765. https://doi.org/10.3390/JOURNALMEDIA5020049 (2024).
Ro, D., Kwon, B., Lee, E. & Baek, H. What factors determine international television flows on over-the-top platforms? A fuzzy set qualitative comparative analysis approach. Telematics and Informat. 102, 102322. https://doi.org/10.1016/J.TELE.2025.102322 (2025).
Rix, J. & Gläser, M. Global entertainment & media outlook 2021–2025. MedienWirtschaft 18(3), 42–45. https://doi.org/10.15358/1613-0669-2021-3-42 (2021).
Christian, A. J. Expanding production value: The culture and scale of television and new media. Crit. Stud. Telev. 14 (2), 255–267. https://doi.org/10.1177/1749602019838882 (2019).
Liu, X. & Pan, H. The path of film and television animation creation using virtual reality technology under the artificial intelligence. Sci. Program. 2022(1), 1712929. https://doi.org/10.1155/2022/1712929 (2022).
Ahmed, M. et al. Real-time violent action recognition using key frames extraction and deep learning. Computers Mater. Continua. 69 (2), 2217–2230. https://doi.org/10.32604/cmc.2021.018103 (2021).
Zhang, H., Sun, Y., Zhao, M., Chow, T. W. S. & Wu, Q. M. J. Bridging user interest to item content for recommender systems: An optimization model. IEEE Trans. Cybern. 50(10), 4268–4280. https://doi.org/10.1109/TCYB.2019.2900159 (2020).
Tran, K. H., Vuong Do, P., Ly, N. Q. & Le, N. Unifying global and local scene entities modelling for precise action spotting, In Proceedings of the International Joint Conference on Neural Networks, https://doi.org/10.1109/IJCNN60899.2024.10650009 (2024).
Khan, U., Khan, H. U., Iqbal, S. & Munir, H. Four decades of image processing: A bibliometric analysis. Libr. Hi Tech. 42(1), 180–202. https://doi.org/10.1108/LHT-10-2021-0351 (2024).
Yang, B. et al. Vision transformer-based visual language understanding of the construction process. Alexandria Eng. J. 99, 242–256. https://doi.org/10.1016/j.aej.2024.05.015 (2024).
Nazir, S., Cagali, T., Sadrzadeh, M. & Newell, C. Audiovisual, Genre, Neural and Topical Textual Embeddings for TV Programme Content Representation, In IEEE International Symposium on Multimedia (ISM), 197–200 https://doi.org/10.1109/ISM.2020.00041 (2020).
Tang, P., Zhao, H., Meng, W. & Wang, Y. One-shot motion talking head generation with audio-driven model. Expert Syst. Appl. 297, 129344. https://doi.org/10.1016/J.ESWA.2025.129344 (2026).
Cammarano, M. E., Guarino, A., Malandrino, D. & Zaccagnino, R. TV shows popularity prediction of genre-independent TV series through machine learning-based approaches. Multimed Tools Appl. 83 (31), 75757–75780. https://doi.org/10.1007/s11042-024-18518-z (2024).
Li, R. et al. UE-Extractor: A Grid-to-Point Ground Extraction Framework for Unstructured Environments Using Adaptive Grid Projection. IEEE Robot Autom. Lett. 10 (6), 5991–5998. https://doi.org/10.1109/LRA.2025.3563127 (2025).
Ghosh, S., Sarkar, S., Ghosh, S., Zalkow, F. & Jana, N. D. Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions. Appl. Intell. 54(6), 4507–4524. https://doi.org/10.1007/S10489-024-05380-7 (2024).
Ghosh, D., Ghosh, S., Jana, N. D., Biswas, S. & Mallipeddi, R. Designing optimal vision transformer architecture using differential evolution for tomato leaf disease classification. Comput. Electron. Agric. 238, 110824. https://doi.org/10.1016/J.COMPAG.2025.110824 (2025).
Ghosh, S. et al. Melanoma skin cancer detection using ensemble of machine learning models considering deep feature embeddings. Procedia Comput. Sci. 235, 3007–3015. https://doi.org/10.1016/J.PROCS.2024.04.284 (2024).
Yang, X. K., Qu, D., Zhang, W. L. & Zhang, W. Q. An adapted data selection for deep learning-based audio segmentation in multi-genre broadcast channel. Digit. Signal. Process. 81, 8–15. https://doi.org/10.1016/j.dsp.2018.03.004 (2018).
Candela, F., Giordano, A., Zagaria, C. F. & Morabito, F. C. Effectiveness of deep learning techniques in TV programs classification: A comparative analysis. Integr. Comput. Aided. Eng. 31(4), 439–453. https://doi.org/10.3233/ICA-240740 (2024).
Sulun, S., Viana, P. & Davies, M. E. P. Movie trailer genre classification using multimodal pretrained features. Expert Syst. Appl. 258, 125209. https://doi.org/10.1016/J.ESWA.2024.125209 (2024).
Shao, Y. & Guo, N. Recognizing online video genres using ensemble deep convolutional learning for digital media service management. J. Cloud Comput. 13 (1), 102. https://doi.org/10.1186/s13677-024-00664-2 (2024).
Lin, F., Yuan, J., Chen, Z. & Abiri, M. Enhancing multimedia management: Cloud-based movie type recognition with hybrid deep learning architecture. J. Cloud Comput. 13(1), 1–16. https://doi.org/10.1186/S13677-024-00668-Y (2024).
Zhang, Z. et al. Movie Genre Classification by Language Augmentation and Shot Sampling, IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 7260–7270 https://doi.org/10.1109/WACV57701.2024.00711 (2024).
Shaukat, F. et al. Multi-label movie genre classification with attention mechanism on movie plots. Comput. Mater. Contin. 83(3), 5595–5622. https://doi.org/10.32604/CMC.2025.061702 (2025).
Khan, H. U., Naz, A., Alarfaj, F. K. & Almusallam, N. A transformer-based architecture for collaborative filtering modeling in personalized recommender systems. Sci. Rep. 15(1), 1–22. https://doi.org/10.1038/s41598-025-08931-1 (2025).
Moreno-Galván, D. A. et al. Automatic movie genre classification & emotion recognition via a BiProjection Multimodal Transformer. Info. Fusion 113, 102641. https://doi.org/10.1016/J.INFFUS.2024.102641 (2025).
Unal, F. Z., Guzel, M. S., Bostanci, E., Acici, K. & Asuroglu, T. Multilabel genre prediction using deep-learning frameworks. Appl. Sci. https://doi.org/10.3390/app13158665 (2023).
Pham, L. et al. An Audio-Based Deep Learning Framework For BBC Television Programme Classification, In 29th European Signal Processing Conference (EUSIPCO). 56–60 https://doi.org/10.23919/EUSIPCO54536.2021.9616310 (2021).
Ben-Ahmed, O. & Huet, B. Deep multimodal features for movie genre and interestingness prediction, Proceedings - International Workshop on Content-Based Multimedia Indexing. https://doi.org/10.1109/CBMI.2018.8516504 (2018).
Braz, L., Teixeira, V., Pedrini, H. & Dias, Z. Image-Text integration using a multimodal fusion network module for movie genre classification,In IET Conference Proceedings. 1 157–162 https://doi.org/10.1049/ICP.2021.1456 (2021).
Chu, W. T. & Guo, H. J. Movie genre classification based on poster images with deep neural networks, MUSA2 - Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes, co-located with MM. 39–45 https://doi.org/10.1145/3132515.3132516 (2017).
Montalvo-Lezama, R., Montalvo-Lezama, B. & Fuentes-Pineda, G. Improving transfer learning for movie trailer genre classification using a dual image and video transformer. Inf. Process. Manag. 60(3), 103343. https://doi.org/10.1016/J.IPM.2023.103343 (2023).
Yadav, A. & Vishwakarma, D. K. A unified framework of deep networks for genre classification using movie trailer. Appl. Soft Comput. https://doi.org/10.1016/j.asoc.2020.106624 (2020).
Naz, A. et al. Machine and deep learning for personality traits detection: A comprehensive survey and open research challenges. Artif. Intell. Rev. 58(8), 239–239. https://doi.org/10.1007/s10462-025-11245-3 (2025).
Pandey, A. & Vishwakarma, D. K. VyAnG-Net: A novel multi-modal sarcasm recognition model by uncovering visual, acoustic and glossary features. Intell. Data Anal. https://doi.org/10.1177/1088467X251315637 (2024).
Aggarwal, S., Pandey, A. & Vishwakarma, D. K. Extracting cross-modal semantic incongruity with attention for multimodal sarcasm detection. Appl. Intell. 55(12), 1–22. https://doi.org/10.1007/S10489-025-06717-6 (2025).
Pandey, A. & Vishwakarma, D. K. Aspect-based multimodal sentiment analysis via employing visual-to-emotional-caption translation network using visual-caption pairs. Lang. Resour. Eval. 59(3), 2945–2972. https://doi.org/10.1007/S10579-025-09824-5 (2025).
Pandey, A. & Vishwakarma, D. K. Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs. https://arxiv.org/pdf/2408.02571 (2025).
Wang, J., Li, S. & Sung, Y. Deformer: Denoising Transformer for Improved Audio Music Genre Classification. Appl. Sci. 13(23), 12673. https://doi.org/10.3390/APP132312673 (2023).
Pandey, A. & Vishwakarma, D. K. Progress, achievements, and challenges in multimodal sentiment analysis using deep learning: A survey. Appl. Soft Comput. 152, 111206. https://doi.org/10.1016/J.ASOC.2023.111206 (2024).
Zhao, K. et al. Neutron-gamma discrimination method based on voiceprint identification. Radiat. Meas. 187, 107483. https://doi.org/10.1016/j.radmeas.2025.107483 (2025).
Yin, L. et al. AFBNet: A lightweight adaptive feature fusion module for super-resolution algorithms. Comput. Model. Eng. Sci. 140(3), 2315–2347. https://doi.org/10.32604/CMES.2024.050853 (2024).
Song, W. et al. AttriDiffuser: Adversarially enhanced diffusion model for text-to-facial attribute image synthesis. Pattern Recognit. 163, 111447. https://doi.org/10.1016/J.PATCOG.2025.111447 (2025).
Chen, J., Zhang, S. & Xu, W. Scalable prediction of heterogeneous traffic flow with enhanced non-periodic feature modeling. Expert Syst. Appl. 294, 128847. https://doi.org/10.1016/J.ESWA.2025.128847 (2025).
Shaukat, F. et al. An interpretable multi-transformer ensemble for text-based movie genre classification. PeerJ Comput. Sci. 11, e2945. https://doi.org/10.7717/PEERJ-CS.2945 (2025).
Zhao, Y., Wang, X., Cao, S. & Huang, Z. Zero-shot automatic modulation recognition using a large vision-language model. IEEE Trans. Commun. https://doi.org/10.1109/TCOMM.2025.3611710 (2025).
Wang, Q., Chen, J., Song, Y., Li, X. & Xu, W. Fusing Visual Quantified Features for Heterogeneous Traffic Flow Prediction. Promet - Traffic and Transportation 36(6), 1068–1077. https://doi.org/10.7307/PTT.V36I6.667 (2024).
Ji, J. et al. DPA-MVSNet: Dynamic Context Perception Multi-view Stereo with transformers and data augmentation. Knowl. Based Syst. 325, 113852. https://doi.org/10.1016/J.KNOSYS.2025.113852 (2025).
Wang, S. et al. Interactive Siamese Network-based roadside perception for multi-vehicle tracking. IEEE Trans. Intell. Transp. Syst. 26(12), 22482–22496. https://doi.org/10.1109/TITS.2025.3611287 (2025).
Zhang, R. et al. Online adaptive keypoint extraction for visual odometry across different scenes. IEEE Robot. Autom. Lett. 10(7), 7539–7546. https://doi.org/10.1109/LRA.2025.3575644 (2025).
Xue, X., Hu, H. M., He, Z. & Zheng, H. Towards multi-source illumination color constancy through physics-based rendering and spectral power distribution embedding. IEEE Trans. Comput. Imaging. 11, 1349–1360. https://doi.org/10.1109/TCI.2025.3598440 (2025).
Zhang, G. et al. A Novel Spatial-Temporal Learning Method for Enhancing Generalization in Adaptive Video Streaming. IEEE Trans. Mob. Comput. 24(12), 12852–12866. https://doi.org/10.1109/TMC.2025.3588135 (2025).
Zhang, Y. et al. S2DBFT: Spectral–Spatial Dual-Branch Fusion Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2025.3608444 (2025).
Wang, T. et al. ResLNet: Deep residual LSTM network with longer input for action recognition. Front. Comput. Sci. 16(6), 166334–166334. https://doi.org/10.1007/s11704-021-0236-9 (2022).
Acknowledgements
This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. KFU261384].
Funding
This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. KFU261384].
Author information
Authors and Affiliations
Contributions
F.K.A. conceived the study and designed the research. F.K.A. secured the funding, supervised the work, and led the manuscript writing. A.N. implemented the models, prepared the data, and conducted the experiments. A.N. produced the results and figures. Both authors analyzed and interpreted the findings, revised the manuscript, and approved the final version.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Alarfaj, F.K., Naz, A. Exploring vision transformers for deep feature extraction and classification in video genre recognition for digital media. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45087-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-45087-y