A novel multi-module neural networks strategy of human emotion recognition in the human-robot interaction

Zaman, Khalid; Islam, Ammad Ul; Zengkang, Gan; Bilal, Muhammad; Alharbi, Ayman; Shah, Sayyed Mudassar; Asghar, Sohail; Wang, Hongzhao

doi:10.1038/s41598-026-40798-8

Download PDF

Article
Open access
Published: 28 February 2026

A novel multi-module neural networks strategy of human emotion recognition in the human-robot interaction

Khalid Zaman^1,2,3,4,
Ammad Ul Islam⁵,
Gan Zengkang³,
Muhammad Bilal⁶,
Ayman Alharbi⁷,
Sayyed Mudassar Shah⁸,
Sohail Asghar⁹ &
…
Hongzhao Wang^1,2

Scientific Reports , Article number: (2026) Cite this article

1239 Accesses
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

New technologies in human emotion recognition (HER) have drawn considerable attention to use in the fields of security, intelligent customer service, healthcare, educational, human-robot interaction (HRI), and adaptive system training. To identify human emotions, our model incorporates MobileNetV3, Vision Transformer (ViT), RegNet and SE-ResNeXt into a unique deep ensemble classification structure. A Novel Multi Module Neural Networks (MMNNs) architecture is designed in this research for HER for practical application the main purpose of is to identify the human emotions. An innovative approach to improve the performance of HER by integrating MMNNs with Transfer Learning (TL) to train CNNs is researched. The MMNNs classification model is trained by combining features from four CNN models using feature pooling. The key novelty of the model is the novel DEtection TRansformer (DETR) which enhances the CNN learning block. It consists of a CNN that learns low dimensional feature representation, an encoder decoder transformer and a simple Feed Forward Network (FFN) that outputs the final detection prediction, which ultimately boosts face recognition efficiency and accuracy. The MMNNs results are validated on AffectNet, CK + and a custom-made dataset (CMD) achieving accuracy of 91.07%, 87.03% and 96.98% respectively which is further increased by data augmentation technique to 95.09%, 89.15% and 98.13% respectively.

Emotion recognition with multiple physiological parameters based on ensemble learning

Article Open access 06 June 2025

RF sensing enabled tracking of human facial expressions using machine learning algorithms

Article Open access 13 November 2024

Multi-branch convolutional neural network with cross-attention mechanism for emotion recognition

Article Open access 01 February 2025

Data availability

The dataset used in this research is available on the following web link [https://github.com/123456789khalid/Human-Emotion-HE-.git] (https:/github.com/123456789khalid/Human-Emotion-HE-.git) .

Abbreviations

MMNNs:: Multi-Module Neural Networks
HER:: Human Emotion Recognition
HRI:: Human Robot Interaction
HCI:: Human Computer Interaction
ER:: Emotion Recognition
DNNs:: Deep Neural Networks
CNNs:: Convolutional Neural Networks
TL:: Transfer Learning
DETR:: DEtection TRansformer
FFN:: Feed Forward Network
CMD:: Custom-Made-Dataset
IIMT Lab:: Institute of Intelligent Manufacturing Technology, Laboratory
ViLT:: Vision and Language Transformers
NMS:: non-maximal suppression
DACL:: Deep Attention Centre Loss
STN:: Spatial Transformation Network
FER:: Facial Expression Recognition
MLP:: multilayer perception
RNNs:: recurrent neural networks
LSTM:: long short-term memory
BDBNs:: Boosted Deep Belief Networks
DCNN:: Deep Convolutional Neural Network
DTN:: deep temporal network
DSN:: deep spatial network
SE:: Squeeze-and-excitation
GRU:: Gated Recurrent Units
IRB:: Inverted Residual Block
CK+:: Cohn-Kanade

References

Hirota, K. & Dong, F. Development of mascot robot system in NEDO project. In 2008 4th International IEEE Conference Intelligent Systems (Vol. 1, pp. 1–38). IEEE. (2008), September.
Yamazaki, Y., Dong, F., Masuda, Y., Uehara, Y., Kormushev, P., Vu, H. A., … Hirota,K. (2009). Intent expression using eye robot for mascot robot system. arXiv preprint arXiv:0904.1631.
Yamazaki, Y., Vu, H. A., Le, Q. P., Fukuda, K., Matsuura, Y., Hannachi, M. S., … Hirota,K. (2008, November). Mascot robot system by integrating eye robot and speech recognition using RT middleware and its casual information recommendation. In Proc. 3rd International Symposium on Computational Intelligence and Industrial Applications (pp. 375–384).
Fukuda, T. et al. Human-robot mutual communication system. In Proceedings 10th IEEE International Workshop on Robot and Human Interactive Communication. ROMAN 2001 (Cat. No. 01TH8591) (pp. 14–19). IEEE. (2001), September.
Liu, Z., Wu, M., Cao, W., Chen, L., Xu, J., Zhang, R., … Mao, J. (2017). A facial expression emotion recognition-based human-robot interaction system. IEEE CAA J. Autom. Sinica, 4(4), 668–676.
Carion, N. et al. End-to-end object detection with transformers. In European conference on computer vision (pp. 213–229). Cham: Springer International Publishing. (2020), August.
Yahyaoui, M. A., Oujabour, M., Letaifa, L. B. & Bohi, A. Multi-face emotion detection for effective Human-Robot Interaction. arXiv preprint arXiv:2501.07213. (2025).
Pan, S. Y. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22 (10), 1345–1359 (2010).
Google Scholar
Zaman, K. et al. A novel driver emotion recognition system based on deep ensemble classification. Complex & Intelligent Systems, 9(6), 6927–6952. (2023).
Zaman, K., Zengkang, G., Zhaoyun, S., Shah, S. M., Riaz, W., Ji, J., … Attar, R. W.(2025). A Novel Emotion Recognition System for Human–Robot Interaction (HRI) Using Deep Ensemble Classification. International Journal of Intelligent Systems, 2025(1), 6611276.
Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. 79 (8), 2554–2558 (1982).
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), 1929–1958 (2014).
Google Scholar
Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448–456). pmlr. (2015), June.
Glorot, X., Bordes, A. & Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics (pp. 315–323). JMLR Workshop and Conference Proceedings. (2011), June.
Yu, Z. & Zhang, C. Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on international conference on multimodal interaction (pp. 435–442). (2015), November.
Perez-Gaspar, L. A., Caballero-Morales, S. O. & Trujillo-Romero, F. Multimodal emotion recognition with evolutionary computation for human-robot interaction. Expert Syst. Appl. 66, 42–61 (2016).
Google Scholar
Zaman, K. et al. Driver emotions recognition based on improved faster R-CNN and neural architectural search network. Symmetry 14 (4), 687 (2022).
Google Scholar
Riaz, W., Ji, J., Zaman, K. & Zengkang, G. Neural Network-Based Emotion Classification in Medical Robotics: Anticipating Enhanced Human–Robot Interaction in Healthcare. Electronics 14 (7), 1320 (2025).
Google Scholar
Mudassar Shah, S., Zengkang, G., Sun, Z., Hussain, T., Zaman, K., Alwabli, A., … Ali,F. (2025). AI-enabled driver assistance: monitoring head and gaze movements for enhanced safety. Complex & Intelligent Systems, 11(7), 297.
Zaman, K., Zengkang, G., Zhaoyun, S., Mansoor, M., Wei, C., Tao, G., … Xiaozhi, Q.(2025). FTDGT: Federated Temporal Dense Granular Transformer-Based Wind Power Forecasting in Medium and Long Term. International Journal of Energy Research, 2025(1), 9377203.
Zaman, K. et al. Accurately recognizing driver emotions through using CNN fused features and NasNet-large model. Alexandria Eng. J. 134, 177–196 (2026).
Google Scholar
Khor, H. Q., See, J., Phan, R. C. W. & Lin, W. Enriched long-term recurrent convolutional network for facial micro-expression recognition. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018) (pp. 667–674). IEEE. (2018), May.
Mollahosseini, A., Chan, D. & Mahoor, M. H. Going deeper in facial expression recognition using deep neural networks. In 2016 IEEE Winter conference on applications of computer vision (WACV) (pp. 1–10). IEEE. (2016), March.
Burkert, P., Trier, F., Afzal, M. Z., Dengel, A. & Liwicki, M. Dexpression: Deep convolutional neural network for expression recognition. arXiv preprint arXiv :150905371. (2015).
Liu, P., Han, S., Meng, Z. & Tong, Y. Facial expression recognition via a boosted deep belief network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1805–1812). (2014).
Agrawal, A. & Mittal, N. Using CNN for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy. Visual Comput. 36 (2), 405–412 (2020).
Google Scholar
Liang, D., Liang, H., Yu, Z. & Zhang, Y. Deep convolutional BiLSTM fusion network for facial expression recognition. Visual Comput. 36 (3), 499–508 (2020).
Google Scholar
Mohanraj, V., Chakkaravarthy, S. & Vaidehi, V. Ensemble of convolutional neural networks for face recognition. In Recent Developments in Machine Learning and Data Analytics: IC3 2018 467–477 (Springer Singapore, 2018).
Google Scholar
Wang, Y., Li, Y., Song, Y. & Rong, X. Facial expression recognition based on auxiliary models. Algorithms 12 (11), 227 (2019).
Google Scholar
Li, T. H. S., Kuo, P. H., Tsai, T. N. & Luan, P. C. CNN and LSTM based facial expression analysis model for a humanoid robot. IEEE Access. 7, 93998–94011 (2019).
Google Scholar
Nguyen, L. D., Gao, R., Lin, D. & Lin, Z. Biomedical image classification based on a feature concatenation and ensemble of deep CNNs. J. Ambient Intell. Humaniz. Comput. 14 (11), 15455–15467 (2023).
Google Scholar
Fan, Y., Lam, J. C. & Li, V. O. Multi-region ensemble convolutional neural network for facial expression recognition. In International Conference on Artificial Neural Networks (pp. 84–94). Cham: Springer International Publishing. (2018), September.
Renda, A., Barsacchi, M., Bechini, A. & Marcelloni, F. Comparing ensemble strategies for deep learning: An application to facial expression recognition. Expert Syst. Appl. 136, 1–11 (2019).
Google Scholar
Vinyals, O., Bengio, S. & Kudlur, M. Order matters: Sequence to sequence for sets. (2015). arXiv preprint arXiv:1511.06391.
Alsenan, A., Youssef, B., Alhichri, H. & B., & Mobileunetv3—a combined unet and mobilenetv3 architecture for spinal cord gray matter segmentation. Electronics 11 (15), 2388 (2022).
Google Scholar
Prasad, S. B. R. & Chandana, B. S. Mobilenetv3: a deep learning technique for human face expressions identification. Int. J. Inform. Technol. 15 (6), 3229–3243 (2023).
Google Scholar
Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., … Adam, H. (2019).Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1314–1324).
Fard, A. P., Hosseini, M. M., Sweeny, T. D. & Mahoor, M. H. AffectNet+: A Database for Enhancing Facial Expression Recognition with Soft-Labels. IEEE Transactions on Aff ective Computing. (2024). arXiv preprint arXiv:2410.22506.
Lucey, P. et al. The Extended Cohn-Kanade Dataset (CK+): A complete expression dataset for action unit and emotion-specified expression. Proceedings of the Third International Workshop on CVPR for Human Communicative Behavior Analysis (CVPR4HB 2010), San Francisco, USA, 94–101. (2010).
Lyons, M., Kamachi, M. & Gyoba, J. The Japanese female facial expression (JAFFE) dataset. (No Title). (1998).
Kosti, R., Alvarez, J. M., Recasens, A. & Lapedriza, A. Context based emotion recognition using emotic dataset. IEEE Trans. Pattern Anal. Mach. Intell. 42 (11), 2755–2766 (2019).
Google Scholar
Liu, Y. et al. Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. In Proceedings of the 30th ACM international conference on multimedia (pp. 24–32). (2022), October.
Mirzaee, H., Peymanfard, J., Moshtaghin, H., Zeinali, H. & H., & Armanemo: A persian dataset for text-based emotion detection 1–23 (Language Resources and Evaluation, 2025).
Bota, P., Brito, J., Fred, A., Cesar, P. & Silva, H. A real-world dataset of group emotion experiences based on physiological data. Sci. data. 11 (1), 116 (2024).
Google Scholar
Liu, S. et al. Spectral Efficient Neural Network-Based M-ary Chirp Spread Spectrum Receivers for Underwater Acoustic Communication. Arab. J. Sci. Eng. 49 (12), 16593–16609 (2024).
Google Scholar
Gang, Q. et al. A Q-Learning-Based Approach to Design an Energy-Efficient MAC Protocol for UWSNs Through Collision Avoidance. Electronics 13 (22), 4388 (2024).
Google Scholar
Farid, G. et al. An improved deep Q-Learning approach for navigation of an autonomous UAV agent in 3D Obstacle-Cluttered environment. Drones 9 (8), 518 (2025).
Google Scholar
Ali, W., Bilal, M., Alharbi, A., Jaffar, A. & SA Hassnain Mohsan. Intelligent Bayesian regularization backpropagation neuro computing paradigm for state features estimation of underwater passive object. Front. Phys. 12, 1374138 (2024).
Google Scholar
W ur Rahman, Q. et al. &. A MACA-based energy-efficient MAC protocol using Q-learning technique for underwater acoustic sensor network, 2023 IEEE 11th international conference on computer science and network and technology (ICCSNT) (2023).
Xuezhi, X., Ali, S. M., Farid, G. & M Bilal. Image processing in visual tracking by various techniques with the use of a particle filter – a critical review. J. Flow. Visualization Image Process. 23, 1–2 (2016).
Google Scholar
Zaidi, S. M. H., Ashraf, S. N., Iqbal, R., Bilal, M. & HH Zuberi. A Alharbi &. Based AI-Driven posture correction and personalized fitness assistant using computer vision and augmented reality. Int. J. E-Health Med. Commun. (IJEHMC) 16 (1), 1–25 (2025).
Khan, M. A., Songzuo, L., Bilal, M. & Y Wang. Low probability of detection constrained Covert underwater acoustic communication receiver using autoencoders (IEEE Transactions on Communications, 2025).
Ali, S. M., Bilal, M. & R Amin., A Alharbi & A Novel Deep Reinforcement Learning Based Extended Fractal Radial Basis Function Network for State-of‐Charge Estimation. IET Power Electron. 18 (1), e70101 (2026).
Khan, M. A., Liu, S., Bilal, M. & Hassan, A. Convolutional autoencoders for low probability of detection constrained underwater acoustic communications. Ocean Eng. 344, 123720 (2026).
Google Scholar
Sankoh, A. P. et al. Automated Facial Pain Assessment Using Dual-Attention CNN with Clinically Calibrated High-Reliability and Reproducibility Framework. Biomimetics 11 (1), 51 (2026).
Google Scholar
Bilal, M. et al. Covert underwater communication through cepstrum modulation mimicking Pseudorca crassidens whistles using machine learning. Scientific Reports (2026).

Download references

Acknowledgements

The authors extend their appreciation to Umm Al-Qura University, Saudi Arabia for funding this research work through grant number: 26UQU4290339GSSR01.

Funding

This research work was funded by Umm Al-Qura University, Saudi Arabia under grant number: 26UQU4290339GSSR01.

Author information

Authors and Affiliations

School of Software, Northwestern Polytechnical University, Xi’an Shaanxi, 710129, China
Khalid Zaman & Hongzhao Wang
China Aviation International and Investment CO.,LTD, Beijing, China
Khalid Zaman & Hongzhao Wang
Institute of Intelligent Manufacturing Technology, Shenzhen Polytechnic University, Shenzhen, 518000, Guangdong, China
Khalid Zaman & Gan Zengkang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, Guangdong, China
Khalid Zaman
Institute for Advanced Study in Nuclear Energy & Safety College of Physics and Optoelectronic Engineering, Shenzhen University, Shenzhen, China
Ammad Ul Islam
School of Engineering, Nanfang College Guangzhou, Guangzhou, 510970, China
Muhammad Bilal
Computer and Network Engineering Department, College of Computing, Umm Al-Qura University, Mecca, 24231, Saudi Arabia
Ayman Alharbi
School of Civil and Transportation Engineering, Shenzhen University, Guangdong, Shenzhen, China
Sayyed Mudassar Shah
Institute for Advanced Study in Nuclear Energy & Safety College of Physics, Shenzhen University, Shenzhen, China
Sohail Asghar

Authors

Khalid Zaman
View author publications
Search author on:PubMed Google Scholar
Ammad Ul Islam
View author publications
Search author on:PubMed Google Scholar
Gan Zengkang
View author publications
Search author on:PubMed Google Scholar
Muhammad Bilal
View author publications
Search author on:PubMed Google Scholar
Ayman Alharbi
View author publications
Search author on:PubMed Google Scholar
Sayyed Mudassar Shah
View author publications
Search author on:PubMed Google Scholar
Sohail Asghar
View author publications
Search author on:PubMed Google Scholar
Hongzhao Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

Khalid Zaman, Ammad Ul Islam and Sayyed Mudassar Shah have contributed equally to this work and are the first coauthors.

Corresponding authors

Correspondence to Gan Zengkang, Muhammad Bilal or Hongzhao Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics declarations

“Written informed consent was obtained from all participants for the publication of identifiable images in all figures in this manuscript and confirm that informed consent was obtained from all subjects and/or their legal guardian(s) prior to their participation in the study.”

Compliance with Guidelines and Regulations

All methods involving human participants and/or human tissue samples were carried out in accordance with the relevant ethical guidelines and regulations. The study was approved by “Institute of Intelligent Manufacturing Technology, Shenzhen Polytechnic University, Shenzhen, Guangdong 518000, China” with the approval number “Supported by the Post-Doctoral Foundation Project of Shenzhen Polytechnic University (Grant No.6024331021K)”. All participants provided informed consent prior to inclusion in the study.

Approval by institutional committee

We have provided details of the institutional that approved the experimental protocols, including the name of the committee and any relevant approval numbers.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zaman, K., Islam, A.U., Zengkang, G. et al. A novel multi-module neural networks strategy of human emotion recognition in the human-robot interaction. Sci Rep (2026). https://doi.org/10.1038/s41598-026-40798-8

Download citation

Received: 10 August 2025
Accepted: 16 February 2026
Published: 28 February 2026
DOI: https://doi.org/10.1038/s41598-026-40798-8