Human fall direction recognition in the indoor and outdoor environment using multi self-attention RBnet deep architectures and tree seed optimization

Khan, Awais; Kim, Jung-Yeon; Kim, Chomyong; Khan, Muhammad Attique; Shin, Hyojin; Woo, Jiyoung; Nam, Yunyoung

doi:10.1038/s41598-025-11031-9

Download PDF

Article
Open access
Published: 04 August 2025

Human fall direction recognition in the indoor and outdoor environment using multi self-attention RBnet deep architectures and tree seed optimization

Awais Khan¹,
Jung-Yeon Kim²,
Chomyong Kim³,
Muhammad Attique Khan⁴,
Hyojin Shin¹,
Jiyoung Woo¹ &
…
Yunyoung Nam⁵

Scientific Reports volume 15, Article number: 28475 (2025) Cite this article

714 Accesses
Metrics details

Subjects

Abstract

Falling poses a significant health risk to the elderly, often resulting in severe injuries if not promptly addressed. As the global population increases, the frequency of falls increases along with the associated financial burden. Hence, early detection is crucial for initiating timely medical interventions and minimizing physical, social, and economic harm. With the growing demand for safety monitoring of older adults, particularly those living alone, effective fall detection has become increasingly important for supporting independent living. In this study, we propose a novel deep learning architecture and an optimization algorithm for human fall direction recognition. Subsequently, we developed four novel residual block and self-attention mechanisms, named residual block-deep convolutional neural network (3-RBNet), 5-RBNet, 7-RBNet, and 9-RBNet self-attention models. The models were trained on enhanced images, and deep features were extracted from the self-attention layer. The 7-RBNet and 9-RBNet self-attention models demonstrated superior accuracy and precision rates, leading us to exclude the 3-RBNet self model from further analysis. To optimize feature selection and improve classification performance while reducing computational costs, we employed the tree seed algorithm on the self-attention features of 7-RBNet and 9-RBNet self-attention models. Experiments using the proposed method were performed on a human fall dataset collected from Soonchunhyang University, South Korea. The proposed method achieved maximum accuracies of 93.2% and 92.5%, respectively. Compared with recent techniques, our approach improved accuracy and precision.

A vision transformer with recurrent neural network-based fall activity recognition system for disabled persons in smart IoT environments

Article Open access 26 September 2025

An eight-camera fall detection system using human fall pattern recognition via machine learning by a low-cost android box

Article Open access 28 January 2021

Multistage fall detection framework via 3D pose sequences and TCN integration

Article Open access 30 July 2025

Introduction

Falls are a significant cause of severe injury among elderly people worldwide, hindering their ability to live comfortably and independently. Statistics indicate that falls are the leading cause of injury-related deaths in individuals aged 80 and older. A United Nations study¹ in 2017 found that there were 962 million people aged 60 years or older, representing 13% of the global population. According to the World Health Organization (WHO), the number of elderly individuals is expected to reach 1.2 billion by 2025, more than double by 2050 (2.1 billion), and triple (3.1 billion) by 2100². Approximately 2.8 million older adults annually experience emergency health issues related to falls. Adults over 65 years are particularly vulnerable to life-threatening injuries caused by falls. For those aged > 85 years, falls account for approximately two-thirds of all injury-related deaths. Approximately 20% of falls result in severe injuries, such as hip fractures and head trauma³. In the United States alone, 29,668 residents aged > 65 years died from fall-related injuries in 2016, corresponding to 61.6 deaths per 100,000 people⁴.

Most falls occur at home because of common hazards such as poor lighting, clutter, obstructed pathways, slippery floors, pets, and unstable furniture^5,6. Elderly individuals with neurological conditions such as dementia and epilepsy are more susceptible to falls and related injuries than the average elderly population^7,8,9,10,11. In addition, the cultural tendency of elderly individuals in Western societies to live independently of their family members contributes to fall-related injuries. While falls are not always life threatening, those occurring in cluttered environments can result in concussion, hemorrhage, and other severe health risks, leading to unfortunate outcomes^12,13. In the absence of fall detection technology, emergency services often do not respond promptly, resulting in severe consequences. Many surveillance systems have been developed to meet the need for constant monitoring by nurses and support staff members. However, creating an environment that is completely fall proof is challenging. Therefore, fall detection and rescue services are essential for ensuring the safety of the elderly^{12,14,15,16,17,18,19}. As a result, there is an increasing need for intelligent detection and prevention systems. Numerous fall detection and monitoring systems have been reviewed from various perspectives^20,21,22. The challenges, issues, and advancements in fall detection and prevention have been discussed in^23,24,25,26.

Fall detection refers to identifying the occurrence of a fall from the provided data, whereas fall recognition involves identifying the specific type of fall, such as a forward or backward fall, or a fall from a sitting, standing, or lying position^11,14. Identifying the type of fall is crucial for mounting an appropriate response. For example, a fall from a standing position may be more serious than a fall from a sitting or lying position, depending on the circumstances^10,17. Knowing the specific type of fall allows for effective responses to potential complications. In this study, we use the term “fall detection” to encompass both general fall detection and activity recognition systems without explicitly distinguishing between detection and recognition systems.

In recent years, deep learning has been widely applied in various fields^27,28. Specifically, in the domain of fall recognition, deep learning methods have demonstrated greater effectiveness than traditional approaches, such as threshold-based algorithms^{29,30,31,32,33}. Machine learning techniques are also prevalent in this area^34,35,36,37, falling under the broader category of artificial intelligence^38,39. Although a personal emergency response system (PERS)⁴⁰ can be effective, it becomes useless when an individual cannot reach the button or is unconscious^41,42. To address this limitation, passive monitoring techniques have been introduced to accurately detect falls without user interaction⁴³. Recent advancements in deep learning and signal processing, particularly in the application of various optical sensors, have attracted significant attention from researchers⁴⁴. Breakthroughs in smart video surveillance have led to the development of effective monitoring systems that rely on semantic data extraction techniques, such as pose estimation, human detection, anomaly detection, and motion tracking⁴⁵. The combination of high-quality RGB cameras, extensive datasets, and enhanced computational power, in the context of deep neural networks, has contributed to the growing use of AI in smart surveillance systems²⁸.

This study has several key motivations. Falls are a leading cause of injury and death among the elderly, and the ability to quickly detect and respond to falls can save lives. Additionally, falls may indicate underlying health issues; therefore, early detection can support timely intervention and prevention efforts. This study leverages deep learning techniques, specifically convolutional neural networks (CNNs), to enhance fall detection and classification accuracy. CNNs are well-suited for image-based classification tasks and have demonstrated promising results in previous studies on fall detection.

Major contributions

In this study, we present a novel deep learning architecture for human fall direction classification. The model aims to classify fall events into four distinct class labels: non-fall, back-fall, side-fall, and forward-fall, covering a wide range of scenarios. The process begins with data preprocessing, followed by the development of self-attention deep learning models. These models extract relevant features that are further optimized using a tree seed algorithm for feature selection. The extracted and selected features were passed to different machine learning classifiers for the final classification. The primary aim of this study was to develop a lightweight CNN-based model that can be applied to human fall direction datasets. The focus was on addressing the limitations of existing CNN architectures by introducing hybrid preprocessing techniques and lightweight models for deep feature extraction. This ensures that nonredundant and valuable features are captured without losing critical information in the sample images.

The key contributions of this research are as follows:

Frame preprocessing techniques are applied to resize the frames, enhancing performance efficiency.
Four models are proposed based on residual blocks and self-attention mechanisms: 3-residual block deep convolutional neural network (3-RBNet), 5-RBNet, 7-RBNet, and 9-RBNet self-attention models. The selected models are trained on the proposed dataset.
Deep features are extracted from the self-attention layer for classification.
The tree seed algorithm is applied to select the optimal features among the features extracted from the proposed models, to enhance accuracy and reduce computational time.

Literature review

Adhikari et al.⁴⁶ developed a fall detection system using video images captured by a Kinect RGB-Depth camera. The system applies a CNN to classify fall events and activities of daily living (ADL). They created their own dataset, which was divided into 73% for training and 27% for testing. The dataset was recorded in various indoor settings using the activities of different individuals, resulting in a total of 21,499 images. The system, which was designed to detect falls in situations involving only one individual, achieved an overall accuracy of 74%, exhibiting a sensitivity of 99% when the user was lying down; however, for the crawling, bending, and sitting positions, the sensitivity decreased significantly. This system operates in a controlled environment and utilizes basic deep learning techniques, whereas a fusion-based approach can provide better results in diverse settings.

Li et al.⁴⁷ applied a fall detection system using CNNs to a video surveillance setting. The CNN is directly applied to each video frame to learn features related to human shape deformation, thereby detecting ADL and fall events. The University of Rzeszow fall detection (URFD) dataset was used for this purpose. The performance of the system was assessed using tenfold cross-validation, with each fold containing 850 test images, achieving an average specificity and accuracy of 99.98% and an average sensitivity of 100%. However, because the dataset had a consistent background, color scheme, and environment, system performance could be affected by changes in the background and foreground. In addition, this system was not tested for real-life falls among elderly individuals, in various environments. Yhdego et al.⁴⁸ presented a machine learning technique that utilizes pretrained kinematic models on annotated accelerometry datasets. The accelerometer data were converted into images using a continuous wavelet transform, and a deep CNN was trained on these images using transfer learning. The open URFD dataset, which included 40 sequences of normal activities and 30 sequences of falls, was used in this study. The dataset was split in an 80:20 ratio for training and testing. The proposed system achieved an accuracy of 96.43%.

Yu et al.⁴⁹ employed a CNN in an application using the background subtraction method to extract human body silhouettes. The CNN was applied to the preprocessed silhouettes to identify movements, such as bending, lying down, standing, and sitting. In this system, the background was subtracted using Codebook, a common technique for separating moving objects from the background. A custom dataset with 3,216 postures, including 769 sits, 833 bends, 804 stands, and 810 lies, was utilized to test the system. This approach achieved an accuracy of 96.88%, outperforming traditional machine learning systems. Santos et al.⁵⁰ proposed a CNN-based deep learning method for fall detection. This approach was designed to operate within a fog computing environment and the Internet of Things (IoT). The model architecture consisted of three convolutional, two max-pooling, and three fully connected layers. The system was evaluated on three open datasets and compared with the current research. The datasets were divided in an 80:20 ratio for training and testing. The system achieved an accuracy of 93.8%.

Hwang et al.⁵¹ proposed a method for analyzing continuous motion data from depth cameras using 3D-CNNs. They employed data augmentation to overcome overfitting. The system used five random trials, extracting 240 videos for training and 24 videos for evaluation from the TST fall detection dataset, resulting in improved classification accuracies of 92.4% and 96.9%, respectively. Zhou and Komuro⁵² utilized a 3D convolutional residual block with a variational autoencoder (VAE) and a region extraction technique to enhance the accuracy of fall detection. This unsupervised learning technique used reconstruction errors to identify ADL and fall actions. The system was tested using le2i fall detection (Le2i FD) and high-quality simulated fall dataset (HQFD). An accuracy of 88.7% was achieved using the proposed unsupervised restricted latent (Res)-VAE model.

Proposed method

Dataset acquisition

Robust data collection is critical to this research. The dataset used in this study was collected by Soonchunhyang University, South Korea. To address the limitations of existing fall detection datasets, we created a new dataset that included four distinct fall directions: non-fall, back-fall, side-fall, and forward-fall, covering a wide range of scenarios. For data collection, we used eight cameras to capture videos of resolution 4 K at 60 frames per second (fps). The cameras were strategically positioned at various heights and angles, ranging from 1.5 to 3 m above the ground to ensure comprehensive coverage of the recording environment. The experiments were conducted in both indoor and outdoor settings. The indoor environments included hospitals, nursing facilities, and homes, designed to replicate real-world conditions with different room layouts and lighting. Outdoor settings such as parks and streets were also included to provide a complete range of scenarios. The high-resolution videos were converted into individual frames for direct analysis and manipulation. To optimize the training process, we reduced the image dimensions, balancing computational efficiency and accuracy. This careful approach in data collection and preprocessing ensured the reliability and effectiveness of our dataset for fall detection research. A sample of the dataset is shown in Fig. 1.

Proposed methodology

In this section, we proposed a novel deep learning approach for the classification of human fall directions, as shown in Fig. 2. This methodology comprises several key steps: data preprocessing, feature extraction using four deep self-attention models, feature selection using tree seed optimization, and classification. Subsequently, the four proposed deep self-attention models were trained on human fall images. Features were extracted from the self-attention layer and classification was performed using support vector machine (SVM) models. An advanced tree seed algorithm (TSA) was applied on the extracted features for selecting the best features, which were then passed to the cubic and quadratic SVM for final classification.

Video frames preprocessing

Preprocessing is a crucial step in image processing. The frames were initially extracted from the video data, and normalization was applied to the pixel values, resizing the images and enhancing local contrast. In this study, the extracted video frames were resized from their original dimensions of 3840 × 2160 pixels to 320 × 240 pixels during preprocessing. These normalized frames were subsequently used for the training of 3- RBNet, 5-RBNet, 7-RBNet, and 9-RBNet self-attention models⁵³.

Proposed residual network

A residual block is a key architectural component that incorporates a skip connection along with a standard feedforward process. The purpose of a residual block is to allow the network to model the residual functions rather than directly engaging in input mapping. The residual block is mathematically expressed as

$${\beta }_{OP}={\varphi }_{activation} (\varnothing Convolution \left(j\right)+j)$$

(1)

where $\varnothing Convolution \left(j\right)$ represents the output of the convolutional operation applied to input $j$, and $\varphi$ is the activation function. In this study, four customized CNNs were developed based on multiple residual blocks. The details of the proposed networks are provided in Table 1.

Table 1 Description of proposed residual models.

Full size table

Proposed 3RBNet

The proposed 3-RBNet architecture comprises three residual blocks with 78 layers, 89 connections, and approximately 11.9 million parameters. Each residual block includes four parallel sets of layers connected at the end by an additional layer. The model begins with an input layer that accepts images of size 224 × 224 × 3. This is followed by a convolutional layer with 64 filters of size 3 × 3 and a stride of 2 × 2. The model includes a residual block with four parallel sets of layers. The first layer in this block is a convolutional layer with 64 filters of size 2 × 2 and a stride of 1 × 1. This is followed by a batch normalization layer to enhance model convergence and stability during training. Another convolutional layer with 64 filters of size 2 × 2 and stride 1 × 1, is applied using the rectified linear unit (ReLU) activation function to introduce nonlinearity. An additional batch normalization layer is then applied, mirroring the configuration of the initial layers. The subsequent residual blocks follow a structure similar to the first but with variations in depth and filter size, as detailed in Table 2. The final residual block includes a convolutional layer with 512 filters of size 2 × 2 and a stride of 1 × 1, followed by a ReLU activation layer. The architecture incorporates global average pooling to reduce the multidimensional feature map to a one-dimensional array, which is then flattened. This flattened array is used as the input for the self-attention layer. At the end of the network, a fully connected softmax layer is added for classification. The model was trained on selected datasets and self-attention was employed to extract deep features, resulting in feature vectors of size N × 512. Proposed architecture of 3-RBNet is shown in Fig. 3.

Table 2 Description of proposed 3-RBNet model.

Full size table

Proposed 5-RBNet

The proposed 5-RBNet architecture features five residual blocks comprising 123 layers, 142 connections, and approximately 12.4 million parameters. Each residual block includes four parallel sets of layers connected by an additional layer. The network starts with an input layer that processes 224 × 224 × 3 images. The initial convolutional layer has 64 filters, each of size 2 × 2 and stride 2 × 2. This is followed by the first residual block, which also contains four parallel sets of layers. The initial convolutional layer in this block uses 64 filters of size 2 × 2 and stride 1 × 1. This is followed by another convolutional layer with the same filter size and stride, along with ReLU activation and batch normalization layers. The subsequent parallel layers in the block follow a similar structure. A max-pooling layer is then applied to extract the maximum values from the feature maps, thereby reducing the dimensionality of the data. The remaining four residual blocks have designs similar to the first but with variations in depth, stride, and filter size, as detailed in Table 3. The final residual block includes a convolutional layer with 1024 filters of size 2 × 2 and a stride of 1 × 1, followed by a ReLU activation layer. To complete the network, global average pooling (GAP) is performed, followed by a flattening self-attention and fully connected (FC) layers. Finally, a softmax layer is added for classification. The 5-RBNet was trained on selected datasets utilizing self-attention to extract prominent features, resulting in feature vectors of size N × 1024, Proposed network architecture of 5-RBNet is shown in Fig. 4.

Table 3 Detail description of proposed 5-RBNet model.

Full size table

Proposed 7-RBNet

The proposed 7-RBNet architecture features five residual blocks comprising 123 layers, 142 connections, and approximately 14.3 million parameters. Each residual block includes four parallel sets of layers connected by an additional layer. The initial input layer processes images of size 224 × 224 × 3. The first convolutional layer contains 64 filters with strides of 2 × 2. This is followed by the first residual block, which also comprises four parallel groups of layers. The first convolutional layer in this block uses 64 filters of size 3 × 3 and stride 2 × 2. This is followed by another convolutional layer with the same filter size and stride, along with ReLU activation and batch normalization layers. The ensuing parallel layers in the block follow a similar structure. A max-pooling layer is then applied to extract the maximum values from the feature maps, thereby reducing the dimensionality of the data. The remaining four residual blocks are similar in design to the first but with variations in depth, stride, and filter size, as detailed in Table 4. The final residual block includes a convolutional layer with 1024 filters of size 2 × 2 and stride of 1 × 1, followed by a ReLU activation layer. To complete the network, GAP is performed, followed by a flattening self-attention, and FC layers. Finally, a softmax layer is added for classification. The 7-RBNet was trained on selected datasets utilizing self-attention to extract prominent features, resulting in feature vectors of size N × 1024.

Table 4 Detail description of proposed 7-RBNet model.

Full size table

Proposed 9-RBNet

The proposed 9-RBNet architecture features nine residual blocks, 214 layers, 249 connections, and approximately 23.8 million parameters, as illustrated in Fig. 5. Each residual block is composed of four parallel layers connected by an additional layer. A self-attention layer is incorporated after the sequence of residual blocks. The network accepts input images of size 224 × 224 × 3. The model starts with a convolutional layer using 32 filters of size 2 × 2 and a stride of 2 × 2. This is followed by the first residual block, which includes four parallel sets of layers. The initial convolutional layer within this block contains 32 filters of size 2 × 2 and stride of 1 × 1. This is followed by a batch normalization layer and another convolutional layer with 32 filters of size 2 × 2 and stride 1 × 1. A rectified linear unit activation function is applied to introduce nonlinearity, and batch normalization is used to enhance model stability, accelerate convergence, and improve performance during training. The remaining three parallel sets of layers in the residual block have the same parameters as the first set, with an additional layer connecting all the parallel layers. A max-pooling layer with a pool size of 5 × 5 and stride of 1 × 1 is then applied to reduce dimensionality by selecting the maximum value from each window of the input feature map, effectively condensing the data. The next convolutional layer has 64 filters of size 3 × 3 and stride 2 × 2. The remaining eight residual blocks follow the same structure as the first but vary in depth and filter size, as detailed in Table 5. After the final residual block, a convolutional layer with 1280 filters of size 2 × 2 and stride 1 × 1 is applied, followed by a ReLU activation layer. A GAP layer is then used to reduce the feature map to a single average value per feature, followed by a flattening layer to convert the 2D feature map into a 1D vector. This vector is processed using a self-attention layer. Finally, FC, softmax, and classification layers are added to complete the network. The model was trained on selected datasets with prominent features extracted using self-attention, to generate feature vectors of dimension N × 1280, Proposed network architecture of 9-RBNet is shown in Fig. 5.

Table 5 Detail description of proposed 9-RBNet model.

Full size table

Feature selection using Tree Seed Algorithm

The tree search algorithm is modeled after the natural process of trees scattering seeds that ultimately produce new trees⁵⁴. In this analogy, the search space of the optimization problem corresponds to the surface of the tree, with the positions of the seeds and trees representing potential solutions. The search efficiency is heavily influenced by the selection of optimal locations for these seeds and trees. The optimization process is governed by two mathematical equations. The first equation enhances the local search by optimizing the locations of the tree population and seed-producing trees. Consequently, a new seed is generated for the tree as follows:

$${W}_{\text{ij}}={\text{Z}}_{\text{ij}}+{\upbeta }_{\text{ij}}\times ({\text{Y}}_{\text{j}}-{\text{Z}}_{\text{ij}})$$

(2)

$${\text{W}}_{\text{ij}}={\text{Z}}_{\text{ij}}+{\upbeta }_{\text{ij}}\times ({\text{Z}}_{\text{ij}}-{\text{Z}}_{\text{bj}})$$

(3)

where ${\text{Z}}_{\text{ij}}$ is the ${\text{j}}{\text{th}}$ dimension of the ${\text{i}}{\text{th}}$ tree; ${W}_{\text{ij}}$ is the ${\text{j}}{\text{th}}$ dimension of the ${\text{i}}{\text{th}}$ seed that produces the ${\text{i}}{\text{th}}$ tree; ${\text{Y}}_{\text{j}}$ is the ${\text{j}}{\text{th}}$ dimension of the best tree position; and ${\text{Z}}_{\text{ij}}$ is the ${\text{j}}{\text{th}}$ dimension of a randomly selected ${\text{b}}{\text{th}}$ tree in the range $[\text{1,1}]$, where $\text{i}$ and $j$ denote different indices or points. The critical part of these equations is determining the position of the new seed. To manage this selection within the $\left[\text{0,1}\right]$ range, a search tendency (ST) parameter is employed. A higher average ST value indicates a strong local search capability and faster convergence, whereas a lower average ST suggests slower convergence but a more robust global search. Thus, the ST parameters regulate the balance between exploitation and exploration in TSA. Consequently, the optimization problem, starting with the initial seed position, is defined as follows:

$${\text{Z}}_{\text{ij}}={\text{F}}_{\text{j},\text{ min}}+{\text{b}}_{\text{ij}}\times ({\text{I}}_{\text{j},\text{ max}}-{\text{F}}_{\text{j},\text{ min}})$$

(4)

where ${\text{F}}_{\text{j},\text{ min}}$ represents the lower bound; ${\text{I}}_{\text{j},\text{ max}}$ represents the upper bound; and the random number of each location and dimension in the range of $\left[\text{0,1}\right]$ is represented by ${\text{b}}_{\text{ij}}$.

$$\text{H}=\text{min}\left\{\text{f}\left({\text{Z}}_{\text{i}}\right)\right\}\text{i}=\text{1,2},3,\dots \dots .,\text{N}$$

(5)

Here, $Z$ represents the number of trees in the population, and $Y$ represents the solution in the selected population. The resultant best-selected features of TSA are utilized in a maximum function to identify the most optimal points as follows:

$$\text{WY}=\text{max}\left(\text{Y}\left(\text{p}\right)\right),\text{ p}=\text{1,2},3, \dots \dots ,\text{ N}$$

(6)

The optimally selected features are then passed to the fitness function, and this procedure is repeated until the error is minimized. Once the error is appropriately reduced, the selected features are fed into a supervised learning classifier for the final classification.

Results and discussion

Result and analysis

In this section, the results obtained using the proposed methodology are presented. The proposed dataset was divided into 80:20 ratio, with 80% used for training and the remaining 20% used for testing, as shown in Table 6. To ensure reliability, we implemented a tenfold cross-validation throughout the experimental process. The training process involved manually selecting key hyperparameters: a learning rate of 0.0001, 100 epochs, a mini-batch size of 64, and stochastic gradient descent (SGD) as the optimizer. These settings were carefully selected to optimize the training and improve model performance. To further improve the classification performance of our models, we applied various SVM models to the extracted features to improve performance, including linear SVM (LSVM), cubic SVM (CSVM), quadratic SVM (QSVM), fine Gaussian SVM (FGSVM), medium Gaussian SVM (MGSVM), and coarse Gaussian SVM (CGSVM). We employed TSA to fine-tune the hyperparameters of these models to ensure optimal performance. The performance of each classifier was evaluated using a comprehensive set of metrics: accuracy, precision, sensitivity, computational time, and area under curve (AUC). These metrics offer insights into the effectiveness and efficiency of the models in accurately distinguishing between different classes. The experiments were conducted using MATLAB R2024a on a desktop computer equipped with 64 GB RAM, 2 TB SSD, and a 16 GB NVIDIA GeForce RTX 4070Ti super graphics card.

Table 6 Detail distribution of human fall image dataset for experiment with respect to each class.

Full size table

Proposed results for 3-RBNet

In this subsection, the classification results for the human fall direction are evaluated using the four residual-based 3-RBNet, 5-RBNet, 5-RBNet, and 9-RBNet self-attention models. Different machine learning classifiers were employed, including LSVM, MGSVM, QSVM, FGSVM, CSVM, and CGSVM. The classification results for the proposed 3-RBNet on the human fall direction dataset are detailed in Table 7a. The results indicate that a maximum accuracy of 88% was achieved using QSVM, with a computational time of 1.5 s. Additional metrics were also calculated, including a precision rate of 88%, sensitivity rate of 88%, and AUC value of 0.97. LSVM achieved an accuracy of 87.6% with a computational time of 2 s, sensitivity rate of 87.5%, precision rate of 87.6%, and AUC of 0.97. Hence, the accuracy and computational time differed between these models by 0.4% and 0.5 s, respectively. The third-best accuracy of 86.5% was achieved using CSVM with a computational time of 5 s which is higher than that of QSVM and LSVM. An accuracy of 80.1% along with computational time of 6 s was achieved by FGSVM. The accuracy of this model was less as compared to QSVM, and computational time difference between QSVM and FGSVM was 4.5 s which is higher than as compared to the other models. The confusion matrices for QSVM and 3-RBNet are shown in Fig. 6a).

Table 7 Proposed results of human fall direction dataset deep learning models, (a) 3-RBNet, (b) 5-RBNet, (c) 7-RBNet and (d) 9-RBNet.

Full size table

Proposed results of for 5-RBNet

The classification results for the proposed 5-RBNet on the human fall direction dataset are detailed in Table 7b). The results show that the highest accuracy of 92% is achieved using the CSVM classifier with a computational time of 1.2 s. Different key metrics were also calculated, including a precision rate of 91.75%, sensitivity rate of 91.7%, and AUC value of 0.98. LSVM achieved the second-best accuracy of 91.8% with a computational time of 2.5 s, sensitivity rate of 91.7%, precision rate of 92% and along with AUC of 0.98. The difference between the performance of both models in terms of accuracy and computational time was 0.2% and 1.3 s. The third-best accuracy of 91% was achieved using QSVM with a computational time of 1.7 s which was higher than that of CSVM. An accuracy of 90.9% was achieved by MGSVM with computational time of 7 s. The accuracy of this model was lower than that of CSVM, and the computational time difference between CSVM and MGSVM was 5.8 s which is higher than as compared to the other models. Additionally, the confusion matrix for CSVM and 5-RBNet shown in Fig. 6b.

Proposed results for 7-RBNet

The classification results for the proposed 7-RBNet on the human fall direction dataset are detailed in Table 7c). The results show that the highest accuracy of 92.2% was achieved using the CSVM classifier, with a computational time of 2.5 s. Different key metrics were also calculated, including a precision rate of 92.1%, sensitivity rate of 92.1%, and AUC of 0.98. LSVM achieved the second-best accuracy of 91% with a computational time of 4 s, sensitivity rate of 91%, precision rate of 91.02%, and AUC of 0.98. The difference between the performance of these models in terms of accuracy and computational time was 1.2% and 1.5 s. The third-best accuracy of 90.9% was achieved using QSVM with a computational time of 3 s which is higher than that of CSVM. The FGSVM achieved an accuracy of 80.8%, with a computational time of 6 s. The accuracy of this model was lower than that of CSVM, and the computational time difference between CSVM and FGSVM was 3.5 s which is higher than that of the other models. The confusion matrix for CSVM and 7-RBNet is shown in Fig. 6c.

Proposed results for 9-RBNet

The classification results for the proposed 9-RBNet on the human fall direction dataset are presented in Table 7d. The results show that the highest accuracy of 92.6% was achieved using the QSVM classifier with a computational time of 3 s. Different key metrics were also calculated, including a precision rate of 92.6%, sensitivity rate of 92.7%, and AUC value of 0.98. LSVM achieved the second-best accuracy of 91.9% with a computational time of 3.8 s, along with the sensitivity rate of 91.8%, precision rate of 91.9%, and along with AUC of 0.98. The difference between the performances of these models in term of accuracy and computational time was 0.7% and 0.8 s, respectively. The third-best accuracy of 91.8% was achieved using MGSVM and CGSVM with computational times of 4.6 and 6.3 s, respectively, which is higher than that of CSVM. The FGSVM obtained an accuracy of 71.4% and computational time of 6.3 s. The accuracy of this model was lower than that of CSVM, and the computational time difference between CSVM and FGSVM was 3.3 s which is higher than that of the other models. The confusion matrix for QSVM and 9- RBNet is shown in Fig. 6(d).

Analysis of model accuracy

The performance results for the four proposed models, 3-RBNET, 5-RBNET, 7-RBNET, and 9-RBNET self-attention models are presented in Table 7a–d) for the human fall direction datasets, and the corresponding confusion matrices are shown in Fig. 6a–d. The analysis shows that the 3-RBNET self-attention model did not achieve the same level of accuracy as the 7-RBNET and 9-RBNET self-attention models. Specifically, the 9-RBNET self-architecture performed better on the proposed dataset. Consequently, the features extracted from the 7-RBNET and 9-RBNET self-attention model architectures were selected for further optimization using a tree seed algorithm. This optimization was designed to maintain high precision, accuracy, and AUC values while also reducing computational time.

Proposed tree seed optimization results

To improve accuracy and computational time, the proposed tree seed algorithm method was applied to the features extracted from the 7-RBNet model using a self-attention layer. According to the analysis, the 7-RBNet model achieved the best accuracy of 92.2% with a computational time of 2.5 s by using CSVM, as presented in the above Table 7a. By applying TSA optimization to the 7-RBNet model, our model achieved an accuracy of 92.5% with a computational time of 1.1 s, along with sensitivity and precision rate of 92.3% and 92.2%, respectively. The difference between accuracy and computational time of 7-RBNet with and without optimization was 0.3% and 1.4 s. Based on this performance, the computational time was reduced by approximately 90%. The results for the 7-RBNet are presented in Table 8a, and the confusion matrix is shown in Fig. 7a.

Table 8 Proposed results of human fall direction dataset with four classes using deep learning models (a) 7-RBNet and (b) 9-RBNet with TSA.

Full size table

The proposed results of 9-RBNet with the tree seed algorithm using the human fall direction dataset are presented in Table 8b. The best accuracy of 93.2% was achieved using QSVM with a computational time of only 1 s. The different key metrics included a precision rate of 93.2%, sensitivity rate of 93.1%, and AUC value of 0.98. LSVM obtained the second best accuracy of 92% with a computational time of 0.9 s, along with a sensitivity rate of 91.9%, precision rate of 91.9%, and AUC of 0.98. The third-best accuracy of 91.9% was achieved using MGSVM with a computational time of 2.1 s which is higher than that of QSVM. According to the analysis section, the 9-RBNet model achieved the best accuracy of 92.6% with a computational time of 3 s by using QSVM, as presented in the above Table 7d. The differences between accuracy and computational time of 9-RBNet with optimization and without optimization was 0.6% and 1 s, respectively. Accordingly, the computational time was reduced by approximately 200%. The performance of the other models also improved in terms of accuracy and computational time after applying the TSA algorithm. The confusion matrix for 9-RBNet with TSA and QSVM is shown in Fig. 7b.

Comparison with sate of the methodologies

While direct comparison with other studies is limited due to differences in datasets and class definitions, an indicative comparison is presented in Table 9 for general context. Comparison of proposed method with existing technique. Our proposed deep learning model, optimized with tree seed algorithm achieved the accuracy of 93.2% by quadratic SVM, using our own dataset collected from Soonchunhyang University in South Korea. The referenced models including Navdeep Kaur et al.⁴², Na Zhu et al.³⁹, Jorge D. Cardenas et al.⁴¹, and Wenfeng Pang et al.⁴⁰, which are based on Hybrid Haar Cascade Model, based on Deep Neural Networks, a combination of CNN and LSTM, and Convolutional Neural Networks, with accuracies of 89.2%, 86%, 92.1%, and 81.1%, respectively. Our approach got higher accuracy, utilizing eight cameras and four classes.

Table 9 Comparison of proposed methodology with existing work. Results are for reference only due to differences in datasets, classes, and settings).

Full size table

Conclusion

In this study, we developed four novel residual block and self-attention mechanisms, named 3-RBNet, 5-RBNet, 7-RBNet, and 9-RBNet self-attention models, along with feature selection using the tree seed algorithm for fall direction classification. The proposed model classifies fall events into four categories: non-fall, back-fall, side-fall, and forward-fall. A frame-preprocessing technique was applied to resize the frames used for training the proposed architectures. Based on the initial results, the 3-RBNet, 5-RBNet, 7-RBNet, and 9-RBNet self-attention models obtained accuracies of 88%, 92%, 92.2%, and 92.6%, respectively. The tree seed optimization algorithm was applied to the 7-RBNET and 9-RBNET models, which allowed optimal feature selection for classification. The proposed architecture achieved accuracies of 93.2% and 92.5% on the human fall images dataset, respectively. From the detailed experimental process, the following conclusions are drawn:

The inclusion of a self-attention layer in 5-RBNET, 7-RBNET, and 9-RBNET significantly enhanced accuracy and precision.
Optimizing deep feature extraction and hyperparameters for 7-RBNET, and 9-RBNET models, improved accuracy and precision while reducing computation time.

Current limitations of our proposed model are the dependance on manual setting of hyperparameters, which may affect generalizability across different datasets. In future work, we aim to integrate a dynamic hyperparameter tuning strategy to enhance both adaptability and overall performance. Furthermore, we intend to enhance the model by integrating an inverted bottleneck architecture with a self-attention mechanism to improve the efficiency of fall direction classification. As self-attention offers valuable insights into model behavior, we also aim to explore interpretability by analyzing attention weights to identify which time steps or temporal features the model prioritizes. This interpretability analysis will enhance the transparency and explainability of our system, particularly for real-world applications.

Data availability

The study’s data are available upon request from the corresponding author.

References

Jeyashree, G., Padmavathi, S. & Shanthini, D. 2022 Third International Conference on Intelligent Computing Instrumentation and Control Technologies (ICICICT). 1223–1229 (IEEE).
Khoddam, H., Eshkevar laji, S., Nomali, M., Modanloo, M. & Keshtkar, A. A. Prevalence of malnutrition among elderly people in Iran: protocol for a systematic review and meta-analysis. JIRM 8, e15334 (2019).
Google Scholar
Berg, R. L. & Cassells, J. S. The second fifty years: Promoting health and preventing disability (National Academies Press (US), 1992).
Burns, E. Deaths from falls among persons aged ≥ 65 years—United States, 2007–2016. 67 (2018).
Lord, S. R., Menz, H. B. & Sherrington, C. Home environment risk factors for falls in older people and the efficacy of home modifications. Age Ageing 35, 55–59 (2006).
Article Google Scholar
Vallabh, P. & Malekian, R. Fall detection monitoring systems: a comprehensive review. J. Appl. Int. Comput. 9, 1809–1833 (2018).
Google Scholar
Wang, P., Chen, C.-S. & Chuan, C.-C. 2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE). 309–315 (IEEE).
Doulamis, N. 2010 3rd International Symposium on Applied Sciences in Biomedical and Communication Technologies (ISABEL 2010). 1–5 (IEEE).
Rawashdeh, O. et al. 2012 IEEE International Conference on Electro/Information Technology. 1–7 (IEEE).
Homann, B. et al. The impact of neurological disorders on the risk for falls in the community dwelling elderly: A case-controlled study. Med. Eng. Phys. 3, e003367 (2013).
Google Scholar
Nooruddin, S., Islam, M. M. & Sharna, F. A. An IoT based device-type invariant fall detection system. J. Internet Thing 9, 100130 (2020).
Article Google Scholar
Erden, F., Velipasalar, S., Alkar, A. Z. & Cetin, A. Sensors in assisted living: A survey of signal and image processing methods. Med. Eng. Phys. 33, 36–44 (2016).
Google Scholar
Khan, S. S. & Hoey, J. Review of fall detection techniques: A data availability perspective. Med. Eng. Phys. 39, 12–22 (2017).
Article PubMed Google Scholar
Prajapati, T., Bhatt, N. & Mistry, D. A survey paper on wearable sensors based fall detection. 115 (2015).
Yu, X. HealthCom 2008–10th International Conference on e-health Networking, Applications and Services. 42–47 (IEEE).
Chaudhuri, S., Thompson, H. & Demiris, G. Fall detection devices and their use with older adults: A systematic review. J. Geriatr. Phys. Ther. 37, 178–196 (2014).
Article PubMed PubMed Central Google Scholar
Delahoz, Y. S. & Labrador, M. A. J. S. Survey on fall detection and fall prevention using wearable and external sensors. Sensors 14, 19806–19842 (2014).
Article ADS PubMed PubMed Central Google Scholar
Zhang, Z., Conly, C. & Athitsos, V. Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments. 1–7.
Debes, C. et al. Monitoring activities of daily living in smart homes: Understanding human behavior. Nature 33, 81–94 (2016).
Google Scholar
Mubashir, M., Shao, L. & Seed, L. A survey on fall detection: Principles and approaches. Nature 100, 144–152 (2013).
Google Scholar
Igual, R., Medrano, C. & Plaza, I. Challenges, issues and trends in fall detection systems. Biomed. Eng. 12, 66 (2013).
Google Scholar
Islam, M. M. et al. A review on fall detection systems using data from smartphone sensors. Sensors 24, 569–576 (2019).
Google Scholar
Koshmak, G., Loutfi, A. & Linden, M. J. Challenges and issues in multisensor fusion approach for fall detection. Sensors 2016, 6931789 (2016).
Google Scholar
Xu, T., Zhou, Y. & Zhu, J. New advances and challenges of fall detection systems: A survey. Appl. Sci. 8, 418 (2018).
Article Google Scholar
Khel, M. A. B. & Ali, M. 2019 2nd International Conference on Advancements in Computational Sciences (ICACS). 1–8 (IEEE).
Ren, L. & Peng, Y. J. Research of fall detection and fall prevention technologies: A systematic review. Nature 7, 77702–77722 (2019).
Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article ADS CAS PubMed Google Scholar
Khan, A. et al. Human gait recognition using deep learning and improved ant colony optimization. 70 (2022).
Colon, L. N. V., DeLaHoz, Y. & Labrador, M. 2014 IEEE Latin-America Conference on Communications (LATINCOM). 1–7 (IEEE).
Hsieh, S.-L., Yang, C.-T. & Li, H.-J. 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 2373–2377 (IEEE).
Rakhman, A. Z. & Nugroho, L. E. 2014 The 1st International Conference on Information Technology, Computer, and Electrical Engineering. 99–104 (IEEE).
Aguiar, B., Rocha, T., Silva, J. & Sousa, I. 2014 IEEE International Symposium on Medical Measurements and Applications (MeMeA). 1–6 (IEEE).
Wu, F., Zhao, H., Zhao, Y. & Zhong, H. J. Development of a wearable-sensor-based fall detection system. Int. J. Telecommun. Appl. 2015, 576364 (2015).
Google Scholar
Vallabh, P., Malekian, R., Ye, N. & Bogatinoska, D. C. 2016 24th international conference on software, telecommunications and computer networks (SoftCOM). 1–9 (IEEE).
Rahaman, A., Islam, M. M., Islam, M. R., Sadi, M. S. & Nooruddin, S. J. Developing iot based smart health monitoring systems: A review. Rev. Intell. 33, 435–440 (2019).
Google Scholar
Khan, A., Pin, K., Aziz, A., Han, J. W. & Nam, Y. J. S. Optical coherence tomography image classification using hybrid deep learning and ant colony optimization. Sensors 23, 6706 (2023).
Article ADS PubMed PubMed Central Google Scholar
De Miguel, K., Brunete, A., Hernando, M. & Gambao, E. J. S. Home camera-based fall detection system for the elderly. Sensors 17, 2864 (2017).
Article ADS PubMed PubMed Central Google Scholar
Yodpijit, N., Sittiwanchai, T. & Jongprasithporn, M. 2017 3rd International Conference on Control, Automation and Robotics (ICCAR). 547–550 (IEEE).
Casilari-Pérez, E. & García-Lagos, F. A comprehensive study on the use of artificial neural networks in wearable fall detection systems. Expert Syst. Appl. 138, 112811 (2019).
Article Google Scholar
Adhikari, K., Bouchachia, H. & Nait-Charif, H. J. Deep learning based fall detection using simplified human posture. Int. J. Comput. Syst. 13, 255–260 (2019).
Google Scholar
Sarabia-Jácome, D., Usach, R., Palau, C. E. & Esteve, M. Highly-efficient fog-based deep learning AAL fall detection system. J. Internet Thing 11, 100185 (2020).
Article Google Scholar
Thakur, N. & Han, C. Y. A study of fall detection in assisted living: Identifying and improving the optimal machine learning method. Networks 10, 39 (2021).
Google Scholar
Han, Q. et al. A two-stream approach to fall detection with MobileVGG. Sensors 8, 17556–17566 (2020).
Google Scholar
Chen, W., Jiang, Z., Guo, H. & Ni, X. J. S. Fall detection based on key points of human-skeleton using openpose. Sensors 12, 744 (2020).
Google Scholar
Wu, X. et al. Applying deep learning technology for automatic fall detection using mobile sensors. Sensors 72, 103355 (2022).
Google Scholar
Şengül, G. et al. Deep learning based fall detection using smartwatches for healthcare applications. Sensors 71, 103242 (2022).
Google Scholar
Adhikari, K., Bouchachia, H. & Nait-Charif, H. 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA). 81–84 (IEEE).
Li, X., Pang, T., Liu, W. & Wang, T. 2017 10th international congress on image and signal processing, biomedical engineering and informatics (CISP-BMEI). 1–6 (IEEE).
Yhdego, H. et al. 2019 Spring Simulation Conference (SpringSim). 1–12 (IEEE).
Yu, M., Gong, L. & Kollias, S. Proceedings of the 19th ACM international conference on multimodal interaction. 416–420.
Santos, G. L. et al. Accelerometer-based human fall detection using convolutional neural networks. Sensors 19, 1644 (2019).
Article ADS PubMed PubMed Central Google Scholar
Hwang, S., Ahn, D., Park, H. & Park, T. Proceedings of the Second International Conference on Internet-of-Things Design and Implementation. 343–344.
Zhou, J. & Komuro, T. 2019 IEEE International Conference on Image Processing (ICIP). 3372–3376 (IEEE).
Haider, I. et al. Crops Leaf Disease Recognition From Digital and RS Imaging Using Fusion of Multi Self-Attention RBNet Deep Architectures and Modified Dragonfly Optimization. (2024).

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00218176) and the Soonchunhyang University Research Fund.

Author information

Authors and Affiliations

Department of ICT Convergence, Soonchunhyang University, Asan, 31538, Republic of Korea
Awais Khan, Hyojin Shin & Jiyoung Woo
ICT Convergence Rehabilitation Engineering Research Center, Soonchunhyang University, Asan, 31538, Republic of Korea
Jung-Yeon Kim
ICT Convergence Research Center, Soonchunhyang University, Asan, 31538, Republic of Korea
Chomyong Kim
Department of AI, College of Computer Engineering and Science, Prince Mohammad Bin Fahd University, Al Khobar, Saudi Arabia
Muhammad Attique Khan
Emotional and Intelligent Child Care Convergence Research Center, Soonchunhyang University, Asan, 31538, Republic of Korea
Yunyoung Nam

Authors

Awais Khan
View author publications
Search author on:PubMed Google Scholar
Jung-Yeon Kim
View author publications
Search author on:PubMed Google Scholar
Chomyong Kim
View author publications
Search author on:PubMed Google Scholar
Muhammad Attique Khan
View author publications
Search author on:PubMed Google Scholar
Hyojin Shin
View author publications
Search author on:PubMed Google Scholar
Jiyoung Woo
View author publications
Search author on:PubMed Google Scholar
Yunyoung Nam
View author publications
Search author on:PubMed Google Scholar

Contributions

A. K. contributed to developing the proposed model and preparing the manuscript, J. Y. K., C. K., M. A. K., and H. S. contributed in revising the manuscript. J. W and Y. N. supervision, reviewed the manuscript. All authors have approved the manuscript and agree with submission.

Corresponding authors

Correspondence to Jiyoung Woo or Yunyoung Nam.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Khan, A., Kim, JY., Kim, C. et al. Human fall direction recognition in the indoor and outdoor environment using multi self-attention RBnet deep architectures and tree seed optimization. Sci Rep 15, 28475 (2025). https://doi.org/10.1038/s41598-025-11031-9

Download citation

Received: 02 January 2025
Accepted: 07 July 2025
Published: 04 August 2025
DOI: https://doi.org/10.1038/s41598-025-11031-9

Subjects

Abstract

Similar content being viewed by others

A vision transformer with recurrent neural network-based fall activity recognition system for disabled persons in smart IoT environments

An eight-camera fall detection system using human fall pattern recognition via machine learning by a low-cost android box

Multistage fall detection framework via 3D pose sequences and TCN integration

Introduction

Major contributions

Literature review

Proposed method

Dataset acquisition

Proposed methodology

Video frames preprocessing

Proposed residual network

Proposed 3RBNet

Proposed 5-RBNet

Proposed 7-RBNet

Proposed 9-RBNet

Feature selection using Tree Seed Algorithm

Results and discussion

Result and analysis

Proposed results for 3-RBNet

Proposed results of for 5-RBNet

Proposed results for 7-RBNet

Proposed results for 9-RBNet

Analysis of model accuracy

Proposed tree seed optimization results

Comparison with sate of the methodologies

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links