Introduction

Vision is the most crucial sense, and it is vital for day-to-day life activities. The World Health Organisation (WHO) reports that around 2.2 billion individuals suffer from loss of sight or impaired vision1. For people who are visually challenged, assistive technologies have been developed to provide them with assistance. As they require independence and mobility, most of this technology is not available and is also expensive. Therefore, they necessitate economical, lightweight, versatile, and real-time applications that work with greater precision. Technological advancements enable systems to examine and view surroundings as individuals do, utilizing computer vision (CV)2. For visually challenged individuals, CV provides visual data. OD is a vital task in a CV that enables the acquisition of intellectual capabilities3. OD mainly aims to detect and classify many objects in imagery and videos. Moreover, OD is considered a vital component in CV, tackling object classification in imagery, which allows applications such as content-based image retrieval (CBIR) and augmented reality4. Advancements in technologies have resulted in the growing acceptance of smart devices, like IoT devices, among the general public5.

This acceptance can be attributed to their capability to offer higher accessibility, efficacy, and functionality in several aspects of everyday living, and their incorporation with intelligent sensors, connectivity, and sophisticated analytics allows for real-time data assortment and transmission6. This has increased the count of devices linked to the network. As the data is mainly generated at the network edge, and demanding processing happens in central cloud environments, an adaptable connected framework is needed to bridge the gap, allowing edge and cloud environments to operate seamlessly together7. The complete employment of that continuum is anticipated to employ the assistance of latency-critical applications8. Furthermore, DL is presently being used in numerous applications connected to IoT, such as CV, where real-time processing is essential. For these applications, IoT devices produce extensive data that is examined by an artificial intelligence (AI) approach. Nevertheless, the more intricate the model is, the higher the computational resources needed for training and model inference9. This is due to the large number of parameters that create the model, as well as the millions of computations required for training and inference. Currently, the DL-driven methods are leveraged for OD to assist VIP10.

Key contributions

This paper proposes an intelligent Feature Fusion with Dynamic Graph Convolutional Recurrent Network for Robust Object Detection (FFDGCRN-ROD) model, designed to support disabled individuals within the smart IoT-edge-cloud environment. The main contributions of this paper are listed below:

  • An intelligent FFDGCRN-ROD model is introduced to assist individuals with disabilities in a smart IoT-edge-cloud environment.

  • To mitigate the noise and improve image quality, an advanced image pre-processing pipeline is developed by utilizing resizing, normalization, and enhancement. This facilitates improved accuracy in feature extraction. This process strengthens the overall performance in smart IoT edge applications.

  • The Faster R-CNN model is integrated to enable reliable OD and precise localization of targets in real-time. It also improves detection speed and accuracy for intrinsic environments. This integration assists in the efficient detection of individuals with disabilities. The model is optimized for deployment in smart IoT edge cloud systems.

  • A fusion of feature extraction models, namely CapsNet, SqueezeNet, and Inception V3, is employed for capturing multi-level and complementary feature representations. This integration also improves the model’s capability to recognize diverse and complex patterns. It also enhances robustness and discriminative power in OD tasks. The approach ensures high accuracy in varying real-time IoT edge scenarios.

  • The DA-GCRN methodology is employed for accurate classification and detection. The technique also facilitates efficient modelling of spatial-temporal dependencies in feature data. This method additionally strengthens the detection of subtle patterns and relationships. The network improves reliability and precision in smart edge-based environments.

  • The FFDGCRN-ROD approach introduces a novel incorporation of CapsNet, SqueezeNet, and InceptionV3 models for enriched multi-level feature extraction. It utilizes DA-GCRN for capturing spatial-temporal dynamics for precise object classification. Integration with Faster R-CNN ensures real-time detection and localization. The end-to-end framework is optimized for smart IoT edge-cloud environments, targeting individuals with disabilities.

  • Extensive experiments are demonstrating that the proposed model outperforms current advanced methodologies in terms of robustness and accuracy.

Organization of the work

This paper is structured to deliver a comprehensive understanding of the research. Section 2 inspects the literature relevant to the study. Section 3 details the methodological technique. Section 4 provides the outcomes of the experimental evaluations, showcasing the effectiveness and efficiency of the presented model. Ultimately, Sect. 5 summarises the research’s significant findings and contributions.

Related works on indoor OD for individuals with disabilities

In11, the authors presented an enhanced assistive technology for blind people and OD utilizing a Hiking optimizer algorithm (EATBP-ODHOA) method. The motive of this method is to design an efficient OD approach for VIP through sophisticated DL models. The image pre-processing step primarily uses an adaptive bilateral filtering (ABF) model. Moreover, the R-CNN methodology was utilized for detecting objects. In12, the authors proposed an innovative OD model for disabled persons utilizing a state-of-the-art DL method with a sparrow search optimizer (ODSDP-ADLMSSO) model. This model aims to improve the OD technique for VIP. Moreover, the YOLO version 7 approach is employed to detect and classify objects from images. Kumar et al.13 introduced a device with advanced technology that would allow VIPs to perform their own tasks rather than depending on others. For OD, this application leverages image processing methods, and for voice output, it utilizes speech synthesis. This technology focuses on providing VIP with real-time vocal or audio data regarding objects captured in their mobile cameras. Biradar and Gull14 presented a novel OD approach termed YOLOv8 alongside Spiking EfficientDet (Yv8SED), which enhances detection precision, minimizing incorrect classification with specific importance on small objects. The presented approach delivers extraordinary OD outcomes with decreased cost and time, which makes it a robust model for diverse OD missions. The segmentation procedure is enhanced by SegNet, which proficiently segments objects from images. Abidi et al.15 introduced an application, depending on the gradient support vector boosting-driven crossover golden jackal (GSB-CGJ) model. The introduced application aims to guide VIP, avoid accidents, and generate real-time feedback. Next, OD is effectively performed through the GSB-CGJ technique. Hebbar and Pullela16 provided a thorough analysis of developments in assistive technology personalized for VIPs with a particular emphasis on the function of the IoT in improving their quality of life. This study incorporates a detailed assessment of various studies and papers regarding wearable devices, navigation solutions, and IoT-based OD methods, highlighting the importance of user-centred design and possible advances in detection accuracy. Malla et al.17 focused on the VIPs, who encounter significant issues in social mobility because of neurological or physiological conditions that cause blindness. Despite a dependence on external support, there is an increasing preference for autonomy among these individuals. Given this, a groundbreaking device, SBS, is introduced to improve their mobility-related problems. The SBS is a cutting-edge adaptive device which is developed to solve everyday navigation problems faced by VIPs. Alagarsamy et al.18 designed a device that assists blind people in identifying various objects in their surroundings by employing the YOLO version 3 model integrated with R-CNN.

This includes various methodologies to design an application that rapidly detects diverse objects in VIP environments and guides them, utilizing audio output. CNN named YOLO, which identifies objects. This developed technique is more efficient and precise when compared to other models in detecting objects. Alazeb et al.19 presented a scene recognition framework by utilizing kernel convolution, UNet segmentation, discrete wavelet transform (DWT), and deep belief networks (DBN) incorporated with AlexNet for accurate multi-OD and labelling in robotic environments. Shawly and Alsheikhy20 proposed an automated epilepsy prediction system using a novel attention module (NAM) integrated into a convolutional neural network (CNN), combined with the Fourier transform for feature extraction, principal component analysis (PCA) for dimensionality reduction, and an optimized stochastic gradient descent with the Adam optimizer to enhance learning from EEG signals. Xu et al.21 developed a Temporal-Frequency-Spatial Transformer (TFST) technique integrated with sub-domain knowledge alignment (SDKA) to improve the accuracy and generalization of major depressive disorder detection from EEG data by incorporating fine-grained frequency features and enhancing domain alignment. Bu et al.22 introduced OD-YOLO, an enhanced OD model based on You Only Look Once version 8 nano (YOLOv8n), integrating deformable convolutional networks, a hybrid attention transformer, a dynamic head, and the Optimised Intersection over Union (OIoU) loss for improving small OD in remote sensing images. Wang et al.23 presented the YOLO-Deformable Convolution with Efficient Channel Attention and Bidirectional Feature Pyramid Network (YOLO-DBS) approach for improving underwater image pre-processing and enhancing real-time detection accuracy and efficiency. Lin and Leng24 proposed a Scale Selection Network (SSN) with a landmark guided scale attention network and a reversible scale semantic flow preserving strategy for mitigating computational redundancy and accelerating multi-scale OD in remote sensing images. Apostolidis and Papakostas25 analyzed the security, robustness, and performance of the YOLO OD models, including YOLOX variants, under adversarial attacks and image corruption to guide optimal model selection for real-world applications. Yuan et al.26 developed a distributed Edge-Cloud collaborative framework for unmanned aerial vehicle (UAV) object detection using the Edge-Embedded Lightweight (E2L) algorithm with attention mechanism (AM) and a fuzzy neural network. Kim et al.27 developed Cirrus, an end-to-end system with combinational point sampling for low-latency and energy-efficient 3D OD on outdoor Light Detection and Ranging (LiDAR) data, preserving accuracy while mitigating input size and processing overhead. Subramanian et al.28 presented an audio-visual object detection system using TensorFlow Lite and FogBus on Raspberry Pi, incorporating multimodal data fusion with edge and fog computing to enable real-time, accurate scene description for visually impaired users.

Though the existing studies are efficient in the OD task for individuals with disabilities, they still exhibit various limitations. For instance, multiple models encounter challenges due to low lighting, high occlusion, and background clutter, affecting detection accuracy. The dependence on single-modal data (either visual or audio) restricts robustness in dynamic environments. Few methodologies illustrate high accuracy but are computationally intensive, making real-time deployment on low-power edge devices difficult. Furthermore, generalization poses a constraint due to limited adaptability across diverse environmental conditions and targets. Additionally, segmentation accuracy for small or overlapping objects remains suboptimal. The research gap lies in effectively balancing detection precision, computational efficiency, and multimodal integration to ensure real-time, on-device performance for practical OD in VIP applications.

Methodological approach

In this manuscript, an FFDGCRN-ROD technique is presented for individuals with disabilities. It concentrates on developing an intelligent OD framework for individuals with disabilities, employing smart technology to allow real-time monitoring and assistive decision-making. To achieve this, the study encompasses image pre-processing, OD process, feature representation, and classification processes. Figure 1 represents the general process of the FFDGCRN-ROD model.

Fig. 1
Fig. 1
Full size image

General process of FFDGCRN-ROD technique.

Dataset description

This study utilizes a publicly available indoor OD dataset to assess the effectiveness of the proposed model. The dataset consists of 6,642 object instances distributed across multiple categories. Among them, Cabinet Door has the highest representation with 3,890 counts, followed by Refrigerator Door (879), Door (562), and Window (482). More minor but notable categories include Chair (223), Table (248), and Cabinet (208). Rarely occurring objects are Opened Door (90), Pole (36), and Couch (24). This varied distribution highlights the dominance of door-related objects while also including a mix of furniture and structural elements29. Also, common augmentation techniques such as horizontal flipping, random rotation, and scaling were applied to enrich data diversity. These augmentations assist in improving the robustnesss of the model, specifically for underrepresented classes. Figure 2 depicts the sample images.

Fig. 2
Fig. 2
Full size image

Sample images.

Pre-processing techniques

Primarily, the image pre-processing phase involves resizing, normalization, and image enhancement to remove noise and enhance image quality30. This stage consists of a series of methods that process raw images to improve their quality for study. This may include processes such as noise reduction, normalization, contrast enhancement, and colour correction. The aim is to enhance the clarity of the image and make its attributes more critical for the following processing stages. Dual dissimilar pre-processing models are applied to the input images, based on the feature extraction methods, which emulate the requirement.

Image resizing

Resizing fine-tunes the image size, normalizes dimensions for ML methods, and maintains the aspect ratio until the alteration is projected. Here, histopathology images are re-scaled and standardized before being provided to the CNN method to detect and optimize, thereby optimizing processing time.

Normalization

It aids in stopping the above-mentioned issue. Normalization is the most effective and encompassing standardization method. A dataset will have a unit standard deviation and 0 mean through this complete after it has been normalized for all feature samples. Min-Max Normalization is used and then the pixel values of image \(\:I\) are scaled to the particular range, typically\(\:\:\left[\text{0,1}\right]\). Here, \(\:{I}_{\text{m}\text{i}\text{n}}\) and \(\:{I}_{\text{m}\text{a}\text{x}}\) refers to the image’s maximum and minimum values of pixel, consistently.

$$\:{I}_{nom}=\frac{I-{I}_{\text{m}\text{i}\text{n}}}{{I}_{\text{m}\text{a}\text{x}}-{I}_{\text{m}\text{i}\text{n}}}$$
(1)

Image enhancement

Image enhancement in ML improves image quality compared to methods like sharpening, contrast adjustment, and noise reduction, resulting in distinct features. These pre-processing stages are crucial for achieving a consistent outcome of the technique, particularly in creating images for feature extraction. CLAHE improves image contrast by using HE inside tiles, increasing local contrast and particularity, however, reducing noise amplification and maintaining image quality. It is streamlined using the succeeding expression for fine-tuning pixel intensity according to local HE:

$$\:Output\left(x,y\right)=\text{m}\text{i}\text{n}\left(\text{m}\text{a}\text{x}\left(\frac{Input\left(x,y\right)-{Min}_{locaI}}{Ma{x}_{locaI}-Mi{n}_{local}}\times\:Cli{p}_{factor},\:0\right),255\right)$$
(2)

Here, \(\:(\)Input\(\:(x,\:y)\)) refers to the new pixel value; however, \(\:\left({\text{M}\text{i}\text{n}}_{local}\right)\) and \(\:\left(Ma{x}_{local}\right)\) represent the maximum and minimum values. The ‘Clip’ feature limits contrast enhancement to the valid interval of \(\:0\) to 255.

Indoor OD process

For OD, the FFDGCRN-ROD approach utilizes the faster R-CNN for automatically recognizing and locating specific targets in the images31. The process of OD is implemented by the Faster R-CNN method for creating accurate region proposals. It is selected because of its higher precision and efficacy in identifying objects with changing scales and composite settings. Unlike traditional models or former R‐CNN versions, it incorporates classification and region proposal. Its region proposal network (RPN) permits accurate and quick localization of objects. In comparison with methods like SSD or YOLO, Faster R‐CNN presents better recognition outcomes, mainly in conditions requiring higher localization accuracy.

Its strength through different datasets makes it a better option for real-time applications. Faster R‐CNN is one of the famous methods for its greater accuracy. It includes two components: RPN and a detection model. The fundamental source of Faster RCNN contains the following phases:

1) Base network: Use a pre-trained CNN to eliminate attributes from input images.

2) RPN: It creates candidate-directed regions by sliding a window across the feature maps, which are considered as anchor boxes.

3) Region of Interest (RoI) pooling: Divided every candidate’s regions into permanent size sub-fields and mapped them on static size feature mapping.

4) Object classification network: Employ fully connected (FC) models for categorizing every candidate region, capturing an output of RoI Pooling as an input, and output the probabilities of every candidate region.

5) Bounding box regression: Apply these regressions to adjust the coordinates of the bounding box aims in every candidate region.

To perform OD, they use the RPN to create candidate regions. The RPN stage consists of two main formulations: one to calculate the coordinates of anchor boxes and the other to compute the overlap between bounding and ground truth anchor boxes.

$$\:\begin{array}{c}{x}_{anchor}={x}_{center}-\frac{{w}_{anchor}}{2}\\\:{y}_{anchor}={y}_{center}-\frac{{h}_{anchor}}{2}\\\:{w}_{anchor}=widt{h}_{anchor}\\\:{h}_{anchor}=heigh{t}_{anchor}\end{array}$$
(3)

\(\:{x}_{anchor}\) and \(\:{y}_{anchor}\) are the anchor box. \(\:{x}_{center}\) and \(\:{y}_{center}\) represent the coordinates of the central point. \(\:{w}_{anchor}\) and \(\:{h}_{anchor}\) represent the height and width. \(\:heigh{t}_{anchor}\) and\(\:\:widt{h}_{anchor}\) indicate the predefined height and width

$$\:{L}_{bbox}={\sum\:}_{i}{L}_{smooth}\left({t}_{i}-{t}_{i}^{{\prime\:}}\:,\:{1}_{i}\:is\:positive\right)$$
(4)

Whereas \(\:{t}_{i}\) symbolizes the forecast bounding boxes offset, \(\:{t}_{i}^{{\prime\:}}\) exemplifies the steady ground truth bounding boxes offset, and \(\:{1}_{i}\) specifies the positive of an indicator.

$$\:{F}_{roi}=RoIpooling\left({F}_{conv},p\right)$$
(5)

\(\:{F}_{conv}\) symbolizes the convolution feature maps, and \(\:p\) signifies the input parameter. Pooling layer for \(\:RoI\) mapping, where \(\:RoI\)s of different dimensions are mapped onto pre-determined size features.

This phase performs pooling procedures, which are equivalent to \(\:RoIs\) of altered dimension, resulting in pre-determined size \(\:RoI\) characteristics.

$$\:{F}_{cls}=softmax\left({W}_{cls}\cdot\:{F}_{roi}+{b}_{cls}\right)$$
(6)

The bounding box regression equation is given below:

$$\:{F}_{reg}={W}_{reg}\cdot\:{F}_{roi}+{b}_{reg}$$
(7)

While, \(\:{W}_{cls},\) \(\:{b}_{cls},\) \(\:{W}_{reg}\), and \(\:{b}_{reg}\) refer to learned parameters. The function of softmax has been applied to convert the output of object classification. By combining the detection network and RPN, this model achieves a correct OD level and has proved to be a significant performance improvement on many benchmark datasets.

Fusion of feature representation model

Next, the fusion models, namely CapsNet, SqueezeNet, and Inceptionv3, are utilized for the feature extraction. The fusion model is chosen for its complementary strengths in capturing spatial hierarchies and pose information, which helps recognize overlapping or deformed objects. The SqueezeNet is effective due to its lightweight architecture, facilitating efficient computation with fewer parameters, thus making it ideal for edge deployment. Inceptionv3 is chosen for its capability in handling diverse object scales through multi-level convolutions, improving feature richness. This integration enhances the robustness and generalization of the model across intrinsic scenes. They are trained using appropriate batch processing, epochs, categorical cross-entropy loss, dropout for regularisation, and weight initialization techniques to ensure efficient training and good convergence.

CapsNet architecture

In DL applications, CNNs have become a general method for analyzing image data. CNN is a particular kind of NN framework, which consists of FC neurons that are provided by specifically developed layers32. The layers included in this process are called flattening layers, convolutional layers, and pooling layers, respectively. Besides these typical layers, there are many specified layers intended to improve data handling in several methods. By incorporating these layers of CNN, image-based data is successfully examined, and classification tasks are implemented with enhanced precision.

CapsNet is an NN approach which uses capsule layers made up of groups of neurons. Unlike classic CNNs, these groupings inside capsules permit the detailed representation of dissimilar properties. Then, a CapsNet approach can well identify further details in image data. A standard CapsNet method improves the convolutional layer’s outputs by processing them comprehensively over capsule layers. In this work, the CapsNet features have also been advanced with different changes, like dynamic routing, which contains class and primary capsule layers. These changes aid in computing added parameters, like the probability of the presence of particular features. In this work, CapsNet was used as the DL module. In comparison with classic CNNs, this model is a better option to analyze image data as it successfully takes into account the spatial hierarchy. In addition, this model has smaller parameters than classic CNNs, which decreases the overfitting hazard.

SqueezeNet architecture

SqueezeNet is a lightweight CNN designed to achieve greater precision with significantly fewer parameters, making it ideal for applications that require computing efficacy33. It minimizes model size across three major approaches: \(\:1\)x\(\:1\) convolutions for parameter decline, Fire modules for effective feature extraction, and delayed downsampling for maintaining spatial data. These model selections enable SqueezeNet to maintain AlexNet-level precision while having 50\(\:\times\:\) smaller parameters, making it particularly suitable for tasks such as classification. The Fire module incorporates expand and squeeze layers. The layer of squeeze utilizes convolutions with kernel dimensions of 1 to reduce the input channel counts, resulting in a considerably lower computing cost. Then, the expanded layer incorporates convolutions with kernel dimensions of one and three, ensuring a balance between parameter efficacy and strong feature extraction. This framework guarantees that the model can seize fine-grained particulars that are important for precise classification.

Unlike conventional CNN structures that use pooling layers initially, this model delays these processes to deep layers. This assistance maintains higher-resolution spatial data that is important for classification, as more minor changes in colour, shape, size, and variation frequently identify the category. The system further utilizes GAP rather than FC layers, also reduces the number of parameters, and maintains strong feature representation. This leads to a method that is either computationally lightweight or able to remove discriminatory features essential for detection. By utilizing effective feature extraction, this model proves to be a perfectly suitable method for classification, offering the best balance between interpretability, precision, and speed. Figure 3 signifies the architecture of SqueezeNet.

Fig. 3
Fig. 3
Full size image

SqueezeNet architecture.

Inceptionv3 architecture

Inception-v3 is a deep CNN image classifier which integrates numerous structural developments to improve precision and efficacy34. A smaller convolution count, similar to dual \(\:3\text{x}3\) convolutions, completely substitutes large convolutions, such as \(\:5\text{x}5\). Factorized convolution enables the model to reduce calculation costs without altering the receptive field. To further improve performance, asymmetric convolution, such as separating an nxn matrix into \(\:1\text{x}n\) and \(\:n\text{x}1\), decreases the total parameter counts. Other prominent features include an auxiliary classifier, a smaller network inserted at layers in the centre, which serves as a regularizer to mitigate the problem of gradient vanishing and assist in convergence improvement after training. For training, the system utilizes categorical loss of cross-entropy that can be specified by:

$$\:L=-{\sum\:}_{i=1}^{N}{y}_{j}log\left({x}_{j}\right)$$
(8)

Whereas \(\:{y}_{i}\) denotes real classes, and \(\:{x}_{i}\) means predicted possibility for all classes \(\:I\). To prevent the method from being overly confident in its predictions, label smoothing is applied substantially. It fine-tunes the labels of ground truth by distributing a smaller likelihood mass to inaccurate class labels.

$$\:{y}_{i}^{{\prime\:}}={y}_{i}\left(1-\epsilon\right)+\frac{\epsilon}{N}$$
(9)

Here, \(\epsilon\) (epsilon) assists in avoiding overfitting by making the method less reliable for this prediction. These optimizations make inceptionv3 a very effective method for composite image classification tasks.

Classification with DA-GCRN

Lastly, the FFDGCRN-ROD method employs the DA-GCRN technique to precisely detect and classify objects for visually impaired people35. The spatial and temporal dependencies in sequential data are effectively captured by this method, which is considered crucial for accurate OD and classification in dynamic environments. The domain adaptation capability across diverse subjects and conditions assists the generalization of the model. Unlike conventional models, this methodology demonstrates excellence in efficiently handling discrepancies in data distribution, thus improving robustness and mitigating errors in real-world scenarios. Additionally, the contextual relationships are effectively learned by the recurrent structure of the model, thereby enhancing detection accuracy and reliability in assistive applications.

This work proposes a DA-GCRN method for spatio-temporal prediction, and it is based on a sequence-to‐sequence structure, using the encoder‐decoder framework. Both the decoder and encoder are comprised of adaptive graph convolutional recurrent units (AGCRUs). The AGCRUs incorporate DAGC processes with the GRU, which is a simpler version of LSTM, considerably increasing computational efficacy while maintaining performance. The introduction of the DAGC substitutes the multi-layer perceptron layer in the new GRU, expanding the ability of the model to complex model dependences and spatio-temporal relationships in dynamic time-series.

The AGCRU becomes the primary element for spatio-temporal modelling, successfully capturing temporal and spatial dependences in the input graph topology. The key to adaptive graph convolution rests in how it uses the dynamic graph framework \(\:\stackrel{\sim}{P}\), which characterizes spatial correlations and is described as demonstrated:

$$\:H=\sigma\:\left(X{\star\:}_{G}\varTheta\:\right)=\sigma\:\left({{\sum\:}_{k=0}^{K}\stackrel{\sim}{P}}^{k}X{W}_{k}\right)$$
(10)

Whereas \(\:X\in\:{\mathbb{R}}^{N\times\:C}\) and \(\:H\in\:{\mathbb{R}}^{N\times\:h}\) characterize the output and input of the GC operation \(\:\left({\star\:}_{G}\right)\), correspondingly, \(\:\varTheta\:\) and \(\:{W}_{k}\in\:{\mathbb{R}}^{N\times\:C\times\:h}\) represent the \(\:kth\) order kernel parameter, \(\:\sigma\:\) means activation function, and \(\:{\stackrel{\sim}{P}}^{k}\in\:{\mathbb{R}}^{N\times\:N}\) characterizes the dynamic topological relationship among the nodes in graph \(\:G.\) The computation procedure of the AGCRU, made by incorporating adaptive GC with GRU components, is as demonstrated:

$$\:{u}_{t}=Sigmoid\left(\left[{X}_{t},\:{H}_{t-1}\right]{\star\:}_{G}{\varTheta\:}_{u}+{b}_{u}\right)$$
(11)
$$\:{r}_{t}=Sigmoid\left(\left[{X}_{t},\:{H}_{t-1}\right]{\star\:}_{G}{\varTheta\:}_{r}+{b}_{r}\right)$$
(12)
$$\:{C}_{t}=Tanh\left(\left[{X}_{t},\:\left({r}_{t}\odot\:{H}_{t-1}\right)\right]{\star\:}_{G}{\varTheta\:}_{C}+{b}_{C}\right)$$
(13)
$$\:{H}_{t}={u}_{t}\odot\:{H}_{t-1}+(1-{u}_{t})\odot\:{C}_{t}$$
(14)

Here,\(\:\:C,u\), and \(\:r\) characterize the candidate state, update gate, and reset gate in the AGCRU component, respectively. \(\:\odot\:\) signifies the product of Hadamard, while \(\:{\varTheta\:}_{\{u,r,C\}}\in\:{\mathbb{R}}^{K\times\:(C+h)\times\:h}\) characterizes the learnable gate parameters. \(\:[{X}_{t},\:{H}_{t-1}]\) and \(\:{H}_{t}\) represent input and output at time step \(\:t\). Moreover, to the observation \(\:{X}_{t}\), the AGCRU needs auxiliary input from the dynamic topological framework \(\:\stackrel{\sim}{P}.\).

Evaluation metrics

Accuracy, precision, recall, F1-score, and G-measure are some of the traditional performance measures used to verify. It also allows for investigating the outcomes of predictions using the confusion matrix, where \(\:TP\) and \(\:TN\) signify true positive and negative, and \(\:FP\) and \(\:FN\) signify false positive and negative. These metrics assess the model’s ability to generalize across diverse image data types while maintaining patient privacy in a DL setting.

Accuracy

A prevalent statistic to assess how precise a classifier model is in terms of accuracy. In the below-mentioned Eq. (5), it is calculated as the ratio of clearly marked examples to every occurrence in the dataset.

$$\:Accuracy=\frac{TP+TN}{TP+TN+FP+FN}\times\:100$$
(15)

Precision

Precision is described as the ratio of properly classified images to every categorized image. The precision \(\:\left(Pr\right)\) might be stated utilizing Eq. (6):

$$\:Precision=\frac{TP}{TP+FR}\times\:100$$
(16)

Recall

Recall determines how many of all images in the dataset were correctly identified. Equation (17) offers a brief formulation that outlines it as shown:

$$\:Recall=\frac{TP}{TP+FN}\times\:100$$
(17)

F1-score

A maximum F-score proposes that the model has a successful predictive ability, as it characterizes the harmonic means of precision and recall. Precision and recall are still insufficient metrics to evaluate the performance. Based on Eq. (8), the \(\:F\)‐score is established as demonstrated in Eq. (18).

$$\:F1-score=\frac{2\times\:recall\times\:precision}{recall+precision}$$
(18)

G-measure

It is a performance metric that balances recall and precision by capturing their geometric mean. It offers a single score that imitates the tradeoff between these two measures, ensuring neither is neglected.

$$\:G-measure=\sqrt{precision\text{*}recall}$$
(19)

Results and discussion

The investigational study of the FFDGCRN-ROD model is inspected under the Indoor OD dataset. The technique is simulated using Python 3.6.5 on a PC with an i5-8600k, 250GB SSD, GeForce 1050Ti 4GB, 16GB RAM, and 1 TB HDD. Parameters include a learning rate of 0.01, ReLU activation, 50 epochs, 0.5 dropout, and a batch size of 5.

Figure 4 shows the classifier outcome of FFDGCRN-ROD methodology on 80:20. Figure 4a and b demonstrate the confusion matrices with accurate detection of each class. Figure 4c exhibits 80% of the training phase (TRAPH) for several classes at diverse metrics. Figure 4d illustrates 20% of the testing phase (TESPH) for several classes on dissimilar metrics.

Fig. 4
Fig. 4
Full size image

80:20 of (a, b) confusion matrix, and (c, d) TRAPH/TESPH.

Table 1; Fig. 5 display the OD of the FFDGCRN-ROD method at 80%:20%. On 80% TRAPH, the FFDGCRN-ROD model attains an \(\:acc{u}_{y}\), \(\:pre{c}_{n}\), \(\:rec{a}_{l}\), \(\:{F1}_{Score}\), and \(\:{G}_{Measure\:}\)of 99.57%, 93.57%, 88.13%, 90.09%, and 90.46%, respectively. Likewise, at 20% TESPH, the FFDGCRN-ROD model attains an \(\:acc{u}_{y}\), \(\:pre{c}_{n}\), \(\:rec{a}_{l}\), \(\:{F1}_{Score}\), and \(\:{G}_{Measure\:}\)of 99.65%, 93.85%, 92.25%, 92.85%, and 92.95%, respectively.

Table 1 OD of FFDGCRN-ROD method on 80:20.
Fig. 5
Fig. 5
Full size image

Average values of FFDGCRN-ROD method under 80:20.

Figure 6 describes the training (TRAN) and validation (VALD) accuracy of the FFDGCRN-ROD model on 80:20 over 200 epochs. Both curves progressively surge and steadily converge, denoting that the method is learning effectively. The VALD accuracy consistently exceeds the TRAN accuracy, indicating that the methodology is not over-fitting and is better at generalizing to unseen data. The variations in accuracy are due to the task’s intricacy, but general upward tendencies reveal robust performance and strength of the model in differentiating between normal and attack data.

Fig. 6
Fig. 6
Full size image

\(\:Acc{u}_{y}\) curve of FFDGCRN-ROD technique under 80:20.

Figure 7 describes the TRAN and VALD loss of the FFDGCRN-ROD approach at 80:20 over 200 epochs. Both curves depict a persistent downward trend, denoting that the model is efficiently reducing error at learning. The VALD loss remains minimally lesser than the training loss across most epochs, signifying better generality and no signs of over-fitting. While some fluctuations are observed, it is becoming gradually reliable and steady at differentiating between normal and attack data as training emerges.

Fig. 7
Fig. 7
Full size image

Loss curve of FFDGCRN-ROD technique on 80:20.

In Fig. 8, the precision-recall (PR) curve analysis of the FFDGCRN-ROD approach on 80:20 provides insights into its performance by charting Precision against Recall for each class. The figure demonstrates that the FFDGCRN-ROD approach consistently yields increased PR values across multiple classes. The constant improvement in every class describes the proficiency of the FFDGCRN-ROD technique.

Fig. 8
Fig. 8
Full size image

PR curve of FFDGCRN-ROD technique under 80:20.

In Fig. 9, the ROC curve of the FFDGCRN-ROD approach on 80:20 is examined. The outcomes suggest that the FFDGCRN-ROD approach yields increased ROC values for each class, demonstrating a significant ability to differentiate between them. This persistent pattern over several classes implies the efficacious performance of the FFDGCRN-ROD model on class prediction.

Fig. 9
Fig. 9
Full size image

ROC curve of FFDGCRN-ROD technique under 80:20.

Figure 10 displays the classifier outcome of the FFDGCRN-ROD technique on a 70:30 split. Figure 10a and b exhibit the confusion matrices with perfect recognition of every class. Figure 10c represents 70% of the TRAPH for numerous class labels under dissimilar metrics. Figure 10d illustrates 30% of the TESPH for numerous classes under dissimilar metrics.

Fig. 10
Fig. 10
Full size image

70:30 of (a, b) confusion matrix, and (c, d) TRAPH/TESPH.

Table 2; Fig. 11 portray the OD of the FFDGCRN-ROD technique under 70%:30%. On 70% TRAPH, the FFDGCRN-ROD model attains an \(\:acc{u}_{y}\), \(\:pre{c}_{n}\), \(\:rec{a}_{l}\), \(\:{F1}_{Score}\), and \(\:{G}_{Measure\:}\)of 99.57%, 93.89%, 86.15%, 88.56%, and 89.24%, respectively. Further, at 30% TESPH, the FFDGCRN-ROD model attains an \(\:acc{u}_{y}\), \(\:pre{c}_{n}\), \(\:rec{a}_{l}\), \(\:{F1}_{Score}\), and \(\:{G}_{Measure\:}\)of 99.63%, 96.33%, 91.49%, 93.59%, and 93.75%, respectively.

Table 2 OD of FFDGCRN-ROD methodology under 70:30.
Fig. 11
Fig. 11
Full size image

Average values of FFDGCRN-ROD methodology under 70:30.

Figure 12 portrays the TRAN and VALD accuracy of the FFDGCRN-ROD methodology under 70:30 over 200 epochs. Both curves steadily increase and progressively converge, indicating that the model is learning effectively. The VALD accuracy consistently remains slightly higher than the TRAN accuracy, indicating that the model is not over-fitting and is generalizing better to unseen data. The fluctuations in accuracy are due to the complication of the task. Still, the overall upward trend reveals robust performance and stability of the model in discriminating between normal and attack data.

Fig. 12
Fig. 12
Full size image

\(\:Acc{u}_{y}\) curve of FFDGCRN-ROD technique under 70:30.

Figure 13 exemplifies the TRAN and VALD loss of the FFDGCRN-ROD methodology on 70:30 over 200 epochs. Both curves show steady downward tendencies, indicating that the model is excellently minimizing error during learning. The VALD loss remains minimally lower than the training loss across most epochs, implying good generality and no signs of over-fitting. Although some fluctuations are observed, it is becoming progressively stable and reliable at differentiating between normal and attack data as training progresses.

Fig. 13
Fig. 13
Full size image

Loss curve of FFDGCRN-ROD technique under 70:30.

In Fig. 14, the precision-recall (PR) curve analysis of the FFDGCRN-ROD model on 70:30 provides insights into its outcomes by charting Precision against Recall for every class. The figure shows that the FFDGCRN-ROD model consistently achieves higher PR values across multiple classes. The persistent improvement in each class reveals the efficacy of the FFDGCRN-ROD in the classification process.

Fig. 14
Fig. 14
Full size image

PR curve of FFDGCRN-ROD technique on 70:30.

In Fig. 15, the ROC curve of the FFDGCRN-ROD approach on 70:30 is inspected. The results indicate that the FFDGCRN-ROD model achieves increased ROC values across all classes, demonstrating a significant ability to distinguish between them. This dependable pattern of maximal ROC values on multiple classes implies the efficacious outcomes of the FFDGCRN-ROD model on class prediction.

Fig. 15
Fig. 15
Full size image

ROC curve of FFDGCRN-ROD technique under 70:30.

Table 3; Fig. 16 depict the comparative analysis of the FFDGCRN-ROD model with existing models20,21,36,37. The table values indicate that the methodologies, including NAM, PCA, TFST, Yolo-V8, Yolo-V5 large, MR-CNN, YOLO-inception, CADNet, DICSSD300, and AttenRetina, achieved the minimum performance across various metrics. At the same time, the FFDGCRN-ROD model has obtained maximum performance with \(\:acc{u}_{y}\), \(\:pre{c}_{n}\), \(\:rec{a}_{l}\) and \(\:{F1}_{score}\) of 99.65, 93.85%, 92.25%, and 92.85%, respectively.

Table 3 Comparative study of FFDGCRN-ROD model with existing techniques.
Fig. 16
Fig. 16
Full size image

Overall performance of FFDGCRN-ROD model with existing techniques.

In Table 4; Fig. 17, the time complexity (TC) of the FFDGCRN-ROD methodology is proven with the present approaches. The outcomes indicate that the FFDGCRN-ROD model accomplishes better performance with a TC of 11.95 s.

Table 4 TC outcome of FFDGCRN-ROD approach with existing methods.
Fig. 17
Fig. 17
Full size image

TC outcome of FFDGCRN-ROD approach with existing methods.

Among the existing methodologies, NAM, PCA, TFST, Yolo-V8, Yolo-V5 large, MR-CNN, YOLO-inception, CADNet, DICSSD300, and AttenRetina achieved the highest TCs of 20.09 s, 22.45 s, 19.34 s, 18.46 s, 30.18 s, 29.40 s, 25.71 s, 25.08 s, 26.81 s, and 28.66 s, respectively. These results concluded the betterment of the proposed model over other recent studies.

Table 5 specifies the ablation study of the FFDGCRN-ROD methodology. The base DA-GCRN model achieved an \(\:acc{u}_{y}\) of 97.36%, \(\:pre{c}_{n}\) of 90.94%, \(\:rec{a}_{l}\) of 89.81%, and \(\:{F1}_{Score}\) of 90.00%. Integrating CapsNet model enhanced the outcomes slightly, achieving an \(\:acc{u}_{y}\) of 97.90%, \(\:pre{c}_{n}\) of 91.67%, \(\:rec{a}_{l}\) of 90.33%, and \(\:{F1}_{Score}\) of 90.70%. With SqueezeNet, the model illustrated further improvement, achieving an \(\:acc{u}_{y}\) of 98.47%, \(\:pre{c}_{n}\) of 92.28%, \(\:rec{a}_{l}\) of 90.94%, and \(\:{F1}_{Score}\) of 91.47%. Adding InceptionV3 provided even better results, with an \(\:acc{u}_{y}\) of 99.05%, \(\:pre{c}_{n}\) of 93.05%, \(\:rec{a}_{l}\) of 91.60%, and \(\:{F1}_{Score}\) of 92.27%. The overall model portrayed optimum performance, achieving an \(\:acc{u}_{y}\) of 99.65%, \(\:pre{c}_{n}\) of 93.85%, \(\:rec{a}_{l}\) of 92.25%, and \(\:{F1}_{Score}\) of 92.85%, highlighting the efficiency of intelligent feature fusion in improving robust object detection.

Table 5 Performance comparison through ablation study of diverse components integrated with the DA-GCRN model.

Conclusion

This paper proposes the FFDGCRN-ROD method for OD among individuals with disabilities in the smart IoT-edge-cloud environment. This study has developed an intelligent OD framework for individuals with disabilities utilizing smart technology to allow real-time monitoring and assistive decision-making. Primarily, the image pre-processing step encompasses resizing, normalization, and image enhancement to eliminate noise and improve image quality. For the OD process, the FFDGCRN-ROD method utilizes the faster R-CNN technique for automatically identifying and locating specific targets in the imagery. Furthermore, the fusion models, such as CapsNet, SqueezeNet, and Inceptionv3, are utilized for feature extraction. Finally, the FFDGCRN-ROD methodology implements the DA-GCRN technique to detect and classify objects for visually impaired individuals precisely. The comparison analysis of the FFDGCRN-ROD methodology demonstrated a superior accuracy value of 99.65% over existing methods when examined under the Indoor OD dataset. The limitations of the FFDGCRN-ROD methodology include high computational complexity and the requirement for large labelled datasets. The deployment of the model on resource-constrained devices may be affected, and the technique may also encounter threats in adapting to diverse real-world environments with varying lighting and occlusion conditions, affecting detection accuracy. The research gap is in improving lightweight, adaptive models that maintain high accuracy while being efficient for on-device processing. Real-world deployment challenges include ensuring low latency, managing power consumption, and maintaining robustness against noise and environmental variability. Addressing these issues is significant for practical assistive technology applications.