Introduction

One of the significant building blocks of any smart city is ITSs. Indeed, information and communication technologies (ICT) would benefit road infrastructures1. Technology is continuously evolving despite the advanced ITS solutions to be deployed2. Testing technologies on public roads has already started in nations, and severe efforts continue to mandate and regulate these near-future technologies3. Various new applications will be enabled as inter-connected and autonomous vehicle infiltration in traffic surges. UAVs, or drones, were utilized in the military for many services. In recent times4, there has been a drastic increase in the utility of UAVs in other sectors, like delivery of goods and services, precision agriculture, and security and surveillance. Automation of the entire transportation system could not be attained only through vehicle automation5. Indeed, other elements of the road and the end-wise transportation system, namely the rescue teams, support teams, road surveys, and traffic police, should be automated. Those elements are automated with reliable and smart UAVs6. The transportation scheme is complex, so detecting vehicles in drone images is significant. It could help detect vehicles stuck in disaster zones or rugged terrains and manage traffic and parking lots7.

Vehicular detection from drone images complements on-road vehicular detection and is advantageous for driver assistance models8. Moreover, vehicle detection is the initial step in several traffic surveillance tasks. There was an increasing trend to use CNNs with the current hype of DL and AI, CNNs for extracting data in image and video streams9. However, witnessed as the potential method for semantic segmentation of images, classification, and detection of aerial images are various peculiarities that vary in the traditional kinds of images10. For instance, objects are viewed from multiple viewpoints and altitudes. Therefore, single classes have several representation patterns to be studied11. In addition, various classes may share similar appearances, particularly in higher altitudes. ITS are substantial for developing smart cities, giving transformative merits for managing road infrastructure and traffic flow12. With the swift enhancement of technology, there is a growing requirement for integrating novel outcomes such as UAV imagery and deep reinforcement learning for vehicle recognition13. As urban areas become more complex, efficient and adaptive ITS solutions are crucial to optimize traffic management, improve safety, and improve overall transportation effectiveness14. The continuous evolution of these technologies drives the requirement for rigorous testing and refinement to confirm their practical application and regulatory compliance in real-world settings15.

Contribution of the study

This study presents a flying foxes optimization with deep learning-based vehicle detection and classification model on aerial images (FFODL-VDCAI) technique for ITS application. The main objective of the FFODL-VDCAI technique is to automate and accurately classify vehicles that exist in aerial images. Three primary processes are involved in the presented FFODL-VDCAI technique. Initially, the FFODL-VDCAI approach utilizes YOLO-GD (Ghost-Net and Depthwise convolution) for vehicle detection, where the YOLO-GD uses lightweight Ghost Net in place on the backbone network of YOLO-v4 and interchanges the conventional convolutional with depthwise separable convolutional and pointwise convolutional. Next, the FFO technique is used for hyperparameter tuning the Ghost Net technique. Finally, a deep Q-network (DQN) based reinforcement learning technique is used to classify detected vehicles effectively. A comprehensive simulation analysis of the FFODL-VDCAI methodology is conducted on the UAV image dataset. The significant contribution of the FFODL-VDCAI methodology is listed below:

  • The FFODL-VDCAI technique presents a vehicle detection technique utilizing YOLO integrated with GD for improved accuracy and effectiveness in UAV imagery. The model optimizes vehicle detection by using advanced DL methods. The contribution is in combining YOLO with GD to refine the detection capacities.

  • The FFO approach is employed to fine-tune the hyperparameters of the vehicle detection method, crucially enhancing its performance and adaptability. Utilizing FFO improves the precision and effectiveness of the technique in vehicle detection. The key contribution is the application of FFO for optimizing hyperparameters, which enhances the overall efficiency of the method.

  • The FFODL-VDCAI model incorporates the DQN approach to classify recognized vehicles, implementing reinforcement learning to improve the classification accuracy. By employing DQN, the model dynamically enhances its capability to discriminate between diverse kinds of cars, paving the way to more precise and reliable classification

  • The FFODL-VDCAI approach combines YOLO for vehicle detection, FFO for hyperparameter tuning, and DQN for classification into a unified framework, integrating cutting-edge models for improving performance across overall phases. The novelty is in its seamless incorporation of these advanced methodologies, optimizing every detection phase, tuning, and classification—to crucially improve the vehicle detection and classification from UAV imagery

Related works

Zhao et al.16 present a lightweight detection methodology that depends on an enhanced form of YOLO-v5. The coordinate attention mechanism is given to improve the feature extractor of the network and its recognition capability and detection. Non-maximum suppression is utilized to solve this problem of false detection and omission while identifying congested targets. In17, the authors introduced a new vehicle detection method termed PVIDNet, a traffic control method related to the Brazilian Traffic Code, and a lightweight proposal technique for the PVIDNet method leveraging an activation function for decreasing the implementation time of the presented method. Jagannathan et al.18 devised a novel approach to classify vehicle types. The Gaussian mixture model and AHE can be applied to enhance the quality of gathered vehicle images and to discover vehicles in the denoising image. Afterwards, the Weber Local Descriptor and the Steerable Pyramid Transform abstract the feature vector in the identified cars. Lastly, the feature removed is presented as the input for vehicle classification. Pustokhina et al.19 modelled a potential DL-related VLPR method-based recognition named OKM-CNN approach. This technique has three stages: license plate (LP) number recognition utilizing the CNN method, LP recognition, and segmentation using the OKM clustering approach. Second, the detection process and LP localization were done through Connected Component Analysis (CCA) and Improved Bernsen Algorithm (IBA) methods. Han et al.20 designed a vehicle-detection approach with a CNN-related object detector. The author devised the technique, DRFBNet300, which includes a Deeper Receptive Field Block (DRFB) element that enriched the mapping feature to find smaller objects in the drone imageries. Eventually, a Split Image Processing (SIP) technique will be used to enhance accuracy.

In21, the focus is on expanding the automated vehicle detection approach for drone imageries. Firstly, vehicle datasets for target recognition are built. Afterwards, a novel YOLO-v3 vehicle recognition configuration is devised per the features: The vehicles targeted from the drone image are dense and small. Lastly, the presented structure is tested with three datasets: VEDAI, CAR, and COWC. Moshayedi et al.22 introduced a new low-altitude vehicle speed detector system utilizing drones for RS. To this aim, (2) Mobile Net-SSD DL method variables were entrenched in PI4B processor of physical cars at various speeds, (1) the author has discovered the optimal Raspberry PI's FOV camera in outdoor and indoor situations by altering its degree and height. At last, the author applied it in a real environment by changing the angle and height. Wang23 devises a vehicle image detection approach utilizing DL in a drone video. First, HSV spatial brightness translation operations on the new samples will be executed to increase the flexibility of various sample diversity and light conditions. Lyu et al.24 propose an improved Faster R-CNN method for recognizing small deer in thermal images, utilizing a Feature Pyramid Network (FPN) and various residual networks (ResNet18 to ResNet152) to enhance the accuracy of feature extraction and detection. Ewers et al.25 aim to improve drone search missions in wilderness areas using deep reinforcement learning, optimizing flight paths based on a probability distribution map to enhance search effectualness and efficiency related to conventional methodologies crucially. In26, a novel reinforcement learning-controlled Grey Wolf Optimization-Archimedes Optimization Algorithm (QGA) model is presented on 22 benchmark functions and applied to determine optimal, collision-free UAV flight paths in a 3D environment. Kumar et al.27 evaluate a UAV-based spraying system in cotton fields by employing imaging methods such as Laser Droplet Analyzer and ImageJ, optimize it with response surface methodology, and utilize a hybrid GWO-ANN technique for deposition predictive evaluation. Makrigiorgis et al.28 present the AirCam-RTM framework, which integrates road segmentation and vehicle detection.

Using drones, existing vehicle detection and classification techniques encounter various threats, comprising restricted scalability across diverse environments and high computational complexity. Models such as the improved YOLO-v5, PVIDNet, and DRFBNet300 may face limitations with real-world discrepancies in traffic scenarios, image quality, and computational needs. OKM-CNN and deep reinforcement learning models also need help with license plate discrepancies, image quality, and dependence on probability maps. Moreover, the UAV-based spraying technology systems may have problems adapting to various field conditions and confirming precise deposition. These limitations underscore the requirement for more versatile and efficient solutions capable of efficiently handling several environmental conditions, image qualities, and lighting scenarios.

The proposed model

This paper introduces an automated FFODL-VDCAI methodology for the vehicle detection and classification process in the ITS platform. The projected FFODL-VDCAI methodology involves three main procedures: YOLO-GD-based vehicle detection, FFO-based hyperparameter tuning, and DQN-based vehicle classification. Figure 1 illustrates the workflow of the FFODL-VDCAI methodology.

Fig. 1
figure 1

Workflow of FFODL-VDCAI approach.

Stage I: YOLO-GD based vehicle detection

Primarily, the FFODL-VDCAI technique uses the YOLO-GD model for vehicle detection purposes. The network mainly includes feature extraction, result prediction, and feature fusion29. GhostNet and depthwise convolution are selected in the YOLO-GD model for their efficiency and performance merits. GhostNet’s design lessens computational overhead by producing more feature maps with fewer parameters, improving speed and accuracy. Depthwise convolution, utilized in place of standard convolutions, additionally mitigates computational complexity by employing filters separately to every input channel, thus reducing the number of operations and parameters. This integration allows YOLO-GD to attain high detection accuracy while maintaining a lightweight and efficient architecture crucial for real-time vehicle detection tasks. Figure 2 illustrates the structure of the YOLO-GD model.

Fig. 2
figure 2

Structure of YOLO-GD model.

YOLO‐GD adopts a lightweight feature extraction model; moreover, the depthwise and pointwise convolutional layer replaces the convolution function, efficiently minimizing the computation overhead. Ghost Net replaced the CSPDarknet53 model from the extraction feature phase. Ghost Net aims to provide a feature map. The major function produces the Ghost mapping feature by using linear conversion dependent upon the new mapping feature and extracting the crucial data in new features at the lowest overhead.

\(G_{ - }\)bottleneck mainly consists of \(G_{ - }\)bottleneck, where \(s\) shows the stride size, and “×” denotes the iterative process. \(G_{ - }\)bottleneck primarily consists of a Ghost system. If \(stride = 1\), the 1st Ghost module is exploited to expand the layer and enhance the channel counts. Next, minimizes the number of channels for equivalent shortcut paths. The Ghost system's input and output are interconnected in a shortcut. Relu, non-linearity, and \(BN\) are utilized, and afterwards, only BN is used in the second layer. Whereas \(stride = 2\), the shortcut path exploits depthwise convolutional with \(stride = 2\) for down-sampling and point convolutional layer to channel adjustment.

During the feature fusion and outcome forecast phases, spatial pyramid pooling \(\left( {SPP} \right)\) is inserted into the network output to increase the receptive field data of networks and extract the spatial feature data of dissimilar dimensions. \(SPP\) increases the robustness of the method for object variability and spatial layout:

$$ SPP = C\left( {f^{5 \times 5} MaxPool\left( F \right) + f^{9 \times 9} MaxPool\left( F \right) + f^{13 \times 13} MaxPool\left( F \right) + F} \right). $$
(1)

where \(F\) denotes the feature map, \(C\) symbolizes the concatenate function, \(f^{5 \times 5}\) means \(5\) x \(5\) filters, and \(MaxPool\) represents \({\text{the maxp}}\) ooling function. The Path Aggregation Network (PANet) integrates features among three resultant network layers and accomplishes geometric data in the bottom network and contour data in the top network. YOLOHead forecasts the confidence and classes and coordinates data simultaneously by setting the convolution function of filter count. \(3\)×\(3\) convolutional layer functions are exchanged with \(1\) x \(1\) and \(3\) x \(3\) depthwise separable convolution layers to minimize the overhead.

Utilizing its advanced architecture, this technique integrates various mechanisms for addressing false alarms. The method of a lightweight feature extraction network, namely GhostNet, and depthwise separable convolutions, assists in mitigating computation overhead and enhancing feature extraction accuracy. The GhostNet's Ghost mapping feature and the efficient employment of G-bottleneck layers assist in extracting significant features with minimal redundancy, thereby enhancing precision detection. Furthermore, incorporating Spatial Pyramid Pooling (SPP) improves the capacity of the approach to handle objects at several scales and spatial layouts, which assists in discriminating between true objects and background noise. The PANet additionally refines feature integration and spatial data processing across diverse layers, paving the way to more precise object localization and classification. By optimizing these procedures and reduding computational redundancies, YOLO-GD efficiently mitigates the likelihood of false alarms in vehicle recognition.

Stage II: hyperparameter tuning using the FFO model

The FFO methodology was exploited to fine-tune the parameter values of the Ghost Net model. Flying fox is the largest bat species30. The FFO method is appropriate for hyperparameter tuning due to its capability for effectively exploring and exploiting the search space, paving the way to optimal settings for complex techniques. FFO replicates the foraging behaviour of flying foxes, employing local and global search strategies to avert local minima and improve convergence. Its adaptability and robustness make it efficient for tuning hyperparameters in high-dimensional spaces, where conventional techniques may face computational complexity and convergence issues. Related to other models, namely grid search or random search, FFO presents a more systematic and intelligent technique for hyperparameter optimization, resulting in enhanced model performance and mitigated computational costs. Its balance between exploration and exploitation confirms that the hyperparameter tuning procedure is comprehensive and effectual, making it a valuable tool for improving the accuracy and robustness of ML techniques. Figure 3 specifies the architecture of the FFO technique.

Fig. 3
figure 3

Overall structure of FFO technique.

The movement in space relies on the observance of the environment since they cannot echolocate. They return to the habitat tree after night meals. Flying foxes seek the coolest tree to rest on to protect themselves from rising morning heatwaves. Mostly, those who are first to place a tree with an adequate amount of heat can suffocate by other members and die.

The FFO technique starts with an arbitrary set of the flying foxes' ensuing positions. This position is demonstrated by the vector, \(= \left( {x_{1} ,{ }x_{m} } \right)\), which possesses \(an m\)‐dimensional element. Then, the objective function assesses the solution for the position. Consequently, they find the coolest tree to ensure survival in intense heat.

Since flying foxes look for the nearest tree or follow each other's paths, it is assumed that once the habitat tree cannot give the minimal temperature for flying foxes, it moves to the dissimilar one to escape the high temperature:

$$ x_{i,j}^{t + 1} = x_{i,j}^{t} + a. rand\left( {coolj - x_{i,j}^{t} } \right) $$
(2)

where \(x_{i}^{{\text{o}}} \sim U\left( {x_{{{\text{min}}}} ,x_{{\text{max }}} } \right),\) \(x_{i,j}^{t}\) denotes the \(j\)‐th component of FF(i), at \(t\) reiteration, \(a\) represents a steady value, \(rand \sim U\left( {0,1} \right)\) and cooler relates to the position of FFs located from the tree which takes least temperature. Equation (2) was arranged if \(\left| {f\left( {cool} \right) - f\left( {x_{i} } \right)} \right| > \frac{{\delta_{1} }}{2}\), where \(cool\) denotes the location vector of the flying fox located in the cooler place that is the better solution, and the \(\delta_{1}\) variable equates to the longest possible distance where they are considered near one another. Once the flying foxes approach the tree with minimum temperature \(\left( {|(f\left( {cool} \right) - f\left( {x_{i} } \right)| \le \frac{{\delta_{1} }}{2}} \right)\), they find the nearby space to prevent suffocation:

$$ nx_{i,j}^{t + 1} = x_{i,j}^{t} + rand_{1,j} \cdot \left( {cool_{j} - x_{i,j}^{t} } \right) + rand_{2,j} \cdot \left( {x_{{R_{1} j}}^{t} - x_{{R_{2} j}}^{t} } \right) $$
(3)
$$ x_{i,j}^{t + 1} = \left\{ {\begin{array}{*{20}l} {nx_{i,j}^{t + 1} ,} \hfill & {if\;{ }j = k{ }\;or\;{ }rnd_{j} \ge pa} \hfill \\ {x_{i,j}^{t} ,} \hfill & {otherwise} \hfill \\ \end{array} } \right. $$
(4)

where \(rand \sim U\left( {0,1} \right),\) \(rnd_{j}\) denotes an arbitrary integer within [\(0,1\)], \(x_{{R_{1} }}^{t}\) and \(x_{{R_{2} }}^{t}\) show the two arbitrary members going to the existing population, and \(pa\) shows the probability constant. Finally, \(k\) is randomly selected in \(\left\{ {1,2, \ldots ,m} \right\}\), and guarantees that a minimum of one constituent from \(nx_{i,j}^{t + 1}\) is selected by \(x_{i,j}^{t + 1}\), to assure that there is without duplication among the original as well as novel solutions.

Once the flying foxes create the tree at a low temperature, it can be accepted as the newest solution. If not, it returns to its current position.

Several reasons resulted in the deaths of flying foxes. For example, it might result in a very remote area with high temperatures but seeking the coolest tree. Instead, a replacement List \(\left( {RL} \right)\) was rearranged to the NL's unique optimum solution. Thus, an arbitrary integer \(n \in \left[ {2,{ }NL} \right]\) is generated, and the position of new flying foxes is given as follows:

$$ x_{i,j}^{t + 1} = \frac{{\Sigma_{k = 1}^{n} RL_{k,j}^{t} }}{n} $$
(5)

In Eq. (5), \(RL_{k}^{t}\) denotes the \(k\text{th}\) FF on \(RL\) at \(t\) reiteration. Equation (5) increases the probability of identifying an appropriate area.

Also, they might die from being suffocated by other members of the population:

$$ pD = \frac{nc - 1}{{population\; size}} $$
(6)

In Eq. (6), \(nc\) is closely related to the count of FFs with a main function related to the optimum result. Genetic crossover facilitates the mating of 2 flying foxes. The early step includes arbitrarily choosing two parents in the population, which ensures it can be different:

$$ offspring1 = L \cdot R_{1} + \left( {1 - L} \right) \cdot R_{2} $$
(7)
$$ offspring2 = L \cdot R_{2} + \left( {1 - L} \right) \cdot R_{2} $$

\(R_{1}\) and \(R_{2}\) denote dissimilar population members arbitrarily chosen, and \(L\) shows the created random value in [0,1].

The FFO technique developed a fitness function to have larger classifier results. It expresses a positive value and indicates the better efficacy of candidate solutions. Here, minimizing the classifier mistake is the fitness function, as presented in Eq. (8).

$$ \begin{aligned} fitness\left( {x_{i} } \right) & = Classifier\;Error\;Rate\left( {x_{i} } \right) \\ & = \frac{no. of\; misclassified\; instances}{{Total\; no. of\; instances}} \times 100 \\ \end{aligned} $$
(8)

Stage III: DQN-based vehicle classification

Finally, the DQN technique is used to classify detected vehicles. A DQN is a new end-to-end RL agent that exploits a DNN to map the connection between action and state corresponding to QT in QL31. The DQN approach is selected for classification tasks because it can learn and make decisions directly from high-dimensional input data utilizing DL models. Unlike conventional classification approaches, DQN implements reinforcement learning to optimize policies through experience replay and target networks, which enhances learning stability and performance. Its capability for handling complex environments and dynamically altering to new data makes it specifically efficient for tasks needing complex classification strategies. Furthermore, the robustness of the DQN model to noisy data and its capacity to learn from past interactions give it a crucial edge over conventional classifiers.

The QL technique produces the table for computing the upcoming rewards for every state and action. In particular, the row indicates state, and the column denotes action. The DQN agent provided exploits a CNN to recognize the local spatial connection present in succeeding game frames. Regarding the DQN agent, one severe insufficiency of traditional QL agents lies in QT that exploits to map the relationship between action and state. Generally, QL agents, having enormous states and actions that result in the curse of dimensionally, must solve the challenges. Hence, the DQN exploits a DCNN for precisely the optimal QF instead of using a QT.

Experience replay deals with eliminating the links among observations and smoothing series by modifying the data distribution by randomizing the data. Iteration upgrading minimizes the links between the targeted and the \({\text{Q}}\) values by updating the QV periodically to the objective value. At first, experience replay stores the agent's experience for each step of the process to construct a group of memories with a specific number of experiences. Therefore, the QN was trained by updating the parameter \(\theta_{{\text{i}}}\) at \({\text{i}}\) iteration by minimizing the MSE in the Bellman equation. Thus, the loss function \(L_{i} \left( {\theta_{i} } \right)\), change for each iteration \(i\), is expressed as follows:

$$ L_{i} = E_{s,a,r} \left[ {\left( {E_{{s^{\prime}}} \left[ {y|s,{ }a} \right] - Q\left( {s,{ }a;\theta_{i} } \right)} \right)^{2} } \right] $$
(9)

Distinguishing the loss function based on the weighted outcomes is shown below:

$$ \nabla_{{\theta_{i} }} L\left( {\theta_{i} } \right) = E_{{s,a,r,s^{\prime}}} \left[ {\left( {r + \gamma \mathop {\max }\limits_{{a^{\prime}}} Q\left( {s^{\prime},{ }a^{\prime};\theta_{i}^{ - } } \right) - Q\left( {s,{ }a;\theta_{i} } \right)} \right)\nabla_{{\theta_{i} }} Q\left( {s,{ }a;\theta_{i} } \right)} \right] $$
(10)

The optimizer of the target is essential to the differential and definition procedure of QF. Moreover, the QL method can be retrieved by upgrading the weighted then every time-step that substitutes the expectation by using a single sample and setting \(\theta_{{\text{i}}}^{ - } = \theta_{{{\text{i}} - 1}} .\) Fig. 4 represents the architecture of DQN.

Fig. 4
figure 4

Structure of DQN.

In the trained process of DQN, two variations of QL can be generated to ensure the trained DNN model does not diverge32. Next, QL employs a separate network to create the objective in the QL upgrading task, and these alterations could improve the reliability of DQN. This method includes a delay amongst the upgrading moment of QVs, and the corresponding effect causes the upgrade that minimizes the divergence probability or oscillations presented in the DNN parameter.

Results and discussion

This section tests the vehicle detection and classification results of the FFODL-VDCAI technique on the Stanford and PSU datasets33,34. Tables 1 and 2 illustrates the details on Stanford and PSU datasets. Each scene is captured utilizing a 4K camera mounted on a 3DR Solo quadcopter hovering approximately 80 m above various intersections on a university campus. The videos, processed for distortion and stabilization, have a specified resolution and comprise annotated targets with classes and their trajectories in time and space. Moreover, images were obtained employing a 3DR Solo drone equipped with a GoPro Hero 4 camera in an outdoor environment at a PSU parking lot. Videos recorded by the drone were utilized for extracting frames, which were manually labelled. Images not containing cars were excluded from the dataset, and the training/testing split was performed randomly. Figure 5 defines the sample UAV images. Figure 6 demonstrates the original and detected images.

Table 1 Standford dataset description.
Table 2 PSU dataset description.
Fig. 5
figure 5

Sample UAV images.

Fig. 6
figure 6

Images (a) original, (b) detected.

Table 3 and Fig. 7 examine an average precision (AP) result of the FFODL-VDCAI technique on Stanford and PSU datasets35. The outcome indicates that the FFODL-VDCAI technique achieves enhanced performance on both databases. For instance, on the Stanford database, the FFODL-VDCAI technique reaches an increasing AP of 20.12%, whereas the Faster RCNN, YOLO-v3, and YOLO-v4 models obtain decreasing AP of 19.30%, 13.50%, and 17.50% respectively. Besides, on the PSU database, the FFODL-VDCAI technique attains a superior AP of 95.43%, whereas the Faster RCNN, YOLO-v3, and YOLO-v4 approaches obtain lower AP of 71%, 91.90%, and 94.30% correspondingly.

Table 3 AP analysis of the FFODL-VDCAI approach on two datasets.
Fig. 7
figure 7

AP outcome of FFODL-VDCAI approach on two datasets.

Table 4 and Fig. 8 provide a comparative average recall (AR) examination of the FFODL-VDCAI technique with other Stanford database methods. The outcome implies that the YOLOv3 (320 × 320) model performs poorly with the least AR values. Along with that, the Faster R-CNN (Inceptionv2), Faster R-CNN (Resnet50), and YOLOv4 (320 × 320) models obtain closer AR values. However, the FFODL-VDCAI technique reaches effectual results with increased \(AR^{max = 1}\), \(AR^{max = 10} ,\) and \(AR^{max = 100}\) values of 17.45%, 20.01%, and 19.95%, respectively.

Table 4 AR outcome of the FFODL-VDCAI approach with other models on the Stanford dataset.
Fig. 8
figure 8

AR analysis of the FFODL-VDCAI approach on the Stanford datasets.

In Table 5 and Fig. 9, a comparative AR examination of the FFODL-VDCAI technique with existing methodologies on the PSU dataset is provided. The performance of various methodologies is compared by depending on their Average Recall (AR) at different thresholds. Faster R-CNN with Inceptionv2 achieved \(AR^{max = 1}\), \(AR^{max = 10} ,\) and \(AR^{max = 100}\) values of 6.20 at a maximum of 1, 41.50 at a maximum of 10, and 70.80 at a maximum of 100. Faster R-CNN with ResNet50 had \(AR^{max = 1}\), \(AR^{max = 10} ,\) and \(AR^{max = 100}\) scores of 6.40, 41.50, and 67.20, respectively. YOLOv3 (320 × 320) recorded \(AR^{max = 1}\), \(AR^{max = 10} ,\) and \(AR^{max = 100}\) values of 6.00, 42.20, and 81.00, while YOLOv4 (320 × 320) reached 6.80, 47.10, and 95.50. The FFODL-VDCAI model outperformed the others with higher \(AR^{max = 1}\), \(AR^{max = 10} ,\) and \(AR^{max = 100}\) values of 7.97%, 48.45%, and 97.16%, demonstrating superior performance across all thresholds. The experimental values implied that the YOLOv3 (320 × 320) technique gain worse performance with minimal AR values. Also, the Faster R-CNN(Inceptionv2), Faster R-CNN(Resnet50), and YOLOv4(320 × 320) models obtain closer AR values.

Table 5 AR outcome of FFODL-VDCAI approach with other models on the PSU dataset.
Fig. 9
figure 9

AR analysis of the FFODL-VDCAI approach on the PSU datasets.

Table 6 and Fig. 10 examine the average IoU (AIoU) outcomes of the FFODL-VDCAI technique on Stanford and PSU datasets. The results show that the FFODL-VDCAI technique achieves enhanced performance on both databases. For instance, on the Stanford database, the FFODL-VDCAI methodology achieves superior AIoU of 92.03%, whereas the Faster RCNN, YOLOv3, and YOLOv4 approaches obtain minimal AIoU of 48.80%, 82.50%, and 90.40% correspondingly. Moreover, on the PSU dataset, the FFODL-VDCAI approach attains an increasing AIoU of 96.15%, whereas the Faster RCNN, YOLO-v3, and YOLO-v4 models reach lesser AIoU of 95.50%, 92.80%, and 91.30%, correspondingly.

Table 6 AIoU outcome of FFODL-VDCAI approach with other models on two datasets.
Fig. 10
figure 10

AIoU analysis of FFODL-VDCAI approach on two datasets.

In Table 7 and Fig. 11, a comparative AP analysis of the FFODL-VDCAI approach with existing models on the PSU dataset is provided. The experimental values indicate that the Faster R-CNN (Resnet50) approach obtains the least performance with the least AR values. At the same time, the Faster R-CNN(Inceptionv2), YOLOv3(320 × 320), and YOLOv4 (320 × 320) models obtain closer AR values. However, the FFODL-VDCAI methodology reaches effectual results with maximal small, medium, and large values of 0.98%, 0.74%, and 0.82%, respectively.

Table 7 AP outcome of the FFODL-VDCAI approach with other models on the PSU dataset.
Fig. 11
figure 11

AP outcome of the FFODL-VDCAI approach on the PSU dataset.

Table 8 and Fig. 12 provide a comparative AP outcome of the FFODL-VDCAI approach with existing models on the Stanford dataset. The experimental values implied that the Faster R-CNN (Resnet50) method reached minimal performance with the least AR values. It was followed by the Faster R-CNN (Inception-v2), YOLOv3 (320 × 320), and YOLOv4 (320 × 320) models to obtain closer AR values. However, the FFODL-VDCAI technique gained effectual outcomes with increased small, medium, and large values of 0.10%, 0.15%, and 0.69%.

Table 8 AP outcome of the FFODL-VDCAI approach with other models on the Stanford dataset.
Fig. 12
figure 12

AP analysis of FFODL-VDCAI approach on Stanford dataset.

These results concluded that the FFODL-VDCAI technique gains better performance in the vehicle detection and classification model.

Conclusion

This article presents an automated FFODL-VDCAI technique for the vehicle detection and classification process in the ITS environment. In the presented FFODL-VDCAI technique, three main procedures are involved: YOLO-GD based vehicle detection, FFO-based hyperparameter tuning, and DQN-based vehicle classification. Here, the YOLO-GD uses a lightweight Ghost Net on the backbone network of YOLOv4 and interchanges the conventional convolutional with depthwise separable convolutional and pointwise convolutional. Next, the FFO technique was exploited to parameterize the Ghost Net method. Finally, the DQN method can be used to classify identified vehicles effectively. An extensive simulation analysis is performed on the UAV image dataset to validate the enhanced vehicle classification outcomes of the FFODL-VDCAI technique. The comprehensive validation of the FFODL-VDCAI methodology exhibited superior values of 96.15% and 92.03% under PSU and Stanford datasets concerning various aspects. The FFODL-VDCAI model, while advancing vehicle detection and classification, encounters multiple limitations. The YOLO-GD-based detection may face difficulty with accuracy in highly convolutional or obstructed scenarios. Computational demands and the sensitivity of the optimization to initial conditions could limit the FFO-based hyperparameter tuning. Furthermore, the DQN-based classification may need help with long training times and extensive labelled data requirements. Another limitation of the FFODL-VDCAI model is its potential difficulty scaling to massive datasets with several vehicle kinds, which may impact its robustness and generalization. Future studies should concentrate on improving detection accuracy under challenging environments, optimizing the hyperparameter tuning procedure for effectualness, and enhancing the generalization abilities of the classification method. Exploring hybrid methods and transfer learning methodologies could also be beneficial in addressing these limitations. Future work should also develop scalable solutions for various vehicle kinds and environmental conditions. Moreover, integrating real-time processing capacities and exploring the incorporation of multi-modal data sources could improve the technique's performance and applicability in practical scenarios. Additionally, improving vehicle classification effectualness can be attained by utilizing ensemble DL methods to incorporate diverse classifiers, thereby enhancing accuracy and robustness through several learning perspectives.