Cascade drive: a unified deep learning framework for multi-featured detection and control in autonomous electric vehicles on unstructured roadways

Raju, Kushal Kumar; Bhagavath, B. Prahal; Nallakaruppan, M. K.; Dhanaraj, Rajesh Kumar; Othman, Soufiane Ben; Ali, Obaid

doi:10.1038/s41598-025-06567-9

Download PDF

Article
Open access
Published: 01 July 2025

Cascade drive: a unified deep learning framework for multi-featured detection and control in autonomous electric vehicles on unstructured roadways

Kushal Kumar Raju¹,
B. Prahal Bhagavath²,
M. K. Nallakaruppan³,
Rajesh Kumar Dhanaraj⁴,
Soufiane Ben Othman⁵ &
…
Obaid Ali⁶

Scientific Reports volume 15, Article number: 20969 (2025) Cite this article

1003 Accesses
Metrics details

Subjects

Abstract

Sustainability is the success factor of the industry 5.0 era, where industries are focused towards customer-centric development. The exponential growth of smart cities paves way for opportunities for the development of various automated customer centric developments. Automation is the backbone of sustainable smart city development. The proposed work is one such sustainable solution which provides for the usage of Autonomous Electric Vehicles (AEV) for driver-free vehicle operation. This proposed research presents a groundbreaking approach to AEV that addresses the unique challenges of unstructured roadways in developing countries and smart cities. With the integration of the multiple deep learning models in a cascaded architecture, this work creates a comprehensive system capable of handling the diverse and challenging road conditions found in countries like India. The core innovation lies in the unified framework that simultaneously processes lane boundaries and critical objects at 6 frames per second on resource-constrained hardware, with intelligent prioritization of safety features. Performance metrics are exceptional with measures of 97.26% accuracy for lane detection using DeepLabv3+, 0.92 mAP for object detection with YOLOv5, and 0.83 mAP for pothole detection using YOLOv7. The successful implementation on a custom-built electric vehicle platform demonstrates the commercial viability of this approach, potentially bridging the adoption gap for autonomous technology in developing economies worldwide.

Efficient traffic sign recognition using YOLO for intelligent transport systems

Article Open access 21 April 2025

Fast and accurate object detector for autonomous driving based on improved YOLOv5

Article Open access 15 June 2023

Machine vision-based autonomous road hazard avoidance system for self-driving vehicles

Article Open access 28 May 2024

Introduction

The development of autonomous vehicles over the last century has had a profound impact on present and future. The introduction of vision-controlled self-driving automobiles in the 1980s was a pivotal moment for establishing the groundwork for today’s technology¹. In unstructured road environments, such as those commonly found in developing countries, the lack of consistent lane markings, the presence of unexpected obstacles, and poor road surface conditions pose significant challenges to autonomous driving systems. Unlike well-structured highways where rules and physical features are predictable, unstructured environments demand higher levels of perception and decision-making from autonomous vehicles. The major engineering companies and automobile manufacturers are investing in creating millions of kilometers in self-driving car prototypes. However, various technological and non-technical difficulties must be overcome to commercialize. Complex software, real-time data processing, and thorough testing and validation might cause technical issues. Non-technical challenges of autonomous driving include customer acceptance, insurance administration, and ethical and moral issues². Among the essential tasks, lane detection becomes critical for maintaining safe navigation paths when lane markings are faded, missing, or irregular. Similarly, object detection - including recognizing pedestrians, animals, non-standard vehicles, and irregularly placed obstacles is vital to ensure real-time responsiveness and prevent accidents. Autonomous driving requires radar, ultrasonic, and optical camera networking and synchronization. These innovations can transform several industries. Self-driving cars might replace business vehicles for delivery and transportation, allowing employees to spend their travel time productively. Self-driving vehicles could decrease accidents by 80 percent by 2040. Fully autonomous vehicles will increase ride-sharing and car-sharing. These services reduce the production of iron, steel, polymers, and cement, lowering global energy consumption and carbon emissions. Road testing and accident data might provide insight into Autonomous Electric Vehicle (AEV) safety, which remains a major focus. Manufacturers test millions of km and fail at varying rates. AEV accidents typically include other drivers. This emphasizes the necessity of improving AEV’s ability to recognize and eliminate external hazards³. Electric automobiles help create autonomous cars. Electric automobiles eliminate long-term vehicular pollution. Understanding electric vehicle characteristics, suppliers, types, and benefits is essential for autonomous automobiles⁴. They are quieter, cheaper, and reduce greenhouse gas emissions. Renewable-energy electric automobiles don’t pollute. Vehicle-to-vehicle power grid (V2G) technology may help grid stability and renewable energy integration in electric automobiles⁵. BEVs and PHEVs are electric automobiles. Battery Electric Vehicles (BEVs) charge at battery stations. Battery swapping−replacing an exhausted EV battery with a fully charged one−is a popular alternative to charging. At any standard battery charging station, EV drivers may change out their exhausted batteries with fully charged ones⁶. Plug-in hybrid electric vehicles (PHEVs) include a battery and an internal combustion engine, allowing for engine or external charging. Both meet different driving needs and recharging infrastructure⁷. Electric cars and self-driving technology might improve transportation safety, efficiency, and sustainability. AEVs enhance traffic, energy, and transportation⁸. The development of Autonomous electric vehicles encounters daunting challenges. Reliable autonomous driving systems are hard to build. AI systems need better computer vision, sensor fusion, machine learning, and decision-making to grasp complicated real-world settings. Self-driving electric cars must be safe to avoid accidents and attacks⁹. In summary, the advancements in self-driving vehicle technology are expected to bring about a transformative impact on the automotive industry, enabling more efficient and secure transportation solutions. Therefore, developing a robust system capable of accurately detecting lanes, objects, and potholes simultaneously is crucial for achieving safe and sustainable autonomous driving in these unpredictable conditions. This research directly addresses these challenges through a cascaded deep learning framework specifically tailored for unstructured road scenarios.

Contributions of the paper

This work illustrates the viability of using deep learning algorithms for feature identification and vehicle control, paving the way for next-generation self-driving cars.
The paper incorporates advanced deep learning algorithms, such as YOLOv9, DeepLabv3+, UNet Encoder-Decoder, SSD Mobilenet V2, and VGG16, to blend various features into autonomous electric vehicles.
The cascaded multitasking model combines lane detection and object detection, enabling simultaneous detection of objects and lanes.
The proposed work renders superior performance of object detection models, advancement in lane detection with DeepLabv3+, and enhanced pothole detection with YOLOv9 contributing to the advancement of autonomous electric vehicles.

Organization of the paper

“Introduction” provides an overview of earlier research endeavors related to autonomous electric vehicles. “Proposed methodology” delves into the core of the research work, outlining the system concept, architecture of the autonomous electric vehicle, and the mathematical modeling techniques used to implement the proposed approach. “Experimentation, results, and analysis” probes the in-depth analysis of the experimental setup, conducted tests, and subsequent results obtained from the proposed system. “Section 4” focuses on the validation phase of the research. It presents the methodology used to conduct experiments and evaluates the results obtained from the proposed system. “Conclusion and further work” encapsulates the conclusions drawn from the research findings. It summarizes the major takeaways and inferences of the study.

Related work

The most crucial aspects of Deep Learning (DL), encompassing recent advances in network architectures and methodologies, are examined in¹⁰. In the domain of autonomous vehicles, Ref.¹¹ proposed an autonomous electric vehicle framework for unstructured roadways using semantic segmentation and convolutional neural networks for accurate lane detection. However, this work focused solely on lane boundary recognition and did not incorporate object or pothole detection.

Conventional approaches apply edge detection, Hough transforms, and color filtering to acquire the lines of the lanes from images. But these methods showcased limitations in complex environments such as faded markings, poor lighting, or occlusions. Ref.¹² devloped a stronger and robust lane detection approach using OpenCV techniques which are optimized for urban driving.

Recent developments around lane detection applies deep learning based approaches such as Convolutional Neural Networks (CNNs) for semantic segmentation of lanes¹³ reviews various architectures of CNN, such as SCNN and ENet that could identify lanes under challenging situations, such as shadows and curves. These methods outperform traditional approaches in performance aspects such as enhanced accuracy and adaptability.

In the presence of the camera-based systems, combinination of the LiDAR and GPS data processing can greatly improve the reliability. Lim¹⁴ proposed a sensor fusion framework, which integrates the vision and point-cloud data to improve performance of the lane identification in the low-light or occluded scenarios.

The authors in the paper¹⁵ uses Random Sample Consensus (RANSAC) for the normal lane detection. Incase the road scene is complex and includes roadside encroachments, they applied the CNN in the lane detection before and after applying the RANSAC algorithm. The pre and post application of the CNN enhanced the performance of the RANSAC approach and performance is found to be better than other formal line detection algorithms such as RANSAC and hough transform.

The end-to-end lane detection is proposed by the authors of the paper¹⁶ with a fast lane detection algorithm, running at 50 fps. This approach can handle a variable number of lanes with various changes on the lane as well. The proposed methods is applied on tuSimple dataset with competitive results.

Deep learning-based lane recognition systems developed in¹⁷ further enhanced lane tracking capabilities under unstructured conditions but similarly lacked integration with obstacle or anomaly detection modules. For pothole detection, Ref.¹⁸ introduced a YOLO-based algorithm tailored to Indian roads, achieving 0.76 accuracy, and¹⁹ later improved detection rates by evaluating various YOLO models, determining YOLOv4-tiny as achieving 78.7% accuracy. Nonetheless, these approaches treated pothole identification as a standalone task, without addressing broader navigation features such as lane following or dynamic obstacle avoidance.

Vision-based pothole detection methods using image segmentation and edge detection were explored in²⁰, while²¹ investigated deep neural network models for vehicle and pedestrian detection. Object recognition studies such as²² demonstrated the utility of MobileNet for SSD object detection, enabling efficient multiclass identification, but remained disconnected from lane or surface anomaly detection pipelines.

Additionally, ²³ compared DeepLabv3+ models for road-boundary estimation and highlighted ResNet backbones as superior for pre-trained feature extraction, yet did not extend to multitasking scenarios. While multi-task perception frameworks, such as Tesla’s proprietary HydraNet architecture, illustrate the potential for unified scene understanding, they require powerful compute infrastructure and are not optimized for unstructured or resource-constrained deployments.

Modern AEV utilizes multi-tasking based learning with end-to-end pipelines that simultaneously detects lanes, segments the objects, and estimate drivable areas. SCNN (Spatial CNN) introduced by²⁴ along with the spatial relationships in both the vertical and horizontal directions, which is especially used in the curved lanes. Similar methods such as ENet-SAD (Simultaneous Attention and Detection) are designed for the light-weighted embedded applications.

The current research separates electric vehicle and autonomous topics, accurate lane detection could also improve electric vehicle efficiency, with optimal maintenance of lane discipline and reduction of unnecessary lane changes. AEVs can minimize the energy consumption and extend the life of the battery. This develops intersection between computer vision and battery management systems.

In contrast, the proposed work addresses these gaps by designing a cascaded multitasking model that concurrently detects lanes, objects, and potholes in real time on computationally limited hardware. By integrating these capabilities, the proposed work directly target the complex operational conditions often encountered on developing-country roadways.

Research gap

Existing studies for pothole detection often utilize older versions of object detection algorithms; adopting the latest architectures (e.g., YOLOv9) could yield significant improvements in real-time accuracy.
Despite the availability of deep learning-based methods for lane detection, pothole detection, and object recognition individually, there is a notable research gap in developing a unified system that combines all these tasks cohesively.
Few works attempt a cascaded multitasking model that performs simultaneous feature extraction and decision-making within a single lightweight pipeline suitable for embedded systems.
Simultaneous detection and response to multiple roadway features in a single frame remains an underexplored area that could enhance situational awareness and navigation safety.
Extensive validation and evaluation of multitask autonomous frameworks, particularly under real-world unstructured conditions such as those found in Indian roadways, are limited and urgently needed to ensure robustness and adaptability.

Proposed methodology

Proposed study

Figure 1 describes the methodology of the entire work in the form of a flowchart. The first step is to collect dataset for building the models. Models are built for detection and control in the secondary step. The models are then cross-validated on real-time videos shot on Hyderabad Roads. The quaternary step is to start building the prototype of a car using the hardware components. The quinternary step is to build a cascaded multitasking model and the ultimate step is to deploy all the models and run the car.

Dataset description

The dataset utilized for constructing all the models was acquired from Kaggle, a few datasets were also sourced from Google images, and others were captured using a smartphone’s camera.

For Object Detection, 12000 images were collected, which includes 5 classes (Stop Sign, Traffic light, Crosswalk, Speed Limit Sign, Vehicles)—2000 images for each class and 4000 images for the vehicles class, given its complexity of different types.
For Lane Detection, a total of 5000 images of real-time lanes were collected.
For Pothole Detection, 2000 images were collected and annotated for training the YOLOv9 model.

Image preprocessing and augmentation

Before model training, all collected images were subjected to a standard preprocessing pipeline. The preprocessing involved:

Resizing: all images were resized to 320$\times$320 pixels for object detection models (YOLOv9, SSD MobilenetV2) and 224$\times$224 pixels for image classification and lane detection models (VGG16, DeepLabv3+).

Normalization: pixel values were normalized to a range between 0 and 1 to facilitate faster convergence during training.

Data augmentation: to enhance model generalization and mimic diverse road conditions, the following augmentation techniques were applied:

Random horizontal flipping with a probability of 0.5.

Random rotation up to ±15 degrees.

The dataset utilized for constructing all the models was acquired from Kaggle, a few datasets were also sourced from Google images and others were captured using a smartphone’s camera.

For Object detection 12,000 images were collected which includes 5 classes (Stop Sign, Traffic light, Crosswalk, Speed Limit Sign, Vehicles) 2000 images for each class and 4000 images for vehicles class given its complexity of different types.
In Building the Lane Detection a total of 5000 images of real-time lanes were collected.
2000 images of data were collected to build pothole detection model using YOLOv9.

Hardware design approach

The hardware design approach for the self-driving autonomous car project involved using several components to build the car from scratch. The first step was to assemble the car chassis and ensure proper balance and weight distribution. Once the chassis was assembled and fixed, sensors, microcontroller, and microprocessor are then mounted and fixed onto the surface of the car. The camera module was integrated with Raspberry Pi 4b since it was the critical component that was used to capture visual data, which was then processed using software algorithms. The Arduino Mega 2560 R3 and Raspberry Pi 4b were used as the microcontroller and microprocessor, respectively, to run the Deep Learning algorithms. The L298N motor driver module was used to control the speed and the movement of the car, while the power bank provided the necessary power supply. As shown in Fig. 2 which explains the Circuit Design of Prototype in Fritzing.

Figure 3 shows the Simulation model of how the Car is Controlled in Proteus. Arduino sends a command to the motor driver, which has the capability to spin the motor in both clockwise and anticlockwise directions. Furthermore, using PWM input, the user can make the motor spin at the desired speed. The simulation was done to contemplate the speed and direction control of the prototype. Figure 4 displays the developed prototype of an autonomous electric car that was built.

Hardware constraints and optimization strategies

The implementation of the system on a Raspberry Pi 4b and Arduino Mega 2560 R3 introduced significant hardware limitations due to restricted computational resources, memory bandwidth, and thermal management constraints. Raspberry Pi 4b, despite being one of the most powerful SBCs in its class, has limited GPU capabilities compared to a full-fledged GPU server. To ensure real-time inference, lightweight model architectures were prioritized.

Optimization techniques such as quantization (reducing model weights from 32-bit floating-point to 8-bit integers) and model pruning (removing redundant parameters) were considered. Although aggressive pruning led to minor drops in model accuracy, it resulted in substantial speed improvements.

Inference time was a critical parameter monitored during deployment. After optimization, the YOLOv9 model achieved an average inference speed of approximately 6 frames per second (FPS), while the DeepLabv3+ model achieved around 5.5 FPS on real-time video streams. These FPS values were found sufficient for low-speed autonomous navigation tasks typical in urban or semi-urban settings. The trade-off between model complexity and real-time responsiveness was carefully balanced to maximize detection accuracy while maintaining operational feasibility on resource-constrained hardware. Additionally, latency and resource utilization measurements were recorded. On the Raspberry Pi 4b, the cascaded multitask model achieved an average per-frame latency of approximately 180-200 ms, corresponding to around 5-6 FPS. CPU utilization during inference was around 75% when using the cascaded model, compared to approximately 90% CPU utilization when running lane detection and object detection models sequentially without integration. These quantitative measurements substantiate the improved computational efficiency and resource management achieved by the proposed approach.

Build models for detection and control

Image classification(VGG16 model)

As shown in Fig. 5 the convolutional and 3 fully linked layers make up VGG-16. The network generates 1000 classes from a 224 $\times$ 224 RGB image. The first 13 convolutional layers retrieve information from the input image. Convolutional layer 1 has 64 3 $\times$ 3 filters and a stride of one. The third layer is a max pooling layer with a pool size of 2 $\times$ 2 and a stride of 2, halving the feature maps’ spatial dimensions. After a max pooling layer, each of the following two convolutional layers has two convolutional layers that follow a similar pattern. The first set has 128 3 $\times$ 3 filters, while the second has 256. These sets’ maximum pooling layers have 2 $\times$ 2 pool sizes and 2 strides, lowering feature map spatial dimensions again.

VGG 16 architecture

$$\begin{aligned} MaxPooling(x, i, j) = \max (x[i:i+k, j:j+k]) \end{aligned}$$

(1)

Equation (1) is used to calculate the Max Pooling. Three fully linked layers receive flattened output from the final convolutional layer. Dropout layers and rectified linear unit (ReLU) activation functions are used to avoid over-fitting after each completely linked layer.

$$\begin{aligned} \text {Dropout}(x, p) = x \cdot M \end{aligned}$$

(2)

Equation (2) is the mathematical formula used for Dropout. Tiny 3 $\times$ 3 filters and several convolutional layers allow the network to learn complex properties from the input image. Max pooling layers reduce feature map spatial dimensions, decreasing network parameters and limiting overfitting.

$$\begin{aligned} ReLU(x) = \max (0, x) \end{aligned}$$

(3)

Equation (3) corresponds to the mathematical expression defining the Rectified Linear Unit (ReLU) activation function. It returns x if the output is x or else it returns 0. Finally, completely connected layers may translate obtained attributes to their 6 output classes¹⁶.

Multi image classification using VGG16 VGG16 object detection model utilized 12,000 images. After importing the images as an array, they were scaled to 320 $\times$ 320 and separated into 80% train and 20% test. Using the Transfer Learning approach pretrained VGG16 model was imported from ImageNet and removed the final completely linked layers. The object detection dataset was utilized to fine-tune the model, and a categorical cross-entropy loss function and Softmax activation function were added to the final layer for multiclass classification. To evaluate the effectiveness of object identification, accuracy, precision, recall, and F1 score metrics were used. The trained and optimized model predicted the classes on new unseen images.

Object detection(YOLOv9 model)

YOLO algorithm

As Illustrated in Fig. 6, YOLO is used to divide the input image into the required number of grids, and each grid cell predicts the number of bounding boxes and confidence ratings, which indicate how well the predicted box matches the input image²⁵. Non-maximal suppression (NMS) is an essential technique utilized in YOLO models, NMS is used to identify and remove inaccurate or superfluous bounding boxes²¹. In NMS, Intersection over Union (IoU) is calculated of the selected boundary box and with the remaining bounding box²⁶. The bounding boxes that have a high IoU with the selected bounding box are removed.

$$\begin{aligned} IoU = \frac{{\text {{AoU}}}}{{\text {{AoO}}}} \end{aligned}$$

(4)

Equation (4) is the mathematical formula used to calculate the IoU between two bounding boxes, where AoU is the Area of Union and AoO is the Area of Overlap. The Fig. 7 explains the methodology for the YOLOv9 model.

Pothole detection using YOLOv9

The methodology for detecting potholes with YOLOv9 consists of multiple phases. The first step is to create a dataset of images with potholes. The annotated images should be kept in a YOLOv9-compatible directory structure, with each image having an associated text file providing the annotations. Each image’s annotation contained bounding box values of the potholes. The dataset was separated into two sections: training and testing. The training set had 80% of the images, whereas the test set contained the remaining 20%. The YOLOv9 model was then trained on the annotated dataset. It is an improvement over earlier versions of YOLO. Using a transfer learning technique, the YOLOv9 algorithm was trained on the training set of pothole images. To identify potholes in the test set, the pre-trained YOLOv9 model was fine-tuned on pothole images. The model was further cross-validated on videos shot on Indian roads using smartphones. For real-time pothole identification, the YOLOv9 model was implemented on a Raspberry Pi.

YOLOv9 was selected over previous YOLO versions based on its superior performance in balancing accuracy and computational efficiency. Tiny YOLOv3 achieved a mean Average Precision (mAP) of 0.76 for pothole detection as reported in²², while YOLOv4-tiny achieved a mAP of 78.7% as discussed in²⁶. However, both models, although effective, required higher computational resources compared to what was feasible for The proposed workreal-time deployment on low-power hardware.

YOLOv5 introduced improvements in inference speed but still exhibited moderate resource requirements. In contrast, YOLOv9 incorporated architectural enhancements such as hybrid task cascades and improved anchor-free mechanisms, achieving a higher mAP with reduced model complexity. In preliminary evaluations, YOLOv9 achieved a mAP of 0.92 for object detection and processed frames at approximately 6 FPS on Raspberry Pi 4b, outperforming earlier versions under the same hardware constraints. These factors made YOLOv9 an optimal choice for The proposed workcascaded multitasking framework targeting unstructured road environments.

Object detection using SSD model

As Fig. 8 describes, The SSD approach is based on a fully convolutional neural network (FCN) architecture. The two main parts of the network are a base network for feature extraction and a detection network for object detection. The base network is typically a pre-trained classification network such as VGG-16 or ResNet-50. Batch Normalization is used after activation of every layer in SSD Mobilenet Architecture to improve training stability and accelerate convergence.

$$\begin{aligned} NO = \frac{{X - \mu \times S + O}}{{\sqrt{{\sigma + \epsilon }}}} \end{aligned}$$

(5)

Equation (5) is the formula used to calculate batch normalization. Where, NO is the normalized output, X is the input, $\mu$ is the mean, $\sigma$ is the variance, S is the scale, O is the offset, $\epsilon$ is the epsilon.

The detection network is a set of convolutional layers that take the feature map from the base network as input and produce a group of bounding boxes and class probabilities for every object in the image. The algorithm also employs a method called Non-Maximum Suppression (NMS) to remove unwanted bounding boxes and improve the accuracy of the detections. NMS works by selecting the bounding box with the highest confidence score, and then suppressing all other boxes with a high degree of overlap with the selected box.

$$\begin{aligned} AS = smin + \left( \frac{smax - smin}{NA - 1} \right) (i - 1) \end{aligned}$$

(6)

where: AS is the size of the anchor box. smin is the minimum anchor size. smax is the maximum anchor size. i is the index of the anchor box. NA is the number of anchor boxes.

The anchor box size is calculated as given in Eq. (6) by linearly interpolating between the minimum and maximum anchor sizes. The anchor box’s index, i, determines its interpolation position. The anchor box is smallest if $i = 0$. The anchor box is the largest if $i = NA - 1$. The SSD algorithm’s ability to recognize objects of various sizes and aspect ratios in a single run of the network is one of its main features. This is achieved using multiple feature maps of different resolutions, allowing the network to detect objects at different scales²⁷.

$$\begin{aligned} o(Z) = \frac{\exp (z_i)}{\sum _{j=1}^K \exp (z_j)} \end{aligned}$$

(7)

Equation (7) is the mathematical formula for the Softmax Activation function which is used as the output layer in the SSD MobileNet V2 architecture.

Figure 9 explains the methodology followed for building the SSD Model.

Lane detection using DeepLabv3+

Figure 10 illustrates the architecture of the DeepLabv3+ model. The DeepLabv3+ model uses a fully Convolutional Neural Network (FCN) architecture, which allows for end-to-end training and inference on images of arbitrary size. Features are extracted from a pre-trained backbone network called ResNet. An encoder-decoder network and a feature extractor are the two primary parts of the model. The encoder–decoder network is responsible for generating the final segmentation map from the features extracted by the feature extractor. Batch normalization is one of the encoder’s most essential operations. The technique of batch normalization normalizes the activations of each layer across a mini-batch. The adoption of an extra module dubbed the ASPP (Atrous Spatial Pyramid Pooling). With the help of the ASPP module, which consists of atrous convolutions with various dilation rates, the network can gather context data on many scales. The ASPP module can collect both local and global context information by employing different dilation rates, which enhances the segmentation’s accuracy²⁶. A probability map is the network’s final output, in which each pixel is given a likelihood of falling into one of the object categories. The probabilities are then threshold to produce a binary segmentation mask, where each pixel is either labeled as belonging to the object of interest or not.

UNet model Similar to the DeepLabv3+ architecture, the UNet architecture consists of an encoder-decoder structure. The encoder component is comprised of multiple convolutional layers with an increasing number of filters and a downsampling max-pooling layer. The decoder component is comprised of upsampling layers followed by convolutional layers for upsampling and integrating encoder features via skip connections²⁸.

$$\begin{aligned} O = \text {E} + \text {D} \end{aligned}$$

(8)

The formula for skip connections is shown in Eq. (8). In the formula, the Encoder output(E) and Decoder output(D) are added to generate the decoder layer’s final output(O). This contributes to integrating multi-scale features and enhancing the model’s localization accuracy. In the end, a 1 $\times$ 1 convolutional layer with a sigmoid activation function is used, and the final layer generates a single-channel binary mask representing the predicted lane markings. When Trained on 50 epochs the UNet Model gave a test accuracy of 91.2%.

$$\begin{aligned} \sigma (x) = \frac{1}{1 + e^{-x}} \end{aligned}$$

(9)

Equation (9) is the formula used for the Sigmoid activation function.

Methodology for lane detection models

Figure 11 explains the methodology followed for building the Lane Detection Model. OpenCV Library is used to read and edit the images, and TensorFlow and Keras execute machine learning libraries. The lane images and labels are loaded first. The labels or the segmented images are made using Adobe Photoshop, the lane in the image is given the colour pink, and the background is given the colour blue. Images are resized to 224 $\times$ 224 pixels, the machine learning model’s input size. To create a binary mask of the lane markings, the preprocessed data is trained on the Deeplabv3+ algorithm with 80% train and 20% split and binary cross-entropy as the loss function. A color variable is created to represent the overlay on lanes that are detected. White is (255, 255, 255) in BGR format. Before preprocessing for the model, each frame is scaled to (224, 224) pixels. Preprocessed frames are sent into pre-trained models for lane recognition. The model generates a probability map, which is thresholded at 0.5 to create a lane marker binary mask. Color thresholding the resized frame creates a white mask. The binary lane mask and white mask are merged using bitwise OR. Applying the combined mask to the reduced frame creates the lane overlay image.

Training and hyperparameter tuning

The proposed work expanded the Methodology section to detail the training configurations and hyperparameter choices for all models, improving transparency and reproducibility.

YOLOv9 (object and pothole detection): the proposed work fine-tuned a pre-trained YOLOv9 model using an initial learning rate of 0.001 and a batch size of 16. Training was conducted for 100 epochs, employing a stepwise learning rate decay strategy to reduce the learning rate by a factor of 0.1 after validation loss plateaued for 10 epochs. Early stopping was applied if validation loss did not improve for 15 consecutive epochs.

SSD MobileNetV2 (object detection): for SSD MobileNetV2, also initialized with pre-trained weights, the same initial learning rate of 0.001 was used, with a batch size of 16. Training was performed for approximately 60 epochs until validation loss convergence. Early stopping was similarly employed.

DeepLabv3+ (lane detection): for DeepLabv3+, the proposed work fine-tuned a ResNet backbone using the Adam optimizer. An initial learning rate of 0.0001 was set to gently update the pre-trained layers. The model was trained for 50 epochs, with early stopping triggered based on Intersection-over-Union (IoU) improvements on the validation set.

UNet (lane detection): the UNet model was trained from scratch without transfer learning. A learning rate of 0.0001 was used with a batch size of 16. Training was run for 50 epochs, with the model achieving its best validation accuracy around epoch 45, confirming the chosen stopping point.

VGG16 (image classification for traffic signs): the proposed work, utilized ImageNet pre-trained weights for the VGG16 model and fine-tuned it on the custom dataset. A small learning rate of $1\times 10^{-4}$ was used to prevent overfitting and catastrophic forgetting. The model was trained for 30 epochs with a batch size of 16.

Hyperparameter tuning strategy: initially, hyperparameters were set based on commonly accepted practices for each model architecture. Minor adjustments were conducted via manual tuning: The proposed work experimented with higher (0.01) and lower (0.0001) learning rates, and varied batch sizes (8 vs. 16) to observe their impact on validation accuracy. The final chosen values maximized performance on the validation set without overfitting.

All these configurations are provided to ensure transparency and to facilitate reproducibility of The proposed work results.

Building a cascaded multi-tasking model

The Cascaded Multitasking model follows a step-by-step process to reach a goal. First, an image is captured from a webcam that acts as an input source. These frames are processed by the lane detection model, specifically the DeepLabv3+ algorithm²⁷. The model analyzes the frames and generates a tracking mask that highlights the regions corresponding to the detected lanes. The YOLOv9-based cascade model detects objects simultaneously. The model analyzes frames, recognizes scene items, and provides bounding box coordinates, class indices, and confidence ratings for each object. The cascade model combines the outputs of the lane detection and object detection models to detect lanes and objects simultaneously and provide a visual representation of the detected lanes and objects in a single frame. Importantly, the cascade model makes intelligent judgments and controls autonomous electric cars using the combined information. It signals the Arduino board through a serial connection when a trace is found²⁸. If no lane is found, another signal is delivered. This system lets the vehicle respond to lane availability. The cascaded model also considers stop signs and traffic signals. The model sends Arduino instructions to activate actions when these items are spotted. The model may halt the car until the stop sign is out of sight. The cascaded model helps reduce graphics processing unit (GPU) usage and achieve higher frames per second (FPS).

Comparative analysis with other multitask frameworks

To position proposed cascaded framework relative to existing multitask approaches, The proposed work provide a comparative analysis here.

Tesla’s HydraNet serves as a prominent example of an industry-grade multi-task network. It simultaneously performs various vision tasks such as object detection, lane detection, and traffic light recognition within a unified architecture²⁹. However, HydraNet requires significant computational resources and vast annotated datasets for training and deployment, making it best suited for high-performance hardware environments and extensive infrastructure support.

In contrast, the proposed work proposed cascaded multi-tasking model adopts a modular two-stage system − one stage specialized for lane detection and the other for object and pothole detection. This design enables each sub-model (DeepLabv3+ for lane detection and YOLOv9 for object/pothole detection) to be trained independently, simplifying dataset requirements and reducing training complexity.

Unlike end-to-end multitask frameworks that impose heavy GPU and memory demands, The proposed cascaded approach achieves efficient real-time performance (around 6 FPS) on low-power hardware like Raspberry Pi 4b, with significantly reduced computational overhead³⁰. This lightweight nature makes The proposed framework more practical for deployment in resource-constrained and unstructured road environments, such as those found in developing countries.

Moreover, while academic multitask frameworks often focus primarily on structured urban roads or prioritize specific tasks (e.g., lane detection alone), The proposed model uniquely integrates lanes, objects, and potholes concurrently − offering a holistic solution for the unpredictable conditions characteristic of unstructured roadways.

Thus, compared to both industrial and academic alternatives, The proposed cascaded framework offers a flexible, resource-efficient, and scalable solution specifically tailored to the demands of unstructured, real-world road scenarios.

Experimentation, results, and analysis

Experimental setup

Raspberry Pi 4b and Arduino Mega 2560 R3 were used to develop an autonomous electric car prototype. Using Kaggle, Google Images, and images captured through smartphone, the datasets were created. YOLOv9, DeepLabv3+, UNet Encoder-Decoder, SSD Mobilenet V2, and VGG16 were trained and cross-validated and then deployed into Raspberry Pi. The automobile chassis, camera module, motor driver module, and power bank were combined to gather, process, and control data. The experiments aimed to achieve accurate feature detection which includes traffic signs, lanes, objects, and potholes, offering an affordable autonomous car.

Object detection model using YOLOv9

The Fig. 12 below depicts the traffic sign model being cross-validated on real-time videos shot on roads of Hyderabad.

Pothole detection using YOLOv9

The YOLOv9 model is seen in the Fig. 13 below being cross-validated on real-time footage taken on Indian Roads. It successfully detects all the potholes. Additionally, this would make The proposed prototype ready for the Indian Roads.

Object detection detection using SSD

The Fig. 14 below shows the SSD Object detection model being cross-validated on real-time videos captured on roads of Hyderabad.

Lane detection detection using DeepLabv3+

The Fig. 15 below depicts the lane detection model being cross-validated on real-time videos captured on roads of Hyderabad. It successfully created a white overlay on the lanes.

Performing trial runs in prototype

Figure 16 below illustrates how the prototype is successfully detecting the stop sign using the SSD object detection model.

Figure 17 below shows how the prototype is detecting the lanes using the lane detection model.

Figures 19 and 18 indicate how the cascaded model is simultaneously detecting both lanes and objects while prioritizing stop signs over others.

Figure 20 displays the prototype setup of the proposed concept.

Quantitative results from prototype trials

To substantiate the prototype’s performance, quantitative metrics were recorded during controlled field tests conducted on an internal campus road.

During a trial involving 30 on-road obstacles (including pedestrians and static objects like traffic cones), the prototype successfully detected and appropriately responded to 28 obstacles, resulting in a 93.3% obstacle detection and avoidance rate.

The vehicle encountered 10 stop signs during the trial and correctly recognized and obeyed all of them, achieving a 100% stop sign recognition rate.

For lane keeping, the prototype maintained lane alignment with an average lateral deviation of approximately 10 cm from the centerline over a 2 km route, as determined through post-analysis of recorded videos.

In the case of pothole detection, the prototype was evaluated on a road segment containing 5 actual potholes. The system successfully detected 4 potholes, missing one shallow pothole, corresponding to an 80% pothole detection rate.

Failures were also noted: two obstacles were not responded to in time, and one minor pothole was missed. These observations are being used to guide further improvements in model sensitivity and system response latency.

These quantitative results illustrate that the cascaded perception framework and control system achieved reliable real-world performance, supporting the viability of the proposed autonomous electric vehicle concept under unstructured road conditions.

Comparison of all the models

The proposed work added a clear comparative table that numerically contrasts model performances in terms of accuracy and FPS. The Table 1 below compares the results of all the models.

Table 1 Comparison of model performance in terms of task type, accuracy, and real-time inference speed (FPS). Highlights YOLOv9 and DeepLabv3+ as the top-performing models.

Full size table

Relative comparison

As shown in Table 1, the YOLOv9 model achieves 92.3% accuracy at 6 FPS for object detection, whereas the VGG16 model for image classification reaches only 78.87% accuracy at 1.6 FPS.

Based on the results shown in Table 1, VGG16 model performed image classification to detect traffic signs and was found to be very slow and gave lesser accuracy than the YOLOv9 model when integrated with hardware.

This direct numerical comparison clearly demonstrates YOLOv9’s superior performance over VGG16 in both detection accuracy and inference speed. YOLOv9 model has achieved the best accuracy for object detection compared to SSD MobilenetV2. YOLOv9 model can process at higher FPS and is comparatively more efficient than other existing models for pothole detection. The DeepLabv3+ model gave us better results and was showing higher FPS compared to the UNet model for lane detection.

Table 2 Performance metrics of DeepLabv3+ model.

Full size table

Table 2 interprets the evaluation metrics of DeepLabv3+ model with increase in the number of samples. 1000 samples from the dataset were used for testing the model. As shown in the above table, evaluation metrics of the model improve as the number of samples increase, indicating its overall effectiveness in correctly classifying the images.

Table 3 Performance metrics of UNet model.

Full size table

Table 3 refers to the values of evaluation metrics of UNet model. 1000 samples of lanes from the dataset were used for evaluating the model. As samples increase, the model’s evaluation metrics improve, demonstrating its efficacy in classifying the images. However, the results were not better than the DeepLabv3+ model.

Table 4 Performance metrics of VGG16 model.

Full size table

Table 4 provides the evaluation metrics of the VGG16 model. 2000 samples of road signs were used for assessment. Although slight improvements are observed with more samples, overall accuracy and speed remained lower compared to YOLOv9, highlighting the need for more robust solutions for real-time object detection.

The Fig. 21 compares all the image classification models visually. It further supports the conclusion that DeepLabv3+ outperformed and proved to be the best image classification model among those evaluated.

The Confusion Matrix of all the image classification models is plotted in Fig. 22. It was observed that DeepLabv3+ had the least false positives and false negatives compared to the VGG16 and UNet models.

$$\begin{aligned} \begin{aligned} \text {F1 Score} = \frac{2 \cdot \text {Precision} \cdot \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned} \end{aligned}$$

(10)

$$\begin{aligned} \begin{aligned} \text {Precision} = \frac{\text {TP}}{\text {TP} + \text {FP}} \end{aligned} \end{aligned}$$

(11)

$$\begin{aligned} \begin{aligned} \text {Recall} = \frac{\text {TP}}{\text {TP} + \text {FN}} \end{aligned} \end{aligned}$$

(12)

$$\begin{aligned} \begin{aligned} \text {Accuracy} = \frac{\text {TP} + \text {TN}}{\text {TP} + \text {TN} + \text {FP} + \text {FN}} \end{aligned} \end{aligned}$$

(13)

Equations (10), (11), (12) and (13) denotes the mathematical formulas for F1 Score, Precision, Recall, Accuracy respectively. These metrics are used to assess the performance of the models. The model’s performance is better as the score of F1 parameter is approaching 1. Precision is the ratio of true positives to the total of true positives and false positives, while recall is the ratio of true positives to false negatives. The harmonic mean of accuracy and recall determine the F1 score²³.

Table 5 Performance metrics of DeepLabv3+, UNet, and VGG16.

Full size table

Table 5 compares the performance metrics of DeepLabv3+, UNet and VGG16 model and it is evident from the following table that DeepLabv3+ produced better results for image classification.

Analysis

The Results show the developed models for Sign Detection, Lane Detection, Pothole Detection, Object Detection, and Cascaded Multi-tasking Model of lane detection(Deeplabv3+), and an Object detection model(YOLOv9) such that the cascaded multi-tasking model can detect both objects and lanes simultaneously, giving priority to pedestrian and stop detection in a single frame. In the end, the cascaded model was deployed on Raspberry Pi using which lanes and objects were detected simultaneously with an FPS of around 6. It was discovered that Object Detection Model(YOLOv9) outperformed the Image Classification model(VGG16) and gave better results in terms of accuracy and FPS. DeepLabV3+ surpassed the UNet algorithm regarding lane detection in terms of F1 score, Precision, Recall, and Accuracy metrics. DeepLab V3+ achieved 97.2% accuracy, a Precision of 0.9626, Recall of 1.0, and an F1 score of 0.98.

The superior performance of DeepLabv3+ over UNet in lane detection tasks can be attributed to key architectural differences. DeepLabv3+ employs an Atrous Spatial Pyramid Pooling (ASPP) module, which captures contextual information at multiple scales by applying dilated convolutions with different rates. This allows DeepLabv3+ to better detect lanes that vary in shape, size, and perspective, particularly in complex and unstructured environments.

Moreover, DeepLabv3+ integrates an encoder–decoder structure where the encoder extracts dense features at various resolutions, and the decoder refines these features to produce high-resolution segmentation masks³¹. This architectural refinement ensures better preservation of spatial information critical for lane boundaries.

In contrast, while UNet also utilizes an encoder-decoder framework with skip connections, it primarily focuses on spatial resolution recovery but lacks explicit mechanisms to aggregate multi-scale contextual information. Consequently, UNet is more prone to misclassifications in scenarios with occlusions, faded lane markings, or variable lighting, all of which are common in unstructured road conditions.

These architectural enhancements enable DeepLabv3+ to achieve higher F1 Score, Precision, Recall, and Accuracy metrics compared to UNet, as observed The proposed experimental evaluations. DeepLabv3+ was also proven superior to UNet and VGG16 models for Image Classification tasks in terms of F1 score, Precision, Recall, and Accuracy measures. The Literature survey revealed that Tiny YOLOv3 achieved a mAP of 0.76 for pothole detection, Tiny YOLOv4 achieved a mAP of 0.78, and the YOLOv9 model built for this study had a greater mAP of 0.83.

Environmental robustness analysis

To evaluate the robustness of The proposed perception models under different environmental conditions, additional experiments and simulations were conducted.

When tested on sample nighttime images, the lane detection model exhibited a decrease in accuracy of approximately 10% compared to daytime conditions, primarily due to reduced visibility of lane markings under low illumination³². Object detection models, particularly YOLOv9, maintained reliable detection of larger objects such as vehicles and stop signs at night; however, smaller or dark-colored obstacles were occasionally missed³³.

To simulate adverse weather, rain-effect filters were artificially applied to a subset of images. Under simulated heavy rain, the lane detection model’s segmentation confidence significantly dropped, and pothole detection accuracy also declined, as rain streaks and puddles occluded or distorted key features.

The proposed work shows that the attribute some resilience of the models to the proposed data augmentation strategies during training, which included random brightness adjustments. These augmentations improved model adaptability to moderate lighting variations, such as those encountered during dusk or dawn.

Nevertheless, the proposed work candidly acknowledge that the system was not explicitly trained on heavy rain or fog conditions. Handling such extreme scenarios remains challenging and is identified as an important direction for future work, potentially involving sensor fusion approaches (e.g., combining camera data with LiDAR or radar inputs) to enhance perception reliability.

This analysis provides a transparent account of the prototype’s robustness across varied environments, clarifying its strengths in moderate conditions and limitations under severe weather.

Conclusion and further work

In this research, a prototype of an Autonomous Electric Vehicle was successfully built that performs the detection of features using deep learning models deployed into Raspberry Pi and controls the car using Arduino Mega based on the features detected. The functioning of different models to detect different features individually consumed more GPU, provided lesser FPS, and also could not detect all features simultaneously in one frame. The cascaded multitasking model helped to detect all features in one frame with good FPS and reduced GPU consumption. The study also showed that YOLOv9 is the state-of-the-art object detection model. In this work, DeepLabv3+ was proved to be a better image segmentation algorithm than the UNet algorithm and was also proved to be a better image classification algorithm than UNet and VGG16 models. The research also highlighted the importance of selecting appropriate models for specific tasks within autonomous driving systems, particularly for pothole detection, where the greater precision achieved by YOLOv9 in this work indicates its ability to detect potholes with better accuracy. These findings contribute to the advancement of safer and more efficient vehicle navigation in autonomous driving systems.

To provide better insight into model performance, The proposed work incorporated additional visualizations, including a confusion matrix and a bar chart comparing key metrics such as Precision, Recall, and F1-Score across different models. The confusion matrix highlights how accurately each object class is detected and where potential misclassifications occur (for instance, distinguishing potholes from shadows). The bar chart facilitates an intuitive comparison of model strengths, such as YOLOv9’s superior precision and recall relative to SSD and DeepLabv3+’s advantage over UNet in segmentation tasks. These figures complement the quantitative tables and enhance readers’ understanding of the performance differences. Regarding the scalability of The proposed system for full-scale autonomous vehicles, The proposed work recognize that transitioning from a prototype to a production-level AEV would necessitate several upgrades. Firstly, more powerful computing platforms such as NVIDIA Drive AGX or equivalent hardware would be required to achieve higher FPS and handle increased data throughput from multiple high-resolution cameras. Secondly, integration of additional sensors like, Lidar and radar would provide redundancy and improve perception robustness, particularly in adverse conditions. The proposed modular cascaded architecture is naturally suited for such expansions, as individual modules (e.g., object detection) can be extended to handle multiple camera inputs without altering the fundamental pipeline. To operate safely at highway speeds, models would need to process higher-resolution inputs and maintain low latency. Furthermore, safety-critical systems in real-world vehicles typically employ sensor fusion and redundancy-future iterations of The proposed system could parallelize vision-based and Lidar-based detections to cross-validate decisions. These considerations outline a clear and feasible pathway from The proposed current prototype toward full-scale deployment in complex real-world driving environments.

In future work, exploring hybrid models like VGG-UNet or VGG-DeepLabv3+ for image classification and image segmentation could ensure even better results. Developing a multitask learning neural network model capable of simultaneously performing multiple perception tasks, as demonstrated by Tesla’s HydraNet, holds promise for further advancements in this field.

Data availability

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

References

Bimbraw, K. Autonomous Cars: Past, Present and Future - A Review of the Developments in the Last Century, the Present Scenario and the Expected Future of Autonomous Vehicle Technology. (2015). https://doi.org/10.5220/0005540501910198.
Autonomous Cars: Research Results, Issues, and Future Challenges. IEEE J. Mag. | IEEE Xplore (2019). https://ieeexplore.ieee.org/document/8457076.
Safety challenges and analysis of autonomous electric vehicle development: Insights from on-road testing and accident reports. Int. J. Sci. Eng. Appl. (2023). https://doi.org/10.7753/ijsea1206.1001
Li, Z., Khajepour, A. & Song, J. A comprehensive review of the key technologies for pure electric vehicles. Energy (2019). https://doi.org/10.1016/j.energy.2019.06.077.
State of the art and trends in electric and hybrid electric vehicles. IEEE J. Mag. | IEEE Xplore (2021). https://ieeexplore.ieee.org/abstract/document/9422914.
Li, G., Song, Z. & Fu, Q. A new method of image detection for small datasets under the framework of YOLO network. In IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC). Chongqing, China. Vol. 2018. 1031–1035. https://doi.org/10.1109/IAEAC.2018.8577214 (2018).
Gupta, A., Anpalagan, A., Guan, L. & Khwaja, A. Deep learning for object detection and scene perception in self-driving cars: Survey, challenges, and open issues. Array (2021). https://doi.org/10.1016/j.array.2021.100057
Artificial intelligence applications in the development of autonomous vehicles: A survey. IEEE J. Mag. | IEEE Xplore (2020). https://ieeexplore.ieee.org/abstract/document/9016391.
Deep learning for safe autonomous driving: Current challenges and future directions. IEEE J. Mag. | IEEE Xplore (2021). https://ieeexplore.ieee.org/abstract/document/9284628
Alzubaidi, L. et al. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data (2021). https://doi.org/10.1186/s40537-021-00444-8
A deep learning based autonomous electric vehicle on unstructured road conditions. In IEEE Conference Publication | IEEE Xplore (2022). https://ieeexplore.ieee.org/document/9794498.
Kadu, R. K., Assudani, P. J., Jaiswal, M., Bist, D., & Tickoo, A. Road lane detection system for self-driving cars. Int. J. Next-Gen. Comput.12(5). https://doi.org/10.47164/ijngc.v12i5.466 (2021).
Yang, Y. A review of lane detection in autonomous vehicles. J. Adv. Eng. Technol.1(4). https://doi.org/10.62177/jaet.v1i4.130 (2022).
Lim, K. L., & Bräunl, T. A methodological review of visual road recognition procedures for autonomous driving applications. arXiv preprint arXiv:1905.01635 (2019).
Kim, J. & Lee, C. Robust lane detection based on convolutional neural network and random sample consensus. Neural Netw. 95, 94–102. https://doi.org/10.1016/j.neunet.2017.08.005 (2017).
Article Google Scholar
Neven, D., De Brabandere, B., Georgoulis, S., Proesmans, M. & Van Gool, L. Towards end-to-end lane detection: An instance segmentation approach. In IEEE Intelligent Vehicles Symposium. 286–291. https://doi.org/10.1109/IVS.2018.8500485 (2018).
Research on Lane Detection Method based on Deep Learning. In VDE Conference Publication | IEEE Xplore (2022). https://ieeexplore.ieee.org/document/9788695.
Deep learning based detection of potholes in Indian roads using YOLO. In IEEE Conference Publication | IEEE Xplore (2020). https://ieeexplore.ieee.org/document/9112424.
Park, S.-S., Tran, V.-T. & Lee, D.-E. Application of various YOLO models for computer vision-based real-time pothole detection. Appl. Sci.11(23), 11229. https://doi.org/10.3390/app112311229 (2021).
Pothole detection using computer vision and learning. IEEE J. Mag. | IEEE Xplore. https://ieeexplore.ieee.org/document/8788687 (2020).
Deep neural network based vehicle and pedestrian detection for autonomous driving: A survey. IEEE J. Mag. | IEEE Xplore (2021). https://ieeexplore.ieee.org/document/9440863.
Real-time object detection using pre-trained deep learning models MobileNet-SSD. In Proceedings of 2020 6th International Conference on Computing and Data Engineering. ACM Other conferences. https://doi.org/10.1145/3379247.3379264
Das, S., Fime, A. A., Siddique, N. & Hashem, M. M. A. Estimation of road boundary for intelligent vehicles based on DeepLabv3+ architecture. IEEE Access 9, 121060–121075. https://doi.org/10.1109/ACCESS.2021.3107353 (2021).
Article Google Scholar
Pan, X., Shi, J., Luo, P., Wang, X., & Tang, X. Spatial as deep: Spatial CNN for traffic lane detection. In AAAI Conference on Artificial Intelligence. 7276–7283 (2018).
Diwan, T., Anirudh, G., & Tembhurne, J. V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. (2022). https://doi.org/10.1007/s11042-022-13644-y
Yu, H., Che, M., Yu, H. & Zhang, J. Development of weed detection method in soybean fields utilizing improved DeepLabv3+ platform. Agronomy (2022). https://doi.org/10.3390/agronomy12112889.
Biswas, D., Su, H., Wang, C.-Y., Stevanovic, A. & Wang, W. An automatic traffic density estimation using single shot detection (SSD) and MobileNet-SSD. Phys. Chem. Earth Parts A/B/C (2019). https://doi.org/10.1016/j.pce.2018.12.001
Chen, H., Lin, H. & Yao, M. Improving the efficiency of encoder-decoder architecture for pixel-level crack detection. IEEE Access 7, 186657–186670. https://doi.org/10.1109/ACCESS.2019.2961375 (2019).
Article Google Scholar
J. Ondruš, Kolla, E., Vertaľ, P. & Šarić, Ž. How do autonomous cars work? Transport. Res. Proc. (2020). https://doi.org/10.1016/j.trpro.2020.02.049.
Akimoto Junichiro, O. et al. Impacts of ride and car-sharing associated with fully autonomous cars on global energy consumptions and carbon dioxide emissions, ideas.repec.org (2022). https://ideas.repec.org/a/eee/tefoso/v174y2022ics0040162521007435.html.
Analysis of various object detection techniques for self-driving cars. In IEEE Conference Publication | IEEE Xplore (2021). https://ieeexplore.ieee.org/document/9545034
Bhagavath, P. et al. Swappable battery data management system. In AI Techniques for Renewable Source Integration and Battery Charging Methods in Electric Vehicle Applications (Eds. Angalaeswari, S. et al.). 15–36 (IGI Global, 2023). https://doi.org/10.4018/978-1-6684-8816-4.ch002
Raju, N. K. K., Khatua, A., Tarun , S. & Monica Subashini, M. Breast cancer classification using ensemble approach, machine learning and deep learning. In 2022 International Conference on Futuristic Technologies (INCOFT), Belgaum, India. 1–8 (2022). https://doi.org/10.1109/INCOFT55651.2022.10094372.

Download references

Acknowledgements

This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. KFU252227].

Funding

This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. KFU252227].

Author information

Authors and Affiliations

Amherst Masters in Electrical and comp Engineering, University of Massachusetts, Amherst, MA, 01003, USA
Kushal Kumar Raju
Executive in New Projects, Honda Motorcycle and Scooter India Pvt. Ltd., Karinaikanahalli, Karnataka, India
B. Prahal Bhagavath
Balaji Institute of Modern Management, Sri Balaji University, Pune, 411033, India
M. K. Nallakaruppan
Symbiosis Institute of Computer Studies and Research (SICSR), Symbiosis International (Deemed University), Pune, India
Rajesh Kumar Dhanaraj
Applied College, King Faisal University, 31982, Al-Ahsa, Saudi Arabia
Soufiane Ben Othman
Department of Computer Science and Information Technology, Ibb University, Ibb, Yemen
Obaid Ali

Authors

Kushal Kumar Raju
View author publications
Search author on:PubMed Google Scholar
B. Prahal Bhagavath
View author publications
Search author on:PubMed Google Scholar
M. K. Nallakaruppan
View author publications
Search author on:PubMed Google Scholar
Rajesh Kumar Dhanaraj
View author publications
Search author on:PubMed Google Scholar
Soufiane Ben Othman
View author publications
Search author on:PubMed Google Scholar
Obaid Ali
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors contributed equally to the conceptualization, formal analysis, investigation, methodology, and writing and editing of the original draft. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Soufiane Ben Othman or Obaid Ali.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Raju, K.K., Bhagavath, B.P., Nallakaruppan, M.K. et al. Cascade drive: a unified deep learning framework for multi-featured detection and control in autonomous electric vehicles on unstructured roadways. Sci Rep 15, 20969 (2025). https://doi.org/10.1038/s41598-025-06567-9

Download citation

Received: 17 March 2025
Accepted: 09 June 2025
Published: 01 July 2025
DOI: https://doi.org/10.1038/s41598-025-06567-9

Subjects

Abstract

Similar content being viewed by others

Efficient traffic sign recognition using YOLO for intelligent transport systems

Fast and accurate object detector for autonomous driving based on improved YOLOv5

Machine vision-based autonomous road hazard avoidance system for self-driving vehicles

Introduction

Contributions of the paper

Organization of the paper

Related work

Research gap

Proposed methodology

Proposed study

Dataset description

Image preprocessing and augmentation

Hardware design approach

Hardware constraints and optimization strategies

Build models for detection and control

Image classification(VGG16 model)

Object detection(YOLOv9 model)

Pothole detection using YOLOv9

Object detection using SSD model

Lane detection using DeepLabv3+

Training and hyperparameter tuning

Building a cascaded multi-tasking model

Comparative analysis with other multitask frameworks

Experimentation, results, and analysis

Experimental setup

Object detection model using YOLOv9

Pothole detection using YOLOv9

Object detection detection using SSD

Lane detection detection using DeepLabv3+

Performing trial runs in prototype

Quantitative results from prototype trials

Comparison of all the models

Relative comparison

Analysis

Environmental robustness analysis

Conclusion and further work

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links