AI-based prediction of traffic crash severity for improving road safety and transportation efficiency

Mostafa, Ayman Mohamed; Aldughayfiq, Bader; Tarek, Mayada; Alaerjan, Alaa S.; Allahem, Hisham; Elbashir, Murtada K.; Ezz, Mohamed; Hamouda, Eslam

doi:10.1038/s41598-025-10970-7

Download PDF

Article
Open access
Published: 28 July 2025

AI-based prediction of traffic crash severity for improving road safety and transportation efficiency

Ayman Mohamed Mostafa^1,2,
Bader Aldughayfiq¹,
Mayada Tarek^3,4,
Alaa S. Alaerjan⁴,
Hisham Allahem¹,
Murtada K. Elbashir¹,
Mohamed Ezz⁴ &
…
Eslam Hamouda^3,4

Scientific Reports volume 15, Article number: 27468 (2025) Cite this article

4124 Accesses
10 Altmetric
Metrics details

Subjects

Abstract

Ensuring safe transportation requires a comprehensive understanding of driving behaviors and road safety to mitigate traffic crashes, reduce risks and enhance mobility. This study introduces an AI-driven machine learning (ML) framework for traffic crash severity prediction, utilizing a large-scale dataset of over 2.26 million records. By integrating human, crash-specific, and vehicle-related factors, the model improves predictive accuracy and reliability. The methodology incorporates feature engineering, clustering techniques such as K-Means and HDBSCAN, with oversampling methods such as RandomOverSampler, SMOTE, Borderline-SMOTE, and ADASYN to address class imbalance, along with Correlation-Based Feature Selection (CFS) and Recursive Feature Elimination (RFE) for optimal feature selection. Among the evaluated classifiers, the Extra Trees (ET Classifier) ensemble model demonstrated superior performance, achieving 96.19% accuracy and an F1-score (macro) of 95.28%, ensuring a well-balanced prediction system. The proposed framework provides a scalable, AI-powered solution for traffic safety, offering actionable insights for intelligent transportation systems (ITS) and accident prevention strategies. By leveraging advanced ML and feature selection techniques, this approach enhances traffic risk assessment and enables data-driven decision-making.

Crash severity analysis and risk factors identification based on an alternate data source: a case study of developing country

Article Open access 08 December 2022

Unveiling the risks of speeding behavior by investigating the dynamics of driver injury severity through advanced analytics

Article Open access 28 September 2024

An explainable multi-task deep learning framework for crash severity prediction using multi-source data

Article Open access 01 July 2025

Introduction

Traffic crashes continue as a significant worldwide problem which kills millions of people every year while creating extensive community strain. These traffic accidents produce effects that surpass the loss of human lives and injuries, creating economic and healthcare system strain and urban transportation issues. Road traffic crashes create substantial economic strain because they require healthcare expenses and property damage compensation while reducing workforce productivity, thus making traffic safety essential for governments and urban planners. The psychological strain experienced by accident victims, together with social effects on families and communities, demonstrates why strong crash prevention strategies need immediate implementation. Predicting traffic crashes represents a powerful solution for reducing road accidents because it enables organizations to deploy preventive measures ahead of time. Accurate prediction of traffic crashes reduces fatalities while enabling better traffic management and improved infrastructure development. Statistical models and rule-based approaches demonstrate limitations when processing big-scale, real-time crash data. The models struggle to capture complex traffic dynamics because they use simplified assumptions and restricted feature interactions. As presented in¹, the authors conducted a study of roundabout crash severity in Jordan by evaluating multiple contributing elements, including weather conditions, lighting conditions, vehicle characteristics, road geometry, and driver demographics. The research team uses rule-based classifiers and RF models to discover important risk elements affecting crash severity and property damage-only incidents. The research uses evidence-based findings to help policymakers enhance roundabout traffic safety programs. The authors of² studied fatal crash risks between highways, collector roads, and local roads in Thailand by applying DT, RF, XGBoost, and GB ML models. The research analyzes crash severity factors to create targeted safety measures that will benefit different road environments by addressing speeding alongside alcohol use and lighting conditions. The research in³ examines Indian expressway crash severity by studying 2747 recorded incidents from three major expressways. The research utilizes Multinomial Logit (MNL) and DT and RF models to identify critical factors that affect crash severity between fatal, severe, minor, and property damage only—PDO incidents. The goal aims to improve traffic safety by implementing enhanced speed enforcement alongside better road infrastructure design and improved lane discipline.

ML has proven itself as an advanced predictive tool for traffic crashes through its ability to process extensive datasets and complex patterns and execute real-time analysis. Analyzing traffic conditions alongside driver behavior, environmental factors, and accident-prone areas shows better results through ML models. The authors in⁴ studied various ANN models with activation functions and optimizers to enhance prediction accuracy. This led them to select the ADAM optimizer and SOFTMAX activation function as the optimal pairing. The Apriori algorithm serves to uncover important factors that contribute to fatal crashes. The ML-based method allows predictive crash analysis, leading to specific road safety measures. The authors in⁵ utilized Decision Trees and RF classifiers to analyze and predict road traffic accidents based on present traffic accident data. The study aims to solve the homogeneity issue that affects road safety surveys and accident analysis through accident data. The study implements elements from weather to lighting conditions to accident severity patterns across time to evaluate data mining approaches for enhancing prediction accuracy. Implementing ML-driven crash prediction faces multiple unresolved issues that restrict their complete potential utilization. The main obstacle in crash prediction arises from limited access to inconsistent traffic accident data, which is also challenging to obtain. The large number of dimensions in crash datasets requires advanced memory capacity and processing speed for real-time implementation to be feasible. The quality of real-time accident data suffers because incomplete or missing records cause information reduction and deteriorating model precision. The imbalance of data in crash datasets⁶ creates prediction bias because accident-related features rarely appear in the dataset, which trains models primarily on non-severe crashes, making them less effective at detecting severe or fatal crashes. The unbalanced distribution of classes in the data affects models to become more reactive to common crash types while underperforming when detecting important yet infrequent accident scenarios. The use of inconsistent data formats that include “NA” or “null” and empty strings creates misinterpretations that deteriorate model performance. The examination of crash types, together with severity levels, functions as an essential method to forecast and minimize fatal accidents. The classification of crashes into three severity levels by ML models enables the identification of dangerous areas and key accident patterns. Developing early-warning systems and targeted road safety measures becomes possible by analyzing impact force, vehicle speed, weather conditions, and driver behavior. Predictive models that integrate crash severity analysis enable traffic management authorities to make better decisions through safety intervention deployment, speed limit adjustments, and road infrastructure enhancement for fatal accident reduction.

To address these challenges, this study aims to analyze recent advancements in ML-based crash prediction, evaluate the effectiveness of feature selection techniques, and propose optimized modeling approaches for improving data quality, computational efficiency, and real-time crash prediction accuracy. By tackling issues related to low data availability, high dimensionality, and data inconsistency, this research contributes to developing more reliable and scalable traffic crash prediction models, paving the way for safer and more efficient transportation systems. The contribution of the paper is presented as follows:

Developing an enhanced Triple Merge Dataset by integrating multiple traffic datasets, resulting in large-scale dataset traffic records.
Provide a comprehensive analysis of recent crash severity prediction methods.
Applying advanced feature engineering, including temporal, environmental, and location-based attributes, to improve predictive performance.
Applying baseline experiments with recent ML classification models.
Implementing K-Means and HDBSCAN clustering to segment crash locations and enhance classification performance.
Applying oversampling techniques (RandomOverSampler, SMOTE, Borderline SMOTE, and ADASYN) to address class imbalance and improve classification fairness.
Utilizing CFS with RFE to optimize feature selection.

While several previous studies have individually applied data balancing techniques, clustering algorithms, or feature selection to crash prediction, our study introduces a comprehensive, multi-stage AI-driven framework that unifies these techniques into a cohesive pipeline applied to a large-scale, triple-merged dataset. This dataset combines human-related, crash-specific, and vehicle-specific features—an integration not commonly addressed together in the literature. Additionally, we evaluate the impact of clustering (K-Means, HDBSCAN) and oversampling methods (RandomOverSampler, SMOTE, Borderline-SMOTE, ADASYN) in tandem with advanced feature generation and hybrid feature selection using CFS with RFE. To our knowledge, this is among the few studies that offer an end-to-end performance comparison across all these dimensions using over 2 million records, yielding highly accurate and generalizable results for real-world crash severity prediction. To improve clarity, accessibility, and reader comprehension, Table 1 summarizes list of acronyms used in the paper.

Table 1 List of paper acronyms.

Subjects

Abstract

Similar content being viewed by others

Crash severity analysis and risk factors identification based on an alternate data source: a case study of developing country

Unveiling the risks of speeding behavior by investigating the dynamics of driver injury severity through advanced analytics

An explainable multi-task deep learning framework for crash severity prediction using multi-source data

Introduction

Literature review

Proposed methodology

Dataset collection

Data preprocessing

Data cleaning

Dual merge process

Triple merge process

Feature engineering and encoding

Data splitting

Baseline ML classification

Experiment 1: baseline results for traffic people dataset

Experiment 2: baseline results for dual marge dataset

Baseline results for triple marge dataset

Feature generation

Crash severity prediction enhancement using clustering methods

Oversampling

Feature selection

Discussion

Conclusion and future works

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links