Introduction

Speeding is a highly dangerous driving behavior that poses a significant risk to traffic safety, leading to devastating consequences in terms of fatalities and serious injuries. In the United States, speeding-related fatalities account for a quarter of all traffic-related deaths. Furthermore, approximately 30% of these fatalities result from single-vehicle crashes caused by speeding1. This issue is not limited to the US alone; speeding has been a major contributing factor in traffic fatalities and injuries worldwide. The National Highway Traffic Safety Administration (NHTSA) defines a speeding-related crash as one in which the police officer determines that excessive speed, surpassing the posted speed limit, or engaging in a race played a role, or the driver was charged with a speeding-related offense1. Additionally, a speeding-related fatality is defined as any death resulting from a crash related to speeding1. Figure 1 illustrates the trends in fatalities and fatalities per 100 million vehicle miles traveled (VMT) caused by speeding. Notably, there was a 17% increase in speeding-related fatalities in 2020, with 11,258 lives lost, and a further 23% increase in 2021, with 11,780 lives lost, compared to 2019’s figures of 9,5921. These numbers equate to an alarming average of 30 to 32 lives lost every day on our roads in recent years (2020-21), lives that could have been saved through the implementation of practical measures to address the problem. As such, speeding behavior poses a serious safety challenge considering a few critical factors.

Fig. 1
figure 1

Speeding-related fatalities and fatalities per 100 M VMT.

The Impact of speeding behavior has been a pervasive safety challenge universally, contributing to a considerable number of traffic fatalities and injuries each year, with societal losses in high-, medium-, and low-income countries. According to the NHTSA, speeding was a factor in 26% of all traffic fatalities in the United States in 2019. This alarming statistic underscores the need to address speeding as a major public safety concern.

Single-vehicle crash dynamics indicate drivers’ losing control due to excessive speed, particularly at risky combinations of factors, such as curves, wet or icy roads, and adverse weather conditions. High speeds influence the driver’s ability to react to unexpected obstacles, sharp turns, or sudden changes in road conditions. Additionally, the kinetic energy of impact in high-speed crashes is significantly greater (square of velocity), leading to more severe injuries or fatalities. Contributing Factors, that lead to the prevalence of speeding-related single-vehicle crashes, include the following:

  • Driver Behavior: Risk-taking behaviors, such as aggressive driving and overestimating driving skills, are common characteristics among those who speed on the road.

  • Road Conditions: Poor Road maintenance, lack of signage, and hazardous weather conditions can contribute to the risks associated with speeding.

  • Vehicle Factors: High-performance vehicles (modern technology of airbags, side curtain, with high horsepower) may encourage drivers to exceed speed limits, while older vehicles might lack modern safety features to mitigate the impact of high-speed crashes.

  • Environmental Factors: Rural roads, which are often less maintained and have higher speed limits, see a higher incidence of speeding-related single-vehicle accidents compared to urban areas.

The consequences and Economic Impact of speeding-related single-vehicle crashes extend beyond the immediate physical harm to the driver and passengers. These crashes impose significant economic costs on society, including medical expenses, lost productivity, property damage, and emergency response costs. According to a study by IIHS, speeding-related crashes cost an estimated $52 billion annually in the United States.

Given the heightened concern surrounding the risky behavior of speeding, it is crucial to investigate the factors that contribute to the injury severities in speeding-related crashes. Speeding, whether it involves exceeding the speed limit or variations in speed, is strongly associated with the risk of crashes and the resulting injury severities. This is primarily due to the excessive kinetic energy dissipated upon impact between vehicles or between a vehicle and a fixed object and rollover crashes. Various factors have been shown to influence the choice of excessive or inappropriate speed, including age, gender, attitudes, safety perception, alcohol consumption, number of occupants in the vehicle, roadway geometry, road-surface conditions, type of vehicle, vehicle power, and ambient weather conditions2,3,4,5.

In this study, unlike previous studies, we approached the analysis of speeding by considering it as an entire data sample, rather than solely as an indicator variable. We measured speeding as the difference between the estimated travel speed (provided by the police officer) and the posted speed limit for the corresponding road segment. Other studies typically utilized ‘speeding’ as an indicator variable obtained from police-reported crash databases. Furthermore, unlike most previous studies6,7,8,9, we established a baseline scenario where the difference between travel speed and the posted speed is zero. This allowed us to compare the effects of various factors on driver injury severities.

Our study focused on developing models for driver injury severity, considering unobserved factors related to independent variables from police-reported crash databases across all states in the country. These factors include roadway-related and crash-specific variables readily available in the database, such as weather conditions, lighting, surface type, roadway classes, roadway characteristics, speed limit, travel speed, roadside information, driver age and gender, physical condition, maneuvers, violation history, vehicle make, model, model year, displacement, and vehicle type. We utilized the Crash Report Sampling System (CRSS) database, which contains sampled crash data from across the entire country, to analyze the potential severity of injuries in single-vehicle crashes related to speeding within a multivariate severity modeling framework. Our analysis employed machine learning algorithms to identify contributing factors to driver injury severity levels, and subsequently utilized advanced econometric techniques that allowed the mean and variance of the random parameters to vary across observations in the dataset. This approach better accounted for unobserved heterogeneity in the dataset extracted from the CRSS database system. Temporal and spatial inability was not the focus of this study. A future study built on the understanding and knowledge from this study will be the motivation for further exploration.

In recent years, machine learning (ML) methods have been widely used for feature selection. While ML methods were once considered black-box approaches, recent advancements and research on interpreting ML models have made them more transparent and interpretable. Among the various ML methods, Extreme Gradient Boost (XGBoost) and Random Forest (RF) have emerged as the most employed algorithms in traffic crash studies10,11,12,13. To interpret the results of machine learning models, many studies have extensively used SHAP (Shapley Additive Explanation) in recent years14,15,16. SHAP is a visualization measure that provides insight into how different features influence the model.

The application of random parameter logit models, also known as mixed logit models, has garnered significant attention due to their ability to incorporate mixing distributions. These models are considered a promising alternative approach in injury severity analysis. One advantage of this approach is its consideration of unobserved heterogeneity by allowing certain variables to vary across the population. This makes the mixed logit model superior to fixed-parameter models17. Previous research studies have demonstrated the capability of mixed logit models in considering unobserved effects, such as driver behavior, roadway characteristics, and environmental factors18,19,20,21,22,23,24,25.

The paper is organized as follows: the next section provides a review of the existing literature on driver injury severity in speeding-related crashes, as well as methodological literature on injury severity modeling. It also describes machine learning algorithms and the approach for incorporating heterogeneity. This is followed by details of the CRSS data and the empirical setting. Finally, we explain the results of the machine learning algorithm and econometric modeling, including the marginal effects of the variables. The paper concludes with a summary of our findings.

Literature review

Several studies are identifying speeding as a key contributing factor to severe injuries using crash, naturalistic data, and driver surveys at state or country-level data. In 2022, Se et al.26 researched the causes of severe injuries in crashes in Thailand, revealing that factors such as vehicle type, road conditions, and regional location consistently influenced the likelihood of severe injuries in both speeding and non-speeding incidents. This study found that the use of restraints and certain vehicle types was key in reducing injury severity, while factors like alcohol influence significantly increased the risk. Das et al.9 analyzed Louisiana crash data from 2010 to 2016 to identify patterns leading to speeding-related motorcycle crashes, resulting in six distinct clusters associated with high crash likelihood. These clusters include single motorcycle crashes on low-volume two-lane roadways, older motorcyclists involved in undivided curve crashes, various intersection crashes, fatal left turn intersection incidents, lane splitting sideswipes on busy roads, and crashes on open country at-grade locations during wet pavement conditions. In 2021, Perez et al.8 analyzed naturalistic driving data and driver questionnaire responses to understand the factors affecting speeding likelihood, finding that both age and gender are significant influencers. Their results indicated that younger drivers (16–24 years old) are 1.5 times more likely to speed than those 80 or older, males are more likely to speed than females, and the likelihood of speeding is higher in lower speed limit zones, with odds 9.5 times greater in 10–20 mph zones compared to those over 60 mph. Islam and Mannering27 investigated the too-fast for rainy conditions between male and female drivers in Florida. Another study by Islam and Mannering21 found that drivers who exceeded the speed limit by more than 10 mi/hr were a consistent temporally stable predictor of driver injuries where aggressive driving was identified. Kong et al.28 extracted naturalistic driving data collected by the Safety Pilot Model Deployment (SPMD) program to evaluate the relationship between trip/driving/roadway features and speeding behavior. Hong et al.7 evaluated the effects of socio-demographic factors on motorcycle speeding behavior by developing linear network autocorrelation models. However, in another study by Fernandes et al.29, the researchers concluded that female drivers had more risky behavior in terms of speed violations. In 2019, Cheng et al.30 determined the factors that influence speeding violation behavior using electronic law enforcement data obtained from the public security administration of Wujiang.

Yadav and Velega31 explored the effect of different Blood Alcohol Concentration (BAC) levels on drivers’ speed and their ability to avoid crashes in case of sudden events by simulating driving experiments in rural and urban conditions. Chen et al.6 used principal component analysis and hierarchical clustering to determine the contributing factors to severe crash injuries in China. Ma et al.32 developed a partial proportional odds model to identify the contributing factors to crash injury severity using crash records collected on rural two-lane highways in China. Moreover, Abegaz et al.33 developed a generalized ordered logit/partial proportional odds model to investigate the contributing factors to crash severities using crash data collected from June 2012 to July 2013 on one of Ethiopia’s main and busiest highways. In 2013, Hassan and Al-Falah34 developed a logistic regression model to categorize crashes into fatal and non-fatal crashes and identify the contributing factors to them. Council et al.35 developed a speeding-related crash typology to identify the crash, vehicle, and driver features that increase the probability of speeding-related crashes.

A summary of the key variables used by previous studies, as well as their increasing or decreasing effects on speeding crash severity and frequency, is provided in Table 1.

Table 1 Summary of previous studies on speeding behavior.

Data description

Single-vehicle crashes caused by driving above or within the speed limit were extracted from the Highway Safety Information System (CRSS) database between January 1, 2016, and December 31, 2018. CRSS is a sample of police-reported crashes involving all types of motor vehicles, pedestrians, and cyclists, ranging from property damage-only crashes to those that result in fatalities. CRSS obtains its data from a nationally representative probability sample selected from the estimated 6 to 7 million police-reported crashes that occur annually. This determination of driving within or exceeding was made based on the difference between travel speed and speed limit. The process of determining if a vehicle was speeding involved calculating the difference between its speed and the posted speed limit. A zero difference meant the vehicle was within the speed limit, while a positive difference indicated it was exceeding the speed limit. This method provided a clear and measurable way to categorize driving behavior about speed limits. Records from various datasets, including crash, person, vehicle, violation, distraction, vpicdecode, vpictrailerdecode, and pbtype (providing crash information for pedestrians, bicyclists, and individuals on personal conveyances), were aggregated and filtered. Data from multiple sources, such as crash, person, vehicle, violation, and distraction datasets, were compiled and processed. This included vpicdecode, a data file offering vehicle specifications derived from the Vehicle Identification Number (VIN), and vpictrailerdecode, providing similar information for trailers. Additionally, the pbtype dataset, which offers crash details involving pedestrians, bicyclists, and personal conveyance users, was also incorporated into the aggregated and filtered data set. The crash, vehicle, person, violation, and distraction files were linked based on case and vehicle numbers. The pbtype and person files were integrated, excluding information related to pedestrians, bicyclists, and individuals on personal conveyances. The vpicdecode and vpictrailerdecode files were combined to obtain detailed car information. Throughout the merging process, the unique crash case number and associated vehicle numbers were validated to ensure a high-quality merged dataset. Figure 2 illustrates the data processing and linkage process which resulted in the final data sample for modeling.

Fig. 2
figure 2

Stepwise flowchart for data linkage process for final sampled datasets.

After merging, the data was filtered to include only single-vehicle crashes and driver information, considering the total number of vehicles involved in a crash and the seating position of the person. The filtering process for driving within the speed limit (where the difference between travel speed and speed limit was zero) resulted in a dataset of 3,450 crashes. For driving exceeding the speed limit (where the difference was positive), there were 2,299 crashes over the three-year analysis period (2016–2018).

Injury severity was defined as the level of injury sustained by the most severely injured driver in each crash. In the single-vehicle crashes, the five driver injury severity levels (fatal, suspected serious injury, suspected minor injury, possible injury, and no injury/property damage only) were aggregated into three levels: severe injury crashes (37% in exceeding the speed limit vs. 20% in within the speed limit of the respective crash data sample), minor injury crashes (33% in exceeding the speed limit vs. 32% in within the speed limit), and no injury crashes (30% in exceeding the speed limit vs. 48% in within the speed limit) – see Fig. 3. Figure 3 highlights the importance of estimating the effects of factors influencing the severity levels of such crashes, where injury severities are higher for those exceeding the speed limit compared to those within the speed limit.

Fig. 3
figure 3

Proportion of driver injury severity in single-vehicle crashes involving driving exceeding the speed limit and within the speed limit in the USA (2016-18).

When examining different types of road facilities, it was observed that on two-way undivided roadways, 47% of crashes involved single vehicles traveling, while 52% of them occurred beyond the speed limit. In comparison, on two-way undivided roadways with a positive median barrier, 25% of crashes were involved, while 18% occurred beyond the speed limit. Furthermore, on two-way roadways with an unprotected median, 11% of crashes involved vehicles at the speed limit, and 9% of them occurred beyond the speed limit.

Table 2 presents the descriptive statistics for the variables found statistically significant in the models. These variables are categorized into temporal, weather, crash, vehicle, roadway, crash, and driver characteristics.

Table 2 Descriptive statistics of key variables in the models.

SD = Standard Deviation.

Methodology

Previously, researchers have developed machine learning and econometric models to compare their performance in predicting crash severities36,37. They concluded each of these methods outweigh the other one in different scenarios. However, these studies are not comprehensive since they did not attempt to explore both severity prediction and driving behavior using machine learning and econometric models simultaneously. The extent of crash information collected and reported by the officers could lead to limited data compiled in the crash database compared to the wealth of information present at the crash scene that could have been extracted. Accounting for unobserved factors (not captured in the reported crash data in the database), injury severity modeling with mixed logit with heterogeneity in means and variances was found to be more promising to uncover contributing factors leading to crash and crash outcomes focusing on driving behaviors. In this study, machine learning and advanced econometric models were implemented to take advantage of the values of both approaches to capture variables in the model estimation. In the advanced machine learning process, XGBoost and Random Forest, with SHAP values, were applied first. A recent study by Hasan et al.38 emphasized the effectiveness of XGBoost and Random Forest algorithms in identifying the importance of variables. Secondly, econometric models were employed to quantify the effect of the variables (marginal effects) on the driver injury severity in single-vehicle crashes. Figure 4 below shows the steps covered with machine learning models and the rest of the process with the advanced econometric model. The link between machine learning and the econometric model is the top variables selection by SHAP values from two algorithms of machine learning (Random Forest and XGBoost) and qualification of some of the overlapping variables including others to estimate the random parameters multinomial logit with heterogeneity in means and variances. In general, a significant advantage of many machine learning algorithms is their ability to autonomously identify pertinent features from a broad array of variables. This capability is instrumental in pinpointing key predictors without the need for predefining a model grounded in theoretical assumptions. Furthermore, it is important to recognize that machine learning techniques serve as a complement to, rather than a replacement for, traditional econometric models. Basically, in this context, machine learning algorithm provides a list of important variables (say, top 10 or 15 variables) to consider in the econometric modeling framework where driver injury severity is modeled with three simultaneous equations. These algorithms would not guarantee that those variables will be statistically significant although checking their descriptive statistics, these variables would be considered as the top variables in the econometric modeling rather than just trying any variables in the model (some sort of context-driven variable). In many instances, simplified variables rather than more classified variables identified in the machine learning algorithm, would stand out as statistically significant in the econometric modeling. These techniques have the potential to reveal patterns and relationships that may not be immediately evident or are typically overlooked in conventional models.

Fig. 4
figure 4

Methodology to analyze speeding behavior with machine learning and advanced econometric model.

Machine learning models

XGBoost, also known as an ensemble technique, is integrated based on gradient-boosted decision trees. This algorithm was initially proposed in 2002 by Friedman39. XGBoost algorithm has achieved promising results in many research fields40,41,42,43. This algorithm consists of a set of decision trees in which every tree learns from the prior tree and affects the following tree. XGBoost with k tree functions can be formulated as follows44:

$$\hat {y}_{i}^{{\left( t \right)}}=\mathop \sum \limits_{{k=1}}^{t} {f_k}\left( {{x_i}} \right)=\hat {y}_{i}^{{\left( {t - 1} \right)}}+{f_t}\left( {{x_i}} \right)$$
(1)

where \({\widehat{y}}_{i}^{\left(t\right)}\) is the estimated crash severity after tth iterations, k is the number of the additive trees, t is the number of iterations, \({f}_{k}\left({x}_{i}\right)\) is the kth tree function for variables \({x}_{i}\), \({\widehat{y}}_{i}^{(t-1)}\) is the predicted response value for the final iteration, and \({f}_{t}\left({x}_{i}\right)\) is the tree function of ith iteration.

The objective function for minimizing the loss \(l\left({y}_{i},{\widehat{y}}_{i}\right)\) can be shown as follows:

$$Obj=\sum_{k=1}^{n}l\left({y}_{i},{\widehat{y}}_{i}\right)+\sum_{k=1}^{t}{\Omega}\left({f}_{k}\right)$$
(2)
$${\Omega}\left({f}_{t}\right)=\gamma T+\frac{1}{2}\lambda{\sum}_{j=1}^{T}{\omega}_{j}^{2}$$
(3)

where \({\Omega}\left({f}_{t}\right)\) is the regularization term for preventing overfitting and reducing the complexity, T is the number of leaves, \({\omega}_{j}^{2}\) is the L2 norm of jth leaf scores, and n is the total number of crashes in sample data.

Random Forest, also known as the decision forest technique, is an ensemble learning method that was initially created by T. Kam Ho45. This algorithm is comprised of a set of individual decision trees functioning as an ensemble. Each individual decision tree predicts a class, and eventually, the class having the highest vote is selected as the final model prediction. Due to its simplicity, Random Forest can be used for both classification and regression problems, making it one of the most used algorithms.

In this study, SHAP is used to interpret the machine learning output results indicating the importance of each feature on the model. Let us assume that \({x}_{i}\) is the ith sample, \({x}_{ij}\) is the jth feature of the ith sample, and \({y}_{i}\) is the predicted value for the ith sample in the model, and \(\stackrel{-}{y}\) is the baseline of the whole model. The SHAP value can be calculated using the following equation46:

$${y}_{i}=\stackrel{-}{y}+f\left({x}_{i},1\right)+f\left({x}_{i},2\right)+\dots +f({x}_{i},k)$$
(4)

where \(f({x}_{i},1)\) is the SHAP value for the first feature in the ith sample.

Finally, for evaluating the performance of the developed machine learning models, three commonly used evaluation metrics, including accuracy, precision, and F-1 score, were used. The considered evaluation metrics are defined as follows47:

$$Accuracy=\frac{(TP+TN)}{(TP+TN+FP+FN)}*100$$
(5)
$$Precision=\frac{TP}{(TP+FP)}*100$$
(6)
$$F-1 \,score=\frac{\left(2TP\right)}{(2TP+FP+FN)}*100$$
(7)

where TP: Samples are positive and correctly assigned to the positive category. TN: Samples are negative and correctly assigned to the negative category. FP: Samples are negative, but wrongly assigned to the positive category. FN: Samples are positive but wrongly assigned to the negative category.

Econometric models

A random parameter logit model with heterogeneity in means and variances was estimated to account for any possible heterogeneity in the dataset focusing on single-vehicle crashes due to exceeding the speed limit and driving within the speed limit in the US. Restricting the variable effect to be the same across observations (i.e., the standard mixed logit formulation) may cause biased estimates and erroneous inferences. The conventional crash databases extracted from the police reports cover part of a wealth of information originating from the crash scene. Considering the given limitations in the police-reported crash database, the intent of these “heterogeneity” models provides accurate inferences by explicitly accounting for observation-specific variations in the effects of influential factors (as unobserved heterogeneity). The extensive discussions and justifications for accounting for unobserved heterogeneity in crash data modeling are highlighted in a study by Mannering et al.48. Considering the paradigm shift from the other traditional models, if unobserved heterogeneity is ignored, and the effects of observable variables are restricted to be the same across all observations, the model will be mis specified and the estimated parameters will, in general, be biased and inefficient, which could in turn lead to erroneous inferences and predictions.

In this study, a random parameter multinomial logit model that accounts for possible heterogeneity in the means and variances of the random parameters has been utilized to address the possible unobserved heterogeneity in the single-vehicle crash data due to exceeding the speed limit and driving within the speed limit in the analysis period of three years of 2016 to 2018 (inclusive). The injury severity of drivers in single-vehicle crashes is considered with possible injury outcomes of no injury, minor injury (possible injury and non-incapacitating injury), and severe injury (incapacitating injury and fatality). Following the recent work, the modeling approach starts by defining a function that determines injury severity,

$${S_{in}}={{{\varvec{\beta}}}_{{i}}}{{{\varvec{X}}}_{in}}+{\varepsilon _{in}}$$
(8)

where Sin is an injury-severity function determining the probability of injury-severity outcome i in single-vehicle crash n, Xin is a vector of explanatory variables that affect single-vehicle crash injury-severity level i, βi is a vector of estimable parameters, and εin is the error term. If this error term is assumed to be generalized extreme value distributed, a standard multinomial logit model results as McFadden49:

$${{{P}}_{{n}}}\left( i \right)={{ }}\frac{{{{EXP}}\left[ {{\varvec{\beta}_{{i}}}{{\varvec{X}}_{{{in}}}}} \right]}}{{\sum\limits_{{\forall {{I}}}} {{{EXP}}\left[ {{\varvec{\beta}_{{i}}}{{\varvec{X}}_{{{in}}}}} \right]} }}{{ }}$$
(9)

where Pn(i) is the probability that single-vehicle crash n that will result in driver-injury severity outcome i and I is the set of the three injury-severity outcomes. The following form of Eq. 2 allows for the possibility of one or more parameter estimates in the vector βi to vary across each crash (i.e., each observation)50:

$${{P}}_{{{n}}}^{{}}\left( i \right)=\int {\frac{{EXP\left( {{\varvec{\beta}_i}{{\varvec{X}}_{in}}} \right)}}{{\sum\limits_{{\forall I}} {EXP\left( {{\varvec{\beta}_i}{{\varvec{X}}_{in}}} \right)} }}} f\left( {{\varvec{\beta}_i}|{{\mathbf{\varphi }}_i}} \right)d{\varvec{\beta}_i}$$
(10)

where f(βi|φi) is the density function of βi and φi is a vector of parameters describing the density function (mean and variance), and all other terms are as previously defined.

To account for the possibility of unobserved heterogeneity in the means and variances of parameters, let βin be a vector of estimable parameters that varies across single-vehicle crashes defined as (a similar formulation used by20,21,22,23,24,25,51,52 in other injury severity contexts.

$${\varvec{\beta}_{in}}={\beta _i}+{\Theta _{in}}{{\varvec{Z}}_{in}}+{\sigma _{in}}EXP\left( {{\psi _{in}}{{\varvec{W}}_{in}}} \right){\nu _{in}}$$
(11)

where βi is the mean parameter estimate across all single-vehicle crashes, Zin is a vector of crash-specific explanatory variables that captures heterogeneity in the mean that affects injury-severity level i, Θin is a corresponding vector of estimable parameters, Win is a vector of crash-specific explanatory variables that captures heterogeneity in the standard deviation σin with corresponding parameter vector Ψin, and vin is a disturbance term.

During model estimation, several density functions (for example, uniform, triangular, log normal, Weibull, etc.) were empirically evaluated for the term f(βi|φi). However, normal distribution was found to be statistically superior to all and was used in model estimation (this finding is consistent with past work including20,21,51,53. The model estimations used simulated maximum likelihood with 1,000 Halton draws54,55,56.

Results and discussions

This section is explained with the identification of risk factors with machine learning and ranked by SHAP values and likelihood ratio tests on injury severity and estimated model parameters for exceeding the speed limit and driving within the speed limit in the USA.

Identification of top risky factors in speeding with machine learning

Two classification models, including XGBoost and Random Forest, were used to model the injury severity of vehicles driving within the speed limit. Afterward, SHAP values were employed to identify the top risky variables associated with the developed models. In this study, 75% of the data were randomly selected as the training set. The remaining 25% were used to test the models. Moreover, the Synthetic Minority Over-sampling Technique (SMOTE) algorithm was applied to deal with the imbalanced variables in the dataset. One way of addressing imbalanced variables is to oversample the minority class, which can be done by duplicating examples in the minority class, although these examples do not add any new information to the model. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Over-sampling Technique for Handling Edges (SMOTHE)57.

Machine learning estimation results

As mentioned earlier, three evaluation measures, including accuracy, precision, and F-1 score, were selected for evaluating the performance of machine learning models. Random forest models outperform the XGBoost models for both vehicles driving within and beyond the speed limit. Regarding vehicles driving beyond the speed limit, an accuracy of 60.3%, a precision of 60.0%, and an F-1 score of 60.0% were obtained for the Random Forest model. On the other hand, an accuracy of 59.5%, a precision of 59.3%, and an F-1 score of 59.3% were recorded for the XGBoost model. In terms of the vehicles driving within the speed limit, the best accuracy of 73.3%, the precision of 73.0%, and the F-1 score of 72.8% were achieved for the Random Forest model. Finally, for the XGBoost model, the accuracy, precision, and F-1 score were recorded as 71.7%, 71.5%, and 71.3%, respectively. Figures 5 and 6 present the confusion matrix for the developed XGBoost and Random Forest models, respectively.

Fig. 5
figure 5

Confusion matrix of XGBoost model (a) within the speed limit (top), (b) beyond the speed limit (bottom).

Fig. 6
figure 6

Confusion matrix of random forest model (a) within the speed limit (top), (b) exceeding the speed limit (bottom).

Top variables

Figures 7 and 8 show the top ten variables and their SHAP values for XGBoost and Random Forest models. Based on the XGBoost model, variables including shoulder and lap belt used (REST), no rollover (ROLLRNA), tree (TRE), body type other (BDYTYPOTHER), drinking (DRK), right or left roadside departure (ROR), newer vehicles (VAGE5), dry surface (DRY), driver age below 26 (DAGE25), and male driver (ML) were found to be the top contributing factors related to the vehicles driving beyond the speed limit. It was also found that variables including shoulder and lap belt used (REST), no rollover (ROLLRNA), on the roadway (RLRDOR), body type other (BDYTYPOTHER such as buses, motor home, snowmobiles, farm vehicles, etc.), two-way, not divided roadway (W2UD), driver age below 26 (DAGE25), one occupant (OCCS1), the crash year 2018 (Y18), live animal (ANML), and male driver are the top factors associated with driving within the speed limit.

Regarding the Random Forest model results, the factors including shoulder and lap belt used (REST), no rollover (ROLLRNA), tree (TRE), rollover (RLLO), body type other (BDYTYPOTHER), motorcycle (MC), rollover, unknown type (ROLLUN), drinking (DRK), right or left roadside departure (ROR), south region of the country (REGION3) are the leading variables in driving beyond the speed limit. Moreover, shoulder and lap belt used (REST), no rollover (ROLLRNA), on the roadway (RLRDOR), live animal (ANML), rollover (RLLO), body type other (BDYTYPOTHER), newer vehicles (VAGE5), driver age below 26 (DAGE25), and motorcycle (MC) were the top variables affecting the vehicles driving within the speed limit. The south region includes Maryland, Delaware, Washington DC., West Virginia, Virginia, Kentucky, Tennessee, North Carolina, South Carolina, Georgia, Florida, Alabama, Mississippi, Louisiana, Arkansas, Oklahoma, and Texas in the South (Region 3) per CRSS user manual.

Fig. 7
figure 7

SHAP values for top 10 variables by driver injury severity in XGBoost, (a) within the speed limit, (b) beyond the speed limit.

Fig. 8
figure 8

SHAP Values for top 10 variables by driver injury severity in random forest, (a) within the speed limit, (b) beyond the speed limit.

It was notable that the top three variables were the same for both XGBoost and Random Forest models in driving beyond the speed limit. Similarly, the top three variables for both models were recorded the same in driving within the speed limit.

Test for differences in severity for beyond and within speed limit

After extensively testing for differences between injury severity on exceeding and speed limit and driving within the speed limit, it was determined that statistically significant differences existed in the injury severity data for exceeding and speed limit and driving within the speed limit. This was confirmed by a series of likelihood-ratio tests. To test for differences in injury severity for exceeding the speed limit and within the speed limit further, additional likelihood ratio tests were run as50,

$${X}^{2}=-2\left[LL\left({\beta}_{within\,Speed\,Limit,\,\,exceeding\,Speed\,Limit}\right)-LL\left({\beta}_{Exceeding\,Speed\,Limit}\right)\right]$$
(12)
$${X}^{2}=-2\left[LL\left({\beta}_{Exceeding\,Speed\,Limit,\,\,within\,Speed\,Limit}\right)-LL\left({\beta}_{Within\,Speed\,Limit}\right)\right]$$
(13)

where LL(βWithin Speed Limit, exceeding Speed Limit) is the log-likelihood at the convergence of a model containing converged parameters based on using beyond the speed limit data while using data from within speed limit data, and LL(βExceeding Speed Limit) is the log-likelihood at the convergence of the model using beyond speed limit-data, with parameters no longer restricted to using beyond speed limit converged parameters as is the case for LL(βWithin Speed Limit, Exceeding Speed Limit). Using the converged parameters of the beyond speed limit model as starting values and applying them to the within speed limit data gave X2 = 35.716 and, with 23 degrees of freedom, this also gave a χ2 confidence level of more than 95.9% that the null hypothesis that the injury severity on both segments is the same can be rejected. Similarly, using the converged parameters of the within-speed limit model as starting values and applying them to the beyond-speed limit data gave X2 = 69.028. With 23 degrees of freedom, this gave a χ2 confidence level of more than 99.9% that the null hypothesis that the injury severity in scenarios (within and beyond the speed limit) is the same can be rejected.

Econometric model estimation results

Mixed logit with heterogeneity in mean and variance for single-vehicle crashes for driving within the speed limit and exceeding the speed limit were presented in Tables 3 and 4, respectively. The comparison of the marginal effects of these two models is presented in Table 5. The model has an overall statistical fit with a McFadden pseudo-R-squared value of 0.176 for driving within the speed limit and 0.129 for exceeding the speed limit. We note that the constant term specific to minor injury was found to be the only statistically significant random parameter that is normally distributed in both models. Considering the mean (-2.828) and standard deviation (5.082) of the random parameter (i.e., the constant specific to minor injury) in the model, it implies that the intercept for minor injury is more than zero for 28.89% of the target crashes (i.e., single-vehicle crashes for driving within the speed limit) and tend to increase the likelihood of minor injury. It was found that the mean of the random parameter varied by whether the crashes were on straight segments. In this model, the crashes on the straight segment decreased the mean of the random parameter making minor injury less likely. The variance of the constant for minor injury was a function of clear weather. Crashes in clear weather were less likely to be involved in minor injury making the variance of constant specific to minor injury smaller. Likewise, considering the mean (-4.684) and standard deviation (11.541) of the random parameter (i.e., the constant specific to minor injury) in the model, it implies that the intercept for minor injury is more than zero for 34.24% of the target crashes (i.e., single-vehicle crashes for exceeding the speed limit) and tend to increase the likelihood of minor injury. It was found that the mean of the random parameter varied by whether the drivers were restrained or not. In this model, when drivers were restrained, decreased the mean of the random parameter made minor injury less likely. The variance of the constant for minor injury was a function of clear weather. Crashes in clear weather were more likely to be involved in minor injury making the variance of constant specific to minor injury higher.

In addition to parameter estimates for the mixed logit with heterogeneity in means and variances, estimated marginal effects are also included in Table 3. Marginal effects indicate the effect a one-unit increase in an explanatory variable has on the injury-outcome probabilities.

Table 3 Model results of mixed logit with heterogeneity in means and variance in single-vehicle crashes involving driving within the speed limit in the US, 2016–18.
Table 4 Model results of mixed logit with heterogeneity in means and variance in single-vehicle crashes involving driving exceeding the speed limit in the US, 2016-18.
Table 5 Comparison of marginal effects of driver injuries for driving within and beyond the speed limit.

Temporal characteristics

In comparison to other periods, single-vehicle crashes that exceeded the speed limit between midnight and 6 AM were more likely to result in severe injuries (with an average marginal effect of 0.0095) but less likely to result in minor injuries. On the other hand, single-vehicle crashes within the speed limit between 10 AM and 4 PM were more likely to result in minor injuries but less likely in severe injuries (with a marginal effect of 0.0138 for minor injuries vs. -0.0056 for severe injuries). Similarly, there was an opposite effect for minor injury crashes (with a marginal effect of -0.0016 on curved segments vs. 0.0224 on straight segments). Additionally, single-vehicle crashes in 2017 that occurred within the speed limit were more likely to result in severe injuries (with a marginal effect of 0.0090) but less likely to result in minor injuries.

Environmental characteristics

In comparison to other weather conditions, clear weather conditions were more likely to result in minor injury crashes (with an average marginal effect of 0.0749), but less likely to result in severe or no injury crashes for drivers within the speed limit. Conversely, single-vehicle crashes during daylight conditions, where drivers were within the speed limit, were likely to increase the occurrence of severe and minor injuries.

Crash characteristics

Compared to other crash types, single-vehicle rollover crashes involving drivers within the speed limit were likely to result in minor injuries (average marginal effect of 0.0464). A vehicle running off-road from the left side while driving within the speed limit would be more likely to result in severe injuries (average marginal effect of 0.0241). However, when drivers exceeded the speed limit, vehicles running off-road would be even more likely to result in severe injuries (average marginal effect of 0.0827). Notably, single-vehicle crashes involving collisions with trees would likely result in higher severe injuries when exceeding the speed limit compared to driving within the speed limit (average marginal effect of 0.0281 for exceeding the speed limit vs. 0.0244 for driving within the speed limit).

Vehicle characteristics

In comparison to other vehicle types, single-vehicle crashes involving motorcycles were twice as likely to result in higher severe injuries when exceeding the speed limit relative to driving within the speed limit (marginal effect of 0.0040 for exceeding the speed limit vs. 0.0020 for driving within the speed limit). Single-vehicle crashes with vehicle displacement volumes between 5,000 and 10,000 cc were more likely to result in severe injuries when exceeding the speed limit (average marginal effect of 0.0054). Moreover, newer vehicles manufactured within 5 years of the crash were 1.8 times more likely to result in higher minor injuries when exceeding the speed limit relative to driving within the speed limit (average marginal effect of 0.0243 for exceeding the speed limit vs. 0.0135 for driving within the speed limit).

Roadway characteristics

Compared to other types of roadways, single-vehicle crashes on non-interstate roadways involving drivers within the speed limit were more likely to result in minor injuries (average marginal effect of 0.0041). Single-vehicle crashes on two-way roadways with a positive median barrier involving drivers within the speed limit were more likely to increase the occurrence of severe injuries (average marginal effect of 0.0219). Conversely, single-vehicle crashes on two-way undivided roadways involving drivers exceeding the speed limit were more likely to increase the occurrence of severe injuries (average marginal effect of 0.0307). Regarding the roadway alignment, single-vehicle crashes on curved roadways involving drivers within the speed limit were more likely to result in severe injuries (average marginal effect of 0.0098). Conversely, single-vehicle crashes on straight roadways involving drivers exceeding the speed limit were more likely to result in severe injuries (average marginal effect of 0.0165). It is worth noting that single-vehicle crashes on dry surfaces involving drivers exceeding the speed limit were twice as likely to result in severe injuries compared to driving within the speed limit (average marginal effect of 0.0507 vs. 0.0258), whereas the difference in minor injuries was smaller (average marginal effect of 0.0093).

Driver characteristics

Male drivers involved in single-vehicle crashes were found to be more likely to experience severe injuries than female drivers (with an average marginal effect of 0.2591 vs. 0.0945) when exceeding the speed limit. Among different age groups, drivers between 35 and 45 years old involved in single-vehicle crashes were more likely to experience severe injuries when exceeding the speed limit. In comparison to other risky behaviors, reckless drivers involved in single-vehicle crashes were more likely to experience severe injuries when driving within the speed limit. Drivers who were not intoxicated and exceeded the speed limit in single-vehicle crashes were less likely to experience severe injuries (average marginal effect of -0.0560). Unrestrained drivers who exceeded the speed limit in single-vehicle crashes were more likely to experience severe injuries (average marginal effect of 0.0051). Importantly, restrained drivers in single-vehicle crashes were less likely to experience severe injuries when exceeding the speed limit relative to driving within the speed limit (average marginal effect of -0.1576 for exceeding the speed limit vs. -0.1405 for driving within the speed limit).

Out-of-sample simulation

In the context of this study, from a pragmatic point of view, the aggregate effect of change shifts between two driving behaviors considering the speed limit is of particular interest. Looking at the overall injury severity percentages, driving within the speed limit, the distribution of crashes among severity levels was 47.8% no injury, 32.3% minor injury, and 20.0% severe injury. For crashes involving driving exceeding the speed limit, this distribution among severity levels shifted to 30.2% no injury, 32.6% minor injury, and 37.2% severe injury. Thus, while the percentage of crashes resulting in severe and minor driver injuries declined over this period, there was a shift from severe and no injuries to no injuries (this is also reflected in Fig. 3). To explore this issue further with the estimated models, the parameters from the driving exceeding the speed limit model were used to forecast exceeding the speed limit (using actual crash characteristics from the crashes involving exceeding the speed limit crashes) and these predictions were compared with predictions based on the estimated driving within the speed limit parameters using exceeding the speed limit data. Please note that using driving within the speed limit estimated parameters with exceeding the speed limit data constitutes an out-of-sample prediction. While such prediction is trivial for fixed parameters models, it is much more complex for random parameters models since the full distribution of the random parameters must be considered when predicting. As discussed extensively by24, the prediction undertaken here is done by simulation, in this case using the same procedure as was used with the simulated maximum likelihood estimation. This predictive comparison provides an aggregate assessment of how overall injury severity probabilities have changed between two different driving behaviors while controlling for actual crash characteristics. In this case, driving within the speed limit parameters predicts 34.0% of crashes resulting in no injury in exceeding the speed limit instead of the estimated 29.4% (observed 30.2%), 23.7% of crashes resulting in minor injury instead of the estimated 32.5% (observed 32.6%) and 42.4% of crashes resulting in severe injury as opposed to the estimated 38.0% (observed 37.2%). As mentioned above, in looking at Fig. 2 and comparing driving within the speed limit and exceeding the speed limit, the observed crashes exceeding the speed limit have lower percentages of no injuries a higher percentage of severe injuries, and slightly higher minor injury crashes. The predictive comparison also shows that driving within the speed limit parameters predicts the same trend and is very close to the magnitude that was observed. This suggests that some of the observed injury-proportion trends shown in Fig. 2 are due to the specific characteristics of the observed crashes, but that fundamental differences in driving behavior in the influence that crash characteristics are having on injury probabilities are also playing a role. Some of these changes may be the result of improvements in vehicle safety features over time (lane-departure warning systems), roadway improvements (centerline and edge-line rumble stripes, cable median barriers, guard rails, etc.) or temporal changes in the risk profiles of individuals observed to be involved in crashes as discussed in21,24,27.

Conclusions

In this study, we analyzed factors contributing to single-vehicle crashes in the US, considering both driving within the speed limit and exceeding the speed limit. We utilized a machine learning algorithm and a random parameters logit model with heterogeneity in mean and variance. The crash data was extracted from the national CRSS crash database system, covering the period from 2016 to 2018. We focused on three levels of crash injury severity: no injury, minor injury (combining possible and non-incapacitating injury), and severe injury (combining incapacitating and fatal injury). The estimated models incorporated a wide range of factors, including temporal, environmental, vehicular, crash, roadway, and driver characteristics.

In terms of the machine learning models, we identified several common variables that were relevant to both driving within and exceeding the speed limits. These variables included restraint system usage by the drivers, no rollover, vehicle body type as “other,” drivers in the age group younger than 25, and male drivers for the XGBoost models. For the Random Forest models, the common variables included restraint system usage by the drivers, no rollover, right and left road departure, rollover, and vehicle body type as “other.”

On the other hand, the estimated econometric models captured a total of 31 variables that significantly influenced driver injury severity in single-vehicle crashes for both scenarios. Out of these variables, five showed shared effects on both driving within and exceeding the speed limits. Some variables that were significant in driving within the speed limit model did not hold statistical significance in exceeding the speed limit model. The common variables included collisions with trees, motorcycle involvement, newer vehicles (manufactured within five years of the crash), dry road surface conditions, and restraint system usage by the drivers. The effect of these statistically significant variables indicated that the severity of driver injuries was higher when exceeding the speed limit compared to driving within the speed limit. Furthermore, certain driver characteristics, such as being male or female, belonging to the age group of 35 to 45 years, and driving without restraint usage, increased the likelihood of severe injury in single-vehicle crashes when exceeding the speed limit. Potential effective countermeasures include roadway design improvements, such as warning signs on risky road sections with higher speed limits, reassessment of advisory speed limits, and the implementation of appropriate crash cushions around trees. Additionally, drivers with a history of repeated speeding offenses should receive professional training to mitigate the risk of crashes. The findings of this research, based on a data-driven approach, contribute to the expanding body of knowledge in speed safety research.

In future analyses, it will be interesting to explore the speeding behavior by vehicle types which was investigated with distracted driving58. Moreover, we plan to explore the psychological, location-related factors and motivating risk factors utilizing the theory of planned behavior and apply machine learning and econometric models within a similar framework to explore speeding behavior quantitatively59,60,61. This effort aims to capture an important aspect of safety research and enhance our understanding of integrating these two methodologies for more effective future research and the development of new insights to support better countermeasures.