Introduction

Traffic accidents pose a significant threat that can cause human injuries and substantial financial and environmental losses1. The World Health Organization (WHO) reports an annual death toll of 1.19 million due to traffic accidents, and 50 million injuries2. Additionally, WHO reports that traffic accidents are the leading cause of death for children and young people aged 5–293. Moreover, WHO has stated that 93% of all global deaths occur in low- and middle-income countries, which have about 60% of the world’s vehicles. Since road traffic accidents account for nearly 3% to 5% of the nation’s gross domestic product3, the investment in improving road safety would compensate for its cost by saving direct and indirect expenses of accidents, leading to enhanced sustainability in social, economic, and environmental aspects. According to the statistics released by the WHO, countries that have taken measures to raise road safety, have successfully reduced road accidents3. In this regard, many researchers have conducted extensive studies focusing on identifying accident prone areas, predicting the severity of accidents, and reducing the probability and severity of accidents. To understand accident patterns and accident prone areas, and to predict the severity of accidents, innovative techniques must analyse various parameters and provide an accurate result4,5. Recently, advanced computer science techniques such as data mining are used to predict the severity of accidents5. In this method, data related to accidents, vehicle type, speed, weather conditions, and time are considered. Then, a model for predicting the severity of accidents can be generated using machine learning algorithms. By employing these machine learning algorithms, accident prone areas and the severity of accidents can be predicted more accurately, and measures can be taken to reduce accident frequency6. For example, based on this prediction, measures can be taken to improve road infrastructure, limit the speed of vehicles, and raise awareness about safe driving practices in accident prone areas. Certainly, much attention has been recently paid to the problem of identifying accident hotspots and predicting the severity of accidents. For example, Ghaffari et al. introduced a method based on reliability to identify accident hotspots and compared the results with the results of Frequency and Empirical Bayesian methods, utilizing simulated data7. They found their introduced method to be better in prediction than the two others. Xu and Tao used the principal component clustering to identify accident hotspots on a road in China8. This technique can be employed to evaluate and quantify the safety levels of various roads by extracting principal components and conducting clustering for accident hotspots. Using the Classification and Regression Tree as a data mining method, Tavakoli Kashani et al. evaluated the factors affecting the injuries caused by road accidents due to fatigue and drowsiness within seven years in three provinces of Iran and identified several factors. Geometric factors were found to have an impact on them9. The accident data that was clustered was also classified and analyzed according to various parameters, including rural and urban areas, residential and non-residential areas, and the drivers’ gender. Agrawal et al., Hajela et al., and Karami and Johansson identified accident hotspots using the DBSCAN clustering method. DBSCAN clustering requires two primary parameters, which are Eps and MinPts10,11,12,13. Finding the optimal value of these parameters is a rigorous task. Consequently, researchers are always trying to find different developed models to improve DBSCAN clustering. For instance, Akbari and Unland found a way to determine the initial parameters of DBSCAN automatically14. The idea was based on a statistical technique for outlier detection, namely the empirical rule. However, the exact method used for determining the values of the initial parameters was not identified in the DBSCAN method. Numerous studies have clustered and identified hotspots using the K-means clustering method. Puspitasari et al. identified accident prone areas of highways in Indonesia using K-means clustering and finally analyzed each cluster separately to find the hotspots15. Anderson also clustered accidents using K-means clustering. This method is unreliable for identifying hotspots due to a lack of detection of outliers. It may put outliers in a cluster and cause clustering errors. Wan et al. employed the Spatial Agglomerative Hierarchical clustering method to cluster and identify hotspots of pick-up/drop-off passengers from a taxi16. This method provided good accuracy in this study and managed to identify hotspots accurately. Khosravi et al. determined hotspots of Dehbala road using K-means, K-medoids, and DBSCAN clustering. The findings indicated that climate, especially windy or rainy weather, and the road’s geometrical aspects, such as slope and curvature, have a significant impact on accidents17. Accident prediction models enable various agencies to predict the severity of accidents that may occur in the future. Hence, the severity or the frequency of accidents can be reduced by accident prediction18. Beshah and Hill divided accidents into four classes: damage, injury, serious injury, and fatal, and then predicted the severity of accidents using KNN (K-Nearest Neighbor), DT (Decision Tree), and Naive Bayes algorithms19. It was concluded that the KNN algorithm had the best performance with an overall accuracy of 81%. Selvy et al. used KNN, RF, DT, and Logistic Regression (LR) for real-time crash prediction, determining the frequency and severity of crashes. Results proved that classification accuracy obtained from Random Forest (RF) is 96% surpassing that of other classification methods20. Xing et al. evaluated the collision risk of unconstrained vehicle motions at the toll plaza diverging area using the LR model and five non-parametric models, including KNN, ANN (Artificial Neural Network), SVM (Support Vector Machines), DT, and RF21. They used these methods to examine the relationship between influencing factors and vehicle collisions. The result showed that KNN had the best prediction performance among the other methods. Iranitalab and Khattak also used KNN, MNL (Multi-Nomial Logit), and SVM to predict traffic crash severity22. The predictive rates demonstrated that KNN outperformed the other methods in terms of effectiveness. Ijaz et al. worked on a three-year accident data report of Pakistan to predict the severity of injury-causing accidents using classification methods, i.e., Decision jungle (DJ), RF, and DT. Results revealed that DJ outperformed the DT and RF by an overall accuracy of 83.7%. Spearman correlation analysis showed that factors such as lighting conditions and shiny weather conditions were more likely to worsen injury severity23. Amiri et al. compared the Intelligent Genetic Algorithm with ANN by investigating fixed object crashes among elderly drivers24. Compared to ANN, the Intelligent Genetic Algorithm was more capable of predicting high-severity crashes. ANN was more accurate in predicting low-severity crashes, but it failed to detect more severe ones. The light condition was identified as the most significant factor, followed by right and left shoulders. Numerous studies have been conducted to identify accident prone areas and predict the severity of accidents, but more research is needed due to differences in results. There is a research gap in advancing methodological approaches for accident analysis to obtain novel insights that can support road safety interventions and provide intelligent transportation systems. The occurrence and severity of accidents may be influenced by various factors, including vehicles’ safety qualifications; drivers’ drive attributes, and the ride quality of the roads. The police is now keeping a record of the properties of accidents, especially their locations. The current state-of-the-art lacks comprehensive analysis and insights into localized factors contributing to road accidents in Middle East countries. Research is still needed to explore the relationships between different environmental factors and apply advanced methods for the examination of road crash hotspots in Iran, improving infrastructure sustainability. To narrow these knowledge gaps, the current work aims to identify accident prone areas by considering environmental and road geometry properties and applying hierarchical clustering methods. Traffic accident data from a case study conducted by Yazd Traffic Agency in Iran, covering 5 years starting from 2014, was used. The present research provides the following key contributions: (1) identification of accident prone areas through the application of Agglomerative Hierarchical and BIRCH clustering methods, individually or in combination; (2) analysis of accidents in each hotspot to identify the determining parameters leading to accidents; (3) use of two different machine learning classification techniques, K-Nearest Neighbor (KNN) and Random Forest (RF), to predict the severity of accidents; and (4) use of data collected from the field to identify the causes of accidents in accident prone areas, by utilizing a machine learning methodology.

Data and methods

Study area

This study utilized data from accidents that occurred on the Yazd-Kerman road between 2014 and 2018. The data included 465 accidents during this period. Yazd-Kerman road is a quite crowded transit road and a part of an important route that connects the capital of Iran (Tehran) to the most important southern port, Bandar Abbas. The study focused on 107 km of Yazd-Kerman road in Yazd Province. The accident coordinates on this road were recorded using GPS receivers and were referenced using WGS1984 reference ellipsoid and UTM projection systems, as shown in Fig. 1. The data was provided by the Roads and Transportation Organization of Yazd Province and included various pieces of information, such as geographic coordinates, time and date of the accident, the cause of the accident, vehicle type, and alignment. The primary focus of this study is to examine how various geometric and environmental factors contribute to the frequency and severity of accidents. Table 1 outlines the factors that were considered in this study.

Fig. 1
figure 1

Study area (created by ArcGIS 10.4 software http://www.esri.com).

Table 1 Factors investigated in this research.

The factors taken into account by this study are introduced briefly:

Climate: this factor describes the weather conditions at the time and place of the accident, which include clear, cloudy, rainy, snowy, stormy, sandstorm, foggy, or dusty conditions. This data was extracted from Wunderground.com.

Lighting: this factor includes the state of road lighting in four modes; day, night, sunrise, and sunset. The accidents occur in one of these time intervals.

Slope: this factor includes the road slope in three modes of upward, downward, and zero slopes (level), which were extracted from the road’s center-line levels from the DEM (Digital Elevation Model) of the area.

Alignment: this factor describes the properties of the road’s layout (i.e. curve radius, which can be effective on accident positions).

The data used in this article includes various parameters such as the cause of the accident, type of vehicle, climate, slope, date and time of the accident, and alignment. The four most critical parameters in the study area were selected by an expert and incorporated into the model. The lighting is divided into four classes, namely day, night, sunrise, and sunset, based on the recorded time of the accident. In Lighting classes, the times of sunrise and sunset in different months and seasons are also considered. In the classification of slopes, slopes of 0 to ± 2% were classified as level, slopes of more than 2% were classified as an upgrade, and slopes lower than -2% were classified as a downgrade. Selecting the intervals for the slopes was based on the average slope of the road. To classify the alignment, the curves of the road were identified and their radius and curvature were also obtained. The alignment was classified as the curve and the straight. As mentioned before, the data was provided by the Roads and Transportation Organization of Yazd Province, which has originally been recorded by the police. Because this data is produced exclusively by the police, it is very difficult, if not impossible, to validate it. One method to assess the accuracy of accident data (Only related to the number and not the location and time of accidents) is to compare the data of accidents with the records hospitals as demonstrated in studies by Curry et al.25 and Soltani et al.18. These studies have shown that linked data can significantly enhance data quality for accident research. This comparison may help to elucidate potential discrepancies in the data; however, it is important to note that incidents without injuries will lack such information.

In order to identify accident hotspots, this study utilized the BIRCH clustering and Agglomerative Hierarchical algorithms, which were implemented using Python. The outputs were displayed using ArcGIS 10.4 software (http://www.esri.com). In addition, KNN and RF classification algorithms were utilized to predict the severity of accidents, which were implemented using RapidMiner software. The dataset was divided into 80% for calibration and 20% for validation. These methods were selected because they offer promising alternatives for obtaining comprehensive and accurate insights into the data, ensuring robust clustering results and reliable predictions of accident severity. Below are brief explanations of the mentioned methods.

Clustering

Clustering analysis is the process of dividing a heterogeneous population into several homogeneous subsets or clusters. In other words, it is a process by which a set of objects can fall into separate groups or clusters26. In this study, Agglomerative and BIRCH hierarchical clustering algorithms have been used to cluster the accidents.

Agglomerative hierarchical

Agglomerative Hierarchical clustering is a bottom-up clustering method in which clusters have sub-clusters. This clustering merges observations or clusters with the least distance (most similarity) to form a new cluster. This process continues until only one cluster remains. The output of Agglomerative Hierarchical can be a dendrogram that can be cut according to the research requirements to reach the desired number of clusters27. Figure 2 shows the process of this algorithm as a flowchart28.

Fig. 2
figure 2

Agglomerative Hierarchical algorithm flowchart28.

To perform the calculations related to this clustering method, we need two distance (similarity) criteria: (1) the distance between pairs of observations, and (2) the distance between the clusters. Euclidean distance (1) has been used in this research to measure the distance between observations. Also, a complete- linkage (2) has been used to measure the distance between the clusters29. In Eq. (1), \(\text{d}\) is the Euclidean distance, (\({\text{x}}_{1}\),\({\text{y}}_{1}\)) is the coordinate of the first point and (\({\text{x}}_{2}\),\({\text{y}}_{2}\)) is the coordinate of the second point.

$$d=\sqrt{{({x}_{2}-{x}_{1})}^{2}-{({y}_{2}-{y}_{1})}^{2}}$$
(1)

BIRCH

The BIRCH clustering algorithm is one of the hierarchical clustering algorithms. This method is designed to work with big data. Moreover, BIRCH is user-friendly and can be easily implemented. This algorithm is based on the CF (clustering features) tree. In addition, this algorithm uses a tree-structured summary to create clusters. The BIRCH algorithm first scans all the data in memory and then compresses the data into small summaries. It then clusters small summaries using a hierarchical algorithm or other arbitrary algorithms. Finally, it refines the clusters to form less scattered and more homogeneous clusters. Considering that BIRCH does not directly cluster the datasets, it is often used with other clustering algorithms. The primary purposes of these algorithms are reducing the time of components and reducing the frequency of data scans, identifying dense areas, and mitigating noise30. Figure 3 shows the process of this algorithm as a flowchart31.

Fig. 3
figure 3

The flowchart of BIRCH clustering31.

The input parameters of this algorithm include:

  • Threshold: it is the radius of the sub-cluster to get the new sample in it. The default value of the threshold is 0.5 and it should be as low as possible in the beginning.

  • Branching factor: it is used to make the total number of sub-clusters in each node. If the new sample is entered after the mentioned value, then the sub-cluster splits further at that node. The default value is 50 branches.

  • N_clusters: it is the number of clusters.

To calculate the goodness of the mentioned clustering algorithms, according to Eq. (2), the Silhouette index has been used. Its value ranges from -1 to 1.

$$Silhouette\; index=\frac{(b-a)}{max(a,b)}$$
(2)

where:

\(a\) is the average intra-cluster distance, i.e. the average distance between each point within a cluster;

b is the average inter-cluster distance, i.e. the average distance between all clusters.

Classification

Classification is a machine learning method that is used to learn how to assign a class label to input data. Because of the importance of extracting information from accident data, classification can be a helpful tool in road safety research. The accident severity prediction model enables various agencies to estimate the severity of a reported accident or the severity of an accident that may happen in a specific location in the future22. An accident is the result of the influence of many factors, so it is impossible to accurately predict which class of accident severity (damage, fatal, or injury) belongs to each accident. However, the severity of the accidents can be predicted to some extent. In this study, RF classification and KNN algorithms have been used to predict the severity of accidents.

K-nearest neighbor (KNN)

The KNN method is a supervised algorithm first developed and applied by Cover and Hart. This algorithm is based on the principle that "similar samples in a data set are often adjacent". The decision rule in this method is to assign an observation to the class with the highest number of votes among the K-nearest-neighbors32. KNN parameters include the number of close neighbors (K) and the distance function33. The best choice of K depends on the data. In general, large values of K reduce the classification error but lower the resolution of the boundary between classes. One of the simplest ways to select the best K value is setting a range of neighbors and using validation criteria, such as overall accuracy32. Consistent with the majority of studies, Euclidean distance has been used in this research for the distance function parameter. The process of this classification algorithm is shown in Fig. 4.

Fig. 4
figure 4

KNN processing flowchart34.

Random Forest (RF)

The Random Forest algorithm is one of the hybrid classification algorithms and one of the most widely used machine algorithms35. The term “Random Forest” is derived from random decision forests, first coined by Ho in 1995 and completed by Amit and Geman36. The RF algorithm is comprised of a group of decision trees; therefore, this method includes all the basic concepts of the decision tree. In other words, the RF algorithm combines several decision trees to make more accurate predictions37. The RF algorithm aggregates the votes from different decision trees to decide the final class of the test object38. The process of this classification algorithm is shown in Fig. 539.

Fig. 5
figure 5

RF processing flowchart39.

Classification accuracy metrics

The confusion matrix is a commonly used measure for defining the validity of classification methods. It can be applied to binary and multi-class classification problems40. The confusion matrix is a table that is used to define the performance of a classification algorithm. A confusion matrix depicts and summarizes the performance of a classification algorithm41. An example of a binary confusion matrix is shown in Table 2.

Table 2 Typical binary confusion matrix.

The confusion matrix counts the predicted and actual values by displaying them in a matrix. According to Table 2, the entries of the confusion matrix are defined as follows:

  • True positive value (TP): the total number of true results or predictions when the actual class is positive.

  • False positive value (FP): the total number of false results or predictions when the actual class is positive.

  • True negative value (TN): the total number of true results or predictions when the actual class is negative.

  • False negative value (FN): the total number of false results or predictions when the actual class is negative.

Some indices can be used to evaluate the accuracy and validity of the prediction results. The indices used in this study are defined below:

Overall accuracy (OA): is used to compare system performance. According to Eq. (3), OA determines the ratio of correctly predicted samples to all samples, which shows how correctly a classifier can predict the samples.

$$OA=\frac{TP+TN}{TP+TN+FP+FN}$$
(3)

Precision: determines how many samples are classified correctly in each class based on the predicted labels. This criterion is calculated according to Eq. (4).

$$Precision=\frac{TP}{FN+TN}$$
(4)

Kappa: compares the existing classification algorithm with a random classification algorithm and explains to what extent the existing classification algorithm performed better than a random algorithm according to Eq. (5)42.

$$Kappa=\frac{2\times (TP\times TN-FN\times FP)}{\left(TP+FP\right)\times \left(FP+TN\right)+(TP+FN)\times (FN+TN)}$$
(5)

The recall criterion is the ratio of the correctly predicted samples to the number of all real samples in the class, which is calculated according to Eq. (6)42.

$$Recall=\frac{TN}{FP+TN}$$
(6)

Results and discussions

Investigation of accident prone areas

For the Agglomerative Hierarchical algorithm, the number of clusters must be entered to complete the process; thus, the hierarchy of the algorithm continues until it reaches that number of clusters. The optimal number of clusters for Yazd-Kerman road was determined to be 18 using the silhouette validation index. Then, clusters with sharp differences in the number of accidents were selected based on natural breaks. Clusters 14 and 16 had the highest number of accidents with 114 and 87, respectively, in the Agglomerative Hierarchical algorithm. Figures 6, 7, 8, 9 shows the Silhouette validation index, Fig. 8 illustrates the clustering output, and Fig. 10 shows the number of crashes per cluster.

Fig. 6
figure 6

Silhouette Index for Agglomerative Hierarchical Clustering (created by Excel software http://www.microsoft.com).

Fig. 7
figure 7

Silhouette index for BIRCH clustering (created by Excel software http://www.microsoft.com).

Fig. 8
figure 8

Output of Agglomerative Hierarchical Clustering (created by ArcGIS 10.4 software http://www.esri.com).

Fig. 9
figure 9

Output of BIRCH clustering (created by ArcGIS 10.4 software http://www.esri.com).

Fig. 10
figure 10

Accident per cluster in Agglomerative Hierarchical Clustering (created by Rapidminer software docs.rapidminer.com).

The BIRCH hierarchical algorithm requires the input of parameters such as threshold (T), branching factor (B), and the desired number of clusters (n). Through trial and error, a threshold value of T = 0.02 was determined as the optimal starting point for this study. The default value of the branching factor, B = 50, was employed. The optimal number of clusters was determined using the silhouette validation index, and 18 clusters were identified as being appropriate for the BIRCH hierarchical method. Clusters 3 and 4 had the highest number of accidents, with 54 and 64 accidents, respectively. The Silhouette validation index diagram is shown in Fig. 7, the clustering output is represented in Fig. 9, and the number of crashes per cluster is depicted in Fig. 11.

Fig. 11
figure 11

Accident per cluster in BIRCH clustering (created by Rapidminer software docs.rapidminer.com).

Clusters with a high number of accidents can be used to identify accident prone areas of Yazd-Kerman road. By joining clusters 14 and 16 of the Agglomerative Hierarchical algorithm and clusters 3 and 4 of BIRCH clustering, the accident prone areas were determined, which are shown in Fig. 12.

Fig. 12
figure 12

Final accident prone areas (created by ArcGIS 10.4 software http://www.esri.com).

In accident prone area 1, shown in Fig. 12, the old road passes the front side of a resting area (point 3 in Fig. 13). Due to the increase in traffic volume over the years, a decision was made to construct a new road and convert the existing road into a one-way road. In the area of the resting area (Abolfazl Mosque), route A was connected to route B, and in the opposite direction, route C was connected to route D. By analyzing the accident data, as shown in Fig. 14, it was determined that most accidents in this accident prone area occurred during sunrise and sunset, as well as during night -time, when drivers may face reduced awareness levels and poor peripheral vision.

Fig. 13
figure 13

Field investigation of the accident prone area, cluster one (created by http://www.maps.google.com).

Fig. 14
figure 14

Statistics of cluster one’s parammeters (created by Excel software http://www.microsoft.com).

Field investigation revealed that in the direction from C to D, special topographical conditions characterized by a negative longitudinal slope, insufficient lighting at the curve, and inadequate road signage made it difficult for drivers to see the D path. As a result, drivers mistakenly perceived the extension of path C as A, which caused their vehicles to leave the road and overturn in this hazardous situation.

Furthermore, it was observed that the non-standard turning ramp at point 2 did not provide a sufficient length for the acceleration lane, which created turbulence in traffic flow and increased the risk of accidents. Additionally, the improper separation of paths between high-speed traffic and drivers intending to stop at Abolfazl mosque or leave the place and enter the road has resulted in severe accidents at point 3.

The overall findings indicate that the primary cause of accidents in the first accident prone area was inadequate geometric design for providing access to the main road for mosque-goers. It is clear that changing the mosque’s land use is not feasible. Therefore, to mitigate the probability of accidents, it is essential to rectify the road’s geometric design. The U-turns at the entrance and exit of this accident prone area should be redesigned, and the utilization of traffic signs should be optimized to enhance clarity and awareness for unfamiliar drivers with the road layout in that spot. Enhanced lighting can also significantly improve drivers’ situational awareness. Additionally, a well-designed access-mobility level is crucial in managing the speed of vehicles leaving the main road to stop at the resting area and parking lots, and vice versa.

In the second accident prone area (depicted in Fig. 12), the causes of accidents were entirely different from those in the first area. Statistical data analysis (Fig. 15) indicates a strong correlation between the identified accident hotspots and the area’s weather conditions. Based on field studies of the region and information obtained from social resources, it was discovered that the area experiences sudden rainfall during the summer. Due to various factors, including poor vegetation, proximity of the mountain to the road, and the slight slope of the land, a large area of the region is susceptible to low-speed flooding, which causes soil erosion and sedimentation of silt and clay from the mountainside to the roadside (as shown in Fig. 16).

Fig. 15
figure 15

Statistics of cluster two’s parameters (created by Excel software http://www.microsoft.com).

Fig. 16
figure 16

Field investigation of the accident prone area, cluster two (created by http://www.maps.google.com).

Furthermore, the existence of a high-speed wind corridor in this part of the road results in the rapid formation of sandstorms, which decreases driving visibility due to increased wind speeds. Consequently, the majority of accidents in this area occur when a mass of soil passes over the Yazd-Kerman road during a sandstorm.

Moreover, the unsafe geometric design of the road in this area has significantly increased the risk of accidents. Inadequate consideration of the stopping sight distance in the design of the vertical curve and the presence of a horizontal curve immediately after the unsafe vertical curve have made this section of the road more hazardous, particularly at night or during a sandstorm when horizontal visibility is reduced. As a result, numerous vehicles have lost control and left the roadway.

The slope of alignments in the vertical curve can be reduced to improve the sight distance and enhance safety in the second accident prone area. Additionally, to further increase of the road safety in this area, it is advisable to widen the road by paving the unpaved shoulder, planting trees along the path, increasing the number of reflective signs, and using shy bars on the road borders. These measures are expected to diminish the probability of accidents in that hotspot.

Other factors, such as vehicle quality in Iran, intensify the accidents. The Pride model, manufactured by SAIPA Automaker Company, is prone to accidents 21% more than other cars in Iran. Research has shown that accidents involving SAIPA’s Pride car, the most affordable vehicle in Iran, are likely to result in higher road traffic injuries at the accident scene43. However, considering that this case is particularly restricted to Iran, it has been excluded from the current research. Nonetheless, vehicle quality is an essential factor that should be considered in any study aimed at reducing the intensity of accidents.

Accident severity

In this study, the RF and KNN classification algorithms were employed to predict the severity of accidents along the road. To accomplish this, the data was divided into four classes, as shown in Table 3. Class 0 contains accident-free points and was selected based on the features listed in Table 1 along the path. Class 1 includes damage accidents, Class 2 comprises injury accidents, and Class 3 encompasses fatal accidents. By dividing the data into these classes, the classification algorithms were able to more accurately predict the severity of accidents based on their occurrence history.

Table 3 Number of accidents per class.

To predict the severity of accidents using the KNN algorithm, the parameter and distance function should be determined. In this study, the Euclidean distance function was utilized, as it has been shown to perform well in previous research. During the training phase, different values for the K parameter were tested, ranging from 1 to 12. Ultimately, a value of K = 2 was selected for the Yazd-Kerman road, as it led to the best overall accuracy, which was consistent with previous studies, such as42 and44. As demonstrated in Fig. 17, this model achieved an accuracy of over 71% by examining the two nearest neighbors. The confusion matrix is presented in Table 4. Table 5 show the validation criteria for this method using the test data collection.

Fig. 17
figure 17

Overall Accuracy of KNN (created by Excel software http://www.microsoft.com).

Table 4 Confusion matrix of KNN.
Table 5 Accuracy of KNN.

According to Fig. 17, increasing the K value decreases overall accuracy. The reason is that the features that are spatially close to each other possess close attributes because they all have similar slopes, climates, alignment, and lighting. In contrast, those accidents that are far from the intended feature have different attributes to that feature because they are in different locations with different attributes. In this study, as illustrated in Fig. 17, analyzing two nearest neighbors can yield an accuracy exceeding 71%. By using more essential attributes, a higher level of accuracy can be achieved.

The RF algorithm randomly selects several predictive variables, which are a subset of the total variables. To predict the severity of accidents, this algorithm requires two critical parameters: the number of trees (n-tree), and the number of random samplings (MTRY). During the training process, the first parameter was determined, as displayed in Fig. 18. The second parameter was initiated based on the first parameter during the algorithm’s implementation. Figure 18 indicates that the construction of 15 decision trees enables the Forest algorithm to attain an overall accuracy of 60%. The confusion matrix is presented in Table 6. Table 7 displays the validation criteria for this method using the test dataset.

Fig. 18
figure 18

Overall Accuracy of RF (created by Excel software http://www.microsoft.com).

Table 6 Confusion matrix of RF.
Table 7 Accuracy of RF.

After evaluating the outcomes of the RF and KNN algorithms, it was discovered that the KNN algorithm provides higher accuracy than the RF algorithm. Consequently, accidents that occur close to each other share nearly identical characteristics. By analyzing the attributes of two neighboring accidents, the severity of an accident can be predicted. Accidents that occur near each other tend to have similar attributes. Therefore, by analyzing the number of accidents surrounding a specific accident, it is possible to predict its attributes. The results of the confusion matrix in Tables 4 and 6 demonstrate that both methods accurately predicted accident-free points and had desirable attributes, such as appropriate climate (clear and cloudy), standard slope (within ± 3%), level alignment, and sufficient lighting (Day). This result highlights the essential impact of these four attributes on accidents on the Yazd-Kerman road. However, due to the low number of training data, the test data model struggled to predict fatal accidents.

The KNN algorithm is particularly notable for its capacity to predict the damage and injury levels of accidents. Compared to the RF algorithm, the KNN algorithm provides a more accurate prediction of injury and damage levels in accidents. Several similar studies, including21 and22, have also reported that the KNN method outperforms other algorithms when it comes to predicting accident severity. Therefore, KNN has the potential to become a promising tool for predicting the severity of accidents on roads.

However, the KNN algorithm does not necessarily have a better performance than the RF algorithm in accident analysis. The performance of each algorithm depends on the specific dataset and the problem being solved. Nevertheless, KNN is a simple and intuitive algorithm that can work well for certain types of accident analysis problems. It operates by finding the K-nearest neighbors to a given data point and using their labels to predict the label of the new point. This can be effective when the data has a clear structure and the nearest neighbors are likely to have similar labels. On the other hand, RF is a more complex algorithm capable of accommodating a broader array of data structures and relationships among variables. It functions by building a large number of decision trees and combining their predictions to develop a final prediction. This can be effective when there are data-related complexities in the relationships between variables and a large number of features. Ultimately, the choice of algorithm depends on the specific problem being solved and the characteristics of the dataset. It is important to try multiple algorithms and compare their performance to determine which one works best for a given problem.

Conclusions and recommendations

The present study examined the application of machine learning algorithms and spatial analyses to identify the accident prone areas along the Yazd-Kerman road. This research successfully assessed the results of clustering, revealing the fundamental factors contributing to accidents in order to guide future practical interventions. Additionally, a framework was established to predict the severity of accidents along the road and to suggest strategies for risk reduction. The following conclusions were drawn from the current research.

Agglomerative Hierarchy and BIRCH clustering algorithms were able to identify clusters with a significant number of accidents. Each method identified two clusters with a high concentration of accidents. By identifying the intersection of these clusters, two areas were determined as being particularly susceptible to accidents.

In the first area, the main reasons for accidents were the presence of a resting area (Abolfazl mosque) that caused traffic chaos, insufficient lighting at curves, and improper road signage. Most accidents occurred during sunrise and sunset, which coincided with the time for praying, emphasizing the mosque’s role in being a possible cause of accidents in this area. In the second accident prone area, the main cause of accidents was the downward slope and the presence of a high-speed wind corridor perpendicular to the road direction, which limited the driver’s visibility.

The intelligence techniques applied in the present research were able to predict accidents or their severity, which is an effective approach to mitigating the severity of accidents and their outcomes. Two classification algorithms, Random Forest and K-Nearest Neighbor, were able to classify and predict accidents’ severity. In comparison to the Random Forest technique, the K-Nearest Neighbor approach exhibited a higher level of effectiveness, recording an overall accuracy of 71% as opposed to 60%.

The results indicated that both algorithms were capable of correctly identifying more than 80% of accident-free points, demonstrating a strong correlation between road and environmental characteristics and the occurrence of accidents. These results highlight the importance of using classification algorithms to predict the severity of accidents, leading to the implementation of preventive measures to reduce the occurrence and severity of accidents, in addition to enabling the efficient allocation and use of construction resources.

Accurately identifying the locations of accidents poses a challenge for studies in this field. To achieve reliable results, the inaccuracy of the coordinates of accidents should not be more than a few meters. Even in practice, the police should pay attention to the fact that some accidents may be misleading as the precise position of the accident cannot be determined. Thus, the police can only record the approximate position of the accident. This inaccuracy is also true for the time of the accident.

Research limitations and future research

The results of this study could have been more satisfactory if additional attributes were available. Validating the number of injury accidents with the help of hospital reports may be a good idea for future research if hospital records are available. Furthermore, by including additional attributes, such as traffic density and road surface conditions, the classification algorithms might yield more precise predictions and could potentially identify the factors that contribute to accidents’ severity with higher accuracy. Also, due to the limitation of data, this study did not consider the type of vehicles (such as trucks, sedans, motorbikes, etc.); however, taking vehicle types into account is recommended. It is worth mentioning that the number of accidents for each vehicle type should be considerable, otherwise, the prediction will not be thoroughly accurate.