Introduction

Wildfire outbreaks and affected areas rapidly increased as climate change triggered frequent hot and dry weather conditions within the condition of anthropocentric land uses1,2,3; Jones et al., 2022). Particularly, UNEP has projected that approximately 30% more wildfires by the middle of the twenty-first century compared with those of current levels (UNEP, 2022), raising red flags for wildfire authorities globally. Meteorological factors such as spring dryness, lower rainfall, and strong winds are known to sharply lower the moisture content of forests, which in turn increases the risk of wildfires; as more wildfire outbreaks were caused by anthropogenic factors, global wildfire management is more complicated4,5,6.

In South Korea (Republic of Korea), wildfires are often considered one of the most common types of disasters. In 2022, more than 740 wildfire events occurred across the country, with affected areas of approximately 24,000 ha7. Notably, the wildfire that occurred in Uljin and Samcheok in March 2022 was the longest in history due to a delay in extinguishing for approximately 213 h, and large-scale property and ecosystem damage was reported, including the loss of more than 20,000 ha of forest areas within those corresponding areas alone7. From 2014–2023, the number of wildfire outbreaks in South Korea averaged around 567 per year, with an estimated damage area of about 4,000 ha, and both annual wildfire and large-scale outbreaks accelerated4,8. As the ignition of wildfires in South Korea is mainly driven by human activities, accounting for 63% of the entire causes of wildfire cases, such as accidental fires by mountain climbers/visitors or cemetery visitors, incineration, cigarette fires, and building fire spread, rather than that by natural causes7,9, a prediction and response system comprehensively considering not only weather information but also socioeconomic and anthropogenic factors, has become indispensable9,10.

Once occurred, a wildfire spreads widely and destroys forest resources and ecosystems, causing environmental problems in the medium to longer term such as greenhouse gas emissions and soil erosion11,12. Furthermore, as it causes loss of life and property for residents in nearby areas, along with enormous economic losses in the tourism and forest industries, it has become a more urgent task to develop scientific wildfire prediction models, and prepare prevention and extinguishment strategies. Considering such reality, attempts were made to automatically estimate the probability of wildfire outbreaks by utilizing vast meteorological, forest, and population data, or to introduce machine learning techniques to improve model performance4,13. In previous studies on wildfires, the category of academic approaches can be divided into spatial susceptibility, which aims to identify areas mainly vulnerable to wildfires, and statistical and simulation model studies, which quantify the risk of wildfire outbreaks at specific points in time14,15. As for the former, earlier studies mainly focused on the identification of ‘hot spots’ with high outbreak risks by considering forest and terrain characteristics, climber/visitor accessibility, and land use types through geographic information systems (GIS), and those studies made important contributions to regional wildfire risk mapping9,12. Contrarily, in the statistics and simulation field, models estimating the number of wildfire outbreaks in a specific period or season by applying probabilistic approaches, were initially attempted16,17,18,19,20, and since the 2000 s, machine learning-based models gained attention along with the development of information and communication technology (ICT), and the advent of the big data era13. For example, artificial neural networks (ANNs), support vector machines (SVMs), and ensemble techniques are known to be appropriate for detecting complex patterns in large amounts of data and maximizing prediction performance21,22. Particularly, ensemble-based models such as XGBoost and Random Forest showed high prediction accuracy in numerous studies on wildfires, by processing data including meteorological, geographical, and artificial factors15,21,23,24.

However, most of the studies were limited to creating a ‘fire risk index’ or ‘fire susceptibility map,’ and in practice, relatively few models exist that can predict the possibilities of wildfire outbreaks on a specific date and at a regional unit. Additionally, it was highlighted that even when AI-based models show high predictive accuracy, their black box-like nature makes it difficult to explain which factors specifically increase the likelihood of wildfire outbreaks or the direction of the outbreak during the prediction process25,26. In overseas studies, numerous studies on wildfires based on machine learning with high accuracy at various spatiotemporal scales were conducted, and factor analysis using Explainable AI (XAI) techniques was also actively conducted in Europe25,27,28. Therefore, it is necessary to develop a machine learning-based wildfire prediction model considering various variables and ensure interpretability. Thus, this study aims to construct and evaluate a machine learning model for predicting the ‘daily probability’ of wildfire outbreaks and improve interpretability. Specifically, we target Si (cities) and Gun (counties) in Gangwon State with a high frequency of wildfires in South Korea, train and validate a machine learning-based prediction model by integrating the information on (1) weather (e.g., temperature, precipitation, humidity, and wind speed), (2) forests (e.g., forest growing stock and forest floor ratio), and (3) socioeconomic factors (e.g., agricultural land use ratio, and cemetery land use ratio), and quantify the contributions of main variables through (4) the SHapley Additive exPlanations (SHAP) analysis, to transparently interpret the model prediction process. This study is expected to make a significant contribution to laying the foundation for establishing more sophisticated advanced prevention and extinguishment strategies for forest fire response in South Korea, and to suggesting the likelihood of a standard prediction model that can be expanded to a nationwide scale in the future.

Study area and data

Scope of study

The Republic of Korea is a representative forest country with approximately 63% of its land area covered by forests, most of which are located in temperate and cool temperate climate zones2. This study targeted Gangwon State, which has the highest percentage of forested areas in South Korea and has experienced the highest frequency and magnitude of wildfire outbreaks, as shown in Fig. 1. Gangwon State is located in the northeastern region of South Korea, and lies at a latitude of 37°02’–38°37’ north and at a longitude of 127°05’–129°22’ east29. Administratively, it is the largest province in South Korea, with a total area of approximately 16,875 km2, accounting for approximately 16.8% of the total land area of the country. In 2023, its population reached around 1.5 million; its population density is lower than that of the whole country, and most of the areas consist of mountainous terrain, and approximately 81.4% of the total area is forests29.

Fig. 1
figure 1

Map of Gangwon State with topographic features and historical wildfire occurrences. The map was generated using QGIS v3.34 (https://www.qgis.org/).

On the eastern side of Gangwon State, the Taebaek Mountains range north and south, creating a mountainous terrain with a large elevation difference; broad-leaved forests dominate the landscape, and coniferous forests predominate at higher elevations. Additionally, the Taebaek Mountains, which define the boundary between the Yeongdong and Yeongseo regions, influence significant changes in climatic characteristics. Specifically, the Yeongdong region is influenced by the maritime climate due to its proximity to the eastern coast of Korea. It is relatively warm and humid, while the Yeongseo region is closer to the continental climate, with an extensive annual range. Its mountainous region has the characteristics of a mountainous climate, leading to complex weather phenomena29. This climatic and topographical heterogeneity brings about differences in wildfire outbreaks per region within Gangwon Province, and in fact, large wildfires were reported to frequently occur in the Yeongdong region due to dry spring winds (winds between Yangyang-gun and Gangneung-si), while the Yeongseo region is characterized by gradual spread of wildfires over a relatively long period14,30.

As such, Gangwon State is a mixture of terrain, which is structurally vulnerable to wildfire outbreaks, and a space with a great number of anthropogenic activities. Accordingly, a detailed prediction model considering local characteristics is essential for preventing and responding to wildfires efficiently. For instance, wildfires in coniferous forests with steep slopes are highly likely to spread rapidly to other areas, and the nearby areas with agricultural lands or cemeteries may be at higher risk of ignition due to relatively high levels of activities by climbers/visitors. Furthermore, coastal areas near tourist destinations or recreational areas may be more likely to experience wildfire outbreaks due to tourist crowds, which may increase the likelihood of illegal burning or carelessness30. Therefore, it is necessary to comprehensively identify natural factors (e.g., climate, topography, and forest structure) and anthropogenic factors (e.g., population density, land use, and ratios of agricultural lands and cemeteries) that increase the likelihood of wildfire outbreaks.

Data description

Here, we collected wildfire cases and wildfire outbreak influencing factors in Gangwon State for a decade from 2013–2022 at the administrative district level (Si/Gun), to utilize them for the analysis of the basic status and creation of a model. The influencing factors of wildfire outbreaks can be categorized into meteorological, forest-related, and socioeconomic factors, which can be found in [1], [2], and [3], respectively, in Supplementary Table 1. First, meteorological data encompass Maximum Temperature (TMAX), Relative Humidity (RHM), Precipitation, and Mean Wind Speed (WSMN), and we utilized daily observations of those data provided by the Korea Meteorological Agency (KMA). Among them, precipitation was coded as a discrete variable (RNCD) with four levels (0–3) of effectiveness, as it was assumed to have the greatest effect in preventing wildfires on rainy days31. Because not all administrative regions had their own automatic weather station (AWS), we used interpolation method from nearby stations for missing meteorological data. The forest-related data include Coniferous Forest Cover Ratio (CFRT), Private Forest Ownership Ratio (PFRT), and Forest Growing Stock Volume (FGSV), which were obtained from the Korean Statistical Information Service (KOSIS) and the Korea Forest Service (KFS). CFRT has a fuel characteristic determining the extent of wildfire spread, and FGSV indirectly indicates the amount of combustibles. PFRT, which corresponds to forest ownership, was considered as an additional variable because it was assumed to be related to such strategies as wildfire management and restricted access to mountains.

As for socioeconomic factors, Agricultural Land Use Ratio (ALRT), Cemetery Land Use Ratio (CLRT), Forest-Adjusted Population Density (FPDN), and Yeongdong Region Indicator (YDCD) were considered as independent variables. On the basis of the data provided by KOSIS and KFS, we calculated the ratio of the total area to the area of agricultural or cemetery lands in Si and Gun, respectively (ALRT and CLRT), and FPDN was calculated by multiplying forest area per Si and Gun, and population density. As for the status of Yeongdong Region Indicator, we coded 1 and 0 for Yeongdong and otherwise, respectively (YDCD). Such approaches reflected the following points: the impact of anthropogenic ignition such as paddy field burning and carelessness of cemetery visitors on wildfires, the need to consider forest area and population status simultaneously, and the differences in the wildfires in the Yeongdong and Yeongseo regions. Finally, we collected wildfire outbreak data based on the Forest Fire Statistical Yearbook and Forest Fire Damage Register4,7 provided by KFS, and coded (FOCR) and utilized them as 1 and 0 for days with and without wildfire outbreaks, respectively. The aforementioned data were preprocessed and scaled according to the process of this study and utilized for model training.

Methods

Our research method contains three stages: data collection, model training and evaluation, and analysis and visualization. First, after collecting the wildfire outbreak data and independent variables (e.g., meteorological, forest-related, and socioeconomic factors) in target sites from 2013–2023, we preprocessed the data before inputting them into the machine learning model. Here, we also reconstructed all variables in a daily unit, where meteorological variables were assigned estimated daily values per Si/Gun using interpolation, and static or annual variables (e.g., CFRT, CLRT, FPDN) were forward-filled to represent constant daily values across the year. This process finally produced the dataset in the form of “(Per Si/Gun) × (Per date).” Next, we trained five machine learning models (i.e., Logistic Regression, XGBoost, LightGBM, Random Forest, and Extra Tree) with the data from 2013–2022, and optimized the performance through parameter tuning. Finally, we evaluated the prediction accuracy of the models with the previously separated data in 2023 as a test dataset. In the evaluation process, we not only compared quantitative metrics such as Accuracy and Recall but also utilized Shapley Additive exPlanations (SHAP) values to improve the interpretability of the model. Here, we utilized machine learning packages such as scikit-learn in the Anaconda environment of Python 3.9. The overall flow of the study can be found in Fig. 2.

Fig. 2
figure 2

Workflow of the study, from data collection to analysis and visualization. The diagram illustrates the study’s process, starting from key factors and sub-level variables to the final prediction. Blue boxes represent the major factors or primary tasks within the model. White boxes denote the lower-level variables or sub-tasks associated with these main factors. The red box, labeled “Forest Fire Occurrence”, indicates the final dependent variable that the model aims to predict.

Data preprocessing

The data collected here are categorized into meteorological, forest-related and socioeconomic data, and wildfire outbreak records, which require various preprocessing due to different collection methods and units of each data. Particularly, as for the meteorological data, several cases in which direct observations were missing due to errors in the collection devices or uneven distribution of stations across regions. Here, we used the average value of the data collected at neighboring stations to replace the missing measurement data of a specific area, and interpolated the missing values of a specific date by reflecting the trend of consecutive days. As for precipitation, KMA provides the data on daily cumulative precipitation (mm/day), although preliminary analysis revealed that it rarely contributed to wildfire prediction when the precipitation values were applied to the model. Here, we coded it into four sequential levels: 3 (on the day of rain), 2 (the next day), 1 (two days later), and 0 (other days). This coding reflects a decreasing impact of precipitation on fire risk as time elapses from a rain event. Contrarily, as various variables such as meteorological, forest-based, and demographic data were considered in a complex manner in this study, each variable has different units and numerical ranges. Hence, it is necessary to introduce a standardization technique to prevent the model from being sensitive to certain scales; specifically, we used StandardScaler from the scikit-learn package, which converts to the mean of 0, and the standard deviation of 1 (Eq. 1).

$${\text{X}}{\prime}=\frac{\text{X}-\upmu }{\upsigma }$$
(1)

Model selection and development

To predict daily wildfire outbreaks, we compared and analyzed representative machine-learning techniques. For model selection, we considered the following: Logistic Regression, which is traditionally used with a high frequency due to its relatively simple explanatory power, Random Forest32, which is a tree-based ensemble model, XGBoost33, Extra Trees34, and LightGBM35. These models predict the likelihood of wildfire outbreaks as a binary classification (0: non-outbreaks, 1: outbreaks) and vary in their predictive accuracy and power to interpret characteristics.

Logistic Regression is a traditional technique to estimate the probability of wildfire outbreaks through linear combination and was utilized in South Korea to identify the wildfire risk index with meteorological factors as input variables36. Unlike ordinary linear regression, it uses a sigmoid function to output a probability value between 0 and 1, and interprets the result in the form of \(\text{p}=1/(1+{\text{e}}^{-\text{z}})\). Here, \(\text{z}\) is a linear combination of independent variables, and the sign of the regression coefficient (\(\upbeta\)) can be applied to relatively infer the directionality of the variable to wildfire outbreaks.

XGBoost and LightGBM are both ensemble models in the Boosting family, and take methods in which an earlier predictor weighs the part where the classifier is wrong. XGBoost33 trains trees in stages, assigning weights to incorrect predictions from the previous tree to compensate for the error. It shows a fast learning speed, and has regularization techniques to prevent overfitting, showing excellent performance in handling large datasets.

Random Forest and Extra Trees are both ensemble models by creating numerous decision trees. Random Forest32 randomly extracts parts of the training data during the learning process, and randomly selects candidate variables at each split to construct a tree. Such Bagging-based randomization lowers the overfitting risks of a single tree and ensures a stable prediction performance by exploring a wide range of variable combinations. Extra Trees34 strengthens more randomness when splitting nodes, the candidate node splitting criteria are randomly selected when splitting each time. As this technique can increase variance instead of lowering bias, an appropriate parameter tuning is essential. In LightGBM35, a histogram-based data structure and a Leaf-wise Growth strategy can be applied, resulting in higher learning speed and accuracy. It was designed to efficiently handle large-scale and high-dimensional data, and shows a high predictive power in imbalanced data situations by simply adjusting parameters.

The aforementioned models used preprocessed independent variables in the training phase as input data, and the parameters were updated to minimize the log-likelihood loss of the binary classification by targeting the dependent variable, wildfire outbreaks. After training, we input the 2023 data, binarized the probability of wildfire outbreaks predicted by the model with a threshold of 0.5 and compared it with that of the actual observations to evaluate the classification performance of the final model.

Data imbalance: oversampling technique

In the study of wildfire outbreaks, the so-called imbalanced data problem, where the number of “non-outbreak (zero)” cases of wildfires is overwhelmingly higher than that of “wildfire outbreak (one)” cases, is a serious problem31,37. Particularly, when data is organized on a daily basis, and per Si and Gun, it is facile to have a situation where most days do not have wildfires, and therefore, outbreak cases represent less than 5% of the total. As such, if a model is trained with an extremely small number of one category (wildfire outbreaks: 1), errors may exist: in an extreme case, the model achieves significantly high accuracy even though it predicts no outbreaks (0) for all events. Additionally, the Recall will dramatically drop, leading to more room for false negative (FN) issues to occur, where the model misses real wildfire outbreaks.

To address this data imbalance, we applied the Synthetic Minority Over-sampling Technique (SMOTE), which is one of the oversampling techniques38. SMOTE can mitigate overfitting by generating new synthetic data points using vector differences between neighboring samples within the same class, rather than simply replicating the minority class data randomly (Eq. 2). Considering that excess synthetic data are generated when applying SMOTE, a possibility of the classification boundaries being distorted exists, and hence, we set the number of normalized samples to a 1:1 level.

$${{\text{x}}_{\text{i}}}{\prime} = {\text{x}}_{\text{i}}+\uplambda ({\text{x}}_{\text{j}}-{\text{x}}_{\text{i}}),\uplambda \sim \text{u}(0, 1)$$
(2)

Hyperparameter tuning

Machine learning models have their own hyperparameters, and different setting of the hyperparameters can significantly affect their prediction performance33. For example, in Random Forest, it is possible to adjust the number of trees (n_estimators), maximum tree depth (max_depth), and minimum number of samples split (min_samples_split). XGBoost and LightGBM support more parameters by including learning_rate, maximum depth, and L1/L2 regularization coefficients (reg_alpha, reg_lambda).

Here, we first performed a random search by setting an intuitive range, and then iteratively narrowed the range by selecting promising combinations. In this process, excessively large max_depth triggered a problem of overfitting the training data, and too many trees (n_estimators) not only increase the computational cost, but also do not guarantee a practical performance improvement of a model. Finally, n_estimators was limited to a range of 50–300, and max_depth was limited to 20 or less, to avoid overfitting problems, and maximize a model performance. To ensure robust evaluation of hyperparameter combinations, we utilized fivefold cross-validation during this tuning process.

Model performance evaluation

The model utilized here functions as a sort of classifier to classify wildfire outbreak status, and a combination of Accuracy, Recall, and Area Under the ROC Curve (AUC), which are representative for evaluating binary classification models, were utilized. Accuracy and Recall refer to the percentages of correct predictions for all predicted events and by the model among actual wildfire events, as shown in Eq. 3 and 4, respectively.

$$\text{Accuracy} =\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}$$
(3)
$$\text{Recall} =\frac{\text{TP}}{\text{FN}+\text{TP}}$$
(4)

AUC is an metric that can comprehensively evaluate whether a model performs above a certain level of classification ability regardless of the setting of the threshold21,38. which is defined in Eq. 5. If AUC is closer to 0.5, the same level of performance as randomized classification is shown, and if it is closer to 1, it is a perfect classification model. If a model shows a high AUC and excellent Recall, it can generally be regarded as having stable classification performance without missing any real outbreak cases.

$$\text{AUC} ={\int }_{-\infty }^{\infty }\text{TPR}(\text{t})\text{d}(\text{FPR}(\text{t}))$$
(5)

One of the limitations of machine learning-based models is that they have a nature of black boxes, and even when they show good results, it is difficult to explain the results25,26. If it is difficult to interpret which factors make a huge contribution to predicted outcomes, its practical utilization may decrease. Here, we utilized SHAP (Shapley Additive exPlanations) values to improve the interpretability. SHAP came from the concept of Shapley value in Game Theory, and can quantify the contribution of each independent variable to predicted results39. The SHAP technique involves calculating the contribution per variable by averaging the difference in predicted values between excluding and including the variables among all combinations of variables, and can be expressed as shown in Eq. 627. The disadvantage of this technique is that it requires complex computations to calculate the SHAP value, although its advantage is that the results can be visualized intuitively. The analysis using SHAP values was applied to four machine learning models except Logistic Regression.

$${\varphi }_{\text{m}}(\text{v}) =\sum_{\text{S}\subseteq \text{N}\setminus \text{m}}\frac{|\text{S}|!(\text{p}-|\text{S}|-1)!}{\text{p}!}[\text{v}(\text{S}\cup \text{m})-\text{v}(\text{S})]$$
(6)

Results and discussion

Variable contribution analysis

Figure 3 is a correlation matrix presenting the correlation between the main variables utilized in this study. First, the low correlation coefficients between FOCR (status of wildfire outbreaks) and other variables indicate the limitations in predicting the status of wildfire outbreaks through only individual independent variables. Nevertheless, the correlation coefficients for FOCR of RHM and RNCD are −0.10 and −0.06, respectively, which are relatively higher in magnitude compared with those of the other variables. Although some correlations exist between variables (e.g., a positive correlation between TMAX and RNCD, and a negative correlation between FGSV and ALRT, as shown in Supplementary Table 2), the Variance Inflation Factor (VIF) of all variables is lower than 10 (All higher than 0.1 in the case of Tolerance), presenting that the multicollinearity problem among independent variables is not serious28. To further assess the individual predictive power of variables concerning forest fire occurrence (FOCR), we also calculated the Mutual Information (MI) metric. The MI values between each independent variable and FOCR revealed that meteorological factors (TMAX, RHM, WSMN, and RNCD) had particularly high MI values, suggesting their key role in predicting wildfire outbreaks.

Fig. 3
figure 3

Correlation matrix of variables included in the predictive models.

To specifically identify the interaction between variables, Principal Component Analysis (PCA) was further performed, and the results were visualized in the form of a biplot on a plane in Supplementary Fig. 1. As shown in Table 1, PC1 was identified as an axis with high contribution from anthropogenic and forest variables such as ALRT, CLRT, and PFRT. PC2 was identified as an axis with a high contribution from natural environmental factors such as CFRT, FGSV, and RHM. As for TMAX (Maximum Temperature), a lower negative value was found in PC2 than that in PC1, indicating that meteorological and forest variables interacted with each other. ALRT and PFRT show a substantial contrast between the proportions of agricultural land and private forest land areas, based on the arrows extending in different directions. This implies that wildfires may be influenced by interactions between anthropogenic ignition sources (e.g., agricultural and cemetery-related activities) and forest management practices (e.g., private or national forests).

Table 1 Loadings of Principal Components 1 and 2.

Model performance

Logistic Regression, Random Forest, XGBoost, Extra Trees, and LightGBM models were trained to predict the possibilities of wildfire outbreaks and evaluated their performances by comparing the prediction results based on the 2023 test data with the actual wildfire outbreak records. As illustrated in Table 2, LightGBM, and Extra Trees models showed excellent performances, as for the Accuracy metric with 0.786 and 0.767, respectively, and other models showed a good performance by ranging from 0.73–0.75. As for the Recall metric, Random Forest showed the best performance (0.828), and other models also showed good performances (0.77–0.80). Recall is significant in disaster response because it indicates the high percentage of cases predicted as “outbreaks” without missing the actual wildfire outbreaks.

Table 2 Performance evaluation of predictive models using multiple metrics.

Additionally, the models’ overall classification capabilities were assessed using the AUC (Area Under the ROC Curve) metric shown in Fig. 4. The Extra Trees model showed the highest value of 0.839, followed by Random Forest (0.836), LightGBM (0.835), XGBoost (0.834), and Logistic Regression (0.831). Summarily, the ensemble models showed good overall predictive power for wildfire outbreaks after applying appropriate parameter tuning, and the tree-based ensemble technique was superior for detailed predictions on a daily basis. Contrarily, Logistic Regression is a relatively simple model, although its prediction in Gangwon State at a Si and Gun level was not significantly inferior to that of other ensemble models, confirming that it is still a powerful model.

Fig. 4
figure 4

ROC curves and AUC scores of wildfire prediction models.

In addition to these quantitative metrics, the confusion matrix, shown in Supplementary Fig. 2, provides a more detailed view of the accuracy of the model in classifying wildfire outbreaks and non-outbreaks. In the confusion matrix, the lower left (FN) area represents the cases missed by the model among actual wildfire outbreak cases, while the lower right (TP) area indicates the cases in which actual outbreaks were correctly predicted. Regarding the confusion matrix for the Random Forest model in , FN is relatively small (11 cases), confirming that the model is superior in predicting actual wildfire outbreaks. The consistently high recall observed across all models is paramount for wildfire detection; minimizing missed events, despite a precision trade-off, prioritizes the identification of actual wildfires.

SHAP analysis

Figure 5 shows the results of a comprehensive analysis of SHAP values calculated for a subset of the 2023 test dataset (Figs. 1, 2, 3, 4) for the four ensemble-based models and discusses the influences and directions of each variable. The left side of the figure is the SHAP distribution per sample of each variable (Bee swarm plot), and the right side is the Bar plot indicating the mean absolute SHAP value.

Fig. 5
figure 5figure 5

SHAP analysis of the ensemble model (Left: Beeswarm plot; Right: Summary bar plot).

First, relative humidity (RHM) appears to be a key determinant of the likelihood of wildfire outbreaks, with the highest mean SHAP values across all four models (e.g., approximately + 0.75 in XGBoost, and + 1.13 in LGBM). In the SHAP distribution, low relative humidity conditions (blue dots) show positive SHAP values, contributing in the direction of increasing the likelihood of wildfire outbreaks, while high relative humidity conditions (red dots) chiefly showed negative SHAP values, implying that they are a suppressive factor for wildfire risk. This finding is consistent with those of previous studies40 showing that lower humidity dried out forest fuels, making them more vulnerable to ignition.

The lower wildfire risk for a certain period after rainfall was confirmed by the precipitation code (RNCD). The mean SHAP value was found to be high in the LGBM model (+ 0.25); a clear tendency exists for higher precipitation to suppress (negative SHAP) the likelihood of wildfire outbreaks. Maximum temperature (TMAX) was also evaluated as one of the main influencing variables, with mean SHAP values of + 0.14 in the XGBoost model, and + 0.21 in the LGBM model. As the temperature increases, the SHAP values show a positive direction, indicating that the risk of wildfire outbreaks increases under high-temperature conditions. Mean wind speed (WSMN) showed an overall negative distribution of SHAP values, indicating that higher wind speeds slightly decreased the probability of wildfire outbreaks. However, considering that localized high winds or instantaneous maximum wind speeds can promote the spread of wildfires10, a more accurate analysis of wind speed data should be conducted in the future.

The forest-related variables, CFRT and FGSV, were mostly positive, implying that in areas with denser coniferous forests or larger forest growing stock volumes, when wildfire events occurred, a tendency of predicting the ‘likelihood of wildfire outbreaks’ existed while considering the risk of wildfire spread. Contrarily, the PFRT had negative SHAP values ranging from −0.03 to −0.06, implying that private forest management may be relatively systemic, or more preventive activities exist such as access control.

As for anthropogenic factors, both ALRT and CLRT showed positive SHAP mean values, which is consistent with those of the finding of an earlier study indicating that embers from daily activities such as agricultural activities and visiting cemeteries contributed to wildfire ignition31. Forest-adjusted population density (FPDN) also ranged from +0.02 to +0.14, indicating a higher wildfire risk when sufficient fuel (forest) and anthropogenic ignition sources (population) exist simultaneously (Hong et al., 2018). Finally, the status of the Yeongdong Region (YDCD) has an overall low SHAP value and is biased in the negative direction, reflecting the relatively lower number of wildfire outbreaks observed in the Yeongdong Region within the study area.

While this study only adopted SHAP to ensure interpretability, future work may benefit from comparative analysis with other XAI techniques, such as LIME41 to explore the trade-offs in explanatory performance and domain suitability.

Conclusion

We suggested the process of combining machine learning ensemble models and oversampling technique (SMOTE) to create a predictive model of wildfire outbreaks on a daily basis throughout the year in Gangwon State, South Korea, and of interpreting the main influencing factors through SHAP analysis. First, we compared diverse models such as Logistic Regression, Random Forest, XGBoost, Extra Trees, and LightGBM, and found that the tree ensemble algorithm achieved high prediction accuracy (AUC with a maximum of 0.839) and Recall (a maximum of 0.828). Such results can be interpreted that the ensemble models effectively captured the complexity of wildfire outbreak patterns along with meteorological, forest-related, and anthropogenic variables involved. Simultaneously, the Logistic Regression model with a relatively simple structure also maintained competitive predictive performance, confirming that it is still a powerful baseline model.

Second, the SHAP analysis reaffirms that meteorological factors such as RHM, RNCD, and TMAX were crucial for wildfire outbreaks. Furthermore, forest-related characteristics such as coniferous forest cover and forest growing stock volume and anthropogenic factors such as agricultural land and cemetery land use ratios and forest-adjusted population density also made a significant contribution to the prediction accuracy; it implies that a model, which comprehensively considers variables, can provide a more sophisticated estimate of wildfire ignition potential, compared with that by a simple model considering only meteorological factors. Such finding indicates the need for monitoring and management in the meteorological and forestry sectors, as well as anthropogenic factors, to prevent and respond to wildfires in South Korea.

Third, this study has a novelty because we utilized year-round data, rather than being limited to specific seasons when wildfire outbreaks are more common. Such an attempt is in line with the need for a predictive model applicable to the entire period of a year, as wildfire outbreaks in South Korea are becoming increasingly year-round. However, the data utilized here was limited to Si and Gun, which makes it difficult to capture localized meteorological or topographical variations; as for meteorological data, a constraint existed of relying on limited data from meteorological stations, making it difficult to achieve an accurate wildfire prediction. Nevertheless, the ensemble-based wildfire outbreak prediction model and SHAP-oriented interpretation procedure suggested here can help practitioners in wildfire authorities intuitively understand “why the ignition risk is higher in a particular area on a particular day?” For future research, incorporating Remote Sensing data—such as high‑resolution satellite imagery—with ultra‑short‑term forecasts, and applying deep‑learning techniques such as RNNs or CNNs22,42,43, could enable the development of more sophisticated models. These advancements could potentially capture localized meteorological and topographical effects and even estimate the potential scale of wildfire damage. Furthermore, if a model is built to predict the damage scale by separately categorizing large wildfire cases or multiple ignition events, it is expected to support both pre-suppression strategy and real-time response.

In conclusion, this study provides a methodological attempt to simultaneously increase prediction accuracy and transparency by using machine learning techniques to predict the likelihood of daily wildfire outbreaks in Gangwon State, South Korea, and by utilizing explainable AI (XAI) techniques to interpret the internal operating principles of the model. This approach can be extended to a national scale in the future, and by combining it with various data types such as floating population and ultra-short-term forecasts, it can make a practical contribution to disaster management decision-making.