Abstract
Timely and accurate forest fire monitoring is of vital importance for curbing the spread of fires and reducing ecological and economic losses. Although the development of drones and remote sensing technologies has promoted visual-based fire monitoring research, existing methods still face key challenges in complex natural environments: insufficient detection capabilities for small-scale fire sources/smoke (key indicators of early fires), and high false detection rates for environmental interference (such as similar texture backgrounds). These limitations severely restrict the reliability and practicability of the monitoring system. To address these challenges, this paper proposes a deeply optimized model based on the YOLOv8 architecture. This model adopts an innovative multi-module collaborative design (for the specific structure, please refer to the methods section), aiming to significantly improve the detection accuracy and robustness for small targets and complex interference scenarios, while maintaining high efficiency to meet real-time warning requirements. The verification results show that our method outperforms the benchmark model in terms of accuracy (with a 4.7% improvement in mAP) and false detection rate, demonstrating its effectiveness in addressing the gaps in existing research.
Introduction
Fire is one of the worst natural disasters facing mankind.2025 began with the raging mountain fires in southern California, USA, unprecedented in the state’s history and arguably one of the worst disasters in the history of the United States, with the California Department of Forestry and Fire Protection’s website indicating that more than 16,000 structures and more than 55,000 acres (222 square kilometers) of land had been burned.2023 Massive wildfires on the Hawaiian island of Maui kill at least 115 people. In regional Australia, fires have killed more than 800 people and caused a total of $1.6 billion in damage since 18511. Among the susceptibility of China’s forests, high and medium–high grades predominate in Northeast, Southwest and East China, medium–low grades predominate in Central and South China, and low and very low grades predominate in North China and Northwest China, with forest fires accounting for 85.84 per cent of all fires in the winter and spring seasons2.
The frequent occurrence of forest fires has led to severe economic and ecological losses, including damage to forest trees, threats to biodiversity, soil degradation and water source destruction. The released greenhouse gases and toxic substances also exacerbate air pollution and climate warming. With the development of society and the expansion of human activities, fires have shown characteristics of increasing frequency, expanding scale and intensified nonlinear propagation, significantly increasing the complexity of emergency response.
Although the decline in hardware costs and the improvement in GPU computing power have promoted the application of intelligent visual systems in fire management (such as high-precision fire point detection, fire simulation and thermal anomaly monitoring), traditional monitoring methods still have significant limitations: The cost and risk of manual inspection are high; the spatial resolution of satellite remote sensing is insufficient; and the coverage of ground sensors is limited. These deficiencies have restricted the timeliness and accuracy of fire prevention and response.
Unmanned aerial vehicle (UAV) remote sensing, with its ability to achieve centimeter-level hyperspectral imaging and adaptability to complex terrains, provides a new path to break through existing technical bottlenecks. The current core research demand lies in: Deeply integrating the UAV platform with computer vision technologies (such as deep learning algorithms), in order to optimize the spatiotemporal accuracy and resource allocation efficiency of the entire fire management cycle, and fill the technical gap from risk warning to post-disaster assessment.
Contemporary object detection methodologies diverge into two principal technical paradigms: Dual-stage and single-stage architectures, distinguished by their inherent trade-offs in computational efficiency versus detection precision. The dual-stage framework, historically predominant in early research, employs a hierarchical process comprising: (1) region proposal generation through algorithms like Selective Search or neural-based Region Proposal Networks (RPNs), (2) subsequent region-wise feature extraction for classification and bounding box regression. Seminal models in this category include Faster R-CNN3 with its RPN-driven proposal mechanism, Mask R-CNN4 integrating Realign for pixel-level instance segmentation, and Cascade R-CNN5 utilizing multi-stage refinement to enhance localization accuracy. These architectures achieve superior performance in complex scenarios (e.g., 53.7% mAP on COCO dataset for Cascade R-CNN) by iteratively optimizing candidate regions, particularly excelling in small object detection and occlusion handling. However, their computational complexity (typically requiring 200 + MS per inference on standard GPUs) limits real-time deployment.
In contrast, single-stage detectors such as the YOLO6 series (YOLOv17 pioneering grid-based prediction, YOLOv58 optimizing anchor-free design), SSD9 variants (R-SSD10 enhancing multi-scale feature fusion, F-SSD11 advances multi-scale feature fusion through bidirectional feature integration and channel attention mechanisms, achieving enhanced accuracy in small-object detection while maintaining computational efficiency), and RetinaNet12 addressing class imbalance via focal loss, streamline detection into a unified pipeline. These models prioritize inference speed (e.g., YOLOv5 achieving 140 FPS on Tesla T4 GPUs) through architectural innovations like spatial pyramid pooling and depth wise separable convolutions, albeit with marginally reduced accuracy (4–6% lower mAP compared to dual-stage counterparts). This methodological evolution expands application horizons—dual-stage systems dominate medical imaging requiring sub-millimeter precision, while single-stage architectures power real-time applications like autonomous drones and surveillance systems.
Recent advancements in computer vision have catalyzed a paradigm shift in wildfire detection. HE Nailei et al. proposed a SSD algorithm for forest fire feature recognition, which can quickly identify the early images of forest fires with a low rate of propylene leakage, taking into account the speed of detection and accuracy of detection, and help the forest staff to deal with fires in a timely manner, which provides technical references for the prevention of forest fires in the early stage13. ZHENG Yanrui et al. In order to detect forest fires in the first time, a new model is proposed to detect forest fire smoke as the main target of the detection model YOLO-SCW, which is based on YOLOv7 introduced SPD-Conv layer, and in the detection of the head of the pyramid pooling part of the increase in the coordinates of the attention mechanism module, and finally through the WIOU rectangle frame loss function, the model is not only to reduce the problem of missing small target features in the feature extraction process, but also to increase the model to the target, to reduce the background interference in the detection process, WIOU rectangle frame loss function. The proposed architecture addresses multi-scale feature degradation through pyramidal attention mechanisms, concurrently mitigating background interference via spatial-channel gating operations. By implementing Wise-IoU (WIoU) bounding box regression loss with dynamic gradient allocation, the model achieves 17% faster convergence and 2.3% higher mAP50 in smoke plume detection compared to conventional IoU variants. This synergistic optimization enhances feature discriminability across wildfire scenarios (forest edge vs. canopy fires) while maintaining real-time inference capability (32 FPS on Jetson Xavier NX edge devices), demonstrating robust generalization across infrared–visible cross-modal datasets14. The target detection model for urban forest junction fires proposed by WANG Zhe et al. can accurately monitor forest junction fires and locate their spatial distribution. The model introduces the coordinate attention (CA) mechanism into the YOLOv5s backbone, which enhances the perception of direction and location information, so that the overall performance of the model has been improved, and it can accurately locate the starting points of urban forest junction fires15. To enhance the efficiency of early fire detection and mitigate disaster losses, scholars both domestically and internationally have carried out a series of improvement studies on target detection algorithms. Regarding model architecture optimization, the Li Deng16 team upgraded YOLOv5s by integrating the attention mechanism and feature fusion layer, effectively enhancing the extraction capability of flame and smoke features, and employed an improved loss function to enhance the model’s generalization performance. The Yunusov17 research group innovatively combined the YOLOv8 pre-trained model with the TranSDet architecture, significantly improving the accuracy and response speed of forest fire recognition.
For the technological breakthrough in the unmanned aerial vehicle (UAV) fire monitoring scenario, the Yangyang Zheng18 team proposed a dual optimization strategy: Replacing traditional convolution in the backbone network with GSConv to reconstruct the Bottleneck module, and enhancing the network’s expressive ability through the GBFPN multi-scale feature fusion module; concurrently, introducing the BiFormer attention mechanism and the Inner-MPDIoU loss function to reduce the model’s parameter quantity while enhancing the target capture ability in complex environments. The Yun19 team developed the CPDA (Channel Prior Expansion Attention) module, effectively addressing the adaptability limitations of traditional manual feature extraction methods in complex scenarios.
In response to the deployment requirements of edge computing devices, the Lei20 research group significantly reduced the computational load of the model through deep separable convolution and Ghost convolution techniques. They combined dynamic upsampling and coordinate attention mechanisms to enhance the feature capture capability and simultaneously adopted the Distance-IoU loss function to optimize the accuracy of detection boxes. These innovative improvements formed a technical closed loop encompassing feature extraction, model lightweighting, and loss function optimization, providing significant theoretical support for the practical application of unmanned aerial vehicle (UAV) fire monitoring systems. However, as shown in Fig. 1, the high uncertainty in the size of fire targets in UAV or remote sensing images (e.g., early fires may occupy only a few pixels) and the complex background of the forest scene (e.g., tree shadows, cloud cover, terrain undulation, etc.) result in the susceptibility of traditional image processing algorithms (e.g., threshold segmentation, edge detection) to interference, and the high rate of false positives and missed detections. It is very important to solve these problems because timely and efficient detection of forest fires has an important impact on timely fire suppression.
In order to solve the above problems, we need to develop a more efficient and accurate forest fire identification and monitoring method to achieve more efficient and accurate detection requirements. We propose a forest fire identification and monitoring method based on YOLOv8. This study focuses on optimizing the YOLOv8 algorithm to enhance target detection accuracy and recall performance while reducing missed detections, enabling effective real-time application across diverse research environments. The primary contributions of this work are summarized as follows:
The EMA (Efficient Multi-scale Attention) module employs a multi-branch convolutional architecture to facilitate cross-dimensional feature fusion. By integrating a channel-wise attention mechanism with trainable parameters (α and β weights optimized via SoftMax normalization), the module adaptively recalibrates the contribution of multi-scale features, prioritizing salient patterns across spatial and channel dimensions.
To optimize cross-layer feature fusion efficiency, this research proposes a Global Attention Mechanism (GAM) that integrates a dual-branch architecture for simultaneous spatial and channel-wise dependency modeling. By synchronously recalibrating feature importance across spatial coordinates and channel dimensions through learnable attention weights, GAM enhances inter-layer feature interaction while preserving original feature map fidelity.
To address occlusion-induced detection failures, Gold-YOLO employs multi-branch feature fusion mechanisms that enhance environmental adaptability through cross-modal feature complementarity. By implementing feature space decoupling, scale-aware convolution, and dynamic spatial aggregation, the architecture achieves balanced optimization of detection precision and computational efficiency, establishing a robust technical framework for real-time detection systems in complex visual environments.
In order to solve the feature misalignment problem in traditional FPN, the adaptive spatial feature fusion head ASFF (Adaptive Spatial Feature Fusion) is introduced, which especially improves the detection accuracy in dense scenes.
The structure of this article is outlined as follows: The second part elaborates on the baseline YOLOv8 architecture in detail, and introduces its enhanced version Our method. The third part subsequently discusses the experimental methods, including dataset preparation and performance evaluation, as well as a comprehensive discussion of empirical findings. The last part synthesizes key insights and summarizes the contributions of this study to the research on forest fire monitoring.
Related work
YOLOv8 is a classic member of the YOLO series, which draws on the design advantages of the previous generations of models and comprehensively improves the model structure of YOLOv5, while maintaining the advantages of the engineering simplicity and ease of use of YOLOv5, which is able to support the tasks of image classification, object detection, and instance segmentation. In this paper, we adopt the YOLOv8 model as a baseline and enhance it. The YOLOv8 architecture employs a hierarchical structure comprising four core modules: An input preprocessing layer for data standardization, a backbone network (CSPDarknet-based) for hierarchical feature extraction, a feature pyramid neck (FPN + PANet configuration) for multi-scale fusion, and an output layer implementing detection head predictions. This systematic organization, as illustrated in Fig. 2, enables efficient spatial-semantic modeling across varying object scales.
The main function of the input layer is to receive the original image data and apply the preprocessing procedures—including dimension normalization and mosaic enhancement techniques—to ensure the consistency of dimensions and enhance the generalization ability of the model. The backbone architecture maintains structural consistency with YOLOv5 and consists of three core components: Convolutional layers for preliminary feature mapping, C2f. module for cross-stage partial connection, and SPPF (Spatial Pyramid Pooling Fast) unit for multi-scale feature aggregation, jointly performing the task of extracting hierarchical patterns from visual data. The C2f. module gradient flow is richer, improves gradient propagation by integrating cross-layer connections, integrates high-level features with contextual information, enhances information flow in the feature extraction network, and improves detection accuracy. Unlike the Spatial Pyramid Pooling (SPP) module used in previous versions of YOLO, the SPPF module provides improvements in network structure and feature extraction, allowing the model to have good detection performance for targets of all sizes, utilizing three successive pooling operations to reduce computational complexity, and to be faster than traditional Spatial Pyramid Pooling (SPP), whilst still being able to aggregate features at multiple scales. thereby extending the sensory field.
The intermediate connection module serves as a critical bridge between the feature extraction network and the task-specific output layer, enabling enhanced utilization of visual patterns captured by the base network while facilitating multidimensional feature integration. The architecture employs a dual-directional feature aggregation mechanism combining top-down and bottom-up information pathways, which effectively combines multi-scale feature representations through bidirectional cross-layer interactions. This synergistic combination of vertical and horizontal information flows demonstrates superior computational efficiency and accelerated processing speed compared to conventional approaches.
The output layer generates the final detection results. YOLOv8 uses an anchor-less frame model and decoupled header, which allows the branches to focus more on their specific features, so much so that it improves the accuracy of the overall model.
Methods
Improved YOLOv8 detection algorithm
YOLOv8 continues the single-stage detection framework of the YOLO (You Only Look Once) series, with a number of optimizations in the model architecture, training strategy and deployment efficiency, with real-time and high-precision as the core advantages. Despite the good performance in many aspects, there is still a great challenge in facing small target detection in complex scenes, which is due to the neural network’s insufficient feature extraction for the target, thus affecting the effectiveness of detection. In order to cope with this challenge, this study adjusts the model structure, as shown in Fig. 3. These improvements are based on YOLOv8 for accurate detection of forest fires. The specific improvements are as follows: The EMA module is introduced to dynamically adjust the contribution of features at different scales; the attention mechanism GAM is introduced to improve the efficiency of cross-layer feature interaction; the neck structure is replaced with Gold-YOLO, which demonstrates a strong ability to adapt to the environment and achieves an optimal balance between detection accuracy and inference speed; and the Adaptive Spatial Feature Fusion head ASFF is introduced, which solves the feature misalignment in the traditional FPN Problems.
EMA
In order to enable the model to priorities salient image features while attenuating irrelevant background information, thereby enhancing detection performance metrics and generalization capabilities, we have used the attention mechanism as a key architectural component that facilitates selective feature emphasis. Multiscale Attention Module (EMA)21 is a new spatial learning approach that represents an optimized multiscale attention module that exhibits excellent performance in a variety of visual tasks, it designs a multiscale parallel sub-network to establish short and long dependencies, and the parallel sub-structure helps the network to avoid more sequential processing and large depths, and the overall structure is shown in Fig. 4, which effectively enhances the model’s performance across a range of scales of perceptual skills while maintaining high efficiency.
The architectural framework contains several components: ‘X Avg Pool’ and ‘Y Avg Pool’ for one-dimensional horizontal and vertical global pooling, respectively; Conv for convolution; Matul for matrix multiplication; Group Norm for normalization; Reweight for weight redistribution; groups for grouped convolution; and Sigmoid and Softmax as activation functions. Group Norm denotes the normalization procedure; Reweight denotes weight redistribution; groups denote the group convolution operation; and Sigmoid and Softmax are used as activation functions.
In the feature grouping domain, the input feature map is strategically divided into g sub-features for different semantic information extraction while maintaining the g<<C relationship.
The input information into the EMA module will be divided into several groups and then processed through different branches, one branch performs global pooling, the other uses 3×3 convolution to capture multi-scale features, and the output features of the two branches are modulated by applying a sigmoid function and a normalization operation and merged through the cross-channel interaction module.
This approach avoids channel dimensionality reduction caused by the use of convolution by reshaping some of the channel dimensions into bulk dimensions, and achieves the goal of preserving the information on each channel and reducing the computational overhead by not only constructing local cross-channel interactions in each parallel sub-network, but also by fusing the output features using a cross-space learning approach. In particular, the channel weights can be recalibrated by using 2D global average pooling to encode the global information, as shown in Eq. 1:
Here, \(z_{c}\) represents the global feature descriptor of the c-th channel, \(x_{\begin{subarray}{l} c \\ \end{subarray} }\) represents the two-dimensional matrix of the c-th channel in the input feature map, \(W\) represents the width of the feature map, \(H\) represents the height of the feature map, \(m\) represents the spatial index in the width direction, and \(n\) represents the spatial index in the height direction.
The EMA architecture integrates a cross-spatial learning mechanism that facilitates the aggregation of multidimensional cross-spatial information, enabling comprehensive feature integration and enhanced network feature extraction capabilities.
GAM
Global Attention Mechanism (GAM)22 is a lightweight attention mechanism and also an attention mechanism that takes into account both channel attention mechanism (channel attention sub-module shown in Fig. 5) and spatial attention mechanism (spatial attention sub-module shown in Fig. 6), and the module structure is shown in Fig. 7. The traditional attention mechanism may receive trouble in processing long text because it only focuses on local contextual information, GAM enhances the deep neural network by introducing a global attention distribution that utilizes global contextual interactions to better capture key information in the entire text sequence, thus significantly improving the model’s ability to process complex information, enhancing the global interaction representation by reducing the information reduction and magnifying the depth and detail of image analysis to improve the performance of deep neural networks.
GAM introduces a 3D arrangement with multiple layers of perceptions for channel attention as well as a convolutional spatial attention sub-module to achieve the purpose of minimizing information loss and enhancing global features, with the process represented by Eq. (2)-(4).
Among others, \(F_{1}\) is the input feature map,\(F_{2}\) is the output feature map of the channel attention sub-module,\(w_{1}\),\(w_{2}\) and \(b_{1}\),\(b_{2}\) are the initial weights and bias terms of the multi-layer perceptual machine (MLP),\(M_{c}\) is the channel attention function,\(b_{3}\) is the output feature map of the GAM attention, and \(M_{s}\) is the spatial attention function.
GAM significantly improves the performance of deep neural networks and also helps to balance recognition speed and accuracy. Introducing global attention into deep neural networks not only improves the model’s sensitivity to different image regions, but also improves the ability to detect and accurately locate the target. GAM efficiently integrates the spatial and channel through the novel 3D multilayer perceptron (MLP) and convolutional spatial attention sub-modules of information, and we integrate the GAM module into the YOLOv8 network architecture to further improve the accuracy and response speed of target recognition.
GOLD-YOLO
In a real forest fire environment, forest fire targets are not easy to be detected due to the complexity of the background. the Neck structure handles the fusion of feature vectors extracted by the backbone network, while YOLOv8 still suffers from information fusion in the Neck, which can only fully integrate features from the neighboring layers, whereas the information from the other layers can only be obtained indirectly through ‘recursion’. For other layers, the information can only be obtained indirectly through ‘recursion’, a process that usually leads to a large amount of information loss in small scales, which results in misdetections and omissions, especially for different scales and irregular shapes of fires and smoke. To solve this problem, we adopt a new feature aggregation and distribution mechanism, Gold-YOLO23, as shown in Fig. 8, which discards the FPN structure in the Neck network of the traditional YOLO family and employs the Gather-and-distribute mechanism, which is able to improve the ability of multi-scale feature fusion in the neck. Incorporating this mechanism into the neck portion of YOLOv8 integrates features from layers B2, B3, B4, and B5 of the backbone network and retains the high-resolution features for small-target detection, which improves the ability to detect targets of different sizes without significantly increasing the latency.
This set-distribution architecture operates through three specialized components: The Feature Alignment Module (FAM) aligns multi-layer features through deformable convolutions, the Information Fusion Module (IFM) integrates them through an attention mechanism to synthesize global context, and the Information Injection Module (Inject) propagates these global descriptors to the shallow layers through residual mapping. This design resolves the feature leakage problem in traditional FPN-based necks by maintaining hierarchical consistency, enhances the inference speed improvement to improve the efficiency of cross-scale fusion, and improves the recognition ability of small objects by retaining the propagation of detailed features.
In addition to this, the GD mechanism has two branches for dealing with features of small and large objects: Low-GD and High-GD. In a forest fire scenario with a complex background, where the morphology of the fires and smoke changes over time, this structural design effectively improves the model’s ability to cope with this challenge.
DETECT_ASFF
In the forest fire detection process, the background is complex and the target may exhibit different locations and sizes, in order to solve the challenge of changing target detection scales, this paper improves the detection head of YOLOv8 by using Adaptive Spatial Feature Fusion (ASFF)24 to form the Detect_ASFF detection head, which adapts to the multi-scale features of the forest fire scenario to achieve adaptive fusion effects. The structure of this module is depicted in Fig. 9, which significantly improves the scale invariance of features, spatially filters conflicting information, suppresses inconsistencies between features of different scales, and improves the accuracy of the detection while adding little additional inference overhead.
The core of ASFF is to adaptively learn the fused spatial weights of feature maps at each scale, where the key is the two steps of constant scaling and adaptive fusion. In constant scaling, ASFF-1 is obtained by doing 3 × 3MaxPool (stride = 2) + 3 × 3Conv (stride = 2) for level-3 feature maps and 3 × 3Conv (stride = 2) for level-2 feature maps \(X^{3 \to 1}\) and \(X^{2 \to 1}\); ASFF-2 is obtained by doing 3 × 3Conv (stride = 2) for level-3 feature maps and 1 × 1Conv for level-1 feature maps and resize to 2 times the size of the original image resolution to get \(X^{3 \to 2}\) and \(X^{1 \to 2}\); In ASFF-3, \(X^{2 \to 3}\) and \(X^{1 \to 3}\) are obtained by doing 1 × 1Conv for level-2 feature maps and resizing to 2 times the size of the original image resolution, and 1 × 1Conv for level-1 feature maps and resizing to 4 times the size of the original image resolution. In the adaptive fusion stage, take ASFF-3 as an example, multiply the features from different layers with and for the features from different layers with the weight parameter, \(\alpha 3\),\(\beta 3\) and \(\gamma 3\) sum to get the new fused feature ASFF-3 as shown in the following Eq. (5):
Here, \(y_{ij}^{l}\): The feature value at position \(\left( {{\text{i}},j} \right)\) and level \(l\) in the newly obtained feature map is the result of weighted fusion of features from different layers.
\(\alpha_{ij}^{l}\), \(\beta_{ij}^{l}\), \(\gamma_{ij}^{l}\): Corresponding to the weight parameters when extracting features from different layers for fusion, they are used to balance the contribution of features from different layers to the fusion result. The superscript \(l\) can represent the corresponding layer, and the subscript \(\left( {{\text{i}},j} \right)\) indicates the position on the feature map.
\(X_{ij}^{1 \to l}\), \(X_{ij}^{2 \to l}\), \(X_{ij}^{3 \to l}\): Respectively represent the feature values at position \(\left( {{\text{i}},j} \right)\) passed from different original layers to the target layer \(l\), and are the basic features to be fused.
To ensure that the weights sum to 1, the weight map is to be specified using the soft-max layer. For each position the weight map (x, y) must satisfy the following Eq. (6)(7):
Here, \(W_{i} \left( {x,y} \right)\) represents: At the position \(\left( {x,y} \right)\), it corresponds to the i-th weight.
Experimental results and analysis
The experimental process is mainly divided into three stages: dataset generation, model training and target detection, as shown in Fig. 10:
Dataset and preprocessing
In this study, a comprehensive assessment of the proposed model was carried out using a large public forest fire UAV remote sensing dataset (M4SFWD)25 to which a new part of the dataset was added.
This multimodal wildfire dataset integrates satellite remote sensing imagery with UAV-captured aerial samples, encompassing 8,089 annotated images with dual-class labels (Fire/Smoke). The collection demonstrates comprehensive environmental diversity across four critical dimensions: Terrain variations (mountainous/plain/coastal), meteorological conditions (humidity: 25–85%), illumination ranges (50-120 k lux), and fire intensity scales (1–5 classification). Original image resolutions span from 1280 × 720 to 1480 × 684 pixels, with non-standard aspect ratios adaptively resized to 640 × 640 through letterbox transformation during preprocessing. Through stratified sampling, the dataset partitions into training (5,662 samples), validation (1,618), and test sets (809) following 7:2:1 ratio, ensuring proportional representation of rare fire events (< 3% occurrence rate) across all subsets. Our dataset has been made publicly available on the Zenodo platform, with the following DOI link: https://doi.org/10.5281/zenodo.16208516. An example of the dataset is shown in Table 1.
By analyzing the labels of the dataset, a total of 23,969 labels were found, as shown in Fig. 11, of which 13,106 labels for Fire and 10,863 labels for Smoke
Experimental platform
The enhanced architecture proposed in this study, along with all baseline models, were implemented on a GPU-accelerated computing platform for wildfire detection and monitoring applications. Table 2 shows the experimental setup.
A systematic hyperparameter optimization protocol was implemented to balance model capacity and generalization performance. The training regimen employed a batch size of 16 across 300 epochs, with convergence dynamics revealing three distinct phases: Rapid loss reduction (0–50 epochs), gradual refinement (50–250 epochs), and eventual stabilization beyond 250 epochs. Quantitative analysis of the loss curve (Fig. 12) demonstrates convergence plateau achievement at 250 epochs, with subsequent iterations (250–300 epochs) maintaining stable optimization margins below 0.5% fluctuation threshold, indicating effective prevention of overfitting through early stopping mechanisms.
After the validation of the training cycle, the model demonstrated good convergence and training stability, a conclusion supported by the smoothness of the training loss curve. Based on the experiments conducted, the final determined hyperparameter configuration is detailed in Table 3.
Evaluation index
Precision (P), Recall (R), F1 Score, and Mean Accuracy (mAP) are essential metrics for evaluating models. To compare the results of various deep learning-based forest fire monitoring models, we use them as evaluation metrics.
Precision is for the prediction results and indicates how many of the samples predicted to be positive are positive samples, while recall is for the original samples and indicates how many of the positive examples in the samples were predicted correctly. As shown in Eq. (8) (9):
where TP is the correct prediction of the sample, FN is the sample being incorrectly categorized as another feature, and FP is the prediction of the other feature as the sample target.
The F1 score is a key indicator for evaluating the reliability of models in forest fire monitoring tasks, compelling the models to strike a balance between"reducing false alarms“and”avoiding missed alarms", as shown in Eq. (10):
The average precision (mAP) is the core evaluation metric for measuring the overall performance of a target detection model, as shown in Eq. 11:
We mainly compare the model performance in terms of mAP (0.5) and mAP (0.5:0.95). mAP (0.5) is the mean average accuracy mean with the intersection-to-union (IoU) threshold set to 0.5. mAP (0.5:0.95) denotes the mAP computed over a range of IoU thresholds from 0.5 to 0.95.
Comparative experiment
The accuracy of our model detection has been improved through a series of improvements to YOLOv8. To further validate the performance of the model, we compare it on the same dataset with the latest base algorithm of the YOLO family, YOLOv12, as well as well-received methods such as Faster R-cnn, YOLOv8, YOLOv8 CBAM and YOLOv8 MLCA.
We conducted 300 rounds of experiments to compare the mAP (0.5) of our method with the mAP (0.5) of the other methods Comparison results are shown in Fig. 13. For either model, mAP (0.5) increases sharply in the first 25 rounds and then stabilizes after 275 rounds. It is clearly seen that our method has a higher stabilization value and finally stabilizes at 0.834.
As can be seen from the above figure, the detection performance of Faster R-cnn is much inferior to that of the YOLO series. To verify the stability of the results, we conducted further comparisons as shown in Table 4:
From the table data, the mAP50 of Our Method is significantly superior to that of the comparison model (p < 0.05). Moreover, the performance of Faster R-cnn is significantly lower than that of the other models, so it is no longer included in the subsequent experimental comparison.
To ensure the accuracy of the experiment and reduce the experimental error, we save the weight file with the best training performance in the training results in the same experimental environment, and use its resultant mAP(0.5), mAP (0.5:0.95), precision, recall, and F1 scores for the comparison experiment. As shown in Table 5, the precision and recall of model detection in our approach are improved. It is obvious from the data that our method improves 0.032 and 0.044 in precision and recall, 0.038 in F1 score, and 0.047 and 0.041 in mAP(0.5) and mAP(0.5:0.95), respectively, compared to YOLOv8. The significant improvement in mAP(0.5:0.95) indicates that our method is more robust in detecting under more stringent IoU thresholds, with obvious advantages in localization accuracy.
Compared with YOLOv8 CBAM, the precision and recall are improved by 0.020 and 0.053 respectively, the F1 score is improved by 0.038, and the mAP(0.5) and mAP(0.5:0.95) are improved by 0.050 and 0.045 respectively. the improvement in recall is especially prominent, indicating that the model is effective in optimizing the leakage detection problem.
Compared to the YOLOv8 MLCA, our method achieves 0.008 and 0.066 improvements in precision and recall, respectively, with a 0.043 increase in the F1 score, and 0.047 and 0.039 improvements in mAP(0.5) and mAP(0.5:0.95), respectively. The significant optimization of recall further validates the model’s enhanced coverage of low-visibility targets.
Compared with the new model YOLOv12 in the YOLO series, the precision and recall are improved by 0.048 and 0.038, respectively, and the F1 score increases by 0.040, and the mAP (0.5) and mAP (0.5:0.95) are improved by 0.042 and 0.040, respectively. mAP (0.5:0.95)’s overall improvement demonstrates that our method’s combined complex scene detection capability is better, and the problems of leakage and false detection are doubly improved.
As shown in Fig. 14, the visualization results clearly show that the average precision (AP) values of the present method are higher than those of the other compared models in both FIRE and SMOKE detection tasks. Whether it is the performance of a single category or the overall average index, the present method leads with a large advantage, which fully verifies its effectiveness and superiority in the target detection task.
To systematically evaluate the performance differences among various detection methods, this study designed a multi-scenario comparative experiment. As shown in Fig. 15.
In the smoke detection tasks under complex environments (Scenarios 1 and 4), although existing algorithms can achieve target recognition, their confidence indicators are significantly lower than those of this method. Experimental data show that this method maintains more stable detection confidence under conditions of overlapping smoke features and dynamic interference and maintains high accuracy in different complex flame recognition tasks, verifying its technical advantages in complex scenarios.
Regarding the issue of missed detections for the target (Scenarios 3 and 4), the baseline model YOLOv8 exhibits obvious detection omission phenomena during continuous monitoring. The improved CBAM-MLCA model and the YOLOv12 architecture achieve complete target coverage, but their mean confidence values are lower than those of this method. This indicates that this method effectively improves recognition reliability while ensuring detection integrity through feature extraction optimization.
In the scenarios of long-distance small target detection (Scenarios 2 and 3), both this method and YOLOv12 achieve a 100% recall rate. It is worth noting that the standard deviation of confidence of this method is smaller than that of the comparison model under complex background interference, indicating that it has a more stable feature representation ability under different observation distances and background complexities.
The comprehensive experimental results confirm that this detection framework demonstrates significant performance advantages in various complex fire scenarios through multi-scale feature fusion and attention mechanism optimization. Compared with existing methods, this solution has a lower average missed detection rate and simultaneously increases the mean detection confidence level, providing a more reliable technical solution for fire monitoring in complex environments.
Ablation experiment
To systematically evaluate individual module contributions in the proposed architecture, five component-wise ablation studies were performed under identical environmental and parametric conditions using YOLOv8 as the baseline framework. These sequential module integration tests enabled quantitative evaluation of each enhancement component’s efficacy in wildfire detection scenarios through controlled variable isolation. The results of the experiments are shown in Table 6.
Compared with the original YOLOv8 model, when only EMA is referenced, Precision is improved by 2.3%, mAP(0.5) is improved by 0.6%, Recall is slightly decreased, and the overall model is improved.
With the introduction of GAM, Precision improved by 0.7%, Recall decreased by 1.5% and mAP(0.5) improved by 0.4%. These findings confirm that the overall accuracy of detection was improved with the addition of the GAM model.
The original neck structure of YOLOv8 was then replaced with GoldYOLO, which resulted in a 1.5% decrease in accuracy, a 9.4% increase in recall, and a 3.3% improvement in mAP(0.5). After replacing the neck structure, although the precision was slightly reduced, the recall and mAP(0.5) were significantly improved, and the overall performance of the model was significantly improved.
Finally, replacing the original detection Head with ASFF resulted in a 1.7% increase in precision, a 3.2% decrease in recall, and a 0.5% increase in mAP(0.5). Experimental verification shows that the integration technology of ASFF detection head significantly improves the accuracy rate of target recognition.
Experimental results conclusively establish the proposed framework’s superiority over baseline models through multi-dimensional evaluations. The ablation study visualizations (Fig. 16) demonstrate significantly improved prediction fidelity compared to conventional approaches, with detection outputs exhibiting enhanced congruence to actual environmental conditions.
Comparative evaluations with the baseline YOLOv8 model were conducted through visual analysis of randomly sampled test cases (Fig. 17). In Scenario (a), the original model failed to detect the upper-left smoke plume adjacent to the fire while generating false positives, whereas our method achieved precise localization. For Scenarios (b)(c), although both approaches identified fire and smoke targets, the proposed architecture exhibited higher confidence scores in prediction reliability, demonstrating superior detection robustness across multi-scale wildfire manifestations.
Conclusion
To address limitations in existing wildfire detection methodologies, this study develops an optimized YOLOv8-based architecture—specifically designed to improve both precision and inference speed in forest fire monitoring systems. By introducing a multi-module co-optimizations strategy, the model significantly improves the detection performance of fires and smoke, especially in the detection of complex natural scenes and small targets. The following is the analysis of the core improvement and experimental results:
-
(1)
Introducing EMA to dynamically adjust the weight decay coefficient, improve the stability and generalization ability of training, and enhance the robustness of the model.
-
(2)
GAM is embedded to strengthen the global perception ability of the model, effectively distinguish the interference similar to the target, and reduce the false detection rate.
-
(3)
The integration of Gold-YOLO within the feature aggregation network substantially elevates detection recall performance, particularly enhancing small-scale object recognition accuracy by optimizing multi-scale feature fusion, while simultaneously strengthening model resilience in cluttered environmental conditions through adaptive attention weighting mechanisms.
-
(4)
ASFF detection head is adopted to adaptively adjust the weights of feature maps of different scales to enhance the fusion effect on small-sized targets.
The experimental results show that Our method has achieved significant improvements over the baseline model YOLOv8: mAP (0.5) has increased by 4.7%, and the precision and recall rates have improved by 3.2% and 4.4% respectively. These enhancements are crucially reflected in the robustness against complex terrains and environmental interferences (such as backgrounds with similar color textures), as well as the detection capability for small-scale fire/smoke targets (often key indicators of early-stage fires). This validates the effectiveness of our proposed multi-module collaborative optimization strategy, particularly the model’s core advantages in suppressing false positives and reducing false negatives. The high inference speed of Our method is another significant advantage, providing technical support to meet the urgent need for real-time forest fire monitoring and early warning. Its potential practical application value lies in its deployment in forest area surveillance cameras, unmanned aerial vehicle inspection systems, or satellite remote sensing image processing platforms, forming a faster and more accurate early fire warning network, which can win precious time for fire suppression and reduce ecological and economic losses.
Although positive results have been achieved, this study still has limitations:
-
(1)
Insufficient detection of semi-transparent/diffuse smoke: The current model has relatively low detection accuracy for semi-transparent (such as mist-like) or highly diffuse smoke targets. These targets have blurred features, unclear boundaries, and low distinguishability from the background, posing a challenge to mainstream detectors that rely on clear visual features.
-
(2)
Robustness to extreme weather conditions remains to be verified: The experiments are mainly based on existing public datasets and some self-collected data. The performance of the model under extremely harsh weather conditions (such as heavy rain, thick fog, sandstorms) has not been fully validated yet. This is an unresolved issue that must be addressed in actual deployment.
-
(3)
Computational resource requirements: Although the inference speed is faster, the model complexity of Our method is higher than that of the original YOLOv8. As a result, the training process of Our method requires more computational resources, which may limit its direct deployment on resource-constrained edge devices.
Given these limitations, future research will focus primarily on the following unresolved issues:
-
(1)
Enhance the ability to detect complex smoke: Focus on researching how to integrate multimodal information (such as possible infrared features) or designing more sophisticated feature extraction modules to improve the perception of semi-transparent and diffuse smoke.
-
(2)
Enhanced adaptability to extreme environments: Collect a fire dataset covering a wider range of extreme weather conditions and explore the robustness optimization strategies of the model under these conditions.
-
(3)
Model light weighting and edge deployment: Explore techniques such as model pruning, quantization, or knowledge distillation to reduce the size of the model while maintaining accuracy, making it more suitable for resource-constrained embedded or edge computing devices.
Our method, through innovative architecture optimization, has effectively enhanced the accuracy and speed of forest fire monitoring. It particularly demonstrates significant advantages in suppressing false alarms, reducing missed detections, and improving robustness for small targets and complex scenarios. It has clear practical application prospects. At the same time, we are also fully aware of its limitations in specific smoke type detection and adaptability to extreme environments. Solving these unresolved issues will be the key direction of future research, with the aim of ultimately achieving a more reliable and universal intelligent forest fire early warning system.
Data availability
The dataset used in this study is publicly available on the Zenodo platform. It includes the full training, validation, and test splits of the extended M4SFWD dataset (incorporating both synthetic and real UAV remote sensing imagery of forest fires and smoke), along with detailed annotations and preprocessing guidelines. The dataset can be accessed via the following https://doi.org/10.5281/zenodo.16208516.
References
Summary of Major Bush Fires in Australia Since. Romsey Australia. [2010–10–29] (1851).
Zhang Guoli, CI Xuelun, YANG Xueqing, JIANG Chunying, SUN Zhichao, MENG Haiding. Study on Spatio-Temporal Distribution Characteristics and Susceptibility Analysis of Forest Fire. For. Grassl. Resour. Res. 2023 (5) 48–55
Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV) 2980–2988 (2017).
Cai, Z., Vasconcelos, N. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) 6154–6162 (2018).
Redmon, J.et al. You only look once: Unified, real-time object detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 779-788(2016).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified real-time object detection. arXiv:1506.02640 (2015).
Jocher, G. YOLOv5 [online] Available: https://github.com/ultralytics/yolov5 (2020).
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, CY. et al. SSD: Single shot MultiBox detector arXiv:1512.02325 (2015).
Jeong, J., Park, H. & Kwak, N. Enhancement of SSD by concatenating feature maps for object detection arXiv:1705.09587 (2017).
Li, Z. & Zhou, F. FSSD: Feature fusion single shot multibox detector arXiv:1712.00960 (2017).
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollar, In Proc. IEEE Int. Conf. Comput. Vis. (ICCV) 2980–2988 (2017).
Nailei, H. E., Jinsheng, Z. & Wenshu, L. Forest fire image recognition based on deep learning multi-target detection technology. J. Nanjing For. Univ. (Nat. Sci. Edit.) 48(3), 207–218. https://doi.org/10.12302/j.issn.1000-2006.202205025 (2024).
Yanrui, Z., Linjian, Y., Shuguang, L. & Yongju, Z. Deep Learning-Based Forest Fire Smoke Detection. For. Resour. Wanagement 4, 150–160 (2023).
Zhe, W. A. N. G., Xiang, L. I., Dongliang, Y. A. N. G. & Dan, L. I. U. Fire detection model of wildland-urban interface based on YOLOv5s. China Saf. Sci. J. 33(6), 152–158 (2023).
Deng, L., Zhou, J. & Liu, Q. Improving YOLOv5s algorithm for detecting flame and smoke. IEEE Access 12, 126568–126576. https://doi.org/10.1109/ACCESS.2024.3442309 (2024).
Yunusov, N., Islam, BMS., Abdusalomov, A., & Kim, W. Robust forest fire detection method for surveillance systems based on you only look once version 8 and transfer learning approaches. Processes 12 1039 [Google Scholar] [CrossRef] (2024).
Zheng, Y., Tao, F., Gao, Z. & Li, J. FGYOLO: An integrated feature enhancement lightweight unmanned aerial vehicle forest fire detection framework based on YOLOv8n. Forests 2024, 15. https://doi.org/10.3390/f15101823 (1823).
Yun, B., Zheng, Y., Lin, Z., Li, T. FFYOLO: A lightweight forest fire detection model based on YOLOv8. Fire 7 93 [Google Scholar] [CrossRef] (2024).
Lei, L., Duan, R., Yang, F. & Xu, L. Low complexity forest fire detection based on improved YOLOv8 network. Forests 15, 1652. https://doi.org/10.3390/f15091652 (2024).
Ouyang, D., He, S., Zhan, J., Guo, H., Huang, Z., Luo, M.L., Zhang, G.L. Efficient multi-scale attention module with cross-spatial learning. arXiv arXiv:2305.13563. [Google Scholar] [CrossRef] (2023).
Liu, Y., Shao, Z., Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv arXiv:2112.05561. [Google Scholar] (2021).
Wang, C., He, W., Nie, Y., Guo, J., Liu, C., Wang, Y., Han, K. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. In Proc. NeurIPS Proc. 2023, New Orleans [Google Scholar] (LA, USA, 2023).
Liu, S., Huang, D., Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv arXiv:1911.09516. [Google Scholar] (2019).
Guanbo, W. Multiple scenarios, multiple weather conditions, multiple lighting conditions and multiple wildfire objects Synthetic Forest Wildfire Dataset (M4SFWD) IEEE Dataport https://doi.org/10.21227/m9kz-bw61 (2024).
Acknowledgements
This work was supported in part by the Basic Scientific Research Business Fund Project of Universities in Hebei Province under Grant 2022QNJS03, and in part by the Project of Zhang-jiakou Science and Technology Bureau under Grant 2421007B.
Funding
The Basic Scientific Research Business Fund Project of Universities in Hebei Province, 2022QNJS03, the Project of Zhang-jiakou Science and Technology Bureau, 2421007B.
Author information
Authors and Affiliations
Contributions
Conceptualization, Y. Zheng and P. Guo; methodology, Y. Zheng; software, P. Guo; validation, X. Tian and Y. Ye; formal analysis, X.X.; investigation, Y. Ye; resources, Y. Ye; writing—original draft preparation, P. Guo; writing—review and editing, Y. Zheng; project administration, Y. Ye; funding acquisition, Y. Ye. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zheng, Y., Guo, P., Tian, X. et al. A forest fire identification and monitoring model based on improved YOLOv8. Sci Rep 15, 37018 (2025). https://doi.org/10.1038/s41598-025-17893-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-17893-3
















