Introduction

Building fires may cause casualties and extensive damage due to the rapid spread and difficulty in extinguishment1,2. A statistical report from the United States shows that 74% of all building fire-related deaths are attributable to the lack of timely fire warnings3. To prevent the consequences caused by building fires, timely fire warning enables the rapid activation of fire suppression systems and ensures sufficient time for personnel evacuation. Therefore, timely fire warnings are critical for minimizing casualties in building fires.

Fire detection methods are employed to achieve timely fire warnings and minimize fire-related damage. Traditional fire detection methods predominantly rely on flame, gas, and temperature sensors, which are typically mounted on the ceiling to maximize their coverage area and detection capability4,5. These methods result in detection delays due to the time required for temperature or smoke to reach the sensors6,7, which increases the response time of the fire warning. Such delays can affect evacuation and the timely activation of fire suppression systems, reducing the effectiveness of fire warnings8. Therefore, more efficient detection methods are needed for timely fire warnings.

To overcome the limitations of traditional fire detection methods, scholars have increasingly focused on developing deep learning-based fire detection techniques9,10,11. This approach leverages indoor security cameras for fire detection, enabling more timely and direct fire warnings by utilizing surveillance video12. The proposed method is closely integrated with these indoor security cameras. Indoor security cameras installed in the building capture real-time video to obtain information about the current scene. This information is essential for the deep learning model to accurately extract fire-related features. The proposed method can perform real-time analysis based on the latest captured features and quickly issue alarms when fire features are detected.

Deep learning-based fire detection technology relies on high-quality datasets2 for effective model training13, as both the quality and quantity of the dataset are crucial to ensuring the accuracy of the model’s detection14. However, due to the dangerous nature of fire incidents, it is extremely difficult to obtain real fire images, resulting in a serious lack of real image data in existing model training. Given these issues, artificial intelligence-generated content (AIGC) technologies provide a viable solution by generating fire images for model training. AIGC technologies have been widely applied in fields such as text, image, audio, and code generation15,16. For instance, DALL-E and Midjourney, as advanced text-to-image technologies17, offer methods for generating synthetic images18,19. In addition, Generative Adversarial Networks (GANs), such as StackGAN and StyleGAN, have shown the capability to produce realistic and diverse synthetic images20,21. Therefore, generative artificial intelligence technology offers a method to expand the dataset with real images, thereby improving the performance of the detection model.

To leverage indoor security cameras for the detection of building fires, the following two challenges need to be addressed:

  1. (1)

    The scarcity of real fire image samples. After a fire starts, the environment at the scene is complex and it is difficult to obtain a large number of real images of the fire scene. This limitation hinders the development of robust detection models. Employing generative AI methods to construct effective image datasets can augment dataset diversity, thereby facilitating the training of more accurate and reliable detection models.

  2. (2)

    The efficiency of detection models based on deep learning is insufficient to meet the demands of timely building fire warnings. Building fire detection requires accurate detection of flame or smoke in surveillance images while ensuring detection efficiency to provide accurate fire warnings in various scenarios. Therefore, a detection method for building fires must be developed based on the existing deep-learning models to enable real-time and timely fire warnings in building fires.

For challenge (1), the current method for obtaining training datasets involves using screenshots from indoor security cameras and downloading images online as primary data sources22. This approach is limited by insufficient data diversity and inconsistent data quality. Chen collected a multi-mode video dataset using drones23. While this method can capture real flame images, it faces challenges in ensuring image quality and clarity due to environmental interference and camera limitations. Furthermore, attempts to augment the dataset sample size through horizontal, vertical, and random flips often lead to excessive similarity among samples, which directly impacts model training accuracy24. In addition, researchers often collect fire images from search engines such as Google25,26,27. However, high similarity in image backgrounds reduces sample diversity28. Therefore, a dataset construction method is needed that can generate large images to provide training samples for the fire detection model, thereby improving the detection performance of detection models.

For challenge (2), deep learning-based fire detection has achieved a direct mapping from data input to fire detection results, ensuring the reliability and accuracy of video-based detection methods in practical applications29. The YOLO (You Only Look Once) network model is widely utilized in numerous real-time object detection applications due to its simplicity, efficiency, and adaptability30. The YOLO models have evolved through various versions up to the YOLOv10 model31. However, these higher versions, specifically the YOLOv9 and YOLOv10 models, have high computational complexity and resource demands, making them unsuitable for large-scale video detection tasks32,33,34. Numerous studies indicate that the YOLOv8 model is highly adaptable in large-scale fire detection tasks35. The object detection speed and efficiency of the YOLOv8 model are better aligned with the requirements of building fire detection compared to the YOLOv9 and YOLOv10 models. Despite these improvements, existing fire detection models still require further enhancement in terms of accuracy and efficiency.

To address the above issues, a multi-object detection method through AIGC is proposed to improve building fire warning capability. First, a fire image generation workflow is designed using Midjourney software, where fire-related keywords are extracted to generate diverse fire images. The detection accuracy of the model trained on the AIGC dataset is compared against that of the model trained on the real image dataset to evaluate whether the AIGC dataset is effective. Next, the MLCA mechanism is introduced to enhance feature detection, and the feature fusion layer is replaced to improve the model’s detection efficiency and accuracy. The multi-object detection model is evaluated through performance comparison and ablation experiments. Finally, three cases are detected to demonstrate the method’s efficiency for timely fire warnings. The outcomes of this study can be used to offer timely fire warnings, thereby enhancing personnel evacuation and rescue efforts in building fires.

Framework

The research framework of this study is shown in Fig. 1, which consists of three components as follows:

  1. (1)

    Dataset construction. First, a fire image generation workflow is designed using Midjourney software, where fire-related keywords are extracted and used to generate diverse fire images. Secondly, the validation of the dataset constructed using AIGC demonstrates that the AIGC-based method can expand the number of samples in different fire scenarios.

  2. (2)

    Multi-object detection model. The MLCA mechanism is introduced, and the feature fusion layers are replaced to enhance the feature fusion capability. Subsequently, detection performance comparison and ablation experiments are performed to demonstrate the effectiveness of the multi-object detection model.

  3. (3)

    Case study. Three cases are analyzed to verify the effectiveness of the proposed method. The study also demonstrates that the model can effectively detect fires using video captured by indoor security cameras, highlighting its practical applicability in real scenarios.

Fig. 1
figure 1

The framework of this study.

Dataset construction through AIGC

Construction workflow of the fire image dataset

In the process of building fire evolution, fire causes dynamic damage to the building structure, and this process not only affects the spread of fire, but also significantly changes the main features in the image, so a detailed description of the dynamic damage process can be used in the construction of fire image datasets is of great significance36,37. Building fire accidents are characterized by a sudden and dangerous nature; thus, acquiring images of real fire scenes poses a significant safety risk. When a fire incident occurs, factors such as the type and size of the building, the location of the fire source, and the building materials affect the quality of the image. Therefore, AIGC effectively expands the dataset by synthesizing fire images, thus providing enough training data to help optimize the detection model performance, which solves the problem of scarce fire image data.

Fig. 2
figure 2

The workflow for constructing the AIGC-based fire image dataset.

In this paper, a construction workflow for the building fire image dataset based on Midjourney V5.239 is designed to provide sufficient sample data for the training of fire detection models. Figure 2 shows the workflow for constructing the AIGC-based fire image dataset. First, the range of the generated images must be clarified by specifying fire scenarios and detailed requirements. Then, a descriptive text is created to comprehensively describe different building fire scenarios based on these requirements. The descriptive text consists of two key variables: Variable 1, representing the different fire scenarios, and Variable 2, representing dynamic fire features (e.g., changes in the morphology of the flames and smoke). The combination of variable 1 and variable 2 creates diverse inputs, which are processed by Midjourney to produce a series of building fire scene images. In selecting Variable 1, nine high-risk fire scenarios were chosen to define the environment in which fires occur, taking into account the frequency and impact of building fires. These high-risk scenarios were identified based on factors such as fire incident frequency, building type, and fire impact. For Variable 2, the evolution and morphological characteristics of flames and smoke were refined into 18 categories across five dimensions to capture their dynamics, as summarized in Table 1. In Midjourney, keyword groups are formed by combining the above variables and nesting them within {}. These keyword groups are arranged and combined according to predefined rules, enabling the batch generation of building fire images that meet the target requirements. This workflow not only ensures high realism and diversity in the generated fire images but also provides more accurate training data for the fire detection model.

Table 1 Keywords and explanations for Variable 2.

Through the workflow designed for this paper, 2000 building fire images were generated within 8 h, with the images being clear and accurately reflecting the details of the building fire scene. Figure 3 illustrates the generated fire images. The generated images encompass various fire scenarios, building types, and environmental conditions. Flames are depicted with varying intensities, accompanied by smoke effects that differ in both density and distribution. These details capture the intricate characteristics of fire incidents, reflecting the diversity inherent in real fire scenarios. Such variations are crucial for training detection models. This dataset serves as an invaluable resource for training fire detection models, enhancing their ability to detect fire under different lighting and weather conditions.

Fig. 3
figure 3

The generated images of building fires based on the construction workflow in this study.

AIGC dataset validation

Validation experiment

The validation experiments aim to determine whether the performance of models trained on the AIGC-generated dataset is consistent with that of models trained on the real fire image dataset. To achieve this, the AIGC-generated dataset and the real image dataset are used to train the models separately. The validation of the AIGC dataset consists of two experiments: evaluating the detection performance of the two models and comparing their performance on the same dataset. For Experiment 1, the 2000 AIGC-generated fire images and 16,800 real fire images were divided into training and validation sets in an 80:20 ratio. Both datasets were used to train the YOLOv5 model, and the models’ evaluation metrics were compared. Experiment 2 focused on comparing the detection precision of the two models on a dataset of 1000 real fire images. These two experiments aimed to analyze the impact of different datasets on the performance of fire detection models.

In this study, the experimental platform is built on a Windows 10 operating system with a 12th Gen Intel Core (TM) i7-12700 H CPU and an NVIDIA GeForce RTX 3090 GPU. The platform utilizes Python 3.840 as the programming language and PyTorch as the framework. This paper evaluates the model performance in terms of both accuracy and inference speed. Precision measures the proportion of correctly detected objects to all detected objects, recall evaluates the ability of the model to detect actual objects, and F1 score is used as a combined indicator of precision and recall to assess the overall quality of the model. The inference speed is evaluated by mAP@0.5 (denotes the mAP calculated at an IOU threshold of 0.5) and FPS (frames per second), which is critically used for real-time detection. These metrics provide a comprehensive and scientific basis for different datasets trained model evaluation.

Validation results

The model trained on the AIGC dataset is denoted as the AIGC model, and the model trained on the real dataset is denoted as the real image model.

Experiment 1 compared the evaluation metrics between models trained on the AIGC dataset and those trained on the real image dataset. The results are detailed in Table 2, which demonstrates the superior performance of the AIGC model in comparison to the real image model. Specifically, both models achieve a recall of 98.0%, indicating that the two models are equally effective in detecting fire characteristics. However, the AIGC model achieves 4.9% higher precision than the real image model. This higher precision suggests that the AIGC model generates fewer false positives, thereby improving the reliability of fire warnings. In addition, a comparison of the mAP@0.5 reveals the advantage of the AIGC model. For mAP@0.5 (flame), the AIGC model achieves 85.2%, surpassing the real image model’s 73.3% by 11.9%. For mAP@0.5 (smoke), the AIGC model achieves 94.7%, outperforming the real image model by 6.2%. Overall, the AIGC model achieves a mAP@0.5 of 90.0%, which is 9.1% higher than the real image model’s 80.9%. The F1 score of the models further highlights the advantages of the AIGC model. It achieves 85%, exceeding the 75.0% of the image dataset model. Despite being trained on a smaller dataset, the AIGC model consistently outperforms the real image model across various evaluation metrics, indicating that AIGC-generated data can effectively complement real image datasets in training fire detection models.

Table 2 Evaluation metrics between the models trained on the AIGC and the real image datasets.

Experiment 2 compares the precision of both models on the same image dataset. Figure 4 shows precision of 86.1% and 87.7%, respectively, on the same image dataset. Theexperiments show that the AIGC model has a precision of only 1.6% lower than the realimage model, demonstrating the effectiveness of the AIGC dataset in detecting real fire. Italso demonstrates the reliability of the AIGC-generated dataset. The performance gap between the AIGC model and the real image model in Experiment 2 is minimal, furtherdemonstrating the effectiveness and reliability of the AIGC model for real fire detection tasks.

Fig. 4
figure 4

Precision comparison of the AIGC and the real image models on the same image dataset.

In summary, the results of these validation experiments show that the AIGC model outperforms the real image model in key evaluation metrics. Although the AIGC model was trained on a smaller dataset, it demonstrated greater precision and reliability in detecting real fire images. The results from both experiments highlight the significant potential of AIGC-generated datasets to improve the performance of fire detection models, demonstrating their ability to complement real image datasets.

Multi-object detection model

Model development

Applying the YOLOv8 model in building fire scenarios still faces several challenges despite it demonstrating high efficiency and accuracy in detecting flames and smoke. First, complex backgrounds and varying lighting conditions in building fire scenes pose challenges to feature extraction and object detection. Second, the performance of the YOLOv8 model in detecting objects of varying scales, particularly smaller flames and fine smoke, requires further enhancement. Moreover, the high computational complexity of the YOLOv8 model may hinder its application in real-time surveillance systems.

To address these challenges and improve the detection performance of the detection model in building fire scenarios, this paper develops a multi-object detection model based on the YOLOv8 model. The approach aims to improve the model’s ability to detect flames and smoke through structural modifications and algorithm optimization. The overall architecture of the multi-object detection model is presented in Fig. 5, and the architectures of the sub-modules are shown in Fig. 6.

Fig. 5
figure 5

The overall architecture of the multi-object detection model.

In the backbone sub-module, the fire image is performed with a series of convolutions to output fire feature information. Simultaneously, the attention mechanism is introduced at the 10th layer of the model to focus channel and spatial information to enhance the backbone’s ability to extract flame and smoke features. The introduction of the attention mechanism can retain channel and spatial information. In recent years, attention mechanisms have been proven effective in many studies for detecting objects35, such as large separable kernel attention (LSKA)36, convolutional block attention module (CBAM)40,41,42,43,44, and coordinate attention (CoordAtt). However, these approaches fall short in terms of feature extraction, computational efficiency, and generalization ability. The Mixed Local Channel Attention (MLCA) mechanism boosts performance by reducing information reduction and magnifying global interactive representations, especially in complex environments. Therefore, the MLCA mechanism is introduced into the backbone of the model in this paper to capture important information in images and enhance the model’s feature representation capabilities. By incorporating the MLCA mechanism into its backbone, the precision increases by 2.2%, and mAP@0.5 improves by 1.7% compared to the YOLOv8 model, demonstrating the significant impact of the MLCA mechanism on feature extraction. Additionally, the C2f and Conv convolutions in the neck of the YOLOv8 model are replaced with VOV-GSCSP and GSConv, thereby reducing FLOPs. The neck layer of the multi-object model balances complexity and accuracy, achieving higher computational efficiency while effectively improving detection accuracy for flame and smoke. The neck module contains mainly sub-modules for concat, upsampling, VOV-GSCSP, and GSConv. The VOV-GSCSP allows parameters to be fused from different backbone layers to different detection layers, greatly improving the feature fusion capability of the network. By incorporating the MLCA mechanism into its backbone and replacing the feature fusion layer in its neck, the precision reaches 95.7%, and the mAP@0.5 reaches 96.4% compared to the YOLOv8 model.

Fig. 6
figure 6

The sub-module architecture of the multi-object detection model.

Model validation

Two experiments were used to evaluate the performance of the developed multi-object detection model: the detection performance of the multi-object detection model and an ablation experiment to assess the impact of the sub-module architecture.

Detection performance

Figure 7 presents the change curves of evaluation metrics mAP@0.5 and precision for the original YOLOv8 model and the multi-object detection model. In Fig. 7a, the mAP@0.5 curve shows the overall detection accuracy of both models. The multi-object detection model outperforms YOLOv8 during the early training phase. It maintains high mAP@0.5 values across all iterations and converges faster initially, indicating that the developed model learns flame and smoke features more quickly and comprehensively. In Fig. 7b, compared to the YOLOv8 model, the multi-object detection model exhibits higher accuracy throughout the training process, indicating a reduction in the false alarm rate. This highlights the greater robustness and accuracy of the developed model when dealing with complex fire detection scenarios.

Fig. 7
figure 7

The comparison of mAP@0.5 and precision between the YOLOv8 and multi-object detection model.

Evaluation metrics between the YOLOv8 model and the multi-object detection model are presented in Table 3. Although the FPS of the multi-object detection model is lower than that of the YOLOv8 model, it still surpasses the frame processing capability of indoor security cameras. This indicates that the model is suitable for timely fire warnings in building fires. In addition, the multi-object detection model shows significant improvement in detection accuracy on all metrics. The improvement in the F1 score further highlights the balance between precision and recall, ensuring reliable detection performance. These results indicate that the multi-object detection model successfully balances detection accuracy and computational efficiency. This balance of accuracy and efficiency ensures the applicability of the model in building fire detection.

Table 3 Evaluation metrics comparison between the YOLOv8 and the multi-object detection model.

Ablation experiments

The ablation experiments were conducted to validate the effect of the improved part of the model on the fire recognition effect. The ablation experiment consisted of 3 experiments: Experiment Iused the original YOLOv8 model, Experiment II introduced the MLCA mechanism into its backbone, and Experiment III introduced the MLCA mechanism into its backbone and replaced the feature fusion layer in its neck. By comparing the results of these experiments, the specific contribution of each improvement to the model performance can be explicitly assessed. The ablation experimental results of the multi-object detection model are shown in Table 4. Comparing the results of Experiment I and Experiment II, the precision increases by 2.2%, and mAP@0.5 improves by 1.7%. Comparing the results of Experiment I and Experiment III, the precision reaches 95.7%, and the mAP@0.5 (All) achieves the highest value of 96.4%. Experiment III confirms that the model developed in this study is more capable of acquiring fire characteristic information. The results of the ablation experiments confirm that the multi-object detection model has superior detection performance.

Table 4 Ablation experiments of the multi-object detection model.

Case study

Case introduction

In this study, three cases, consisting of news reports and surveillance videos, were selected to detect fire incidents to evaluate the performance and robustness of the proposed method, the case detection challenges, and the details listed in Table 5.

Table 5 Introduction of three actual fire cases.

The surveillance video for Case 1 was taken from post-disaster news reports, and the surveillance video recorded the course of a major fire incident in Ningbo, China, in 2019, where a fire broke out in a warehouse for daily necessities. The surveillance video from news reports reveals the complexity of the environment in which the fire started and the transient nature of the fire. However, due to the smoke concentration in the early stages of the fire being below the activation threshold of the smoke alarm, the fire alarm was delayed for 20 s.

The surveillance video for Case 2 captured a fireworks ignition incident at a supermarket in the United States. At the time of the fire’s outbreak, customers were selecting items near a shelf. However, due to the unique chemical properties of the fireworks, the fire spread rapidly. Furthermore, the smoke alarm was not triggered until 13 s after the fire ignited.

The surveillance video for Case 3 captured the fire incident in Henan Province, China, in 2014, characterized by an unclear visual representation and transient. However, the smoke concentration at the start of the fire had not yet reached the activation threshold of the smoke alarm, and the fire alarm was only issued 7 s after the fire broke out. Both Case 2 and Case 3 were derived from surveillance video captured by indoor security cameras, and the blurred images present challenges for detection.

Results of the multi-object detection

To further analyze the process of fire detection, this section presented an analysis of different stages of fire images. The increase in video complexity reduced the FPS to 52–71, which was still suitable for video fire detection applications. As seen from Table 6, the model achieved optimal performance in Case 1 due to the clarity of the video. In Case 2 and Case 3, the relatively blurry case video resulted in a slight increase in detection latency.

In terms of smoke detection, Table 6 showed the results of early fire detection, indicating the effectiveness of detecting smoke at an early stage. This early detection was crucial for preventive action and emphasized the sensitivity of the multi-object detection model to the smoke characteristic. In terms of flame detection, the multi-object model successfully detected flames in both localized and intense stages. The model captured the distinctive characteristics of flames at the initial stage and the stage of intense combustion. This suggested that the model could detect flames at all stages of fire development, providing timely warnings in the early stages of a fire.

In conclusion, the accuracy of the detection and the high FPS proved the validity of the multi-object detection model and the reliability of fire monitoring, ensuring that it met the requirements for effective video fire detection.

Table 6 Detection results for flame and smoke in building fires in three actual fire cases.

The fire detection time results are shown in Table 7. In three cases, the detection time was maintained under 2 s, ensuring that the model provided timely warnings during the early stages of a fire. In Case 1, the fire scene included multiple obstacles, but the multi-object detection model could detect the fire within 2 s of flame. In Case 2, despite the low clarity smoke characteristics, smoke was detected 2 s after the fire outbreak. In Case 3, despite the low clarity of the video, the model detected smoke immediately after the fire started, demonstrating the robustness of the detection model.

Table 7 Fire detection time by the proposed method in three actual fire cases.

The warning time for the traditional fire alarms (e.g. smoke sensors) and the developed multi-object detection model were compared shown in Fig. 8. In the above three cases, the time required by the traditional fire alarms was 20, 13, and 7 s, respectively, whereas the warning time by the developed multi-object detection model was significantly shorter, taking only 2, 2, and 1 s, respectively. These results demonstrated the substantial performance advantages of the multi-object detection model over traditional fire alarms, particularly in terms of detection efficiency. In Case 1, the model detected fires in 2 s, compared to the 20 s required by the traditional method, achieving 10 times in efficiency. This highlighted the model’s ability to efficiently identify early fire characteristics, even in complex environments. In Case 2, the developed model made an improvement of 6.5 times in the efficiency of fire warnings. Similarly, in Case 3, the proposed method also improved 7 times in efficiency. This demonstrated that the proposed method was at least 6.5 times more efficient on fire warnings compared to traditional fire alarms in the above cases. In summary, the developed multi-object detection model enabled the timely detection of fire characteristics in building fires.

Fig. 8
figure 8

Warning time comparison between the multi-object detection model and the traditional fire alarms in three actual fire cases.

Conclusion

In this paper, a multi-object detection method through AIGC is proposed. The effectiveness of the proposed method is demonstrated through its application to three historical fire videos. Some conclusions are summarized as follows:

  1. (1)

    For the AIGC-dataset, the AIGC-generated fire images can be used to expand the dataset of building fire images and address the limitation caused by a serious shortage of real building fire images. The validation experiments indicate that the detection accuracy of the model trained on AIGC-generated fire images is only 1.6% lower than that of the model trained on the real image dataset.

  2. (2)

    For the detection model, the developed model enhances the detection of flames and smoke in complex environments, enabling more timely and accurate fire warnings. The developed multi-object detection model achieves a mAP@0.5 of 96.4%, representing a 2.6% increase compared to the YOLOv8 model. Additionally, its precision reaches 95.7%, which is 3.1% higher than that of the YOLOv8 model.

  3. (3)

    For the method, the three case studies demonstrate that the proposed detection method can accurately detect building fires within 2 s. The proposed method shows a significant advantage over traditional fire warning methods, being 10 times, 6.5 times, and 7 times more efficient in the three cases, respectively.

  4. (4)

    For application, the method proposed in this paper leverages indoor security cameras for fire detection, enabling more timely and direct fire warnings. It can provide building occupants with more time to evacuate, significantly reducing the risk of fire-related casualties.

The proposed method is limited to data captured by indoor security cameras, which restricts its fire detection scope primarily to indoor spaces. If outdoor cameras were added, it would be possible to train the model using the AIGC-generated dataset of outdoor images to detect outdoor fires. However, this aspect has not been explored in the present study. In the future, traditional fire detection methods can also be combined with the proposed method for fire detection to improve the efficiency of fire detection.