Introduction

Metro infrastructure construction is a dynamic systems engineering with complex phenomena and chaotic characteristics1. A variety of elements such as intricate geological and hydrological settings can result in formidable challenges of construction and organizational coordination2,3. Safety hazards associated with subway construction are intricate, veiled, and dynamic, which may lead to financial losses, personal harm, ecological disruption, project setbacks and diminished structural integrity4,5. Traditionally, the identification and analysis of safety risks in context relied on the industry specialists, academics, and seasoned project leaders6. With the accumulation of historical data, the distinctive spatiotemporal attributes, highly nonlinear aspects, and intricate interconnections presented formidable hurdles for the comprehensive analysis7,8. The perception, analysis and inference of risks from security experts are also inevitably affected by cognitive bias and individual subjectivity9. To solve the problems mentioned above, this study conducted automatic risks identification of accident data through data mining technology based on the experience of experts, explored safety hazards and related rules in development and evolution. This study can are significant to make up for individual subjective limitations of industry staff, which can improve the safety risks management level.

In the realm of subway construction safety risks management, the analysis of accident cases has been widely utilized to facilitate the administration and enhancement of engineering safety, thereby serving as a potent resource for mitigating similar hazardous situations and risk incidents10,11,12. At present, researchers mostly focus on individual risk events or specific types of subway accidents. Relevant researches are all about the descriptive statistics for risk incidents in specific country or region13,14,15. However, risks do not arise suddenly and in isolation16. A multitude of interrelations exists among risk events, which is often overlooked by singular case analysis17,18. In a specific event, risk factors lead to the occurrence of original risk events, secondary events and derivative events in turn, forming a complete causal chain of risk transmission. The combination of multiple risk chains formed a risk network19,20. By mining and analyzing sets of causal chains in risk transmission, the interactions among various risk factors and events can be revealed in comprehensive, multi-category accident investigations. This approach effectively circumvents the limitations inherent in singular risk factor and event analysis, contributing to the continual improvement the level of a more systematic, comprehensive safety risks management.

Therefore, BiLSTM-CRF and CNN models were introduced in this study for entity recognition and relationship extraction in domain accident texts. BiLSTM was used to capture long-term dependencies in sentences. CRF play the role of addressing sequence labeling tasks and enhances named entity recognition. CNN can be used to reduce parameter count lowers computational costs of model, enabling parameter sharing and sparse connections. Consequently, the automatic extraction and transformation from accident reports of subway construction to causal chains with the structure of ‘safety risk factors - risk events’, addressing the deficiency of risk identification overly reliant on expert experience. This contributes enhanced the efficiency of domain-specific safety risks recognition. By constructing a set of causal chains for metro construction safety risks, it unveiled the intricate impact relationships and risk transmission pathways among various risk factors and events in Chinese subway construction spanning nearly two decades from a multi-case perspective. This approach resolved the problems of limited study cases and singular categories, mitigating the individual subjectivity and limitations associated with conventional analysis.

For accident texts of subway construction, we developed and trained a model for entity recognition in the domain of safety risks in subway construction by using a Bidirectional Long Short-Term Memory Network combined with Conditional Random Fields (BiLSTM-CRF). It can facilitate the automatic extraction of safety risk factors and domain entities in risk events from accident texts. We established and trained a causal relationship extraction model based on Convolutional Neural Networks (CNN) to extract causal relationships among domain entities. It also can automatically construct a causal chain of “subway construction safety risks factors-risk events,” thereby revealing the universal laws of interaction among risk factors and risk events. In order to further clarify the research, the following assumptions were proposed: The BiLSTM-CRF and CNN models could accurately extract safety risk factors and causal relationships from textual data. The event text data represented the authentic safety risks scenarios in subway construction. Although all risk factors were not covered, its comprehensiveness was sufficient to support the objectives of this research.

Literature review

Construction safety risks identification based on text mining

The identification of safety risks constitutes is a complex system engineering task. Conventional approaches to safety risks identification primarily concentrated on individual risk factors and specific types of accidents. As for the safety risks identification in subway construction, the conventional focused on particular construction procedures and stages, employing expert surveys and interviews, brainstorming sessions and literature research as methods for risk recognition. These approaches heavily rely on experience, which are susceptible to individual cognitive limitations and subjective factors. Fang et al. inferred that nearby pipelines and existed buildings are the primary risk factors during the subway construction based on process control and situational surveys21. Meanwhile, Zhang et al. investigated the interaction between safety risks management performance and the perceived significance of each risk factor by conducting surveys and semi-structured interviews with subway construction workers in the Southeast region22. Shi et al. (2024) utilized text mining techniques and DEMATEL-ISM method to identify and evaluate safety risk factors23. Researchers proposed the subway construction safety risks identification and early warning systems based on construction drawings, Internet of Things, BIM, and other technological tools, progressively broadening the scope and categories of risk identification. Li et al. introduced the BIM-based subway construction safety risks identification and early warning system. It utilized engineering parameter information to achieve safety risks identification24. Guo et al. creatively combined BIM with D-S evidence theory to enhance risk management capabilities for complex underground projects25. However, the majority of risk identification data sources come from numerical data collected by construction machinery and image data obtained through optical equipment. It leads to the limited reports about emerging hazard patterns and subtle differences inferred from unstructured and semi-structured textual data such as subway accident reports and records. Furthermore, the constraint in data categories makes it difficult for risk identification to encompass all risk factors and events.

Subway construction safety management based on natural language processing

Natural Language Processing (NLP) is utilized to facilitate the comprehension of human language systems in computer. It is primarily applied in tasks such as text classification, information recommendation, and information extraction26. At present, NER mainly has three mainstream methods in safety management about subway construction: rule-based, statistical machine learning method, and deep learning method27. Tang et al. utilized text mining to extract risk data from texts, guiding on-site management of subway construction28. Huo et al. utilized text mining to extract key features related to subway accidents from raw data, developing a new causal path selection model29. By interrupting causal propagation along these paths, construction safety can be enhanced. Rules are created by experts and scholars in professional fields to meet their own research needs. Li et al. performed entity recognition from human factors, management, and risks by BP neural network model30. The recognition results were used to predict potential accident types and propose safety management measures during the construction. Machine learning method requires high text standardization. In addition, it only can operate on limited data volumes and generally exhibit moderate effectiveness in entity recognition. Deep learning methods exhibited excellent performance in NER recognition, leading to further improvements in entity identification accuracy and efficiency31. Zhou et al. developed a double deep Q-network deep reinforcement learning model to predict subway construction safety risks, which is conducive to enhancing safety management at subway construction sites32. There remains a scarcity of research utilizing deep learning techniques for the NLP mining and analyzing of accident investigation reports, safety records, and other accident texts about subway construction in scholarly literature. With data mining remaining predominant, the entity recognition of safety risks in subway construction is still at an early stage.

Research methods

Research framework

The knowledge structure framework of safety risks in subway construction is depicted in Fig. 1. This framework can be segmented into four sections: corpus construction, entity recognition model, relationship extraction model, and the construction of a subway construction safety risks network. To ensure the representativeness of the results, knowledge extraction was conducted on 562 accident texts among 20 years. The BiLSTM-CRF model was utilized for entity recognition in safety risks about subway construction, while a convolutional neural network model was employed to extract causal relationships among domain entities. In order to get a more comprehensive analysis of safety risk factors and risk events, this research normalized safety risks factors and risk event entities based on the named entity recognition of domain entities, constructing a domain synonym dictionary. Ultimately, this research identified 533 causal relationship triplets in the domain of subway construction safety risks, which served as the basis for constructing a directed unweighted complex network and case database for safety risks in subway construction.

Fig. 1
figure 1

Research framework.

BiLSTM-CRF model

The research employed the BiLSTM-CRF as the deep learning framework for named entity recognition in safety risks of subway construction. BiLSTM is a type of recurrent neural network that processes information from both directions of a sequence to capture context effectively. The CRF is a discriminative probabilistic undirected graph model, which represents the conditional probability distribution of one set of random variables under another given distribution of random variables33. The BiLSTM model effectively captures long-range dependencies by processing sequence information bidirectionally, thereby enhancing the comprehensive utilization and understanding of contextual information. CRF is well-suited for sequence labeling tasks, ensuring the consistency and contextual relevance of predicted entity labels. The BiLSTM-CRF model is adept at handling text characterized by sequential patterns, necessitating the capture of long-range dependencies and ensuring coherent labeling within context. Its performance shines in named entity recognition, where contextual comprehension is pivotal for precise predictions. In this research, the output entities in accident texts were labeled and got the optimal global label sequence with the constraints of CRF. This method considers the influence of label results from other characters during the output of labels, effectively enhancing the recognition effectiveness of this model. Initially, the subway construction accident text was converted into a character vector representation on a per-character basis, denoted as \(\:\{{\text{x}}_{1}\), \(\:{\text{x}}_{2}\), …, \(\:{\text{x}}_{\text{n}}\}\)(\(\:{\text{x}}_{\text{t}}\)[1,n]), serving as the input data for this model. The word vector features utilized the 100-dimensional “Chinese word vector library” trained from Wikipedia, encompassing 16,991 characters, to effectively express character features. Subsequently, the data was input into the LSTM neural network in a forward and backward sequences to obtain forward and backward hidden vectors (\(\:\overrightarrow{{\text{h}}_{\text{t}}}\) and \(\:\overleftarrow{{\text{h}}_{\text{t}}}\)) containing semantic information about the accidents in subway construction. The obtained two vectors were concatenated to form the final output vector \(\:{\text{h}}_{\text{t}}\), serving as the input for the CRF layer. Finally, the CRF model was employed to obtain the predicted labels for named entities in safety risks of subway construction, which are displayed through the output layer.

Convolutional neural network model

The research utilized a deep learning framework based on CNN for extracting entity causal relationships in safety risks of subway construction. The CNN model consists of embedding layer, convolutional layers, pooling layer, and fully connected layer. The embedding layer represents word vectors through embedding word features and position features. The convolutional layer captures the overall semantic information of sentences34. The pooling layer compresses the results of the convolutional layer using max-pooling to extract significant features, control overfitting, and obtain the feature vector of a sentence35. The fully connected layer integrates highly abstracted features obtained through multiple convolutions to produce output probabilities for various classification, that is, relationship classification results36. The training process of this model is depicted in Fig. 2.

Fig. 2
figure 2

CNN model.

Model evaluation criteria

The recognition performance of the Metro Construction Safety Risk Named Entity Recognition Model (MCSR-NER-Model) and the Metro Construction Safety Risk Domain Entity Causal Relationship Extraction Model (MCSR-CE-Model) were evaluated based on three evaluation metrics: Precision (P), Recall (R), and F1 Score37. The introduction and calculation formulas for each metric are displayed as follows.

Precision represents the proportion between the number of correctly identified entities and the total number of identified entities.

$$\:P=\frac{TP}{TP+FP}\times\:100\text{\%}$$

Recall indicates the proportion between the number of correctly identified entities and the total number of pre-labeled entities.

$$\:R=\frac{TP}{TP+FN}\times\:100\text{\%}$$

The F1 Score is the weighted geometric mean of precision (P) and recall (R).

$$\:{\text{F}}_{1}=\frac{2\text{P}\text{R}}{\text{P}+\text{R}}\times\:100\text{\%}$$

In the above formulas, TP represents the number of samples predicted as positive class that are actually positive, FN represents the number of samples predicted as negative class that are actually positive, and FP represents the number of samples predicted as positive class that are actually negative.

Experiment and results

Data acquisition

The data set utilized in this research are the 562 accident texts collected from March 2001 to November 2021, including 130 reports of accidents about subway construction and 432 accident bulletins published by the Ministry of Housing and Urban-Rural Development and news media (Table 1). The dataset consists of encompassing 1821 informative sentences. In contrast to free text, these accident investigation reports contain extensive descriptions of safety risk factors, accident circumstances, outcomes, impacts, and responsibilities, which can facilitate the comprehensive analysis and causal chain delineation of accidents. It also can promote the risk event data mining, and structured documentation. The bulletins and notices published by the Ministry of Housing and Urban-Rural Development succinctly described subway accident causes, risk events name, risk outcomes, and their consequences. It can help to clarify the critical information about subway accidents, such as safety risk factors and accident outcomes.

Table 1 The information of data acquisition.

In the accident texts of subway construction, the descriptions of safety risk factors and risk events are usually stated in the form of proprietary nouns and phrases, such as “violations in operations” and “overloaded vehicle cargo”. The descriptions of risk events are characterized by standardization and uniformity, such as “foundation pit collapse” and “objects striking.” The phrases composed of a small number of Chinese characters that can exist independently and convey a state, attribute, or explicit meaning are referred to as entities. With mining these entities, it can swiftly specifying key information such as the causes, names, and outcomes of risk events. The entities related to safety risks of subway construction studied in this research include two main categories, safety risk factors and risk event (RE). Safety risk factors represent the causes that may lead to risk events in subway construction, consisting of human factor (HF), material defect (MD), environmental factor (EF), and technical and management factor (TM)38,39,40,41. Risk events refers to the ultimate results caused by various safety risk factors, involving subway accidents, casualties, equipment losses, etc.

Text preprocessing

First, the collected accident texts of subway construction in various formats, such as word, PDF, images, and web pages, were converted into TXT format text documents with UTF-8 encoding. Next, each document was reviewed to correct the misspellings, language expression errors, or semantic inaccuracies, ensuring precise and correct language usage42. Finally, the accident texts were segmented into individual sentences. A total of 1821 sentences related to subway construction accidents were obtained.

The accident texts of subway construction are unstructured texts that written in natural language. The colloquial expressions are inevitable. Stop words refer to high-frequency, low-value words lacking actual meaning, removal of which can effectively reduce text feature dimensions and enhance entity recognition. By manually analyzing accident texts of subway construction and reviewing domain-specific literature, the expression patterns for entities such as safety risk factors and risk events were identified. Finally, a stop words list for accident texts of subway construction were summarized and presented in the Table 2.

Table 2 Stop words list in accident texts of subway construction.

Python code was employed to remove stop words in the 1821 texts to mitigate the impact of stop words on subsequent text mining tasks and enhance the accuracy of entity recognition and relationship extraction.

Domain named entity identification

Text sequence annotation

The process of text sequences annotation related to safety accidents of subway construction involves the identification of individual risk factors, domain entities in risk events and attributes of these entities. This annotation process is documented for reference and analysis The text annotation for accident texts of subway construction follows the “BIO encoding format”43. The quality of the labeled corpus plays a decisive role in the performance of the training deep learning models44. Table 3 shows the definitions of label categories. The annotation was carried out by trained personnel from domain experts and the research team. Following the initial annotation, inter-annotator agreement was ensured through cross-validation among annotators.

Table 3 Definitions of label categories.

The sequence annotation results were stored in TXT format documents with UTF-8 encoding. Annotation tasks on 1614 accident texts were completed with Python. Table 4 provided a statistical summary of the annotated quantities for each category of domain entities in texts. The example of entity names were also displayed.

Table 4 The annotation results of entity sequence in accidents texts of subway construction.

Training data structure and environment configuration

The programming language and version utilized for constructing the Chinese named entity recognition model in safety risks of the subway construction is Python 3.6.5. The training framework of deep learning model is based on TensorFlow 1.13.1. The code was carried out on PyCharm Community Edition 2021.2.3. The operating system is a 64-bit Windows 10 system. Due to the relatively small scale of the experimental data, computations were performed by CPU, featuring an Intel Core i7-6700HQ CPU @ 2.60 GHz. The initial parameter settings for the MCSR-NER Model are outlined in Table 5.

Table 5 MCSR-NER-Model parameter settings.

After preprocessing, 1614 sentences from accident texts of subway construction were used for model training and validation. The data was shuffled by the shuttle program to introduce randomness and improve the generalization performance of the neural network, thereby mitigating the effect of text input order on model training. The subway construction accident texts were partitioned into training, cross-validation, and test sets in a 7:1:2 ratio. The training set, containing 1130 sentences, was used for model training to learn text features. The cross-validation set, consisting of 161 sentences, was employed for automatic adjustment of model parameters during the training process.The testing set comprises 323 sentences and it was utilized to assess the performance of the trained model.

Analysis of training results

The prediction results of (MCSR-NER) model were assessed by the Python version of conlleval.pl. The evaluation results were presented in Table 6. The results encompasses the overall model performance and the statistical outcomes related to the recognition of individual entity types.

Table 6 Identification results of MCSR-NER-model.

The trained MCSR-NER-Model achieved precision, recall, and F1 values all exceeding 77%. The entity recognition task was challenged by the moderate scale of processed text in this research and the diversity of expressions encountered in texts. As for the recognition task of named entity in specialized domain based on a relatively limited amount of textual data, the performance MCSR-NER-Model is considered quite favorable. The MCSR-NER-Model effectively captured the language features and semantic expressions in accident texts of subway construction. Subsequently, the remained 207 accident-related texts were input into the trained MCSR-NER-Model to identify safety risk factors and risk event entities. A total of 370 entities in safety risks of subway construction were identified.

During the process of entity recognition in the safety risks of subway construction, it was observed that some specific nouns unique to the subway construction were not identified completely, such as “shield machine is flooded” and “face of the palm collapsed”. It is imperative to continue enriching domain-specific vocabulary and improving the recognition capabilities of this model by embedding domain-specific dictionaries.

In terms of various entity recognition, the risk event (RE) category exhibited the most promising results, with the highest F1 value reaching 85.26%. Several factors contribute to this result. (1) In training text data set, the number of annotation about risk event types is the largest, and the description of accidents is more accurate and unified than other types. (2) The text position of the risk event category is more fixed. It generally exists at the beginning or the end of the whole sentence. For example, “the object strike accident is caused by the fatigue of the construction personnel”, and the accident entity is the object strike. “Construction personnel did not hang the safety rope according to the regulations, violate the operating procedures during the process of high edge operation, resulting in a high fall accident.” (3) The risk event entity is often accompanied by words such as “reason”, “cause”, “lead” and “trigger”. These high-frequency words promote the location of entity about risk event. Therefore, risk event can be easily identified by the model.

It was noted that the environmental factor (EF) exhibited the lowest recognition effectiveness. It can be attributed to the scarcity of environment factor entities and their corresponding category labels in training text. The word “gas” appeared only twice in training text. In addition, some longer expressions about environment factor are failed to be identified. For instance, “Coupled with the early rain, the soil is soaked and loosened, and once the soil layer above the pipeline is not compacted, it is easy to cause the earth to collapse”, the “rain” was correctly identified, while “soil is soaked and loosened” was not recognized.

Domain entity discovery

The MCSR-NER-Model has assimilated fresh domain-specific entities, encompassing safety risk factors and accidents of subway construction that extracted from the 207 accident-related texts. They surpassed the initial 1614 texts utilized for model inception. The fresh domain-specific entities are delineated in Table 7. Considering the limited textual data of accidents about subway construction, and the number of new domain entities is relatively small. With the continuous expansion and enrichment of accident texts, more domain entities will be identified, and the recognition performance of domain entities will be enhanced.

Table 7 Examples of the fresh domain-specific entities.

With the consolidation of domain entities based on pre-annotated and the outcomes furnished by the trained MCSR-NER-Model, 1361 entities of safety risks about subway construction were obtained. The risk factors of subway construction encompass 147 entities characterizing human-centric factor, 124 entities emblematic of material defect, 107 entities reflective of environmental influence and 358 entities epitomizing technological and management problem. A total of 625 entities about risk accidents were acquired.

Causality extraction about domain entities

Relationship types identification

A supervised learning approach was utilized for relation extraction to delineate three distinct causal relationship categories among domain entities: causes, effects, and co-occurrences. The extraction outcomes were represented as triplets (entity1, entity2, relation). These relationships are defined as follows: cause: the occurrence of entity1 is caused by entity 2; effect: the occurrence of entity1 leads to the occurrence of entity 2; accompany: entity 1 and entity 2 frequently occur together. With the manual examination of accident texts about subway construction, it was observed that only partial texts containing domain entities exhibit causal relationships. They primarily encompass the following cases. Firstly, the sentences of accident texts only contain one domain entity. Next, sentences encompass two entities, but a causal relationship does not exist in entities. Finally, sentences comprise three or more entities, the causal relationships may exist among the domain entities. Considering these issues, a set of rules was formulated to filter out the sentences without causality, as depicted at Fig. 3.

Fig. 3
figure 3

Text selection process of entity causality extraction in safety risks of subway construction.

Text sequence annotation

The text annotation format for relation extraction differs from that of the recognition of entities. In the training corpus of relation extraction, each individual sentence occupies a distinct line and includes the following components: [sentence, relation, head, head_type, head_offset, tail, tail_type, tail_offset]. The specific meanings of each component is elaborated in the Table 8.

Table 8 Annotation structure description of domain entity causality extraction text.

The domain entity and its location information in training and testing sentences were obtained by regular expression and implemented by RE library of Python. The annotation results were stored in csv format.

Training data structure and hardware construction

This research collected and organized 996 preprocessed sentences containing causal relationships among domain entities from accidents of subway construction for model training and validation. The shuttle program was used to increase randomness and enhance the generalization performance of neural network. The accident text data of subway construction was divided into training set and testing set in an 8:2 ratio, where the training set and testing set comprises 798 and 198 sentences, respectively.

In order to improve the effectiveness of relationship extraction, a supervised learning approach based on CNN deep learning model for relation extraction was utilized. The MCSR-CE-Model was constructed. The programming language and version used for constructing the extraction model of causal relationship among domain entities is Python 3.6.5. The training framework of neural network model is PyTorch. The experimental code utilized PyCharm Community Edition 2021.2.3, and the operating system employed a 64-bit Windows 10. GPU was utilized in this research, with the NVIDIA GeForce GTX 960 M graphics card. The parameters of MCSR-CE-Model are presented in Table 9.

Table 9 MCSR-CE-Model parameters setting.

Training results analysis

Extraction results conducted by MCSR-CE-Model are shown in Table 10.

Table 10 Extraction results of MCSR-CE-Model.

MCSR-CE-Model achieved high accuracy, recall, and an F1 score of 98.96%, indicating the excellent performance. It can be attributed to several factors. The research aimed to explore the influence between risk factors and risk events in subway safety. Therefore, the relationship was constrained to causality, and initial text filtering was conducted based on this rule. It can significantly reduce the interference of sentences without causality. In addition, the extraction of relationship only focused on “cause, effect, and co-occurrence”.

In this section, 996 out of 1614 accident texts containing entity causal relationships in subway construction were utilized for training and testing the MCSR-CE-Model. The remained 207 accident texts were extracted the causal relationships of entities, 163 sentences containing causality were obtained. During the causality extraction in the safety risks of subway construction, the majority of prediction errors are associated with the “co-occurrence” relationship. For instance, “The mishandling from worker caused the scaffold to fall onto a gas cylinder, leading to the fire and explosion,” the manually labeled relationship between “fire” and “explosion” is “co-occurrence”. However, this relationship was predicted as “effect” by this model. The primary reason lies in the limited training data for the “co-occurrence” relationships, which only contain 25 instances out of 798 training texts. The scarcity of training data hinders the model to effectively learn the structural characteristics of this kind of text. Additionally, the manual analysis and review of text containing “co-occurrence” revealed the minimum differences in comparison to the other two relationships, increasing the difficulties of learning for the model.

In conclusion, the causality results obtained from the relational extraction model and the relationships manually marked during the model training and testing were obtained. Finally, 1159 sets of causal triad structure were obtained, including 319 sets of “cause” relationships, 811 sets of “effect” relationships, and 29 sets of “co-occurrence” relationships. The results of causality extraction are basically consistent with the results of extraction by industry experts.

Construction of domain dictionary

Normalization of domain entities

The diverse origins of accident information lead to the distinct formatting standards, resulting significant disparities in the description of similar or related risk factors and events about subway construction across various accident texts45. This research systematically organized all safety risk factors and events, meticulously analyzed the linguistic expressions of each entity, integrated and summarized similar expressions referencing national standards about accidents classification. Subsequently, the preliminary summary was improved by subway project managers, construction technical leaders, university researchers, and graduate students. Ultimately, 1361 entities were divided into four major classes of risk factors about subway construction and 56 types of risk events. Specifically, the risk factors of subway construction encompass human factors (HF) with 13 categories totaling 147 entities, material defect factors (MD) with 13 categories totaling 124 entities, environmental factors (EF) with 14 categories totaling 107 entities, technical and management factors (TM) with 24 categories totaling 358 entities, risk events (RE) with 56 categories totaling 625 entities. The classification results and codes for the risk factors of subway construction and risk events are exhibited in Tables 11, 12, 13, 14 and 15.

Table 11 Risk factors of subway construction - human factor(part).
Table 12 Risk factors of subway construction-material defect (part).
Table 13 Risk factors of subway construction-environmental factor (part).
Table 14 Risk factors of subway construction-technical and management factor (part).
Table 15 Types of risk events of subway construction.

Construction of domain dictionary

A domain dictionary for safety risks of subway construction was constructed based on the classification results. This dictionary encompasses four major classes of risk factors about subway construction and 56 types of risk events, totaling 1361 entity synonymous expressions. Due to the space limitation, a partial display of the domain dictionary is provided in the Table 16.

Table 16 Dictionary of subway construction safety risks (part).

Results and applications

Subway construction safety risks complex network construction

In this section, 1159 causal relationship triplets concerning domain entities of safety risks about subway construction were normalized based on the domain dictionary constructed in “Construction of domain dictionary”. This process got 533 causal relationship triplets in safety risks of subway construction and constructed a directed unweighted complex network, termed the Metro Construction Safety Risk Complex Network (MCSRCN). The MCSRCN model comprises 120 nodes and 533 directed arcs, encompassing all identified risk factors and event types of subway construction. The MCSRCN model is visualized by Pajek (Fig. 4). Human factors (HF) are represented by yellow nodes, green nodes denote defect factors (MD), environmental factors (EF) are labeled by red nodes, blue and pink nodes indicate the technical and management factors (TM) and risk events (RE), respectively.

Fig. 4
figure 4

MCSRCN model.

The distribution of nodes in MCSRCN was generated based on the random node degree indicator (Fig. 4). The closer a node to the central position, the greater influence it is to the entire network. It is apparent that the cluster of technical and management factor nodes (blue nodes) and human factor nodes (yellow nodes) are relatively closer to the center position of the network. It can be attributed to the fact that human factors are the direct causes of most safety accidents of subway construction, while technical and management factors are frequently the crucial indirect causes of safety accidents. These two types of risk factors are extensively described in accident texts of subway construction. Conversely, the material defect nodes (green nodes) and environmental factor nodes (red nodes) are relatively farther from the center in the network. The possible reason is associated with the enhancement of quality management about subway construction. The construction materials with quality issues are often rejected before entering the site, thereby effectively mitigating these risk factors. Environmental factors are significantly influenced by regional conditions during subway construction. For example, “rain (EF2)” and “weak soil conditions (EF3)” have the significant impact on projects of subway construction in central and southern regions, whereas the impact on projects in northern regions is relatively smaller.

Construction of subway construction accident database

Based on 146 collected accident reports of subway construction, detailed information including project overview, accident types, accident resolution measures, and accident warnings were documented to establish a accident database about subway construction. The database is presented in Table 17.

During subway construction, project management personnel and construction workers can utilize the MCSRCN to identify potential risk events associated with the risk factors observed at the construction site. With the safety response measures of risk events and accident warning database, they can get the historical information regarding the specific accident type, thus learning and researching the resolution and educative warnings from the past accidents. Subsequently, the tailored measures for risk prevention and control with feasibility can be proposed based on the actual engineering circumstances. The establishment of risk database provides the basis for assisting decision-making in safety management. However, the on-site safety management is a complex process, which requires the integration of on-site monitoring data and various methods to ensure comprehensive control.

Table 17 Database of subway construction about accident cases.

Recommendations for mitigating subway construction risks

With the database, recommendations for risk response were provided for the 2008 Xianghu Station foundation pit collapse accident in Hangzhou Metro Line 1. After entity recognition, the risk factors of this project include complex geological conditions (EF13), rainfall (EF2), rushing work (TM1), over-excavation of foundation pits (TM8), and support system defects (MD13). Based on historical experience and the complex risks network of subway construction, five risk factors were incorporated into a complex network to generate the risk network diagram specific to subway projects. The risk network diagram displayed nodes of varying sizes based on the node degree index, with larger radii corresponding to higher node degrees. Figure 5 indicated that rainfall (EF2) exhibited the highest node degree index, highlighting it as the risk factor deserving the most attention. By adopting strategies of risk response for rainfall, the impact of rainfall on construction sites was timely reduced and eliminated. The occurrence of eight risk events (RE1, RE2, RE3, RE7, RE9, RE22, RE36, RE46) was effectively prevented. In response to the risk of rainfall, strategies including optimizing drainage around foundation pits, reinforcing monitoring in regions with elevated groundwater levels and challenging geological conditions, applying plastic film on slopes to prevent erosion during heavy rainfall, and intensifying construction safety inspections could be undertaken. Among all potential risk events, there were four safety risk factors pointing towards foundation pit collapse (RE20) and earth-rock collapse (RE30). It can be anticipated that there might be some risk events related to foundation pit collapse and earth-rock collapse within these construction circumstances. Based on the prototype case, the eventual occurrence of the foundation pit collapse accident was indeed confirmed. The risk network of subway construction developed in this study holds certain practical guidance significance.

Fig. 5
figure 5

The risks network diagram of the Hangzhou Metro project.

With collapse risk events as the specific cases, this study proposed risk response measures based on database analysis. Project managers can refer to the database to search for historical cases of foundation pit collapse accidents that occurred during subway projects (Table 18). Through the analysis of resolution measures in the cases database about subway accidents, preventive and contingency measures for foundation pit collapse incidents in the Hangzhou Metro project were extracted.

Table 18 Foundation pit collapse accident cases.

Conclusion

The safety risks of subway construction have the characters of complex, covert, and dynamic. Combining with the spatiotemporal features, nonlinear characteristics, and coupling effects about the continual accumulation of historical data, the comprehensive analyses exhibited significant challenges1,46. In order to solve the problem, this research focused on text mining and natural language processing in the safety risks of subway construction. The model MCSR-NER-Model, based on BiLSTM-CRF, was developed and trained specifically for entity recognition with Chinese in the safety risks of subway construction. The collection and analysis of textual data related to accidents of subway construction were conducted by this model. Firstly, entities and their types associated with the risk factors and risk events about subway construction were labeled by this model. Subsequently, the automatic extraction of entities from accident text of subway construction was completed. The task of risk identification within the subway construction was achieved by this model with specific train. Additionally, the normalization of entities was achieved by constructing a dictionary for safety risks of subway construction. Ultimately, a total of 736 entity expressions related to risk factors of subway construction was compiled, including 147 human factor entities, 124 material defect factor entities, 107 environmental factor entities, and 358 technical and management factor entities. In addition, 625 risk event entities were obtained. Obtained 1361 domain entities basically covered the key vocabulary from 562 instances of accident texts about subway construction. This research refined the types of risk factors and risk events, thus provided the data support for the assessment and response research of safety risks about subway construction.

This research also constructed and trained the MCSR-CE-Model based on CNN within the safety risks of subway construction for extracting causal relationships among domain entities. 1159 causal relationship triplets among domain entities were obtained. These triplets essentially covered the interrelationships among risk factors, risk events and the mutual influences of subway construction within 562 instances of accident texts. This model clarified the transmission pathways of risk factors during subway construction, providing the better data support for risk response measures.

With the developed MCSR-NER-Model and MCSR-CE-Model, a chain-like structure of causal relationships was automatically transformed from free-text about subway construction accidents, which bridged the gap about the heavily relies on expert experience in conventional identification about safety risks. With the continuous collection and expansion of textual data, a safety risk database about subway construction were established, containing numerous expressions of risk factors and event entities in subway construction, as well as the causal relationships among these entities. This database can combine the extensive historical accident cases with the experience from domain experts to analyze and resolve management issues of safety risk during subway construction from a data-driven perspective.

The primary sources of textual data involve subway accident reports and subway accident notices. Because of the limitations of the text types for recording risk events, the text carries of accident information will be obtained from internal enterprise construction logs, accident hazard inspection forms, and records of attempted accidents in the future. Based on the model architecture, automatic extraction of more relationship types such as synonymy relationship, membership relationship and coupling relationship will be explored, and the interaction relationship between risk factors and risk event types of subway construction will be further mined. In terms of data mining samples, the current accident texts of subway construction used for model training are limited. Consequently, it is challenging to encompass all risk factors and risk event types along with their causalities about subway construction. This experiment segregated the entity recognition and causal relationship extraction into two independent processes, inevitably resulting in relationship extraction being reliant on the results of named entity recognition. As for relationship extraction, the textual data sources mainly involve reports and bulletins of subway accident. The limitations of text type for recording risk events still exist. In the future, more text carriers of accident information such as construction logs, accident hidden danger investigation tables, and attempted accident records inside enterprises can be obtained. Based on the model architecture, the automatic extraction of relationships such as synonymy, membership and coupling relationship should be explored. In addition, the further investigations about the mining of the interaction between risk factors and risk event types of subway safety are supposed to be conducted. The effect of time is crucial in risk research, and the time series will be introduced to simulate the evolution of risks at different construction stages in the future.