Introduction

The safe operation of gas polyethylene pipelines is closely tied to public safety and constitutes a crucial component of the national energy security system, serving as the core infrastructure for ensuring a stable gas supply. With the gradual advancement of intelligent and data-driven pipeline management1an increasing volume of data related to pipeline safety monitoring, risk assessment, and condition monitoring is being generated. These data exhibit a variety of types, encompassing a broad spectrum of information, ranging from the operational status of pipeline equipment to maintenance records and inspection reports. Effectively processing and utilizing these multi-source heterogeneous data has emerged as a key challenge for enhancing the integrity, management efficiency, and operational safety of gas PE pipelines. This research aims to establish a systematic approach to effectively managing data on gas polyethylene pipelines. By integrating structured, semi-structured, and unstructured data into a unified framework, a multidimensional knowledge graph reflecting the complete life cycle of the pipeline system is constructed using a knowledge extraction model.

Currently, gas PE pipeline data exhibits several notable characteristics: firstly, the data sources are diverse, spanning various domains such as sensor data, operation and maintenance logs, and inspection reports, which not only originate from different sources but also vary in format. Secondly, the data are highly varied, comprising structured data (e.g., equipment status, sensor readings), semi-structured data (e.g., operation logs, equipment maintenance records), and unstructured data (e.g., text reports, images). Furthermore, the data is continuously updated and modified after collection and is typically time-sensitive, with new data being generated and accumulated as the pipeline’s operational status evolves. Finally, the issue of data heterogeneity is especially prominent, manifested not only in the diversity of data sources but also in the differences in data encoding, storage formats, and expressions. For instance, sensor data are typically represented as numerical values, while maintenance logs are manually entered as text and often involve terms defined differently across various fields, further complicating data processing2,3,4. Additionally, the temporal and cross-domain nature of the data (spanning domains such as pipeline, equipment, and environment), along with its redundancy and inconsistency, imposes higher demands on data processing and analysis. Therefore, achieving data fusion and effective management while mitigating the impact of data heterogeneity has become a core task in addressing gas PE pipeline data challenges. Consequently, constructing a knowledge graph has emerged as an effective solution to this issue5,6,7,8,9. Knowledge mapping can convert heterogeneous data from multiple sources into structured and semantically meaningful information, effectively integrating knowledge from different domains to form a comprehensive and systematic pipeline management knowledge system. This not only enhances the operability and interpretability of the data but also provides support for pipeline condition monitoring, risk prediction, and emergency response.

The complexity and diversity of gas polyethylene pipeline data present significant challenges for traditional rule-based and shallow learning methods, particularly in managing complex relational networks, abstract term definitions, and large volumes of unstructured data10,11,12,13,14. Previous studies have demonstrated that rule-based knowledge extraction methods frequently fail to meet the demands of processing high-dimensional data relationships and dynamic changes15. Consequently, in recent years, deep learning techniques—especially lightweight pre-trained models like ALBERT—have emerged as ideal tools for constructing intelligent knowledge graphs due to their efficient parameter utilization and superior feature extraction capabilities16,17,18,19. For instance, Chen et al.20 proposed the ALBERT-BiLSTM-CRF model, which integrates pre-trained language models and association rules to construct a knowledge graph for water conservancy project safety management, achieving a recognition accuracy rate exceeding 85% and aiding in the formulation of safety measures. Yang et al.21 proposed an ALBERT-based knowledge graph for power transformer operation and maintenance, utilizing ALBERT-BiLSTM-CRF for entity recognition, BiLSTM-Attention for relationship extraction, and Neo4j for data storage. Although effective, this method’s generalization ability for multi-source heterogeneous data integration and complex environments still requires improvement. Zhou et al.22 proposed the CausalKG model, combining the Sampling Knowledge Graph (SIKG) and Causality Relationship Knowledge (CRK) with reasoning based on ALBERT to enhance root cause analysis of faults. However, high-dimensional and noisy data challenge inference reliability, requiring further optimization of data noise reduction and model robustness.

As deep learning continues to advance, fusion variants based on BiGRU modules have increasingly become a research focal point in the field of knowledge extraction. Liu et al.23 proposed the BERT-BiGRU model, which integrates BiGRU to enhance textual feature extraction, performing well in classifying textual faults in subway onboard equipment. Zhang et al.24 proposed the SC-BiGRU-CNN model, which combines BiGRU and CNN to extract both global and fine-grained features, thereby improving event detection accuracy. However, this method is primarily focused on a single task and lacks the ability to integrate multi-source and multi-modal data.

Several studies have enhanced the RoBERTa model through optimization, aiming to improve its ability to comprehend contextual information. Xiao et al.25 proposed a robust optimization-based RoBERTa model that integrates WWM techniques, BiLSTM, and CRF for knowledge graph construction in air compressor fault diagnosis. The method demonstrates strong performance in terms of accuracy and recall when extracting specific entities from unstructured data. Gu et al.26 proposed a knowledge mapping framework for automated equipment defects, integrating RoBERTa-BiLSTM for NER (Named Entity Recognition), ALBERT-BiGRU for relationship extraction, and KBGAT for knowledge complementation. Although the method improves data processing efficiency, it demands significant computational resources. The hybrid RoBERTa-BiGRU-CRF method proposed by Tao et al.27 has yielded significant results in knowledge extraction from text data on highway greenway failure cases, particularly in the NER task. However, this method primarily focuses on the NER task and does not incorporate relationship extraction or knowledge inference, which limits its ability to support more complex knowledge graph construction.

Although the methods discussed above have yielded significant results within their respective domains, they suffer from the following limitations:

  1. (1)

    Most of these methods concentrate on processing a single data source28,29,30,31such as the method by Yang et al., which lacks the ability to comprehensively handle heterogeneous data from multiple sources;

  2. (2)

    Tao et al. focus solely on specific tasks (e.g., text categorization or fault diagnosis) and are unable to comprehensively support knowledge graph construction or the extraction of multidimensional information;

  3. (3)

    The method proposed by Xiao et al. lacks industry-specific knowledge modelling and fails to fully comprehend and integrate information relevant to gas polyethylene pipelines;

  4. (4)

    The method proposed by Gu et al. relies on a combination of multiple deep learning models, leading to substantial computational resource consumption in large-scale applications.

In order to address the limitations of existing methods, this paper proposes an ALBERT-BiGRU-CRF-based approach for constructing knowledge graphs of gas polyethylene pipelines, which effectively tackles the following challenges:

  1. (1)

    Integrating unstructured and semi-structured data (e.g., sensor data, equipment status) through annotation techniques, overcoming the limitations associated with relying on a single data type.

  2. (2)

    Employing the adaptive selection layer and multi-layer feature fusion of the improved ALBERT model, which not only supports the NER task but also facilitates relationship extraction and knowledge inference to enhance the capacity for constructing knowledge graphs.

  3. (3)

    Enhancing the modeling capabilities of the knowledge graph in gas polyethylene pipelines through the self-constructed, high-quality training set combined with BiGRU, which captures the contextual semantics of industry-specific text.

  4. (4)

    Reducing the computational resource consumption of the ALBERT-BiGRU model through lightweight design, thereby improving its feasibility for large-scale applications.

Methods

Framework of gas polyethylene pipeline O&M knowledge graph construction

Knowledge graph construction methods primarily include knowledge-driven top-down approaches, data-driven bottom-up approaches, and their combination. This paper adopts a combined top-down and bottom-up approach, integrating domain-specific knowledge from the gas industry to preliminarily define entity concepts and their relationships while continuously adjusting and refining the schema layer during entity and relationship extraction. The overall construction framework is shown in (Fig. 1).

Fig. 1
figure 1

Gas polyethylene pipeline knowledge graph construction framework.

Step 1: Preprocess the unstructured and semi-structured data from the operation and maintenance records of gas polyethylene pipelines, including sensor data, equipment status, etc., using annotation methods to construct a training dataset containing text and other multi-source data. After performing data cleaning, format conversion, and missing value treatment, data quality and consistency are ensured to provide high-quality training data for subsequent entity and relationship extraction.

Step 2: Perform named entity recognition (NER) using the ALBERT-BiGRU-CRF model. Initially, the ALBERT model learns the semantic features of the text through the bidirectional Transformer architecture, generating semantic vector representations. The BiGRU layer then captures contextual information and manages long-distance dependencies, while the CRF layer optimizes the sequence globally to ensure coherence and accuracy in entity labelling, enabling the extraction of key entities in pipeline O&M data, such as “pipeline segment,” “valve,” and “leakage point.”

Step 3: Perform inter-entity relationship extraction using the ALBERT-BiGRU-Attention model. The ALBERT model generates semantic representations of the entities, and the BiGRU layer further models contextual information. The Attention layer then focuses on the most relevant textual parts for relationship extraction by employing a weighting mechanism that automatically adjusts feature weights, enhancing the model’s ability to capture relationships. This method can accurately extract relationships between entities, such as the “cause” relationship between “pipeline leakage” and “valve,” the relationship between “pipeline performance parameters” and “design parameters,” and the “decision” relationship between “pipeline leakage” and “valve,” as well as between “pipeline performance parameters” and “design pressure.”

Step 4: After extracting entities and relationships, semantic triples (< entity1, relationship, entity2> ) are generated, stored in CSV format, and imported into the Neo4j graph database. The final knowledge graph constructed in this study contains 4362 entities and 8 entity relationships. The nodes represent different O&M entities, while the edges indicate the relationships between these entities. The graph’s visualization not only facilitates a comprehensive review of O&M data but also supports the tracking of potential failure modes through entity-relationship links, helps assess O&M risks, and assists in formulating preventive maintenance strategies.

Knowledge extraction model construction

Data preprocessing

The construction of the gas polyethylene pipeline operation and maintenance dataset primarily involves two steps. The first step is the structuring of the data. The structured data includes general pipeline information, such as pipeline level, pipeline number, medium, material, operating parameters (e.g., flow rate, pressure, temperature), and historical maintenance records. These fields provide a comprehensive overview of the pipeline’s basic operational status and maintenance history and thus serve as the core structure of the dataset. A total of 3254 entries were collected. In parallel, unstructured and semi-structured data were integrated to compensate for the lack of detail in the structured data. The unstructured data was obtained from images captured by on-site pipeline construction equipment. These images were processed with image recognition technology and annotated by experts to extract operation and maintenance information directly mapped to relevant fields in the structured dataset. A total of 1098 data entries were generated. Figure 2 illustrates the process of integrating unstructured data with structured data.

Fig. 2
figure 2

Structured data fusion process.

Semi-structured data primarily resides in the textual conclusions of inspection reports. These texts comprise structured components (inspection items and data indicators) and unstructured narrative segments. The BIO tagging scheme is adopted to identify key entities within these texts—such as fault types, maintenance measures, and on-site processing results. The annotation scheme is illustrated in (Fig. 3), which defines three tags: B (begin) denotes the start of an entity, I (inside) marks its continuation, and O (outside) represents elements not associated with any entity. For instance, B-PIP marks the start of a pipeline name and I-PIP its continuation; similarly, B-MAT and I-MAT correspond to the beginning and continuation of material types, respectively, while B-STA and I-STA denote the start and continuation of operating statuses, tokens labelled as O represent non-entity content. Twelve annotation labels are defined, as summarized in (Table 1). Adopting the BIO annotation scheme, the entity boundaries and categorical information in texts related to gas polyethylene pipelines are effectively delineated. A total of 489 annotated instances were collected to support subsequent knowledge extraction tasks.

Table 1 BIO annotation labels.
Fig. 3
figure 3

Example of BIO annotation.

The second part of dataset construction primarily involves integrating the collected structured, semi-structured, and unstructured data, followed by reclassification in accordance with the CJJ63-2008 (Polyethylene Gas Pipeline Engineering Technical Standard). Based on content characteristics, the data are categorized into four groups—Basic Data, Production Data, Installation Data, and Operational Data—comprising 35 subcategories (as shown in Table 2). Distinct manual annotation guidelines have been formulated for each data category to ensure that the resulting dataset meets general standards while accentuating domain-specific details.The detailed annotation rules are as follows:

  1. (1)

    For Basic Data—such as fixed fields including pipeline numbers, installation dates, and materials—first, any numerical or date information conforming to the standard format (e.g., “Pipe number: PE-80”) is directly annotated as Basic Data (BD); second, fixed phrases describing the fundamental attributes of the pipeline (e.g., “Material: natural gas”) are annotated in their entirety as Basic Data to ensure both universality and consistency.

  2. (2)

    For Production Data, which primarily reflects on-site construction and material handling processes. First, any description of construction activities (e.g., the term “laying” in “pipe laying”) should be annotated as Production Data (PD); second, numerical descriptions about on-site equipment status or construction progress (e.g., “progress reached 85%”) should be annotated separately—wherein numerical and percentage data are uniformly classified as production indicators and subsequently incorporated into Production Data.

  3. (3)

    For Installation Data, which focuses on the installation processes of pipelines and related equipment. First, terms explicitly describing installation operations (e.g., “install” in “installing the valve”) should be annotated as Installation Data (ID); second, technical terms describing the installation objects (e.g., “valve” and “joint”) should be uniformly annotated as Installation Equipment (ID) to differentiate between operational actions and equipment entities, and then incorporated into Installation Data.

  4. (4)

    For Operational Data, which primarily records pipeline operation, maintenance, and status monitoring information. First, any description of the pipeline’s operating status (e.g., “pipeline unobstructed” or “risk of leakage”) should be annotated in its entirety as Operational Data (OD); second, verbs related to maintenance activities (e.g., “monitor” or “detect”) should be annotated separately as Operational Activities (OA) to distinguish between status descriptions and operational actions clearly, and subsequently incorporated into Operational Data.

The formulation of the aforementioned targeted guidelines not only eliminates ambiguity in data annotation but also accentuates the domain-specific characteristics of each data category, thereby providing more precise training samples for subsequent entity extraction tasks based on the ALBERT-BiGRU-CRF model.

Table 2 Classification of dataset.

Named entity recognition

The goal of Named Entity Recognition (NER) is to extract entities with specific meanings from unstructured text, such as pipeline names, material types, working conditions, and other key information, thereby laying the foundation for the construction of a pipeline O&M knowledge graph. In this paper, the ALBERT-BiGRU-CRF model is employed for Named Entity Recognition (NER) in gas polyethylene pipeline operation and maintenance data. The method comprises three key components: the ALBERT layer, the BiGRU layer, and the CRF layer. First, the ALBERT model, as an efficient pre-trained language model, autonomously learns textual features and generates rich semantic vector representations through its bidirectional Transformer architecture. Next, the BiGRU layer further captures the contextual information from pipeline-related texts. Subsequently, the CRF layer is responsible for annotating the sequences and optimizing the global dependencies between the labels to ensure label coherence and accuracy. The structural diagram of the specific entity extraction model is depicted in (Fig. 4).

Fig. 4
figure 4

Structure of entity extraction model based on ALBERT-BiGRU-CRF.

ALBERT layer

As shown in Fig. 4, ALBERT first converts each word in the sentence into word vectors. These word vectors are generated through the embedding layer of the ALBERT model and fed into the multilayer Transformer Encoder of ALBERT. ALBERT reduces the model’s parameter count through innovative methods such as factorization, parameter embedding, and cross-layer parameter sharing, thereby improving both the model’s stability and computational efficiency. Unlike the BERT model, ALBERT generates character vectors based on contextual information, which effectively addresses the issue of lexical polysemy. For example, in the sentence “The pipeline reached the specified standard in the pressure test,” the term “pressure” refers to the test pressure used to check the pipeline’s sealing. In contrast, in the sentence “The high pressure inside the pipeline may lead to leakage,” “pressure” refers to the actual fluid pressure inside the pipe. The ALBERT module generates accurate semantic representations of the term “pressure” based on different contexts, thereby avoiding ambiguities that may arise in traditional models. Additionally, ALBERT utilizes the self-attention mechanism to calculate the correlation between each word and the other words in the input sentence at each layer of the Transformer Encoder. Specifically, ALBERT dynamically adjusts the representation of word vectors by calculating the correlation weights between each word and all other words in the sequence, thereby generating context-aware word vectors. These weighted word vectors are then passed to the subsequent encoding layers to achieve more accurate semantic understanding. The vector representation of the attention mechanism is provided in Eq. (1), and the self-attention mechanism is mathematically modelled, as shown in Eq. (2).

$$\left\{ \begin{gathered} Q = [Q_{1} ,Q_{2} ,...Q_{N} ] \hfill \\ K = [K_{1} ,K_{2} ,...K_{N} ] \hfill \\ V = [V_{1} ,V_{2} ,...V_{N} ] \hfill \\ \end{gathered} \right.$$
(1)
$$Attention(Q,K,V) = softmax(QK^{T} )V$$
(2)

Where q, k, and v represent query vectors, key vectors, and value vectors, respectively; N denotes the length of the input sequence, T refers to the transpose operation, and softmax indicates the softmax operation applied to all elements. For each input sequence element, a vector is derived by calculating the dot product between that element and all other elements in the sequence, which is known as the query vector. Similarly, for each input sequence element, a vector is derived by computing the dot product between that element and all other elements in the sequence, which is referred to as the key vector.

ALBERT, a lightweight variant of BERT, retains robust linguistic representation while significantly reducing computational complexity through parameter sharing and factorization techniques. In the ALBERT model, the lower layers (e.g., layers 1 to 4) primarily capture surface-level features, enabling the model to learn the basic meanings of words and fundamental syntactic structures. The middle layers (e.g., layers 4 to 8) deeply learn syntax and sentence relationships. In contrast, the higher layers (e.g., layers 8 to 12) are tasked with extracting more complex semantic information and understanding deeper contextual meanings and reasoning. This paper proposes an improvement strategy based on feature extraction and weighted fusion across different levels to enhance the model’s semantic comprehension while maintaining computational efficiency. ALBERT produces a set of feature vectors \(\:{F}_{i}\) from multiple layers (i.e., \(\:i\){1,3,5,7,9,11}), where each \(\:{F}_{i}\) captures different linguistic characteristics. Specifically, the lower layers (e.g., \(\:{F}_{1}\) and \(\:{F}_{3}\)) encode basic lexical information, the mid-layers (e.g., \(\:{F}_{5}\) and \(\:{F}_{7}\)) capture syntactic structures, and the higher layers (e.g., \(\:{F}_{9}\) and \(\:{F}_{11}\)) extract deep semantic and contextual representations. To combine these heterogeneous features, we adopt a weighted fusion mechanism, as specified in Eq. (3), as follows:

$$F_{{fused}} = \sum\limits_{{i \in L}} {\alpha _{i} } F_{i}$$
(3)

Where \(\:L\)={1,3,5,7,9,11} and \(\:{\alpha\:}_{i}\) are the learned weights reflecting the relative importance of each layer’s features. In our implementation, the weights \(\:{\alpha\:}_{i}\) are computed using a softmax function over a set of raw scores \(\:{s}_{i}\) derived from each feature vector, as shown in Eq. (4):

$$\alpha _{i} = \frac{{exp(s_{i} )}}{{\sum\limits_{{j \in L}} e xp(s_{j} )}}$$
(4)

Where \(\:{s}_{i}\)=\(\:W{F}_{i}\)+\(\:b\) for some learnable parameters \(\:W\) and \(\:b\). This formulation allows the model to dynamically adjust the contribution of each layer, ensuring that the fused feature \(\:{F}_{fused}\) robustly captures both surface-level and deep semantic information.

For example, in the sentence, “A random inspection by radiography revealed excessive defects in the welds of the overhead steel pipes near the lighting system owned by the client. For details, see the radiography sub-item in the report,” the lower layers of the ALBERT model (e.g., layers 1 and 3) can accurately identify key entities such as “radiography,” “lighting system,” “overhead steel pipes,” and “excessive defects.” The intermediate layers (e.g., layers 5 and 7) further extract relationships among these entities. Specifically, “find” indicates a causal relationship between the detection action and the defect, “located” reveals the spatial positioning, and “existed” denotes the current condition of the weld defect. The higher layers (e.g., layers 9 and 11) integrate contextual information to infer that the weld defect was detected by radiographic inspection at a specific site, thereby constructing a comprehensive semantic chain ranging from the detection method and spatial positioning to the defect description. This hierarchical feature extraction and fusion mechanism enhances the model’s robustness in feature capture and accuracy in semantic interpretation when dealing with complex professional texts. Subsequently, features extracted at different layers are fused, and a weighting mechanism is employed to assign weights based on their relative importance, ensuring that each layer contributes effectively to the final model output. The weighted and fused features are then subjected to linear dimensionality reduction to obtain the feature representation of the target category (tc), as illustrated in (Fig. 5).

Fig. 5
figure 5

ALBERT key-layer fusion process.

BiGRU layer

The Gated Recurrent Unit (GRU) is an enhanced Recurrent Neural Network (RNN) architecture designed to better capture long-term dependencies while mitigating the vanishing and exploding gradient issues through a gating mechanism. In comparison to the traditional Long Short-Term Memory (LSTM) network, GRU features a more compact structure, leading to a reduction in the number of parameters in the model. The structure of the GRU model is depicted in (Fig. 6). The formulas for GRU are provided in Eq. (5) through (8):

Reset gate:

$$r_{t} = \sigma (W_{r} \cdot [h_{{t - 1}} ,x_{t} ])$$
(5)

Update gate:

$$z_{t} = \sigma (W_{z} \cdot [h_{{t - 1}} ,x_{t} ])$$
(6)

Candidate hidden state:

$$\tilde{h}_{t} = \tanh (W \cdot [r_{t} \odot h_{{t - 1}} ,x_{t} ])$$
(7)

Final hidden state:

$$h_{t} = (1 - z_{t} ) \odot h_{{t - 1}} + z_{t} \odot \tilde{h}_{t}$$
(8)
Fig. 6
figure 6

Structure of the GRU model.

Where: \(r_{t}\) represents the output of the reset gate; \(\sigma\) refers to the Sigmoid activation function; \(W_{r}\) denotes the weight matrix of the reset gate; \(y_{{t - 1}}\) signifies the hidden state from the previous time step; \(x_{t}\) represents the input at the current time step; \(z_{t}\) is the output of the update gate; \(W_{z}\) represents the weight matrix of the update gate; \(\tilde{h}_{t}\) refers to the candidate hidden state; \(W\) is the weight matrix; \(\odot\) denotes the element-wise multiplication; \(y_{t}\) is the final hidden state at the current time step.

Following the BiGRU layer, a feature fusion layer is introduced, where the word embedding vectors generated by ALBERT serve as reference information for the BiGRU layer, enhancing its contextual semantic understanding. Notably, the output of the BiGRU layer plays a central role in the fusion process, as it processes temporal information from the input sequence while considering the role and grammatical function of each word in both preceding and succeeding contexts via a bidirectional mechanism. Consequently, the BiGRU layer’s output yields more nuanced and dynamic features for the final text representation, particularly excelling in capturing long-range dependencies and the temporal structure of the text. The feature fusion layer merges the word embedding vectors from the ALBERT output with the output vectors from the BiGRU layer through weighted concatenation, creating a composite feature representation that encompasses both global semantic and temporal dependencies. This feature representation preserves the semantic understanding advantage of ALBERT while also enhancing the flexibility and accuracy of BiGRU in temporal task handling.

CRF layer

The input to the CRF layer is derived from the contextual feature information provided by the feature fusion layer. In the entity extraction labelling task, for the target sentence X containing tokens x1, x2, x3xi, the recognition and labelling process proceeds as follows: First, the layer calculates the feature-to-label prediction probabilities by learning the dependencies between the feature vectors and the corresponding label results, determining the label for each token xi. Next, a sequence y of predicted labels for sentence X is generated, containing y1, y2, y3yi. Finally, the predicted label sequence y is constrained and adjusted using a labelling strategy based on sentence distribution, yielding the optimal label sequence for the target sentence X. In Fig. 7, the size of the circles is used to approximately simulate the weight values of each feature after the output of the ALBERT-BiGRU model. Without CRF, the scoring model only assigns labels based on the highest scoring rule, which may result in the Tc label sequence being incorrectly assigned as ‘B-PIP’ (i.e., the beginning of the detection pipeline), which is contrary to the conventional understanding of the gas pipeline detection report. For this reason, we introduce CRF, whose label decision-making process fully considers the transfer probability between labels, thus more reasonably reflecting sequence dependencies. As shown in (Fig. 7), the CRF model finally selects ‘B-DET’ (the beginning of the detection method) as the label for Tc and corrects the label for T3 to ‘B-PIP’ (the beginning of the detection pipeline). This error correction mechanism based on transfer probability significantly reduces the incidence of incorrect labels and effectively corrects unreasonable labels that rely only on score maximisation. The parameters of CRF are shown in Eq. (9).

$$S(X,Y) = \sum\nolimits_{{i = 0}}^{n} {A_{{yi,y(i + 1)}} + } \sum\nolimits_{{i = 1}}^{n} {P_{{i,yi}} }$$
(9)

Where A is the transfer matrix, Pi, yi denotes the probability of the y-th label for the i-th word in the sentence, X is the input sequence, and y is the output sequence.

Fig. 7
figure 7

Label prediction mechanism of the CRF layer.

Relation extraction

Relation extraction (RE) is a fundamental task in Natural Language Processing (NLP) and a critical component of information extraction. Its primary goal is to identify the relationships between entities, typically noun phrases or pronouns, within a given text. Building upon Named Entity Recognition (NER), relation extraction establishes the foundation for intelligent systems to comprehend natural language and construct knowledge graphs, playing a vital role in enhancing machine semantic understanding. Similar to NER, which necessitates pre-labelling entity types, relation extraction also requires the predefined classification of relationships between entities. In this paper, we define a set of common relationship types, such as “belongs to,” “decision,” and others, specifically tailored for the operational and maintenance text of gas polyethylene pipelines, as presented in (Table 3). By predefining relationship types and integrating them with a deep learning model, this section enables the automatic extraction of relationships between entities in the O&M text of gas polyethylene pipelines, providing essential support for the construction of the subsequent knowledge graph.

Table 3 Examples of entity relationships.

This section employs the ALBERT-BiGRU-Attention model for relationship extraction in operation and maintenance data of gas polyethylene pipelines. The approach comprises three key components: the ALBERT layer, the BiGRU layer, and the Attention layer. Initially, the ALBERT model, as an efficient pre-trained language model, learns the semantic features of the text through its bidirectional Transformer architecture and generates rich semantic vector representations. ALBERT leverages cross-layer parameter sharing and the factorization of embedding matrices to handle contextual information, significantly reducing the number of model parameters and enhancing computational efficiency and stability. Subsequently, the BiGRU layer performs sequence modelling on the features extracted by ALBERT to capture the contextual information in the pipeline operation and maintenance text, along with the complex interactions between entities. Lastly, the Attention layer focuses on the most relevant parts of the text for relationship extraction by using a weighting mechanism, automatically adjusting the weights between different features, thus enhancing the model’s ability to capture significant relationships. The structure diagram of the relationship extraction model is depicted in (Fig. 8).

Fig. 8
figure 8

Structure of relation extraction model based on ALBERT-BiGRU-Attention.

Compared to the entity extraction model structure, the CRF layer is replaced with the Attention layer, which can more flexibly assign weights to key features in the text, enhancing the model’s ability to capture complex relationships. This addition simplifies the model structure, improving both training efficiency and interpretability. The role of the Attention layer is to enable the model to focus on more relevant parts and suppress less important ones when processing complex input sequences. First, each representation vector yn of the input sequence is linearly transformed to produce a new representation vector yn’.

Second, the dot product of each representation vector yn’ with the query vector q is computed to obtain a vector with the same length as the input sequence, representing the attention weights at each position. The attention weights are softmax-normalized to obtain the final weighting coefficients.

Finally, the weighting coefficients are multiplied by the representation vector yi’ of the input sequence, and the results are summed to produce the final output vector. The formula for the Attention layer is expressed in Eq. (10).

$$output~ = ~\Sigma \left\{ {i = 1} \right\} ^\wedge \left\{ x \right\}softmax\left( {W_{q} *y_{n} ^{{\prime }} } \right)*y_{i} ^{{\prime }}$$
(10)

Where Wq represents the query vector, the linear transformation matrix of q, yi’ is the representation vector at the nth position in the input sequence, and x denotes the length of the input sequence.

Experiments

Experimental dataset

The dataset of gas polyethylene pipeline operation and maintenance constructed in this study is proprietary, and the data mainly comes from production and operation enterprises and supervision and inspection agencies. A total of 4,841 O&M data entries on gas polyethylene pipelines were collected through the actual operation data provided by these companies. This dataset encompasses various operation statuses, maintenance records, fault analysis, operational parameters, and other aspects of the pipeline, such as pipeline operation data, pressure, temperature, flow rate, and other key indicators across different time periods. The dataset includes not only structured data but also a significant amount of semi-structured and unstructured data, which offers rich insights into real-world scenarios and lays a solid foundation for subsequent analysis and modelling.

To ensure the dataset’s representativeness and generalization capability, it was partitioned in a 3:1:1 ratio, resulting in 2905 training samples, 968 validation samples, and 968 test samples. During the data partitioning process, the distribution of data in each subset was ensured to align with the overall dataset, thereby mitigating any potential biases introduced by the partitioning. Additionally, to enhance model training performance and assessment accuracy, the dataset partitioning process specifically accounted for several key features of the gas polyethylene pipeline operation and maintenance data. First, the dataset was categorized into four types, including basic data, production data, installation data and operational data. It was further classified based on the operational status of the pipeline equipment, such as regular operation, maintenance, fault, and other operational states. This stratified division ensured that each subset included representative samples of all equipment states, thereby reducing bias in the training set. Second, each data type was annotated based on distinct operational and maintenance scenarios to enable the model to learn the typical operational patterns associated with different pipeline conditions. Particular attention was given to the temporal characteristics of the data to ensure that each subset maintained a sufficient temporal span and continuity, thus preserving the time series correlations inherent in the dataset. Through these targeted divisions, each subset was ensured to represent the comprehensive characteristics of the entire dataset effectively.

During the data labelling process, various entities in the data were classified and labelled based on domain experts’ knowledge and experience. The labelled content primarily includes pipeline status, fault types, maintenance records, and abnormal alarms. These annotations not only enhance data availability but also provide high-quality labelled data for model training. Based on this, a multidimensional feature dataset suitable for gas polyethylene pipeline operation and maintenance data analysis was constructed, providing sufficient training data for subsequent model applications.

Experimental parameter settings

To validate the effectiveness of the proposed model in this study, commonly used evaluation metrics are employed, including precision, recall, and F1 score. The formulas for these metrics are presented below:

$$P = \frac{{T_{p} }}{{T_{p} + F_{p} }}$$
(11)
$$R = \frac{{T_{p} }}{{T_{p} + F_{N} }}$$
(12)
$$F1 = (2 \times P \times R)/(P + R)$$
(13)

The hyperparameter configurations for NER and RE in this experiment are presented in (Table 4).

Table 4 Hyperparameters configuration of NER and RE.

Model performance comparison experiment

Comparison of different entity extraction models

A comparative experiment is conducted to verify the effectiveness of the ALBERT-BiGRU-CRF model presented in this study. The recognition results of various named entity models are shown in (Table 5). The comparison of F1 scores for different entity extraction models is illustrated in (Fig. 9).

Table 5 Recognition results of different entity extraction models.
Fig. 9
figure 9

Comparison of F1 scores for different entity extraction models.

As shown in Table 5; Fig. 9, the ALBERT-BiGRU-CRF model demonstrates strong performance in the entity classification task for gas polyethylene pipeline operation and maintenance data, achieving a precision of 95.35%, a recall of 96.51% and an F1 score of 95.93%. Compared to traditional BiLSTM and BiLSTM-CRF models, the ALBERT-BiGRU-CRF model significantly enhances classification precision and recall, comparing the Word2vec-BiLSTM-CRF and ALBERT-BiLSTM-Self_Att-CRF models also shows a good improvement, particularly in capturing dependencies in long texts. To further illustrate the model’s superiority, Fig. 10 compares each model’s performance across different entity classes. Figure 10a–c shows that the ALBERT-BiGRU-CRF model achieves the highest accuracy, recall, and F1 score in most entity categories. Additionally, Fig. 10d highlights that the ALBERT-BiGRU-CRF model outperforms the other four models regarding average precision, recall, and F1 score, further underscoring its exceptional entity recognition capabilities.

Fig. 10
figure 10

Comparison of performance of different models at the entity level, Precision (a), Recall (b), F1 (c), Precision, Recall and F1 of different models (d).

Additionally, from the loss curves of the entity extraction model in Fig. 11 and the correlation between predicted labels and true labels in (Fig. 12), it can be observed that the designed ALBERT-BiGRU-CRF model demonstrates strong generalization ability on the training data.

Fig. 11
figure 11

Loss curve of entity extraction models.

Fig. 12
figure 12

The correlation between predicted and true labels.

Comparison of different relation extraction models

To evaluate the effectiveness of the relational extraction model, the ALBERT-BiGRU-Attention model was compared with several classical deep network models. The specific experimental results are presented in Table 6 and (Fig. 13).

The comparative analysis reveals that the ALBERT-BiGRU-Attention model achieves outstanding performance on the gas polyethylene pipeline operation and maintenance dataset, attaining the highest accuracy and F1 score. Compared to the BERT-BiLSTM-Attention model, the proposed model improves accuracy and F1 score by 1.93 and 1.68%, respectively. At the same time, we counted the extracted relationship categories, which included a total of 7 categories. The number of results for each category is shown in (Table 7).

Table 6 Recognition results of different relation extraction models for entity pairs.
Fig. 13
figure 13

Comparison of F1 scores for different relation extraction models.

Table 7 Relationship category extraction result.
Construction of the gas polyethylene pipeline O&M data knowledge graph

Based on the above knowledge extraction method, we extracted 4,362 entities from the gas polyethylene pipeline operation and maintenance data and identified 8 different relationship types. By linking the extracted entity sequences and relationships, semantic triplets in the form of < entity 1, relationship, entity 2 > are generated to capture key knowledge points of pipeline operation and maintenance, as shown in (Fig. 14). The main entity composition of the triplets of the gas polyethylene pipeline operation and maintenance data is shown. These entity-relationship triplets are stored in CSV format and then imported into the Neo4j graph database in batches using Cypher commands for efficient storage, querying and visualisation. A knowledge graph comprising 4362 entities was ultimately constructed. Due to the large amount of data, it cannot be displayed in its entirety. The main components of the knowledge graph are shown in (Fig. 15).

In our study, the Neo4j graph database is employed to systematically represent the comprehensive knowledge extracted from gas polyethylene pipeline operation and maintenance (O&M) data as a semantic network. In this network, nodes correspond to various critical entities—such as pipeline equipment, materials, leakage incidents, and maintenance measures—while edges denote the interactions between these entities. For example, a “cause” relationship might link a leakage incident to a specific valve, suggesting that a malfunction or defect in the valve led to the leakage. Similarly, a “decision” relationship could be established between observed performance parameters and design pressure, indicating how operational conditions inform engineering decisions. Additional relationships such as “belongs to” can connect individual equipment nodes to the more extensive pipeline system, and “contain” may reflect the impact of maintenance measures on system integrity. This detailed mapping allows analysts to visualize and interrogate the intricate interplay among diverse entities, thereby facilitating the identification of potential failure modes, conducting robust risk assessments, and formulating proactive maintenance strategies. Figure 16 illustrates the core entities derived from the O&M data, showcasing the practical utility of the semantic network approach in enhancing data analysis and decision-making in pipeline management.

Fig. 14
figure 14

Main entities comprising gas polyethylene pipeline O&M data.

Fig. 15
figure 15

Partial of the gas polyethylene pipeline O&M knowledge graph.

Fig. 16
figure 16

Core entities of gas polyethylene pipeline O&M data.

Conclusion

The knowledge graph construction method based on the ALBERT-BiGRU-CRF framework proposed in this study addresses the key challenges of insufficient knowledge relevance and inefficient decision-making in gas polyethylene pipeline operation and maintenance. Focusing on the heterogeneous data characteristics of this field, the study adopts the technical approach of ‘data governance-knowledge extraction-graph construction’: firstly, the unstructured O&M documents and semi-structured inspection reports are structured and pre-processed, and a standardized training corpus is constructed containing entity-relationship annotations; then, a hierarchical feature fusion model is developed, and a lightweight ALBERT pre-training model is employed to generate context-aware word vector representations, while a bidirectional word vector representation is generated through a bidirectional partitioning method, capturing global temporal features of text sequences through a bidirectional gated recurrent unit (BiGRU), and incorporating conditional random fields (CRF) to establish a label transfer constraint mechanism, thereby ensuring the accurate extraction of pipeline operation and maintenance knowledge; finally, the domain knowledge graph is constructed based on the Neo4j graph database, revealing the semantic relationships between entities through the visualization of the topology to provide knowledge support for pipeline risk assessment and maintenance decision-making. The experimental results demonstrate that the model significantly improves performance in the entity recognition task for gas polyethylene pipelines, with accuracy, recall, and F1 scores reaching 95.35, 96.51, and 95.93%, respectively. Compared to other models, the proposed model demonstrates higher efficiency while enhancing recognition performance, showcasing strong potential for practical application. The technical framework proposed in this study provides a scalable methodological reference for constructing industrial equipment knowledge graphs, with high efficiency and strong generalization capabilities demonstrating significant application value in the field of intelligent operation and maintenance of gas infrastructure.