Background & Summary

Natural gas, a vital component of the global energy mix, is essential for both industrial development and domestic use1. The production of natural gas is a complex process that involves the extraction of hydrocarbons from subterranean reservoirs2. In the Inner Mongolia region, the Shanxi Formation has been identified as a significant reservoir for natural gas, making it a focal point for our study3. However, the presence of formation water, commonly referred to as “fluid accumulation” or “Jiye” in Chinese, can impede gas flow, reduce well productivity, and increase operational costs4. The accurate prediction of natural gas production and the estimation of associated fluid accumulation are crucial for the efficient management of gas fields.

The ability to forecast natural gas yields and the concurrent accumulation of formation water is not only technically challenging but also economically significant5. Accurate predictions enable oil and gas companies to plan production more effectively, optimize resource allocation, and manage the environmental impacts of their operations. However, the complexity of subsurface conditions and the variability in reservoir characteristics make this a difficult task6. It is against this backdrop that the importance of our dataset becomes apparent.

This study presents a compiled dataset from multiple wells in the Inner Mongolia region, with a primary focus on the Shanxi Formation. The dataset encompasses a wide range of parameters, including wellhead pressure7, casing pressure8, daily methanol injection volumes, and cumulative production of gas, water, and oil. Spanning from March 17, 2010, to April 25, 2024, this dataset provides a detailed account of the operational history of these wells, capturing the dynamic interplay between natural gas production and fluid accumulation. The primary objective of this dataset is to enhance the accuracy of predictions regarding natural gas production and associated fluid accumulation. By providing a comprehensive and standardized dataset, we aim to facilitate advanced analytical techniques and modeling approaches that can improve the understanding of production dynamics in gas fields. This, in turn, can lead to more informed decision-making in the planning and management of natural gas production, with positive ramifications for the transportation, marketing, and other industries dependent on a stable supply of natural gas.

To our best knowledge, the dataset presented in this study is a valuable resource for researchers, engineers, and policymakers. It not only contributes to the scientific understanding of gas field operations but also serves as a foundation for developing more effective strategies for the sustainable extraction of natural gas resources.

Methods

Data Collection

Data was collected from a cohort of wells in the Inner Mongolia region. These wells are outfitted with state-of-the-art sensors9 capable of precisely measuring wellhead pressure, casing pressure, and the flow rate of gas. The sensors undergo regular calibration to guarantee the accuracy of the measurements. The data collection process was designed to minimize errors and safeguard the integrity of the data10. The data was collected on a daily basis to capture short-term variations in production parameters.

To ensure data quality, a comprehensive suite of quality control measures was implemented11. Outlier detection algorithms were utilized to identify and rectify any abnormal data points. Data integrity checks were performed to confirm that all requisite parameters were recorded and that there were no missing values. In cases where missing values were detected, the interpolation methods12 were employed based on the characteristics of the data and the production process.

Data Pre-processing

The gathered data went through the following preprocessing procedures. Initially, the missing values within the dataset were detected and addressed. When the missing values were scattered and the time-series characteristic of the data needed to be maintained, interpolation methods were taken into account. Given that \({y}_{{t}_{i-1}}\) and \({y}_{{t}_{i+1}}\) are the known values neighboring the missing value \({y}_{{t}_{i}}\), the linearly interpolated value \({\widehat{y}}_{{t}_{i}}\) is computed as Equation (1):

$${\widehat{y}}_{{t}_{i}}={y}_{{t}_{i-1}}+\frac{({y}_{{t}_{i+1}}-{y}_{{t}_{i-1}})}{({t}_{i+1}-{t}_{i-1})}({t}_{i}-{t}_{i-1}),$$
(1)

Subsequently, the data was normalized so as to make all features have a comparable scale. This was accomplished by employing standard normalization methods like min-max scaling. Min-max scaling converts the data to a fixed interval [0, 1]. For a feature x, the scaled value xscaled is determined by Equation (2):

$${x}_{scaled}=\frac{x-{x}_{min}}{{x}_{max}-{x}_{min}}$$
(2)

Here, xmin and xmax represent the minimum and maximum values of the feature x in the dataset respectively.

Data Annotation

In this dataset, each data record was labeled with the corresponding well ID and date. Supplementary annotations were provided to denote any special events or operational changes that might have influenced the production parameters, such as equipment maintenance or alterations in injection strategies.

To ensure the consistency and accuracy of the annotations, a team of seasoned engineers and data analysts reviewed and verified the annotated data. Any discrepancies or ambiguities were resolved through in-depth discussions and consultations with field experts.

The dataset is organized in a structured manner and is publicly accessible for research purposes. It is partitioned into training and testing subsets using a stratified sampling approach to ensure that each subset contains a representative sample of the data13.

Dataset statistics

In general, Table 1 provides a comprehensive and detailed overview of the key parameters and data related to the natural gas production process. It encompasses various crucial aspects such as production time, well characteristics, output volumes, pressure and temperature conditions, and injection volumes. This table serves as a vital tool for in-depth analysis and understanding of the complex dynamics and performance of the gas production system. By presenting these data in a systematic manner, it enables us to identify patterns, trends, and potential correlations that are essential for making informed decisions and formulating effective strategies to optimize the production process and enhance overall productivity.

Table 1 Detailed overview of the data fields related to the natural gas production.

In this dataset, there is also data related to fluid accumulation. Since fluid accumulation does not occur every day and its occurrence rate is relatively low, it is stored separately in the dataset. The main stored contents are the time when fluid accumulation occurs and the degree of fluid accumulation. This degree is classified into the following types, including absence, minimal, mild, moderate, and severe. In this study, the classification of fluid accumulation severity in the natural gas production process is determined according to the scores given by on-site engineers. In total, Table 2 presents the key statistics of the dataset:

Table 2 Dataset Statistics. F, A, B, C, D, E denote fluid accumulation, absence, minimal, mild, moderate, and severe, respectively.

LSTM Architecture

The Bidirectional Long Short-Term Memory (Bi-LSTM)14 network utilized in this study consists of an input layer, multiple hidden layers of LSTM cells, and an output layer (as shown in Fig. 1). The input layer takes in the time-series data related to wellhead pressure, casing pressure, and daily methanol injection. Each LSTM cell in the hidden layers contains a memory cell and three gates: the input gate, the forget gate15, and the output gate. The input gate controls the flow of new information into the memory cell, the forget gate determines what information to discard from the cell’s previous state, and the output gate decides what information to output for the current time step. This architecture allows the model to capture both the forward and backward temporal dependencies in the data, making it well-suited for handling the sequential nature of the natural gas production and fluid accumulation data.

Fig. 1
figure 1

The LSTM architecture introduced in this study.

For the hyperparameters of the introduced LSTM model, a batch size of 32 was chosen. The Adam optimizer was adopted, with a learning rate set to 1e-9. The model depth was specified as 8, the dropout rate was set at 0.5, and the number of epochs was fixed at 200.

Model Performance Results

In this study, the evaluation metrics utilized are the Root Mean Squared Error (RMSE) and the Mean Absolute Percentage Error (MAPE). The RMSE formula is presented as Equation (3):

$$RMSE=\sqrt{\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{({y}_{i}-{\widehat{y}}_{i})}^{2}},$$
(3)

where n stands for the overall number of samples or data points. yi indicates the actual observed value of the i-th sample, and \({\widehat{y}}_{i}\) represents the corresponding predicted value. The formula for MAPE is given by Equation (4):

$$MAPE=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\left|\frac{{y}_{i}-\widehat{y}i}{yi}\right|\times 100$$
(4)

where n refers to the number of samples.

In the experiments, 80% of the records in the dataset is taken as the training set, and the remaining 20% records are taken as the testing set. And the performance of the introduced model on this dataset is presented in Table 3:

Table 3 Bi-LSTM Model Performance Results.

These metrics provide a comprehensive assessment of the Bi-LSTM model’s effectiveness in handling the prediction and detection tasks based on the collected dataset. Moreover, this study has also used random forest (RF), decision tree (DT), and support vector machine (SVM) as the baseline models.

In addition, a permutation feature importance analysis was conducted on the proposed model. For features like wellhead pressure and casing pressure, their values were shuffled in the test dataset and the changes in RMSE and MAPE were observed. Table 4 shows that shuffling wellhead pressure and casing pressure led to a significant increase in both RMSE and MAPE, indicating its high importance for the model’s performance. Daily methanol injection volume has a relatively minor effect.

Table 4 Feature Impact Analysis.

Data Records

The proposed dataset is publicly available on the ScienceDB16 platform to facilitate researcher access. It consists of 31 documents, thirty of them dedicated to storing the natural gas production and liquid level samples for each well in the form of .xlsx format and the other one containing the wells’ statistical information in the form of .xlsx.

The sample file of each well consists of two spreadsheets, one spreadsheet contains the daily natural gas production data and the other spreadsheet contains the liquid level data. And the fields in the first spreadsheet are provided in Table 1. The fields in the second spreadsheet are provided in Table 2.

Technical Validation

To validate the dataset’s quality and reliability, a series of tests were conducted. Data consistency checks were performed to ensure that the data was consistent across different wells and time periods, as shown in Fig. 2. Completeness checks were carried out to verify that all required data was present.

Fig. 2
figure 2

Comparison between the proposed model and the real natural gas production data records across different wells and time periods.

A separate validation process was dedicated to the annotations. The annotated data was randomly sampled and cross-checked by an independent team of experts who were not involved in the initial annotation process. They verified the accuracy of the well ID, date, time, and event annotations. In cases where errors or inconsistencies were found, the original annotation team was notified and the necessary corrections were made. Additionally, statistical analyses were performed on the annotations to check for any biases or patterns that could potentially affect the analysis. For example, the distribution of special events across different wells and time periods was examined to ensure that there was no over- or under-representation. This comprehensive annotation validation process further enhanced the reliability and usability of the dataset.

Usage Notes

Potential applications

The dataset presented in this study holds significant potential for various applications within the natural gas production domain. It can be utilized for the development and training of advanced machine learning and artificial intelligence models focused on natural gas production prediction. By leveraging the detailed records of wellhead pressure, casing pressure, daily methanol injection, gas production volume, and fluid accumulation information, researchers and industry practitioners can enhance the accuracy of their predictive models. This, in turn, enables more efficient production planning, allowing for optimized resource allocation and timely decision-making regarding well operations.

Furthermore, the dataset can also be used in the field of reservoir engineering for the analysis of reservoir behavior and the assessment of the impact of fluid accumulation on gas production. Engineers can use the data to study the correlations between different parameters and gain a deeper understanding of the underlying physical processes. This knowledge can then be applied to design more effective production strategies and to implement appropriate measures for mitigating the negative effects of fluid accumulation.

Limitations

Despite its potential applications, the dataset has certain limitations. One of the primary limitations is the geographical scope of the data collection. The dataset is sourced from multiple wells in the Inner Mongolia region, specifically targeting the Shanxi Formation. This limited geographical area may restrict the generalizability of the models trained using this dataset to other regions with different geological characteristics. The reservoir properties, fluid compositions, and production behaviors can vary significantly from one region to another, and thus, the models developed based on this dataset may not perform equally well in other locations.

Another limitation is the potential presence of measurement errors and uncertainties in the data. Although efforts have been made to ensure data quality through calibration of sensors and implementation of quality control measures, there is still a possibility of errors in the recorded values. These errors could affect the accuracy of the models trained with the dataset and lead to less reliable predictions and analyses.