A dataset of Natural Gas and Liquid Level for Oil Field Production Prediction in China

Wang, Yanlei; Lian, Jian; Li, Chengjiang

doi:10.1038/s41597-025-05309-w

Download PDF

Data Descriptor
Open access
Published: 23 June 2025

A dataset of Natural Gas and Liquid Level for Oil Field Production Prediction in China

Scientific Data volume 12, Article number: 1071 (2025) Cite this article

2145 Accesses
Metrics details

Subjects

Abstract

Natural gas, a critical resource for national economies and public welfare, plays a significant role in the energy sector. The efficient production of natural gas is often hindered by the presence of formation water, which can adversely affect well productivity and operational efficiency. Accurate prediction of natural gas production and estimation of associated fluid accumulation are therefore paramount for optimizing extraction processes. This study introduces a dataset compiled from multiple wells in the Inner Mongolia region, primarily targeting the Shanxi Formation, with the aim of enhancing the predictive accuracy of natural gas production and associated fluid accumulation. The dataset, spanning from March 17, 2010, to April 15, 2024, includes detailed records of wellhead pressure, casing pressure, daily methanol injection, and cumulative production volumes of gas, water, and oil. By analyzing these parameters, we can identify trends and anomalies over time, which are essential for refining production strategies and mitigating the impact of fluid accumulation on gas wells.

Increasing productivity by using smart gas for optimal management of the gas lift process in a cluster of wells

Article Open access 05 July 2024

Atomized droplet size prediction for supersonic atomized water drainage and natural gas extraction

Article Open access 23 December 2022

Data-driven assessment of well stimulation in unconventional gas reservoirs

Article Open access 28 December 2024

Background & Summary

Natural gas, a vital component of the global energy mix, is essential for both industrial development and domestic use¹. The production of natural gas is a complex process that involves the extraction of hydrocarbons from subterranean reservoirs². In the Inner Mongolia region, the Shanxi Formation has been identified as a significant reservoir for natural gas, making it a focal point for our study³. However, the presence of formation water, commonly referred to as “fluid accumulation” or “Jiye” in Chinese, can impede gas flow, reduce well productivity, and increase operational costs⁴. The accurate prediction of natural gas production and the estimation of associated fluid accumulation are crucial for the efficient management of gas fields.

The ability to forecast natural gas yields and the concurrent accumulation of formation water is not only technically challenging but also economically significant⁵. Accurate predictions enable oil and gas companies to plan production more effectively, optimize resource allocation, and manage the environmental impacts of their operations. However, the complexity of subsurface conditions and the variability in reservoir characteristics make this a difficult task⁶. It is against this backdrop that the importance of our dataset becomes apparent.

This study presents a compiled dataset from multiple wells in the Inner Mongolia region, with a primary focus on the Shanxi Formation. The dataset encompasses a wide range of parameters, including wellhead pressure⁷, casing pressure⁸, daily methanol injection volumes, and cumulative production of gas, water, and oil. Spanning from March 17, 2010, to April 25, 2024, this dataset provides a detailed account of the operational history of these wells, capturing the dynamic interplay between natural gas production and fluid accumulation. The primary objective of this dataset is to enhance the accuracy of predictions regarding natural gas production and associated fluid accumulation. By providing a comprehensive and standardized dataset, we aim to facilitate advanced analytical techniques and modeling approaches that can improve the understanding of production dynamics in gas fields. This, in turn, can lead to more informed decision-making in the planning and management of natural gas production, with positive ramifications for the transportation, marketing, and other industries dependent on a stable supply of natural gas.

To our best knowledge, the dataset presented in this study is a valuable resource for researchers, engineers, and policymakers. It not only contributes to the scientific understanding of gas field operations but also serves as a foundation for developing more effective strategies for the sustainable extraction of natural gas resources.

Methods

Data Collection

Data was collected from a cohort of wells in the Inner Mongolia region. These wells are outfitted with state-of-the-art sensors⁹ capable of precisely measuring wellhead pressure, casing pressure, and the flow rate of gas. The sensors undergo regular calibration to guarantee the accuracy of the measurements. The data collection process was designed to minimize errors and safeguard the integrity of the data¹⁰. The data was collected on a daily basis to capture short-term variations in production parameters.

To ensure data quality, a comprehensive suite of quality control measures was implemented¹¹. Outlier detection algorithms were utilized to identify and rectify any abnormal data points. Data integrity checks were performed to confirm that all requisite parameters were recorded and that there were no missing values. In cases where missing values were detected, the interpolation methods¹² were employed based on the characteristics of the data and the production process.

Data Pre-processing

The gathered data went through the following preprocessing procedures. Initially, the missing values within the dataset were detected and addressed. When the missing values were scattered and the time-series characteristic of the data needed to be maintained, interpolation methods were taken into account. Given that ${y}_{{t}_{i-1}}$ and ${y}_{{t}_{i+1}}$ are the known values neighboring the missing value ${y}_{{t}_{i}}$, the linearly interpolated value ${\widehat{y}}_{{t}_{i}}$ is computed as Equation (1):

$${\widehat{y}}_{{t}_{i}}={y}_{{t}_{i-1}}+\frac{({y}_{{t}_{i+1}}-{y}_{{t}_{i-1}})}{({t}_{i+1}-{t}_{i-1})}({t}_{i}-{t}_{i-1}),$$

(1)

Subsequently, the data was normalized so as to make all features have a comparable scale. This was accomplished by employing standard normalization methods like min-max scaling. Min-max scaling converts the data to a fixed interval [0, 1]. For a feature x, the scaled value x_scaled is determined by Equation (2):

$${x}_{scaled}=\frac{x-{x}_{min}}{{x}_{max}-{x}_{min}}$$

(2)

Here, x_min and x_max represent the minimum and maximum values of the feature x in the dataset respectively.

Data Annotation

In this dataset, each data record was labeled with the corresponding well ID and date. Supplementary annotations were provided to denote any special events or operational changes that might have influenced the production parameters, such as equipment maintenance or alterations in injection strategies.

To ensure the consistency and accuracy of the annotations, a team of seasoned engineers and data analysts reviewed and verified the annotated data. Any discrepancies or ambiguities were resolved through in-depth discussions and consultations with field experts.

The dataset is organized in a structured manner and is publicly accessible for research purposes. It is partitioned into training and testing subsets using a stratified sampling approach to ensure that each subset contains a representative sample of the data¹³.

Dataset statistics

In general, Table 1 provides a comprehensive and detailed overview of the key parameters and data related to the natural gas production process. It encompasses various crucial aspects such as production time, well characteristics, output volumes, pressure and temperature conditions, and injection volumes. This table serves as a vital tool for in-depth analysis and understanding of the complex dynamics and performance of the gas production system. By presenting these data in a systematic manner, it enables us to identify patterns, trends, and potential correlations that are essential for making informed decisions and formulating effective strategies to optimize the production process and enhance overall productivity.

Table 1 Detailed overview of the data fields related to the natural gas production.

Full size table

In this dataset, there is also data related to fluid accumulation. Since fluid accumulation does not occur every day and its occurrence rate is relatively low, it is stored separately in the dataset. The main stored contents are the time when fluid accumulation occurs and the degree of fluid accumulation. This degree is classified into the following types, including absence, minimal, mild, moderate, and severe. In this study, the classification of fluid accumulation severity in the natural gas production process is determined according to the scores given by on-site engineers. In total, Table 2 presents the key statistics of the dataset:

Table 2 Dataset Statistics. F, A, B, C, D, E denote fluid accumulation, absence, minimal, mild, moderate, and severe, respectively.

Full size table

LSTM Architecture

The Bidirectional Long Short-Term Memory (Bi-LSTM)¹⁴ network utilized in this study consists of an input layer, multiple hidden layers of LSTM cells, and an output layer (as shown in Fig. 1). The input layer takes in the time-series data related to wellhead pressure, casing pressure, and daily methanol injection. Each LSTM cell in the hidden layers contains a memory cell and three gates: the input gate, the forget gate¹⁵, and the output gate. The input gate controls the flow of new information into the memory cell, the forget gate determines what information to discard from the cell’s previous state, and the output gate decides what information to output for the current time step. This architecture allows the model to capture both the forward and backward temporal dependencies in the data, making it well-suited for handling the sequential nature of the natural gas production and fluid accumulation data.

For the hyperparameters of the introduced LSTM model, a batch size of 32 was chosen. The Adam optimizer was adopted, with a learning rate set to 1e-9. The model depth was specified as 8, the dropout rate was set at 0.5, and the number of epochs was fixed at 200.

Model Performance Results

In this study, the evaluation metrics utilized are the Root Mean Squared Error (RMSE) and the Mean Absolute Percentage Error (MAPE). The RMSE formula is presented as Equation (3):

$$RMSE=\sqrt{\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{({y}_{i}-{\widehat{y}}_{i})}^{2}},$$

(3)

where n stands for the overall number of samples or data points. y_i indicates the actual observed value of the i-th sample, and ${\widehat{y}}_{i}$ represents the corresponding predicted value. The formula for MAPE is given by Equation (4):

$$MAPE=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\left|\frac{{y}_{i}-\widehat{y}i}{yi}\right|\times 100$$

(4)

where n refers to the number of samples.

In the experiments, 80% of the records in the dataset is taken as the training set, and the remaining 20% records are taken as the testing set. And the performance of the introduced model on this dataset is presented in Table 3:

Table 3 Bi-LSTM Model Performance Results.

Full size table

These metrics provide a comprehensive assessment of the Bi-LSTM model’s effectiveness in handling the prediction and detection tasks based on the collected dataset. Moreover, this study has also used random forest (RF), decision tree (DT), and support vector machine (SVM) as the baseline models.

In addition, a permutation feature importance analysis was conducted on the proposed model. For features like wellhead pressure and casing pressure, their values were shuffled in the test dataset and the changes in RMSE and MAPE were observed. Table 4 shows that shuffling wellhead pressure and casing pressure led to a significant increase in both RMSE and MAPE, indicating its high importance for the model’s performance. Daily methanol injection volume has a relatively minor effect.

Table 4 Feature Impact Analysis.

Full size table

Data Records

The proposed dataset is publicly available on the ScienceDB¹⁶ platform to facilitate researcher access. It consists of 31 documents, thirty of them dedicated to storing the natural gas production and liquid level samples for each well in the form of .xlsx format and the other one containing the wells’ statistical information in the form of .xlsx.

The sample file of each well consists of two spreadsheets, one spreadsheet contains the daily natural gas production data and the other spreadsheet contains the liquid level data. And the fields in the first spreadsheet are provided in Table 1. The fields in the second spreadsheet are provided in Table 2.

Technical Validation

To validate the dataset’s quality and reliability, a series of tests were conducted. Data consistency checks were performed to ensure that the data was consistent across different wells and time periods, as shown in Fig. 2. Completeness checks were carried out to verify that all required data was present.

A separate validation process was dedicated to the annotations. The annotated data was randomly sampled and cross-checked by an independent team of experts who were not involved in the initial annotation process. They verified the accuracy of the well ID, date, time, and event annotations. In cases where errors or inconsistencies were found, the original annotation team was notified and the necessary corrections were made. Additionally, statistical analyses were performed on the annotations to check for any biases or patterns that could potentially affect the analysis. For example, the distribution of special events across different wells and time periods was examined to ensure that there was no over- or under-representation. This comprehensive annotation validation process further enhanced the reliability and usability of the dataset.

Usage Notes

Potential applications

The dataset presented in this study holds significant potential for various applications within the natural gas production domain. It can be utilized for the development and training of advanced machine learning and artificial intelligence models focused on natural gas production prediction. By leveraging the detailed records of wellhead pressure, casing pressure, daily methanol injection, gas production volume, and fluid accumulation information, researchers and industry practitioners can enhance the accuracy of their predictive models. This, in turn, enables more efficient production planning, allowing for optimized resource allocation and timely decision-making regarding well operations.

Furthermore, the dataset can also be used in the field of reservoir engineering for the analysis of reservoir behavior and the assessment of the impact of fluid accumulation on gas production. Engineers can use the data to study the correlations between different parameters and gain a deeper understanding of the underlying physical processes. This knowledge can then be applied to design more effective production strategies and to implement appropriate measures for mitigating the negative effects of fluid accumulation.

Limitations

Despite its potential applications, the dataset has certain limitations. One of the primary limitations is the geographical scope of the data collection. The dataset is sourced from multiple wells in the Inner Mongolia region, specifically targeting the Shanxi Formation. This limited geographical area may restrict the generalizability of the models trained using this dataset to other regions with different geological characteristics. The reservoir properties, fluid compositions, and production behaviors can vary significantly from one region to another, and thus, the models developed based on this dataset may not perform equally well in other locations.

Another limitation is the potential presence of measurement errors and uncertainties in the data. Although efforts have been made to ensure data quality through calibration of sensors and implementation of quality control measures, there is still a possibility of errors in the recorded values. These errors could affect the accuracy of the models trained with the dataset and lead to less reliable predictions and analyses.

Code availability

The code used for data processing, analysis, and model development is available on Gitee at https://gitee.com/practicing-swordsmanship/jian-lian/tree/master. The code is comprehensively documented, enabling other researchers to reproduce the analysis and build upon the work. It encompasses functions for data cleaning, feature engineering, model training, and evaluation.

References

Faramawy, S., Zaki, T. & Sakr, A. A.-E. Natural gas origin, composition, and processing: A review. Journal of Natural Gas Science and Engineering 34, 34–54, https://doi.org/10.1016/j.jngse.2016.06.030 (2016).
Article CAS Google Scholar
Jimoh, M. O., Arinkoola, A. O., Salawudeen, T. O., Daramola, M. O. 4 - environmental challenges of natural gas extraction and production technologies. In: Rahimpour, M. R., Makarem, M. A., Meshksar, M. (eds.) Advances in Natural Gas: Formation, Processing, and Applications Volume: 1: Natural Gas Formation and Extraction, pp. 75–101. Elsevier, https://doi.org/10.1016/B978-0-443-19215-9.00009-8 (2024).
Xu, L. et al. Origin and isotopic fractionation of shale gas from the shanxi formation in the southeastern margin of ordos basin. Journal of Petroleum Science and Engineering 208, 109189, https://doi.org/10.1016/j.petrol.2021.109189 (2022).
Article CAS Google Scholar
Liu, C. Advances in gas well fluid accumulation modeling. Academic Journal of Science and Technology 5, 169–178, https://doi.org/10.54097/ajst.v5i1.5602 (2023).
Article CAS Google Scholar
Sen, D., Hamurcuoglu, K. I., Ersoy, M. Z., Tunç, K. M. M. & Günay, M. E. Forecasting long-term world annual natural gas production by machine learning. Resources Policy 80, 103224, https://doi.org/10.1016/j.resourpol.2022.103224 (2023).
Article Google Scholar
Lao, T. & Sun, Y. Predicting the production and consumption of natural gas in china by using a new grey forecasting method. Mathematics and Computers in Simulation 202, 295–315, https://doi.org/10.1016/j.matcom.2022.05.023 (2022).
Article MathSciNet Google Scholar
Hari, S., Krishna, S., Patel, M., Bhatia, P. & Vij, R. K. Influence of wellhead pressure and water cut in the optimization of oil production from gas lifted wells. Petroleum Research 7(2), 253–262, https://doi.org/10.1016/j.ptlrs.2021.09.008 (2022).
Article Google Scholar
Yin, F. & Gao, D. Prediction of sustained production casing pressure and casing design for shale gas horizontal wells. Journal of Natural Gas Science and Engineering 25, 159–165, https://doi.org/10.1016/j.jngse.2015.04.038 (2015).
Article Google Scholar
Panda, S., Mehlawat, S., Dhariwal, N., Kumar, A. & Sanger, A. Comprehensive review on gas sensors: Unveiling recent developments and addressing challenges. Materials Science and Engineering: B 308, 117616, https://doi.org/10.1016/j.mseb.2024.117616 (2024).
Article CAS Google Scholar
Gokulakrishnan, D. & Venkataraman, S. Ensuring data integrity: Best practices and strategies in pharmaceutical industry. Intelligent Pharmacy https://doi.org/10.1016/j.ipha.2024.09.010 (2024).
Gong, Y., Liu, G., Xue, Y., Li, R. & Meng, L. A survey on dataset quality in machine learning. Information and Software Technology 162, 107268, https://doi.org/10.1016/j.infsof.2023.107268 (2023).
Article Google Scholar
Caruso, C. & Quarta, F. Interpolation methods comparison. Computers & Mathematics with Applications 35(12), 109–126, https://doi.org/10.1016/S0898-1221(98)00101-1 (1998).
Article MathSciNet Google Scholar
May, R. J., Maier, H. R. & Dandy, G. C. Data splitting for artificial neural networks using som-based stratified sampling. Neural Networks 23(2), 283–294, https://doi.org/10.1016/j.neunet.2009.11.009 (2010).
Article CAS PubMed Google Scholar
Liu, G. & Guo, J. Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing 337, 325–338, https://doi.org/10.1016/j.neucom.2019.01.078 (2019).
Article Google Scholar
Gers, F. A. & Schmidhuber, J. & Cummins, F. Learning to forget: continual prediction with lstm. In: 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470), vol. 2, pp. 850–8552, https://doi.org/10.1049/cp:19991218 (1999).
Lian, J. Natural gas production and liquid level. https://doi.org/10.57760/sciencedb.23211.

Download references

Acknowledgements

The authors express their sincere gratitude to the oil companies and field operators in the Inner Mongolia region for their cooperation and support in providing the data.

Author information

These authors contributed equally: Yanlei Wang, Jian Lian, Chengjiang Li.

Authors and Affiliations

Shandong University of Political Science and Law, Jinan, 250357, China
Yanlei Wang
School of Intelligence Engineering, Shandong Management University, No.3500 Dingxiang Road, Jinan, 250357, Shandong, China
Jian Lian
Jinan Campus, Shandong University of Science and Technology, No.17 Shenglizhuang Road, Jinan, 250351, Shandong, China
Chengjiang Li

Authors

Yanlei Wang
View author publications
Search author on:PubMed Google Scholar
Jian Lian
View author publications
Search author on:PubMed Google Scholar
Chengjiang Li
View author publications
Search author on:PubMed Google Scholar

Contributions

J.L. conceived the study and designed the data collection methodology. C.L. was responsible for data collection and annotation. Y.W., J.L. and C.L. conducted the data analysis and model development. All authors contributed to the writing and revision of the manuscript.

Corresponding author

Correspondence to Jian Lian.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Y., Lian, J. & Li, C. A dataset of Natural Gas and Liquid Level for Oil Field Production Prediction in China. Sci Data 12, 1071 (2025). https://doi.org/10.1038/s41597-025-05309-w

Download citation

Received: 30 January 2025
Accepted: 02 June 2025
Published: 23 June 2025
DOI: https://doi.org/10.1038/s41597-025-05309-w