Abstract
Natural gas, a critical resource for national economies and public welfare, plays a significant role in the energy sector. The efficient production of natural gas is often hindered by the presence of formation water, which can adversely affect well productivity and operational efficiency. Accurate prediction of natural gas production and estimation of associated fluid accumulation are therefore paramount for optimizing extraction processes. This study introduces a dataset compiled from multiple wells in the Inner Mongolia region, primarily targeting the Shanxi Formation, with the aim of enhancing the predictive accuracy of natural gas production and associated fluid accumulation. The dataset, spanning from March 17, 2010, to April 15, 2024, includes detailed records of wellhead pressure, casing pressure, daily methanol injection, and cumulative production volumes of gas, water, and oil. By analyzing these parameters, we can identify trends and anomalies over time, which are essential for refining production strategies and mitigating the impact of fluid accumulation on gas wells.
Similar content being viewed by others
Background & Summary
Natural gas, a vital component of the global energy mix, is essential for both industrial development and domestic use1. The production of natural gas is a complex process that involves the extraction of hydrocarbons from subterranean reservoirs2. In the Inner Mongolia region, the Shanxi Formation has been identified as a significant reservoir for natural gas, making it a focal point for our study3. However, the presence of formation water, commonly referred to as “fluid accumulation” or “Jiye” in Chinese, can impede gas flow, reduce well productivity, and increase operational costs4. The accurate prediction of natural gas production and the estimation of associated fluid accumulation are crucial for the efficient management of gas fields.
The ability to forecast natural gas yields and the concurrent accumulation of formation water is not only technically challenging but also economically significant5. Accurate predictions enable oil and gas companies to plan production more effectively, optimize resource allocation, and manage the environmental impacts of their operations. However, the complexity of subsurface conditions and the variability in reservoir characteristics make this a difficult task6. It is against this backdrop that the importance of our dataset becomes apparent.
This study presents a compiled dataset from multiple wells in the Inner Mongolia region, with a primary focus on the Shanxi Formation. The dataset encompasses a wide range of parameters, including wellhead pressure7, casing pressure8, daily methanol injection volumes, and cumulative production of gas, water, and oil. Spanning from March 17, 2010, to April 25, 2024, this dataset provides a detailed account of the operational history of these wells, capturing the dynamic interplay between natural gas production and fluid accumulation. The primary objective of this dataset is to enhance the accuracy of predictions regarding natural gas production and associated fluid accumulation. By providing a comprehensive and standardized dataset, we aim to facilitate advanced analytical techniques and modeling approaches that can improve the understanding of production dynamics in gas fields. This, in turn, can lead to more informed decision-making in the planning and management of natural gas production, with positive ramifications for the transportation, marketing, and other industries dependent on a stable supply of natural gas.
To our best knowledge, the dataset presented in this study is a valuable resource for researchers, engineers, and policymakers. It not only contributes to the scientific understanding of gas field operations but also serves as a foundation for developing more effective strategies for the sustainable extraction of natural gas resources.
Methods
Data Collection
Data was collected from a cohort of wells in the Inner Mongolia region. These wells are outfitted with state-of-the-art sensors9 capable of precisely measuring wellhead pressure, casing pressure, and the flow rate of gas. The sensors undergo regular calibration to guarantee the accuracy of the measurements. The data collection process was designed to minimize errors and safeguard the integrity of the data10. The data was collected on a daily basis to capture short-term variations in production parameters.
To ensure data quality, a comprehensive suite of quality control measures was implemented11. Outlier detection algorithms were utilized to identify and rectify any abnormal data points. Data integrity checks were performed to confirm that all requisite parameters were recorded and that there were no missing values. In cases where missing values were detected, the interpolation methods12 were employed based on the characteristics of the data and the production process.
Data Pre-processing
The gathered data went through the following preprocessing procedures. Initially, the missing values within the dataset were detected and addressed. When the missing values were scattered and the time-series characteristic of the data needed to be maintained, interpolation methods were taken into account. Given that \({y}_{{t}_{i-1}}\) and \({y}_{{t}_{i+1}}\) are the known values neighboring the missing value \({y}_{{t}_{i}}\), the linearly interpolated value \({\widehat{y}}_{{t}_{i}}\) is computed as Equation (1):
Subsequently, the data was normalized so as to make all features have a comparable scale. This was accomplished by employing standard normalization methods like min-max scaling. Min-max scaling converts the data to a fixed interval [0, 1]. For a feature x, the scaled value xscaled is determined by Equation (2):
Here, xmin and xmax represent the minimum and maximum values of the feature x in the dataset respectively.
Data Annotation
In this dataset, each data record was labeled with the corresponding well ID and date. Supplementary annotations were provided to denote any special events or operational changes that might have influenced the production parameters, such as equipment maintenance or alterations in injection strategies.
To ensure the consistency and accuracy of the annotations, a team of seasoned engineers and data analysts reviewed and verified the annotated data. Any discrepancies or ambiguities were resolved through in-depth discussions and consultations with field experts.
The dataset is organized in a structured manner and is publicly accessible for research purposes. It is partitioned into training and testing subsets using a stratified sampling approach to ensure that each subset contains a representative sample of the data13.
Dataset statistics
In general, Table 1 provides a comprehensive and detailed overview of the key parameters and data related to the natural gas production process. It encompasses various crucial aspects such as production time, well characteristics, output volumes, pressure and temperature conditions, and injection volumes. This table serves as a vital tool for in-depth analysis and understanding of the complex dynamics and performance of the gas production system. By presenting these data in a systematic manner, it enables us to identify patterns, trends, and potential correlations that are essential for making informed decisions and formulating effective strategies to optimize the production process and enhance overall productivity.
In this dataset, there is also data related to fluid accumulation. Since fluid accumulation does not occur every day and its occurrence rate is relatively low, it is stored separately in the dataset. The main stored contents are the time when fluid accumulation occurs and the degree of fluid accumulation. This degree is classified into the following types, including absence, minimal, mild, moderate, and severe. In this study, the classification of fluid accumulation severity in the natural gas production process is determined according to the scores given by on-site engineers. In total, Table 2 presents the key statistics of the dataset:
LSTM Architecture
The Bidirectional Long Short-Term Memory (Bi-LSTM)14 network utilized in this study consists of an input layer, multiple hidden layers of LSTM cells, and an output layer (as shown in Fig. 1). The input layer takes in the time-series data related to wellhead pressure, casing pressure, and daily methanol injection. Each LSTM cell in the hidden layers contains a memory cell and three gates: the input gate, the forget gate15, and the output gate. The input gate controls the flow of new information into the memory cell, the forget gate determines what information to discard from the cell’s previous state, and the output gate decides what information to output for the current time step. This architecture allows the model to capture both the forward and backward temporal dependencies in the data, making it well-suited for handling the sequential nature of the natural gas production and fluid accumulation data.
For the hyperparameters of the introduced LSTM model, a batch size of 32 was chosen. The Adam optimizer was adopted, with a learning rate set to 1e-9. The model depth was specified as 8, the dropout rate was set at 0.5, and the number of epochs was fixed at 200.
Model Performance Results
In this study, the evaluation metrics utilized are the Root Mean Squared Error (RMSE) and the Mean Absolute Percentage Error (MAPE). The RMSE formula is presented as Equation (3):
where n stands for the overall number of samples or data points. yi indicates the actual observed value of the i-th sample, and \({\widehat{y}}_{i}\) represents the corresponding predicted value. The formula for MAPE is given by Equation (4):
where n refers to the number of samples.
In the experiments, 80% of the records in the dataset is taken as the training set, and the remaining 20% records are taken as the testing set. And the performance of the introduced model on this dataset is presented in Table 3:
These metrics provide a comprehensive assessment of the Bi-LSTM model’s effectiveness in handling the prediction and detection tasks based on the collected dataset. Moreover, this study has also used random forest (RF), decision tree (DT), and support vector machine (SVM) as the baseline models.
In addition, a permutation feature importance analysis was conducted on the proposed model. For features like wellhead pressure and casing pressure, their values were shuffled in the test dataset and the changes in RMSE and MAPE were observed. Table 4 shows that shuffling wellhead pressure and casing pressure led to a significant increase in both RMSE and MAPE, indicating its high importance for the model’s performance. Daily methanol injection volume has a relatively minor effect.
Data Records
The proposed dataset is publicly available on the ScienceDB16 platform to facilitate researcher access. It consists of 31 documents, thirty of them dedicated to storing the natural gas production and liquid level samples for each well in the form of .xlsx format and the other one containing the wells’ statistical information in the form of .xlsx.
The sample file of each well consists of two spreadsheets, one spreadsheet contains the daily natural gas production data and the other spreadsheet contains the liquid level data. And the fields in the first spreadsheet are provided in Table 1. The fields in the second spreadsheet are provided in Table 2.
Technical Validation
To validate the dataset’s quality and reliability, a series of tests were conducted. Data consistency checks were performed to ensure that the data was consistent across different wells and time periods, as shown in Fig. 2. Completeness checks were carried out to verify that all required data was present.
A separate validation process was dedicated to the annotations. The annotated data was randomly sampled and cross-checked by an independent team of experts who were not involved in the initial annotation process. They verified the accuracy of the well ID, date, time, and event annotations. In cases where errors or inconsistencies were found, the original annotation team was notified and the necessary corrections were made. Additionally, statistical analyses were performed on the annotations to check for any biases or patterns that could potentially affect the analysis. For example, the distribution of special events across different wells and time periods was examined to ensure that there was no over- or under-representation. This comprehensive annotation validation process further enhanced the reliability and usability of the dataset.
Usage Notes
Potential applications
The dataset presented in this study holds significant potential for various applications within the natural gas production domain. It can be utilized for the development and training of advanced machine learning and artificial intelligence models focused on natural gas production prediction. By leveraging the detailed records of wellhead pressure, casing pressure, daily methanol injection, gas production volume, and fluid accumulation information, researchers and industry practitioners can enhance the accuracy of their predictive models. This, in turn, enables more efficient production planning, allowing for optimized resource allocation and timely decision-making regarding well operations.
Furthermore, the dataset can also be used in the field of reservoir engineering for the analysis of reservoir behavior and the assessment of the impact of fluid accumulation on gas production. Engineers can use the data to study the correlations between different parameters and gain a deeper understanding of the underlying physical processes. This knowledge can then be applied to design more effective production strategies and to implement appropriate measures for mitigating the negative effects of fluid accumulation.
Limitations
Despite its potential applications, the dataset has certain limitations. One of the primary limitations is the geographical scope of the data collection. The dataset is sourced from multiple wells in the Inner Mongolia region, specifically targeting the Shanxi Formation. This limited geographical area may restrict the generalizability of the models trained using this dataset to other regions with different geological characteristics. The reservoir properties, fluid compositions, and production behaviors can vary significantly from one region to another, and thus, the models developed based on this dataset may not perform equally well in other locations.
Another limitation is the potential presence of measurement errors and uncertainties in the data. Although efforts have been made to ensure data quality through calibration of sensors and implementation of quality control measures, there is still a possibility of errors in the recorded values. These errors could affect the accuracy of the models trained with the dataset and lead to less reliable predictions and analyses.
Code availability
The code used for data processing, analysis, and model development is available on Gitee at https://gitee.com/practicing-swordsmanship/jian-lian/tree/master. The code is comprehensively documented, enabling other researchers to reproduce the analysis and build upon the work. It encompasses functions for data cleaning, feature engineering, model training, and evaluation.
References
Faramawy, S., Zaki, T. & Sakr, A. A.-E. Natural gas origin, composition, and processing: A review. Journal of Natural Gas Science and Engineering 34, 34–54, https://doi.org/10.1016/j.jngse.2016.06.030 (2016).
Jimoh, M. O., Arinkoola, A. O., Salawudeen, T. O., Daramola, M. O. 4 - environmental challenges of natural gas extraction and production technologies. In: Rahimpour, M. R., Makarem, M. A., Meshksar, M. (eds.) Advances in Natural Gas: Formation, Processing, and Applications Volume: 1: Natural Gas Formation and Extraction, pp. 75–101. Elsevier, https://doi.org/10.1016/B978-0-443-19215-9.00009-8 (2024).
Xu, L. et al. Origin and isotopic fractionation of shale gas from the shanxi formation in the southeastern margin of ordos basin. Journal of Petroleum Science and Engineering 208, 109189, https://doi.org/10.1016/j.petrol.2021.109189 (2022).
Liu, C. Advances in gas well fluid accumulation modeling. Academic Journal of Science and Technology 5, 169–178, https://doi.org/10.54097/ajst.v5i1.5602 (2023).
Sen, D., Hamurcuoglu, K. I., Ersoy, M. Z., Tunç, K. M. M. & Günay, M. E. Forecasting long-term world annual natural gas production by machine learning. Resources Policy 80, 103224, https://doi.org/10.1016/j.resourpol.2022.103224 (2023).
Lao, T. & Sun, Y. Predicting the production and consumption of natural gas in china by using a new grey forecasting method. Mathematics and Computers in Simulation 202, 295–315, https://doi.org/10.1016/j.matcom.2022.05.023 (2022).
Hari, S., Krishna, S., Patel, M., Bhatia, P. & Vij, R. K. Influence of wellhead pressure and water cut in the optimization of oil production from gas lifted wells. Petroleum Research 7(2), 253–262, https://doi.org/10.1016/j.ptlrs.2021.09.008 (2022).
Yin, F. & Gao, D. Prediction of sustained production casing pressure and casing design for shale gas horizontal wells. Journal of Natural Gas Science and Engineering 25, 159–165, https://doi.org/10.1016/j.jngse.2015.04.038 (2015).
Panda, S., Mehlawat, S., Dhariwal, N., Kumar, A. & Sanger, A. Comprehensive review on gas sensors: Unveiling recent developments and addressing challenges. Materials Science and Engineering: B 308, 117616, https://doi.org/10.1016/j.mseb.2024.117616 (2024).
Gokulakrishnan, D. & Venkataraman, S. Ensuring data integrity: Best practices and strategies in pharmaceutical industry. Intelligent Pharmacy https://doi.org/10.1016/j.ipha.2024.09.010 (2024).
Gong, Y., Liu, G., Xue, Y., Li, R. & Meng, L. A survey on dataset quality in machine learning. Information and Software Technology 162, 107268, https://doi.org/10.1016/j.infsof.2023.107268 (2023).
Caruso, C. & Quarta, F. Interpolation methods comparison. Computers & Mathematics with Applications 35(12), 109–126, https://doi.org/10.1016/S0898-1221(98)00101-1 (1998).
May, R. J., Maier, H. R. & Dandy, G. C. Data splitting for artificial neural networks using som-based stratified sampling. Neural Networks 23(2), 283–294, https://doi.org/10.1016/j.neunet.2009.11.009 (2010).
Liu, G. & Guo, J. Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing 337, 325–338, https://doi.org/10.1016/j.neucom.2019.01.078 (2019).
Gers, F. A. & Schmidhuber, J. & Cummins, F. Learning to forget: continual prediction with lstm. In: 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470), vol. 2, pp. 850–8552, https://doi.org/10.1049/cp:19991218 (1999).
Lian, J. Natural gas production and liquid level. https://doi.org/10.57760/sciencedb.23211.
Acknowledgements
The authors express their sincere gratitude to the oil companies and field operators in the Inner Mongolia region for their cooperation and support in providing the data.
Author information
Authors and Affiliations
Contributions
J.L. conceived the study and designed the data collection methodology. C.L. was responsible for data collection and annotation. Y.W., J.L. and C.L. conducted the data analysis and model development. All authors contributed to the writing and revision of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, Y., Lian, J. & Li, C. A dataset of Natural Gas and Liquid Level for Oil Field Production Prediction in China. Sci Data 12, 1071 (2025). https://doi.org/10.1038/s41597-025-05309-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-025-05309-w