Introduction

Amid the ongoing restructuring of the global energy system and the escalating pursuit of low-carbon energy sources, natural gas—especially unconventional forms like shale gas—has emerged as a pivotal component in safeguarding national energy supply and promoting the evolution toward a cleaner energy mix.

In recent years, China has accelerated the exploration and development of shale gas, achieving significant breakthroughs in regions such as the Sichuan Basin, Fuling, Changning–Weiyuan, and Zhaotong1,2,3,4. According to statistics from the National Energy Administration, China’s annual shale gas production exceeded 25 billion cubic meters in 2023, with its share in total domestic natural gas output continuing to increase. It has become a primary contributor to natural gas production growth during the 14th Five-Year Plan period (see Fig. 1).

However, shale gas resources are typically characterized by deep burial, complex geological structures, and strong reservoir heterogeneity5, which makes their production behavior significantly different from that of conventional gas reservoirs and poses substantial challenges for accurate production forecasting.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Annual shale gas production in China (2015–2023), in billion cubic meters.

Traditional shale gas production forecasting methods are largely based on decline curve analysis (DCA), such as Arps models and SEPD curves, whose theoretical foundations rely on statistical or empirical formulations. While these approaches provide useful guidance during the early stages of reservoir development, they generally assume stable flow regimes and simplified production mechanisms. As a result, their prediction accuracy deteriorates significantly under conditions involving strong production fluctuations, gas–water co-production, or frequent well control adjustments, which are commonly observed in deep shale gas wells.

To overcome these limitations, several studies have attempted to enhance early-stage production evaluation by incorporating production dynamics and statistical analysis. Bu Tao et al.6 proposed a rapid estimation method for ultimate recoverable reserves (EUR) based on flowback-phase dynamic data, enabling preliminary assessment of shale gas potential at an early stage. Zhu Yuanchong et al.5, through large-scale statistical feature analysis, investigated the relationships between shale gas productivity and production parameters, highlighting the importance of data-driven methodologies in capturing complex production behaviors.

In parallel, data-driven machine learning approaches have been increasingly adopted in gas engineering applications. Recent studies published in Energy and Applied Energy demonstrated that machine learning models can effectively predict key production-related parameters—such as deliverability in underground natural gas storage systems—offering competitive accuracy with substantially reduced computational cost compared to physics-based simulation7,8. However, most of these approaches rely on static or shallow learning models and fail to explicitly account for temporal dependency and long-range dynamics inherent in gas production processes.

With the rapid advancement of artificial intelligence, deep learning–based time-series prediction models9,10,11,12 have gained increasing attention in oil and gas production forecasting. Among them, Long Short-Term Memory (LSTM) networks have shown strong capability in modeling nonlinear temporal dependencies and have been successfully applied to daily shale gas production prediction across multiple blocks. To address data scarcity and generalization issues, Alolayan et al.9 introduced a transfer learning framework, while Nguyen-Le et al.10 developed a multivariate input strategy combining early production data to improve forecasting performance in the Barnett shale reservoir.

More recently, hybrid deep learning architectures have been explored to further enhance sequence modeling capability. Qiao Songbo et al13. proposed a hybrid REMD–CNN–Transformer–LSTM framework for complex time-series prediction, demonstrating the advantages of integrating multiple representation mechanisms. Liang et al.14 introduced a BiLSTM–RF–MPA model tailored to the nonstationary characteristics of shale gas production, achieving improved robustness and prediction accuracy. In summary, LSTM-based models are effective in capturing short-term temporal dependencies but often struggle with long-range dynamics, CNN-based approaches focus on local pattern extraction with limited temporal context, whereas Transformer-based models excel at global dependency modeling at the cost of increased computational complexity.

Meanwhile, the success of the Transformer architecture in natural language processing and time-series modeling has provided new opportunities for capturing long-range dependencies in complex sequential data. In addition, state-space models (SSMs), such as the recently proposed Mamba architecture, have shown strong potential in efficiently modeling long sequences with linear computational complexity, offering a promising alternative to attention-based mechanisms.

Motivated by these developments, this study proposes a hybrid prediction framework that integrates the global dependency modeling capability of Transformer with the efficient long-range state-space representation of Mamba, while employing the Rabbit Optimization Algorithm (ROA) for automated hyperparameter optimization. The proposed model is systematically evaluated against Transformer–LSTM and standalone Mamba baselines using real production data from deep shale gas wells.

The main objective of this study is to develop a Transformer–Mamba hybrid model for shale gas production forecasting and to systematically evaluate its predictive accuracy and generalization capability using large-scale field production data, with particular emphasis on its ability to capture production dynamics under complex operating conditions. The remainder of this paper is organized as follows: Sect. 2 describes the data and methodology, Sect. 3 presents the results and discussion, and Sect. 4 summarizes the main conclusions and outlines directions for future research.

Research methods

Problem definition and modeling objective

Daily shale gas production forecasting is essentially a multivariate time-series regression problem, which aims to predict future daily gas production trends based on historical production and operational parameters. At time step \(\:t\), the production state of a shale gas well is represented by a multidimensional input vector

$$\:{\mathbf{x}}_{t}=[{x}_{t}^{\left(1\right)},{x}_{t}^{\left(2\right)},\dots\:,{x}_{t}^{\left(n\right)}]$$
(1)

which includes key production variables such as pressure, water production rate, and flowback ratio. The corresponding prediction target is the daily gas production at the next time step, denoted as \(\:{y}_{t+1}\).

In this study, a sliding time-window approach is adopted, where a historical sequence of length \(\:L\),

$$\:\left\{{\mathbf{x}}_{t-L+1},\dots\:,{\mathbf{x}}_{t}\right\}$$
(2)

is used as the model input to perform regression and predict future daily gas production. This formulation enables the model to learn the dynamic mapping between historical operational states and subsequent production responses.

Transformer–Mamba hybrid model architecture

To simultaneously capture short-term fluctuations and long-term evolutionary trends in shale gas production data, a hybrid deep learning model integrating Transformer15,16,17,18,19 and Mamba20,21,22,23,24,25 modules is developed in this study. The overall architecture of the proposed model is illustrated in Fig. 2. The model first employs a one-dimensional convolutional layer to project the original input features into a higher-dimensional space, thereby enhancing the representation of local temporal patterns. Subsequently, a Transformer encoder is introduced, in which the multi-head self-attention mechanism is used to model global dependencies among different time steps and to capture the coupling effects of production parameters across multiple temporal scales.

Following the Transformer module, a Mamba module is incorporated to perform state-space modeling of the time series. Unlike attention-based architectures, Mamba leverages a selective state update mechanism to compress and propagate historical information, enabling efficient modeling of dynamic state evolution in long sequences while maintaining linear computational complexity. This property makes the Mamba module particularly suitable for shale gas production data, which are often characterized by long-term gradual decline superimposed with localized perturbations. Finally, a fully connected layer maps the high-dimensional temporal features to the predicted future daily gas production.

The Mamba component is implemented as a lightweight, Mamba-inspired selective state-space modeling block to capture long-range temporal dependencies efficiently. The state dimension \(\:N\)is set equal to the Transformer embedding dimension (\(\:N=32\)), ensuring consistent feature representation across modules. Continuous-time state evolution is approximated through discrete convolutional operators, providing an efficient discretisation of state transitions. Selective information propagation is realized via gated nonlinear transformations and residual connections, enabling adaptive compression and transmission of historical production dynamics.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Hybrid model architecture.

To clearly describe the data flow within the model and the interconnections among different modules, the tensor structure of the proposed Transformer–Mamba hybrid model is detailed as follows. Let \(\:B\) denote the batch size, \(\:L\) the length of the sliding time window, and \(\:F\) the number of input features (in this study, \(\:F=3\), corresponding to average casing pressure, daily water production, and flowback ratio). The input tensor of the model can thus be expressed as

$$\:\begin{array}{cccc}&\:\mathbf{X}\in\:{\mathbb{R}}^{B\times\:L\times\:F}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:&\:&\:\end{array}$$
(3)

First, a one-dimensional convolutional layer (Conv1D) is applied to perform linear mapping and feature expansion on the input sequence, thereby enhancing the representation of local temporal patterns. The resulting output tensor is given by

$$\:\begin{array}{cccc}&\:{\mathbf{X}}_{c}\in\:{\mathbb{R}}^{B\times\:L\times\:{D}_{c}}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:&\:&\:\end{array}$$
(4)

where \(\:{D}_{c}\)denotes the dimensionality of the convolutional feature space.

Subsequently, the convolutional output is fed into a Transformer encoder module. Through the multi-head self-attention mechanism, the Transformer models global dependencies across different time steps while preserving the temporal dimension, yielding an output tensor expressed as

$$\:\begin{array}{cccc}&\:{\mathbf{X}}_{t}\in\:{\mathbb{R}}^{B\times\:L\times\:{D}_{t}}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:&\:&\:\end{array}$$
(5)

where \(\:{D}_{t}\)represents the hidden dimension of the Transformer.

On this basis, a Mamba module is introduced to perform state-space modeling of the time series. By means of a selective state update mechanism, the Mamba module compresses and propagates dynamic information over long sequences, producing an output tensor of the form

$$\:\begin{array}{cccc}&\:{\mathbf{X}}_{m}\in\:{\mathbb{R}}^{B\times\:L\times\:{D}_{m}}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:&\:&\:\end{array}$$
(6)

where \(\:{D}_{m}\)denotes the state dimension of the Mamba module.

Finally, the temporal features are aggregated along the time dimension (e.g., by selecting the last time step or applying global average pooling) to obtain a fixed-dimensional representation, which is then passed through a fully connected (Dense) layer to generate the prediction. The final model output is expressed as

$$\:\begin{array}{cccc}&\:\widehat{\mathbf{y}}\in\:{\mathbb{R}}^{B\times\:1}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:&\:&\:\end{array}$$
(7)

corresponding to the predicted daily gas production at the future time step.

Benchmark model configuration

To evaluate the effectiveness of the proposed model, two commonly used models are selected as benchmarks: a Transformer–LSTM hybrid model and a standalone Mamba model. The Transformer–LSTM model combines the global attention mechanism of the Transformer with the recurrent memory structure of LSTM14,26,27 to capture both short-term and long-term dependencies in time-series data. The Mamba model is employed to assess the performance of a pure state-space architecture in the shale gas production forecasting task.

The basic structure of the LSTM unit and its gate mechanisms are illustrated in Fig. 3.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Basic unit of the LSTM network.

The commonly used formulation for the LSTM model is given as follows:

$${c}{f}_{t}=\sigma\:\left({W}_{f}\cdot\:\left[{h}_{t-1},{x}_{t}\right]+{b}_{f}\right)$$
(8)
$${i}_{t}=\sigma\:\left({W}_{i}\cdot\:\left[{h}_{t-1},{x}_{t}\right]+{b}_{i}\right)$$
(9)
$$\:{\stackrel{\sim}{C}}_{t}=tanh\left({W}_{C}\cdot\:\left[{h}_{t-1},{x}_{t}\right]+{b}_{C}\right)$$
(10)
$${C}_{t}={f}_{t}*{C}_{t-1}+{i}_{t}*{\stackrel{\sim}{C}}_{t}$$
(11)
$$\left({W}_{o}\cdot\:\left[{h}_{t-1},{x}_{t}\right]+{b}_{o}\right)$$
(12)
$${h}_{t}={o}_{t}*tanh\left({C}_{t}\right)$$
(13)

Where \(\:{x}_{t}\)​ denotes the input at time step t, I, F, and O represent the input gate, forget gate, and output gate, respectively. The symbol σ denotes the sigmoid activation function, and tanh denotes the hyperbolic tangent function.

For a fair comparison, all models adopt the same input features, data preprocessing procedures, and training–testing split strategy.

ROA-based hyperparameter optimization strategy

To mitigate the uncertainty introduced by manual hyperparameter tuning and to further enhance prediction accuracy and generalization performance, the Rabbit Optimization Algorithm (ROA)28,29,30 is employed to perform global hyperparameter optimization for the proposed model. In ROA, each candidate hyperparameter configuration is treated as an individual “rabbit,” and the optimal solution is searched through an adaptive process that balances global exploration and local exploitation.

In this study, ROA is used to optimize key model hyperparameters, including the learning rate, hidden layer dimensionality, number of attention heads in the Transformer module, and the state dimension of the Mamba module. The optimization objective is defined as the root mean squared error (RMSE) on the validation set. During the iterative optimization process, the positions of individual rabbits are updated based on fitness evaluation, and the algorithm terminates when either the maximum number of iterations is reached or the objective function converges. The optimal hyperparameter configuration obtained by ROA is then adopted for final model training.

Model training and evaluation

During model training, a five-fold cross-validation strategy is adopted to evaluate model stability and robustness. In each fold, the dataset is divided into training and validation subsets, with the training subset used for parameter learning and the validation subset used for performance assessment and hyperparameter optimization. After cross-validation, the model is retrained using the optimal hyperparameters and evaluated on an independent test set to assess its generalization capability.

Model performance is quantitatively evaluated using several commonly adopted regression metrics, including mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and the coefficient of determination (R2)31. By comparing these metrics across different models, the predictive accuracy, stability, and generalization performance of the proposed Transformer–Mamba model are systematically assessed.

To eliminate scale differences among wells and improve the stability of model training, all input variables and the target variable (daily gas production) were normalized using Min–Max scaling prior to model training. As a result, the model was trained and evaluated in a normalized space. The predicted and observed values of daily gas production are therefore dimensionless and range between 0 and 1.Accordingly, the evaluation metrics reported in this study, including mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE), are dimensionless quantities that reflect relative prediction accuracy rather than absolute production rates. The coefficient of determination (R²) remains invariant to linear scaling and thus retains its conventional interpretation.

The corresponding formulas are as follows:

$$\:\text{M}\text{S}\text{E}=\frac{1}{n}\sum\:_{i=1}^{n}\:({y}_{i}-{\widehat{y}}_{i}{)}^{2}$$
(14)
$$\:\text{R}\text{M}\text{S}\text{E}=\sqrt{\frac{1}{n}\sum\:_{i=1}^{n}({y}_{i}-{\widehat{y}}_{i}{)}^{2}}$$
(15)
$$\:\text{M}\text{A}\text{E}=\frac{1}{n}\sum\:_{i=1}^{n}\:|{y}_{i}-{\widehat{y}}_{i}|$$
(16)
$$\:{\text{R}}^{2}=1-\frac{\sum\:_{i=1}^{n}\:({y}_{i}-{\widehat{y}}_{i}{)}^{2}}{\sum\:_{i=1}^{n}\:({y}_{i}-\stackrel{-}{y}{)}^{2}}$$
(17)

Where \(\:{y}_{i}\) is the actual value, \(\:{\widehat{y}}_{i}\)​ is the predicted value, \(\:\stackrel{-}{y\:}\)is the mean of the actual values, and n is the total number of samples.

Ethical approval and compliance

All methods were carried out in accordance with relevant guidelines and regulations. No human participants or animals were involved in this study. The use of shale gas production data was approved by Southwest Oil & Gasfield Company under a data use agreement.

Results and discussion

Experimental data

The data used in this study are obtained from the production data management system of an oil and gas company and consist of continuous daily production records from multiple deep shale gas wells within a key development block. To ensure data consistency and comparability, all selected wells are completed in the same reservoir formation and adopt similar hydraulic fracturing designs and production schemes. The dataset covers the entire production history since well commissioning and is characterized by a long time span and complex dynamic variations.

The original dataset includes daily gas production, daily water production, average casing pressure, average tubing pressure, and flowback ratio, among other production and operational parameters, with daily gas production serving as the prediction target. After performing data preprocessing procedures—including outlier removal, missing-value imputation, and unit normalization—a sliding time-window approach is employed to construct supervised learning samples, thereby preserving the temporal evolution of production parameters. Flowback ratio is used to characterize the intensity of the post-fracturing flowback process. It is defined as the ratio of the volume of returned (flowback) fluids to the total injected fluid volume during stimulation. In this study, the flowback ratio is defined as the cumulative flowback fluid volume from the start of production at time \(\:{t}_{0}\)to time \(\:t\), divided by the total injected fracturing fluid volume:

$$\:FR\left(t\right)=\frac{{V}_{\text{f}\text{b}}\left(t\right)}{{V}_{\text{i}\text{n}\text{j}}}=\frac{{\int\:}_{{t}_{0}}^{t}{q}_{\text{f}\text{b}}\left(\tau\:\right)\text{\hspace{0.17em}}d\tau\:}{{V}_{\text{i}\text{n}\text{j}}}$$
(18)

where \(\:{q}_{\text{f}\text{b}}\)is the flowback-fluid volumetric rate, \(\:{V}_{\text{f}\text{b}}\left(t\right)\)is the cumulative flowback volume from the start of production/flowback \(\:{t}_{0}\)to time \(\:t\), and \(\:{V}_{\text{i}\text{n}\text{j}}\)is the total injected fluid volume.

Data preprocessing

The production data used in this study were obtained from a field production management system of a deep shale gas block. Daily gas production, expressed in 10⁴ m³·d⁻¹, is taken as the prediction target, while pressure- and water-related parameters are used as candidate inputs. Prior to modeling, the raw data undergo outlier removal, missing-value interpolation, and unit normalization. To preserve the temporal evolution characteristics of production dynamics, supervised learning samples are constructed using a sliding time window approach.

The early production period is divided into a flowback phase and a subsequent stable production phase. The flowback phase is identified as the period immediately following hydraulic fracturing, during which fracturing fluids are continuously recovered and water production remains relatively high. This phase is characterized by a rapid increase in the flowback ratio and strong fluctuations in gas and water production. Only continuous production data after the onset of stable flowback behavior are retained to ensure consistency in time-series modeling.

In terms of input feature selection, a two-stage strategy combining engineering knowledge and statistical analysis is adopted. Based on shale gas production mechanisms, six candidate features are initially selected, including pressure-related parameters as well as flowback- and water-production-related variables. These features describe shale gas well performance from the perspectives of reservoir energy release, fracture conductivity evolution, and gas–water two-phase flow behavior.

Subsequently, the Spearman rank correlation coefficient is employed to evaluate the monotonic relationships between candidate features and daily gas production, thereby reducing redundancy and multicollinearity among inputs. The Spearman coefficient is defined as:

$$\:\rho\:=1-\frac{6\sum\:{d}_{i}^{2}}{n({n}^{2}-1)}$$
(19)

where \(\:{d}_{i}\)denotes the difference between the ranks of two variables, and \(\:n\)is the number of samples. Unlike Pearson correlation, Spearman correlation measures monotonic dependence rather than linear dependence, making it more suitable for nonlinear production data.

In this study, the correlation strength is classified as follows:

\(\:\mid\:\rho\:\mid\:\ge\:0.7\)indicates strong correlation,\(\:0.3\le\:\mid\:\rho\:\mid\:<0.7\)indicates moderate correlation, and \(\:\mid\:\rho\:\mid\:<0.3\)indicates weak correlation.

The complete correlation matrix and classification thresholds are reported in Table 1 and visualized in Fig. 4.

The analysis results show that average casing pressure, daily water production, and flowback ratio exhibit stable and significant correlations with daily gas production and are therefore selected as the final model inputs. From a physical perspective, casing pressure reflects reservoir–wellbore pressure transmission and energy depletion, daily water production characterizes fracture flowback behavior and multiphase flow effects, and the flowback ratio represents fracture cleanliness and early-stage conductivity recovery. The combination of these parameters allows the model to capture both production dynamics and underlying physical mechanisms, thereby improving prediction reliability and engineering interpretability.

The weak correlations between fracturing fluid volume and both daily water and gas production (ρ ≈ 0.05) should be interpreted in the context of the data. The selected wells were completed under similar completion programs, resulting in a relatively narrow distribution of injected fluid volumes, which reduces its statistical leverage. Moreover, instantaneous daily water production during the early production period is affected by choke management, drawdown strategy, and transient wellbore/storage effects, which may decouple daily water rate from total injected volume. Note that min–max normalization to [0,1] does not change the rank ordering of samples and therefore does not affect Spearman correlation coefficients.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Spearman correlation analysis.

Table 1 Correlation strength.

After data preprocessing and sliding-window sequence construction, all samples from all wells were combined into a unified dataset. A five-fold cross-validation strategy was employed for model training and evaluation, in which approximately 80% of the samples were used for training and the remaining 20% for validation in each fold. The final performance metrics were reported as the mean and standard deviation across the five folds. Unlike a fixed train–validation–test split (e.g., 70–20–10), this k-fold cross-validation scheme was adopted to reduce sampling bias and to obtain more reliable and statistically robust performance estimates under limited data conditions. Similar cross-validation strategies have been successfully applied in data-driven reservoir characterization studies, where ensemble learning models were shown to achieve stable generalization performance when predicting lithofacies and permeability for unseen intervals and wells based on well-log data alone32.

Prediction results and analysis

The preprocessed data were fed into the constructed Transformer-Mamba model, and the model’s hyperparameters were optimized using the Rabbit Optimization Algorithm. The predicted daily gas production was then compared with the actual production data to evaluate the model’s effectiveness. As shown in Fig. 5, the Transformer-Mamba model demonstrates good fitting performance and is capable of accurately forecasting future daily gas production.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Prediction results of the Transformer-Mamba model for well X.

In the comparative experiments, three model architectures were selected for production forecasting: the baseline Mamba model, a hybrid model (Transformer-LSTM), and the proposed hybrid model (Transformer-Mamba). To ensure fairness in comparison, all three models utilized the same input data format and underwent hyperparameter tuning using the Rabbit Optimization Algorithm.

Figures 6, 7 and 8 collectively present the training and validation loss convergence behaviors of the Transformer–Mamba, Mamba, and Transformer–LSTM models, respectively. For all three models, the training loss decreases monotonically with increasing epochs and stabilizes within approximately 8–10 epochs, indicating efficient and stable optimization. The validation loss follows a consistent downward trend and remains close to the training loss without systematic divergence or late-epoch escalation, providing quantitative evidence against severe overfitting. Among the three models, the Transformer–Mamba model achieves faster convergence and a lower stabilized validation loss, while the standalone Mamba model exhibits slightly larger early-epoch fluctuations, and the Transformer–LSTM model shows smoother convergence but a higher final loss level. Overall, the comparable convergence patterns and bounded training–validation loss gaps demonstrate stable learning behavior and reasonable generalization performance across all models.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Loss function evolution of the Transformer-Mamba model.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Loss function evolution of the Mamba model.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Loss function evolution of the Transformer-LSTM model.

Figures 9 and 10 show the comparison between the predicted results and the actual production data for the Mamba model and the Transformer-LSTM model. It can be observed that both models are capable of effectively capturing the overall trends in the production time series, with their prediction curves closely aligning with the actual data. Among the three models compared, the Transformer-Mamba model demonstrates the best performance in terms of peak prediction, inflection point detection, and fluctuation amplitude reconstruction, followed by the Transformer-LSTM model.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Prediction results of the Mamba model.

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

Prediction results of the Transformer-LSTM model.

To rigorously evaluate predictive accuracy and robustness, four statistical indicators—MAE, RMSE, MSE, and R²—were employed for model comparison. As shown in Fig. 11; Table 2, the Transformer–Mamba model consistently achieves lower MAE, RMSE, and MSE values and a higher coefficient of determination than the Transformer–LSTM baseline, indicating improved predictive accuracy. This improvement suggests that incorporating Transformer-based global attention enhances the effectiveness of the Mamba-based temporal modeling framework.

The proposed Transformer–Mamba architecture serially integrates the global dependency modeling capability of the Transformer with the dynamic state-space representation of a Mamba-like module, making it well suited for shale gas production forecasting. To further examine whether the observed performance gains are statistically significant, a Wilcoxon signed-rank test was conducted on fold-wise results obtained from five-fold cross-validation. The test results confirm that the proposed model achieves statistically significant improvements over baseline models in terms of MAE, RMSE, and R² (p < 0.05).

Quantitatively, compared with the standalone Mamba model, the ROA-Transformer–Mamba framework reduces RMSE from 0.0418 to 0.0328 (approximately 21.5%) and MAE from 0.02438 to 0.0174 (approximately 28.6%), while increasing R² from 0.8965 to 0.938. Relative to the Transformer–Mamba model without ROA, ROA-based hyperparameter optimization further reduces RMSE by 14.6% and MAE by 18.3%, demonstrating the effectiveness of ROA in improving generalization performance.

These performance gains are consistently observed across five-fold cross-validation and independent test wells, indicating that the proposed model not only achieves higher prediction accuracy but also exhibits a smaller generalization gap when applied to unseen production data. Overall, the results highlight the strong potential of the proposed framework for engineering applications and its value in supporting data-driven decision-making in shale gas field operations.

Fig. 11
Fig. 11The alternative text for this image may have been generated using AI.
Full size image

Comparison of evaluation metrics for the three models.

Table 2 Comparison of evaluation metrics.

Discussion

To mitigate the bias introduced by a single random split, model performance was first assessed using five-fold cross-validation on the training set, followed by independent testing on the held-out dataset.Based on the preprocessed shale gas well production data, a systematic evaluation of the predictive performance of the proposed Transformer–Mamba hybrid model was conducted. The time-series data were divided into training and test sets with a ratio of 8:2, where the training set was used for model parameter learning and the test set was employed to assess generalization performance on unseen data. The benchmark models include Mamba, Transformer–LSTM, and the proposed Transformer–Mamba model. All three models adopt identical input features and data-splitting strategies, and their key hyperparameters are uniformly optimized using the Rabbit Optimization Algorithm (ROA). The evaluation results indicate that the Transformer–Mamba model achieves the lowest RMSE and MAE values and the highest coefficient of determination (2) on the test set, demonstrating superior prediction accuracy and stability compared with the benchmark models. Notably, the proposed model exhibits enhanced capability in tracking production trends and capturing peak behavior during periods of pronounced production fluctuations.

A comparison between the predicted production curves and the measured daily gas production further shows that the model not only accurately reproduces the overall production decline trend, but also responds reasonably to local fluctuations and inflection points.

As shown in Fig. 12, when average casing pressure decreases continuously, the predicted daily gas production exhibits an accelerated decline, reflecting the controlling effect of reservoir energy depletion on production performance. During periods characterized by relatively high or strongly fluctuating daily water production, the predicted gas rate shows corresponding suppression or delayed response, indicating that the model effectively learns the influence of gas–water two-phase flow on effective gas transport pathways. In addition, during stages when the flowback ratio increases, the predicted production curve displays production recovery or a reduced decline rate, suggesting that the model is able to identify the positive impact of fracture cleanliness improvement on the recovery of fracture conductivity. These predictive behaviors are consistent with the typical production mechanisms observed during the flowback and early stable production stages of shale gas wells.

Fig. 12
Fig. 12The alternative text for this image may have been generated using AI.
Full size image

Input parameter variation curves.

It should be noted that, although the proposed approach is fundamentally data-driven, its predictions are not merely the result of mathematical curve fitting. Instead, the model reflects, to a certain extent, the combined effects of key physical processes in shale gas production, including pressure transmission, fracture conductivity evolution, and multiphase flow effects. Therefore, the proposed model can be regarded as a surrogate representation of complex physical processes, providing an effective engineering tool for shale gas production forecasting in scenarios where detailed physical parameters or high-fidelity numerical simulations are unavailable. Nevertheless, prediction accuracy may be affected under extreme operating conditions, such as prolonged shut-ins, aggressive choke adjustments, or abnormal pressure drops. Addressing these scenarios by incorporating operational-state indicators or uncertainty analysis constitutes an important direction for future research.

Conclusion

Based on long-term production data from multiple wells in a deep shale gas block, this study develops a hybrid deep learning framework integrating the Rabbit Optimization Algorithm (ROA), Transformer, and Mamba modules for daily shale gas production forecasting. Compared with benchmark models such as Mamba and Transformer–LSTM, the proposed Transformer–Mamba model achieves lower RMSE and MAE values and a higher coefficient of determination (\(\:{R}^{2}\)) on the test set, demonstrating improved prediction accuracy and generalization capability, particularly during periods of pronounced production fluctuations and inflection points.

From an engineering perspective, the model’s predictive behavior is physically interpretable. The selected input features—average casing pressure, daily water production, and flowback ratio—represent reservoir energy depletion, fracture flowback dynamics, and multiphase flow effects, respectively. The model responses to variations in these parameters are consistent with the typical production mechanisms observed during the flowback and early stable production stages of shale gas wells. This indicates that, although the approach is data-driven, it captures key physical controls on shale gas production rather than relying solely on numerical fitting.

The proposed framework provides a practical data-driven tool for short- to mid-term production forecasting in situations where detailed geological characterization or full-physics numerical simulation is unavailable. Prediction performance may degrade under extreme operating conditions, such as prolonged shut-ins or aggressive choke adjustments. Future work will focus on incorporating operational-state indicators, uncertainty quantification, and transfer learning to further enhance model robustness and cross-block applicability.