Introduction

Subsurface applications for climate mitigation and sustainability are essential to achieving the net-zero emissions target set by the Intergovernmental Panel on Climate Change for 20501. Key geo-engineering strategies include the development of enhanced geothermal systems (EGS) for renewable energy generation and the geological storage of carbon dioxide (CO2) to reduce atmospheric greenhouse gas concentrations. The U.S. Geological Survey (USGS) estimates that EGS could provide over 500 GWe of electricity in the western United States alone2. In addition, carbon dioxide sequestration has the potential to store at least 1000 GtCO2 in saline aquifers, with further storage capacity available in depleted oil and gas reservoirs and coal formations3,4. Despite the immense potential to reduce greenhouse gases through these subsurface applications, a key challenge is the induced seismicity that can result from fluid injection operations5,6,7. Fluid injection perturbs in-situ stress fields in the subsurface, potentially leading to the reactivation of preexisting faults or the creation of new fractures, both potentially compromising the integrity of reservoirs8. Notable examples include the magnitude 5.7 2015 Prague and magnitude 5.8 2016 Pawnee earthquakes in Oklahoma after wastewater injection9,10,11,12—and a magnitude 3.9 earthquake following circulation tests for the EGS project in Vendenheim, France13. These events underscore the critical need for accurate forecasting of induced seismicity to ensure the safe implementation of subsurface technologies.

Accurately forecasting fluid-induced seismicity remains a challenge due to the complex interactions between geological, hydrological, and mechanical factors5,14. Traditional approaches rely on physics-based models to estimate induced seismicity by coupling fluid flow, mechanical deformation, and seismicity rates15,16,17,18. Although these models can capture intricate subsurface interactions, they face limitations in real-world applications. Challenges include uncertainties in fracture geometries, material heterogeneity, and in-situ stress conditions. Moreover, assumptions such as isotropic material properties or idealized fracture networks are often required, reducing predictive accuracy. High computational costs associated with three-dimensional modeling with complex fracture geometries further restrict their use in practical forecasting and operational decision-making15,17. As a result, discrepancies between modeled and observed seismicity frequently occur.

From a statistical perspective, the Epidemic-Type Aftershock Sequence (ETAS) model provides a forecasting approach for both natural and fluid-induced seismicity, based on the assumption that an earthquake can trigger clusters of aftershocks19,20. In particular, nonstationary ETAS models have effectively demonstrated their capability in detecting the impacts of fluid-induced seismicity by employing a nonstationary background rate19,21,22,23. This capability positions ETAS as a valuable tool for generating probabilistic earthquake forecasts. However, determining key parameters, including the timing of peak activity, solely based on statistical analysis has been challenging24. Thus, successful applications of ETAS models to spatiotemporal forecasting of microearthquakes (MEQs) due to fluid injection may be limited.

Data-driven approaches—particularly machine learning—have emerged as powerful complements or alternatives to traditional frameworks including both physics-based and statistical approaches, in a range of geoscientific applications25,26,27,28,29,30,31,32,33,34,35,36,37. These methods do not require detailed prior knowledge of uncertain subsurface properties but instead leverage large datasets from monitoring systems to identify patterns and correlations that can be used for forecasting. For instance, deep learning—with and without physical constraints—was used to forecast the seismicity rate, which was then used to estimate the maximum magnitude of fluid-induced microseismicity38. A bidirectional long short-term memory neural network predicted fluid-induced permeability evolution based on MEQ features, including seismic rate and cumulative logarithmic seismic moment39. In addition, an LSTM model was employed to predict average permeability changes inferred from the seismicity data. Another LSTM model was used to predict pore pressure and associated fault displacements given the fluid injection cycles40. These studies demonstrate that deep learning approaches can effectively capture the temporal evolution of permeability or micro-seismicity based on operational parameters. However, they often focus solely on temporal predictions without considering the spatial evolution of MEQs, which is critical for assessing the extent of affected areas and potential impacts. Furthermore, these models rely on simplified assumptions for permeability changes, such as the migration of the triggering front of the MEQ cloud assuming proportionality to the square root of time since the initiation of injection, which is inconsistent with observed MEQ data41. These idealizations limit the applicability and accuracy of the models in complex scenarios.

Our study advances the forecasting of the spatiotemporal evolution of MEQs induced by hydraulic stimulation using a deep learning approach that tackles these challenges. Specifically, we employ transformer networks, a type of neural network architecture that uses self-attention mechanisms to capture complex dependencies within data sequences42,43. Compared with recurrent neural networks such as LSTMs, transformer networks can model long-range temporal dependencies more efficiently and are less susceptible to issues like vanishing gradients44. Their ability to focus on different parts of the input data through attention mechanisms makes them particularly well-suited for capturing both spatial and temporal patterns in MEQ data. Based on hydraulic stimulation history, our model predicts key MEQ features, including the cumulative number of MEQs, cumulative seismic moment, and the spatial extent of induced micro-seismicity. By incorporating both spatial and temporal information, the model provides more comprehensive forecasts that can inform real-time monitoring and risk mitigation strategies in subsurface activities.

Results

We use hydraulic stimulation data and MEQ history from the EGS Collab45,46. Figure 1 shows the architecture of our transformer model for forecasting the spatiotemporal evolution of MEQs based on hydraulic stimulation and MEQ histories (see Section “ Method: Transformer neural network architecture ”).

Fig. 1: Architecture of the transformer-based MEQ forecasting model.
figure 1

Given input history from time steps t0 through tn, the model predicts MEQ features at future time steps tn+1 through \({t}_{n}+{l}_{{{{\rm{future}}}}}\), where \({l}_{{{{\rm{future}}}}}\) is the forecast range (see section “Method: Transformer neural network architecture”).

EGS Collab hydraulic stimulation datasets

We utilize hydraulic stimulation and MEQ data from the EGS Collab project, intermediate-scale (10–20 m) field tests at the Sanford Underground Research Facility in Lead, South Dakota. This study focuses on Experiment 1 data, aimed at producing a fracture network connecting an injection well to a production well via hydraulic fracturing47. A series of stimulations and flow tests were conducted at a depth of 1.5 km to re-open and generate hydraulic fractures in crystalline rock under reservoir-like stress conditions, with passive seismic data cataloged48 and Continuous Active-Source Seismic Monitoring45,49,50.

Figure 2 shows the stimulation-induced MEQs for each stimulation event along with the injection and production wells. Two 60 m boreholes were used for injection (E1-I) and production (E1-P), respectively. A total of five stimulation episodes were carried out in May 2018. During the first two stimulations, injections at flow rates less than 1L/min produced few MEQs. In addition, water leakage was observed between the production well and one monitoring well. Thus, the injection point was moved to a notch at a depth of 50 m in the injection hole (red triangle in Fig. 2) starting from Stimulation 3 and used through Stimulation 5. From Stimulations 3–5, three continuous hydraulic stimulations were performed using controlled step-rate injections to re-open or create fractures around the injection well, with the maximum injection rate reaching up to 5 L/min, resulting in rich MEQ signals46,51. Thus, this study uses data from Stimulations 3 to 5, generated from the same injection point with a rich MEQ history, to train neural networks. The data were recorded at 1-s intervals. Stimulations 3 and 4 each lasted approximately 1 h (3600 time steps), and the first 1 h and 10 min of Stimulation 5 were used (4100 time steps). These continuous records were segmented into overlapping input-output windows for supervised training, validation, and testing, as described in section “Data preprocessing: crop and normalization”.

Fig. 2: Spatial distribution of fluid-induced microearthquakes during hydraulic stimulations.
figure 2

The figure shows the three-dimensional spatial distribution of microearthquakes (MEQs) generated during three hydraulic stimulation episodes. The solid black line represents the injection well, and the dashed black line represents the production well. A red triangle marks the injection point located at the 50 m notch in the injection well. Colored scatter points indicate MEQ locations: yellow for stimulation 3, green for stimulation 4, and purple for stimulation 5.

Figure 3 presents the series of stimulations along with the spatiotemporal MEQ data and corresponding magnitudes. Detailed information about the MEQs—including location, time, and magnitudes—was continuously monitored during the hydraulic stimulations45,46. In addition, to quantify the spatial extent of MEQs in response to fluid injection, we extracted the 95th and 50th percentiles (median) distances of the MEQ clouds from the injection points as a function of time. Although the monitoring array is extensive, the catalog still carries intrinsic uncertainties: hypocenter locations are accurate to about 1 m and there is no reported uncertainty range for magnitude45. These uncertainties limit the fidelity of the training data and establish a floor on achievable forecast accuracy. Additionally, including all raw events—without excluding those below the magnitude of completeness—could constrain the neural network’s capability to learn underlying MEQ patterns (Supplementary Fig. 1).

Fig. 3: Microearthquake and injection history for EGS Collab Stimulation 3-5.
figure 3

Columns correspond to: Stimulation 3 (training data), Stimulation 4 (validation data), and Stimulation 5 (test data). The first row presents the hydraulic stimulation history, showing injection rate (blue) and injection pressure (red). The second row displays the locations of microearthquakes (MEQs) relative to the injection point, with distances calculated as the Euclidean distance between the injection point and observed microseismic events. P95 and P50 represent the 95th and 50th percentile distances over time. The third row shows the cumulative number of MEQ events and the magnitude of each discrete event.

Forecasting performance

We evaluate three forecast intervals—1 s, 15 s, and 30 s—using a sliding-window strategy. At each forecasting instant tn, the model ingests the entire monitoring history [t0tn] and predicts subsequent interval [tn+1tn + lfuture], where lfuture is the forecast range (e.g., 1 s, 15 s, or 30 s). For instance, when using a 15 s range, the model forecasts the next 15 s (e.g., t101t115) based on the history data t1t100. Once actual monitoring for these 15 s is recorded, these new data (t101t115) are appended to the monitoring history. The model then uses the extended history t1t115 to forecast the following segment t116t130, and this procedure repeats until the monitoring concludes. Since the model consistently utilizes actual measurements without recycling previously predicted outputs, forecasting errors do not accumulate over successive forecasts (Fig. 4).

Fig. 4: Schematic of the forecasting procedure.
figure 4

X(k) denotes the cumulative monitoring input and Y(k) the corresponding forecast window; k is the segment index. The forecasting range is lfuture (see section “Method: Transformer neural network architecture”).

Figure 5 compares the forecasted and observed cumulative MEQ counts. For the 1-second forecast model the predicted curves are virtually indistinguishable from the ground truth, even on unseen data (validation R2 = 0.999, test R2 = 0.980). The 15-second forecast model maintains high fidelity (validation R2 = 0.929, test R2 = 0.972), with a slight tendency to overestimate MEQ growth during the most intense injection phases. The 30-s forecast model still captures the overall trend but systematically underpredicts the MEQ count late in each episode (validation R2 = 0.649, test R2 = 0.809). These results show that the transformer delivers excellent short-term forecasts, with accuracy declining gradually as the forecast window lengthens.

Fig. 5: Cumulative MEQ counts: observed data (black dotted) versus forecasts for the 1-s (blue), 15-s (red), and 30-s (green) models.
figure 5

Each forecast curve is constructed by predicting successive, non-overlapping segments whose length equals the forecast interval and concatenating them to cover the full record. Panels show the training (left), validation (middle), and test (right) sets. Shaded bands denote  ±σ (one predicted standard deviation, corresponding to  ≈ 68% coverage under a Gaussian assumption).

Second, we forecast the cumulative logarithmic seismic moment, a proxy for the activated reservoir volume and thus a key metric for planning new production wells52,53. The cumulative moment \({{{\mathcal{M}}}}\) is defined as39

$${{{\mathcal{M}}}}({t}_{i})=\int_{{t}_{0}}^{{t}_{i}}\log {M}_{0}\,dt,$$
(1)

with

$$\log {M}_{0}=1.5\,{M}_{w}+13.5,$$
(2)

where M0 is the seismic moment, Mw the moment magnitude, t0 the start of injection, and ti the current injection time.

Figure 6 compares the predicted and observed cumulative moments for the 1-, 15-, and 30-s forecast models across the three data splits. The 1-s forecast model reproduces the observations almost exactly (validation R2 = 0.999, test R2 = 0.978). Performance remains high at 15-s forecast model (validation R2 = 0.878, test R2 = 0.935), although the predictive bands widen compared with the 1-s case. At 30-s forecast the model still captures the overall trend but underestimates the released seismic energy (validation R2 = 0.546, test R2 = 0.765). These results confirm that our neural network effectively links hydraulic-energy input to seismic-energy release, providing reliable short-term estimates of cumulative moment while showing a gradual and interpretable loss of accuracy as the forecast range increases.

Fig. 6: Cumulative logarithmic seismic moment: observed data (black dotted) versus forecasts from the 1-, 15-, and 30-s models (blue, red, green) for the training (left), validation (middle), and test (right) sets.
figure 6

Shaded bands denote  ±σ.

Accurately forecasting the spatial evolution of MEQ clouds is critical for delineating the affected area, guiding mitigation, and optimizing future well placement15. Figure 7 compares the spatial extent of the MEQ clouds across the training, validation, and test sets, quantified by the 50th and 95th percentiles of the Euclidean distance from the injection point. The 1-s and 15-s forecast models reproduce the ground truth trajectories of both the median distance (P50) and the far distance (P95), achieving R2 > 0.97 for the 1-s forecast model and R2 > 0.94 for the 15-s forecast model.

Fig. 7: Temporal evolution of the MEQ cloud’s spatial extent.
figure 7

The three rows correspond to the training (top), validation (middle), and test (bottom) datasets. In each row, the solid curve shows the observed 50th-percentile distance (P50) and the dashed curve the observed 95th-percentile distance (P95). Forecasts from the 1-, 15-, and 30-s models are plotted in blue, red, and green, respectively. Shaded regions denote  ±σ (standard deviation).

Figure 8 illustrates the final stabilized extents predicted by these models: absolute errors are below 0.4 m for the 1-s model and below 2 m for the 15-s model (Table 1). For the 1-second case, the observed-predicted differences lie within the model’s  ±σ band, indicating that the discrepancies are consistent with the reported uncertainty. In contrast, the 15-s differences exceed σ, revealing the limitations of the mid-range model. The 30-s model underestimates both P50 and P95 in all data splits, highlighting its reduced reliability for long-range spatial forecasts.

Fig. 8: Spatial evolution of microearthquake (MEQ) clouds and forecast performance.
figure 8

The figure shows the spatial distribution of MEQs and forecast results for different datasets and time horizons. Each row represents a dataset: training (top), validation (middle), and test (bottom). Each column shows projections on the XY, YZ, and ZX planes. Solid circles indicate the observed 50th-percentile radius (P50), while dashed circles represent the 95th-percentile radius (P95). Blue and red lines show forecasts from the 1-s and 15-s models, respectively, with shaded regions denoting  ±σ (one predicted standard deviation). Colored dots represent MEQs from different stimulation phases: yellow for Stim 3, green for Stim 4, and purple for Stim 5. Solid black lines indicate the injection well, dashed black lines the production well, and red triangles mark the injection point. All spatial dimensions are in meters.

Table 1 Final MEQ spatial extent

Discussion and conclusion

Our transformer model accurately forecasts fluid-induced MEQs, capturing both their temporal evolution and spatial growth (Table 2). This dual capability is, to the best of our knowledge, novel; earlier studies focused mainly on temporal predictions39,54. Reliable spatiotemporal forecasts are essential for estimating permeability changes and mitigating the risks associated with induced seismicity. In the following, we discuss how permeability enhancement can be inferred from monitoring data and model outputs, how fracture characteristics can be estimated, and the potentials and limitations of deep-learning-based forecasting for field-scale, fluid-induced earthquakes.

Table 2 R2 scores for all metrics and forecast models

Estimation of permeability enhancement

Estimating permeability enhancement is a critical task in EGS, yet direct measurements are challenging in the subsurface. This limitation also applies to our study—we aim to understand how permeability evolves during hydraulic stimulation, but no direct measurements were available from the field experiment. Although the correlation between MEQs and permeability remains elusive55, we derive a physically grounded rationale to indirectly estimate permeability using model outputs. Specifically, we apply the cubic law for permeability, which relates changes in fracture aperture to permeability change56,57:

$$\Delta k=\frac{{\left({b}_{0}+\Delta b\right)}^{3}}{12s}-\frac{{b}_{0}^{3}}{12s}$$
(3)

where Δk is the permeability change, b0 is the initial fracture aperture, Δb is the aperture change, and s is the spacing between parallel fractures. Assuming that the initial aperture b0 is negligible compared to the aperture change (i.e., b0 Δb), we approximate the permeability evolution as \(\Delta k\approx \frac{\Delta {b}^{3}}{12s}\). Given that the EGS Collab Experiment 1 aimed to establish fracture networks via hydraulic fracturing (i.e., tensile fractures)55,58, we assume the seismic moment is linked to normal displacement by tensile opening. The equivalent moment M0 for a tensile opening can be expressed as39:

$${M}_{0}=2GA\Delta {u}_{n}$$
(4)

where G is the shear modulus, A is the area of the fracture, and Δun is the normal displacement across the fracture. Assuming the area A of the fracture is proportional to the aperture (A Δb)59, we establish a direct proportionality between seismic moment M0 and permeability change as60:

$$\log {M}_{0}\propto \frac{2}{3}\log \Delta k$$
(5)

With these scaling relationships, we infer that the overall logarithmic permeability increment is linearly proportional to the logarithmic seismic moment, though this assumption primarily holds during early stimulation, where the initial aperture is substantially smaller than the aperture increment (i.e., b0 Δb).

During the first stimulation, the observed cumulative logarithmic seismic moment reaches  ≈3 (Fig. 6 left), implying a permeability increase of roughly two orders of magnitude. The 1-s forecast reproduces this estimate, whereas the 15-s forecast model overpredicts the moment by about one order of magnitude, and the 30-s forecast model underpredicts it by a similar amount. Because the cumulative seismic moment predicted by our network can be mapped directly to permeability changes, the model provides a practical, indirect means of tracking permeability evolution during hydraulic stimulation—though this mapping is valid only for the initial seismic—moment range where the derivation’s assumptions hold.

Inference of the fracture characteristics

In fluid injection operations, we need to control the spatial extent of fracturing. As an example, in EGS fields, it is crucial to prevent MEQs from extending beyond the region between injection and production wells while enhancing permeability within this region through fracturing. Our model provides estimates of two spatial extents of MEQs: the 95th percentile distance (P95) and the 50th percentile distance (P50). P95 represents the far extent of MEQs, while P50 indicates the most active MEQ regions, which likely correspond to areas of greatest permeability increase due to fracture generation and re-opening.

The importance of tracking P95 and P50 becomes clear when the spatial extents from each stimulation are compared (Table 1). From stimulation 3 (training) to stimulation 4 (validation), the observed P95 grows by 3.85 m (from 10.23 to 14.08 m), while P50 retreats by 0.21 m (from 5.92 to 5.71 m), indicating a slight shrinkage of the seismically active zone. Our 1-s forecast model reproduces these shifts almost exactly, predicting a 4.27 m increase in P95 (from 10.74 to 15.01 m) and 0.13 m retreat in P50 (from 6.29 to 6.16 m); all absolute errors fall within the 1-s forecast model’s  ±σ band. Between stimulation 4 (validation) and stimulation 5 (test), the observed P95 increases by 1.15 m (from 14.08 to 15.23 m), whereas P50 advanced by 4.21 m (from 5.71 to 9.92 m). The 1-s forecast model again captures these trends, predicting a 0.94 m rise in P95 (from 15.01 to 15.95 m) and 4.09 m increase in P50 (from 6.16 to 10.25 m). By accurately forecasting P50 and P95 in real time, the network enables practitioners to infer fracture propagation and activation, making it a practical tool for managing stimulation where direct measurements are not feasible.

Potential and challenges of deep learning forecasting

Among the various deep learning approaches, we chose the transformer model as our core architecture. The success of the transformer model is driven by several key factors. First, the self-attention mechanism allows the model to capture long-term dependencies42,61,62, which are crucial in fluid-induced seismicity, where MEQs are influenced by cumulative fluid injection, pore pressure changes, and perturbed in-situ stress conditions2. In particular, fluid-induced seismicity often exhibits long time intervals between injection and seismicity. For instance, the largest earthquake (local magnitude 3.9) at the deep geothermal site GEOVEN in Vendenheim occurred more than six months after shut-in63. The self-attention mechanism enables the model to weigh the importance of different input features over time, making it highly suited for sequential data44.

Second, transformers excel at processing spatiotemporal data64, which is vital for accurately predicting the spatial distribution of MEQs. This ability provides critical insights into fracture propagation65 and fluid migration66, both of which are key factors in assessing the effectiveness of hydraulic stimulation. The model’s performance in predicting the spatial extent of seismic events reflects its capacity to capture both the temporal and spatial dynamics of fluid injection-induced microseismicity. Third, the transformer’s non-recurrent architecture allows it to handle irregular time series data67, a common occurrence in microseismic monitoring due to variable injection schedules and operational pauses. This flexibility enhances the model’s robustness across different stimulation phases and geological settings, making it adaptable to varying conditions and data availability—a common challenge in real-world geophysical applications.

While the model shows promising results, extending it to large-scale field operations introduces additional uncertainties due to unknown geological heterogeneity and the extended temporal dependencies inherent to fluid-induced seismicity. The data used in this study were collected from an intermediate-scale (10–20 m) experiment with comprehensive monitoring tools from the EGS Collab project47,50. Such dense instrumentation may not be feasible in reservoir-scale engineering applications, raising questions about the model’s generalizability to less controlled, large-scale environments. One promising strategy for adapting deep learning forecasting techniques to larger-scale fluid-induced seismicity applications involves transfer learning with fine-tuning. For example, successful transferability between datasets from Utah FORGE and EGS Collab was recently demonstrated using appropriate fine-tuning methods39. Although further fine-tuning will likely be required to adjust the model to larger operational scales, the fundamental assumption remains that the neural network model learns generalizable signal patterns associated with fluid-induced MEQs. Additionally, integrating uncertainty quantification into predictions becomes increasingly important given the higher uncertainty inherent in real-field-scale operations. By incorporating these strategies, along with judicious monitoring, transformer networks could be systematically validated and effectively implemented at larger scales. Future work could involve training and validating the model’s performance with field-scale fluid-induced seismic data and hydraulic stimulation histories, thus ensuring robustness in more complex geological settings.

In summary, despite limitations related to monitoring systems and scale, this study presents a deep learning based approach for forecasting MEQs in response to fluid injection. The transformer model’s ability to predict both temporal and spatial evolution highlights its potential as a valuable tool in subsurface operations, offering substantial improvements in safety and efficiency.

Method: transformer neural network architecture

We employ a transformer neural network to forecast the spatiotemporal evolution of fluid-induced microearthquakes (MEQs). The attention mechanism captures dependencies in the monitoring time series, allowing the model to learn patterns across multiple temporal scales. Figure 1 illustrates the overall architecture. Given a sequence of past monitoring data, the model predicts the future MEQ features. The following subsections describe data processing, network architecture, loss function, and hyperparameter tuning.

Data preprocessing: crop and normalization

We first construct training segments by sliding a growing stimulation history across the cumulative time series and advancing the forecast horizon in non-overlapping blocks. The monitoring data at discrete time index t are defined as:

$${{{\boldsymbol{x}}}}(t)={\left[{x}_{1}(t),{x}_{2}(t),...,{x}_{6}(t)\right]}^{T}\in {{\mathbb{R}}}^{M}\,,$$
(6)

where the monitoring dimension M = 6 includes hydraulic stimulation features —(1) flow rate (x1) and (2) well head pressure (x2)— and spatiotemporal MEQ features —(3) cumulative MEQ numbers (x3), (4) \(\log {M}_{0}\) (x4), (5) 95th percentile distance (x5), (6) 50th percentile distance)(x6).

The cropping procedure is controlled by two hyperparameters. The minimum history length \({l}_{\min }\) specifies the number of monitoring samples always available, and the forecast horizon lfuture specifies how many future steps are predicted at once. For a monitoring ending at tend, the number of segments is

$$N=\frac{{t}_{end}-{l}_{\min }}{{l}_{{{{\rm{future}}}}}}$$
(7)

For each segment index k {0, . . . , N − 1} the split time is set as

$${t}_{k}^{\,{\mbox{split}}}={l}_{\min }+k{l}_{{{{\rm{future}}}}}$$
(8)

Thus, the cumulative monitoring input (X(k)) and the subsequent forecast window (Y(k)) are defined as:

$${{{{\boldsymbol{X}}}}}^{(k)}=\{{{{\boldsymbol{x}}}}(t)\,| \,1\le t\le {t}_{k}^{\,{\mbox{split}}\,}\}\in {{\mathbb{R}}}^{{t}_{k}^{{{{\rm{split}}}}}\times M}$$
(9)
$${{{{\boldsymbol{Y}}}}}^{(k)}=\{{{{\boldsymbol{x}}}}(t)\,| \,{t}_{k}^{\,{\mbox{split}}} < t\le {t}_{k}^{{{{\rm{split}}}}}+{l}_{{{{\rm{future}}}}}\}\in {{\mathbb{R}}}^{{t}_{k}^{{{{\rm{future}}}}}\times F}$$
(10)

where F = 4 corresponds to the forecasting MEQ features: (1) cumulative MEQ count, (2) \(\log {M}_{0}\), (3) P95, and (4) P50. Each successive segment index k advances the split by lfuture, ensuring that the predicted time blocks Y(k) are non-overlap and contiguous, while the input window grows monotonically. This approach yields continuous, leakage-free forecasting segments that can be applied in real time once at least \({l}_{\min }\) monitoring have been acquired (Fig. 4).

To fairly normalize the data without information leakage from future steps, normalization is applied individually to each input window X(k). For each monitoring dimension m {1, . . . , M} and each segment k, we define the normalization using only the known input window as follows:

$${\tilde{{{{\boldsymbol{x}}}}}}_{m}^{(k)}=\frac{{{{{\boldsymbol{x}}}}}_{m}^{(k)}-{\min }_{1\le t\le {t}_{k}^{\,{{\rm{split}}}}}{x}_{m}(t)}{{\max }_{1\le t\le {{t}_{k}^{{{{\rm{split}}}}}}}{x}_{m}(t)-{\min }_{1\le t\le {{t}_{k}^{{{\rm{split}}}}\,}}{x}_{m}(t)},\quad 1\le t\le {{t}_{k}^{\,{{\rm{split}}}}\,}$$
(11)

The normalization parameters obtained from each input window X(k) are then consistently applied to scale the corresponding forecast window Y(k). This ensures that normalization relies exclusively on information available at the prediction time, thus avoiding any data leakage from future observations.

Neural network architecture

Our transformer neural network architecture employs a multi-head attention mechanism designed to effectively capture temporal dependencies from variable-length sequences. Given an input monitoring sequence X(k), the multi-head attention layer processes the input as follows42:

$$\,{\mbox{Attention}}\,({{{\bf{Q}}}},{{{\bf{K}}}},{{{\bf{V}}}})=\,{\mbox{softmax}}\,\left(\frac{{{{\bf{Q}}}}{{{{\bf{K}}}}}^{\top }}{\sqrt{{d}_{k}}}\right){{{\bf{V}}}}\,,$$
(12)

where Q = X(k)WQ, K = X(k)WK, and V = X(k)WV are the query, key, and value matrices, respectively; WQ, WK, and WV are learnable weight matrices; dk is the dimension of key vectors.

Following the attention layer, a feed-forward network (FFN)68 is applied independently to each time step. The FFN consists of two linear transformations with a Rectified Linear Unit (ReLU) activation function:

$$\,{\mbox{FFN}}\,({{{\bf{z}}}})=\,{\mbox{ReLU}}\,({{{\bf{z}}}}{{{{\bf{W}}}}}_{1}+{{{{\bf{b}}}}}_{1}){{{{\bf{W}}}}}_{2}+{{{{\bf{b}}}}}_{2}\,,$$
(13)

where z denotes the input from the attention output, and W1W2b1, and b2 are learnable parameters.

To enhance training stability, layer normalization and residual connections are applied after both attention and feed-forward layers. These ensure effective gradient propagation and prevent training instabilities.

After attention and feed-forward layers, global average pooling and dense layers reduce the sequence to a single vector, producing predictions for the forecasting window Y(k). In particular, the model predicts both the mean (μ) and log-variance (\(\log {\sigma }^{2}\)) of these forecasting MEQ features to quantify prediction uncertainty:

$${{{{\boldsymbol{y}}}}}_{{{{\rm{pred}}}}}\in {{\mathbb{R}}}^{{l}_{{{{\rm{future}}}}}\times 2F},\quad {{{{\boldsymbol{y}}}}}_{{{{\rm{pred}}}}}(t)=[{\mu }_{1}(t),\,...,\,{\mu }_{F}(t),\,\log {\sigma }_{1}^{2}(t),\,...,\,\log {\sigma }_{F}^{2}(t)]$$
(14)

The model is trained using the Adam optimizer69 with a heteroscedastic Gaussian negative log-likelihood (NLL) loss function70,71, augmented by a monotonicity penalty weighted by the hyperparameter (λ):

$${{{\mathcal{L}}}}=\,{\mbox{NLL}}\,({{{{\bf{y}}}}}_{{{{\rm{true}}}}},{{{{\bf{y}}}}}_{{{{\rm{pred}}}}})+\lambda \,{{\mbox{Penalty}}}_{{{{\rm{mono}}}}}$$
(15)

The NLL explicitly measures the discrepancy between predictions and true values, accounting for predictive uncertainty. Given the predicted mean (μpred and log-variance (\(\log {\sigma }_{\,{\mbox{pred}}\,}^{2}\)), the NLL is defined as:

$$\,{\mbox{NLL}}\,({{{{\bf{y}}}}}_{{{{\rm{true}}}}},{{{{\bf{y}}}}}_{{{{\rm{pred}}}}})=\frac{1}{2NF}\mathop{\sum }_{i = 1}^{N}\mathop{\sum }_{f = 1}^{F}\left[\frac{{({y}_{i,f}^{{{{\rm{true}}}}}-{\mu }_{i,f}^{{{{\rm{pred}}}}})}^{2}}{{\sigma }_{i,f}^{2}}+\alpha \log \left({\sigma }_{i,f}^{2}\right)\right],$$
(16)

where N is the number of time steps in the forecast window, F is the number of MEQ target features, and α is the hyperparameter to discourage the model from inflating variance. This formulation captures both prediction accuracy and model confidence, penalizing over- or under-confident forecasts.

To enforce non-decrease for the cumulative term forecastings, a monotonicity penalty is applied to cumulative MEQ count and cumulative logarithmic seismic moment. The penalty is defined as:

$${{\mbox{Penalty}}}_{{{{\rm{mono}}}}}=\mathop{\sum }_{t = 2}^{T}\left| \min \left(0,\underset{\,{\mbox{pred}}\,}{\overset{t}{{{{\bf{y}}}}}}-{{{{\bf{y}}}}}_{\,{\mbox{pred}}\,}^{t-1}\right)\right| \,,$$
(17)

where only the selected cumulative features are included in the penalty term.

Finally, all predictions are rescaled using the inverse of the normalization applied during preprocessing. The model performance is evaluated using the coefficient of determination (R2):

$${R}^{2}=1-\frac{\mathop{\sum }_{i = 1}^{n}{\left({Y}_{i}-{\hat{Y}}_{i}\right)}^{2}}{\mathop{\sum }_{i = 1}^{n}{\left({Y}_{i}-\bar{Y}\right)}^{2}}$$
(18)

where Y includes the four spatiotemporal MEQ features.

Neural-network hyper-parameter tuning

The transformer model is trained to forecast spatiotemporal MEQs from hydraulic-stimulation history and past MEQ responses. While network weights are learned automatically, several settings—loss-function coefficients, architectural widths, batch size, dropout rate, and penalty weights—must be chosen by the user72,73. Supplementary Table 1 lists the values that remain fixed in every experiment.

Two coefficients are tuned by grid search: β (the variance-regularization weight inside the heteroscedastic Gaussian NLL term) and λ (the weight on the monotonic-increase penalty applied to cumulative MEQ count and cumulative seismic moment). For each forecast horizon lfuture {1, 15, 30} models are trained with βλ {0.1, 1.0, 10.0}. Validation R2 scores identify the optimal pair (βλ); the corresponding results appear in Supplementary Table 2.

Short-horizon models—forecast windows of up to fifteen seconds— achieve excellent accuracy; for example, the lfuture = 15 model reaches \({R}_{\,{\mbox{val}}\,}^{2}=0.924\). As the horizon lengthens, performance degrades: at nfuture = 30 the best model attains \({R}_{\,{\mbox{val}}\,}^{2}=0.046\). The horizon-specific models reported in Supplementary Table 2 are used for all subsequent experiments.