Introduction

Background and motivation

With the rising global energy demand and increasingly stringent environmental regulations, wind energy, as a clean and renewable resource, has taken a prominent role in the energy strategies of many countries1. In recent years, global initiatives, supported by policies and technological innovations, have continuously increased wind power installations to reduce dependence on fossil fuels, lower carbon emissions, and mitigate environmental pollution2. However, the intermittency and volatility of wind power significantly impact the stable operation of power grids, making accurate and reliable wind power forecasting a critical factor for large-scale grid integration3.

Literature review

Wind power forecasting is inherently challenging due to its high uncertainty and nonlinear characteristics. Fluctuations in wind speed, regional environmental diversity, and climate change result in complex, multi-scale dynamics in wind power data4. To address these challenges, various forecasting methods have been proposed, yet improving the accuracy and reliability of wind power prediction remains a key research focus5. Existing forecasting approaches can be broadly classified into physical models, statistical models, and machine learning-based methods. However, each type of model has inherent limitations, making it difficult to comprehensively address the complexities of wind power forecasting.

Physical models

Physical models simulate wind turbine operations based on meteorological and geographical inputs6. Though grounded in well-established aerodynamic principles and suitable for short-term forecasting, they heavily rely on high-resolution input data. As a result, their accuracy deteriorates when real-time or fine-grained data is unavailable. Moreover, physical models are inherently deterministic and struggle to capture the stochastic and nonlinear nature of wind power time series7,8.

Statistical models

Statistical models such as ARIMA and SVR9 are widely used due to their interpretability and simplicity. However, these models typically assume linearity and stationarity, which are seldom satisfied in wind power data10. They lack the flexibility to capture abrupt fluctuations, multi-scale variations, and temporal dependencies that extend beyond their short memory horizons11.

Machine learning-based models

Early machine learning models like ANN and SVM introduced nonlinearity into forecasting frameworks, yet they still struggled to model complex temporal structures12. More recently, deep learning models such as CNNs and LSTMs have gained popularity13,14. CNNs effectively extract spatial or local features but lack the temporal modeling capacity required for time series. LSTMs are more adept at capturing long-term dependencies; however, they typically operate in the time domain and thus cannot effectively model frequency-domain patterns15. This leads to suboptimal forecasting in cases where spectral characteristics are significant.

Transformer-based architectures such as Informer and TFT have introduced global attention mechanisms to model long-range temporal dependencies more effectively16,17,18. FEDformer further expands upon this by incorporating seasonal-trend decomposition and frequency-domain attention to reduce time-domain noise and highlight periodic patterns19. SEAformer similarly adopts frequency-domain decomposition to strengthen long-horizon predictions20. While these approaches offer significant advancements, they tend to focus on high-level temporal patterns and often ignore low-frequency drifts or fine-grained local structures. Additionally, attention mechanisms in these models increase computational cost, limiting real-time deployment potential.

Moreover, most existing deep models are single-path and monolithic, meaning they fail to differentiate between local detail learning and global trend extraction. This often leads to a loss of interpretability and reduced adaptability in diverse operational contexts. To address these issues, ensemble and hybrid architectures have gained attention.

Recent literature shows increasing interest in hybrid and multi-module models. For instance, hybrid CNN-LSTM models attempt to combine local spatial and long-term temporal learning21, and wavelet decomposition-LSTM frameworks integrate wavelet decomposition to capture multi-resolution frequency features22. Attention modules such as CBAM or squeeze-excitation networks have also been applied to enhance the importance of salient temporal features23. However, many of these hybrid models remain heuristic in their design, and few systematically explore the frequency characteristics of wind power time series or incorporate residual modeling strategies.

Additionally, studies rarely consider post-processing of prediction errors, such as modeling residuals for correction24. In fact, error sequences often contain structured information (e.g., cyclic deviation, bias), which can be leveraged to improve accuracy. Yet this aspect remains under-explored in most mainstream models.

Research motivation and proposed framework

Although numerous models have achieved notable progress by improving either temporal or spectral representations, few have jointly addressed multi-scale frequency characteristics, long-term temporal dependencies, and nonlinear dynamics within a unified architecture. Moreover, residual learning and post-correction mechanisms remain largely unexplored in the field of wind power forecasting. These limitations restrict existing models from effectively capturing the complex temporal–spectral patterns and nonlinear behaviors inherent in wind power time series.

To overcome the above challenges, this study proposes a hybrid deep learning architecture that systematically models the multi-scale, nonstationary, and nonlinear characteristics of wind power data. Specifically, Wavelet Transform Convolution (WTC) is utilized to extract localized spectral components across multiple frequency bands; Long Short-Term Memory (LSTM) networks capture long-range temporal dependencies; the Time Series Lightweight Adaptive Network (TSLANet) enhances attention efficiency while maintaining global contextual awareness; the Frequency-Enhanced Channel Attention Mechanism (FECAM) emphasizes key frequency-domain features; and an attention mechanism based on FastKAN provides expressive yet compact nonlinear transformations.

Furthermore, a Least Squares Support Vector Machine (LSSVM) is incorporated for residual error correction in the post-prediction stage, which significantly improves stability and accuracy in multi-step forecasting tasks.

The proposed integrated framework addresses several key research gaps by:

  1. (1)

    capturing multi-scale temporal–frequency patterns that are often overlooked by time-domain models;

  2. (2)

    modeling both global and local dependencies in a computationally efficient manner;

  3. (3)

    introducing an interpretable frequency-domain attention mechanism;

  4. (4)

    achieving enhanced nonlinear representation with reduced model complexity; and.

  5. (5)

    integrating an LSSVM-based residual correction mechanism that improves robustness in multi-step forecasting.

By jointly leveraging time and frequency domain representations and combining deep and shallow learning paradigms, the proposed model achieves superior forecasting accuracy, robustness, and interpretability, providing a comprehensive solution for real-world wind power prediction tasks.

Wind power prediction model structure

Data characterization and exogenous feature analysis

Effective wind power forecasting requires not only advanced modeling techniques but also a deep understanding of the input data’s characteristics. In this study, we utilize a dataset collected from a wind farm that includes six key features measured by sensors mounted on a meteorological mast: wind speed, wind direction, air density, turbulence intensity, wind shear below hub height, and the corresponding power output. These features collectively reflect both the energy potential and operational conditions affecting wind turbines.

To quantitatively evaluate the relative importance and influence patterns of these exogenous variables, we conducted a model-agnostic interpretability analysis using SHAP (SHapley Additive exPlanations)25, with the results shown in Fig. 1. The results reveal that wind speed is the most critical predictor, displaying a strong positive correlation with power output. This aligns with the aerodynamic principle that wind power increases cubically with wind speed. In contrast, turbulence intensity shows a negative contribution to power prediction, consistent with its physical role in inducing unsteady flow conditions that impair turbine efficiency. Moderate yet meaningful contributions were also observed from air density and below-hub-height wind shear, which indirectly reflect changes in atmospheric pressure and vertical wind gradient. Wind direction, however, had a negligible impact on the prediction, likely due to modern wind turbines’ ability to yaw and maintain optimal alignment with prevailing winds.

Fig. 1
figure 1

Feature SHAP analysis.

This data-driven insight serves as a foundation for model component selection in our proposed hybrid architecture. The observed multi-scale variability in wind speed and turbulence intensity justifies the incorporation of WTC for localized frequency-domain feature extraction. The long-term temporal dependencies intrinsic to meteorological sequences support the use of LSTM networks, which excel at capturing sequential memory over extended horizons. To further enhance the model’s efficiency and capture global contextual relationships without incurring prohibitive computational costs, we employ the TSLANet. TSLANet effectively balances attention expressiveness and complexity, allowing the model to focus adaptively on salient temporal segments while maintaining scalability for long sequences.

Given that the relative contribution of each input feature is non-uniform and evolves over time, we integrate the FECAM to emphasize important variables dynamically, especially under shifting atmospheric regimes. Furthermore, the nonlinear and nonstationary relationships observed between features and outputs motivate the use of an attention mechanism based on FastKAN, which provides a compact and flexible approximation of complex functional mappings. Lastly, to refine the final output and address systematic prediction errors, LSSVM is used in a post-processing stage to correct residuals, leveraging potential hidden patterns not captured by the primary forecasting modules.

In summary, this analysis demonstrates that each modeling technique in our architecture was chosen based on empirical evidence derived from data characteristics. By aligning feature behavior with appropriate algorithmic capabilities—spanning time–frequency transformation, temporal memory, attention adaptation, nonlinear mapping, and error correction—we construct a forecasting framework that is both theoretically grounded and practically effective for complex wind power prediction tasks.

Module design rationale

The wind power forecasting model presented in this paper comprises five core modules: WTC, LSTM, TSLANet, FECAM, and an attention mechanism based on FastKAN. The overall process design is illustrated in Fig. 2.

Fig. 2
figure 2

model overall process design diagram.

The proposed model is constructed to leverage the complementary strengths of time-domain, frequency-domain, and nonlinear modeling techniques for wind power forecasting. The architecture integrates a sequence of functionally distinct yet synergistic modules to address the multi-scale, nonstationary, and nonlinear characteristics of wind data.

First, the WTC module serves as a learnable wavelet-based encoder, performing multi-resolution decomposition to extract temporal features at different frequency scales. Compared to conventional CNNs, WTC preserves fine-grained time–frequency information while reducing noise through hierarchical filtering and reconstruction.

To capture temporal dependencies, the output of WTC is processed by an LSTM layer followed by TSLANet, a temporal-spectral learning module composed of the Adaptive Spectral Block (ASB) and the Interactive Convolution Block (ICB). ASB performs adaptive frequency-domain filtering based on FFT, enhancing dominant spectral components via learnable complex weighting and data-driven masking. In parallel, ICB enhances time-domain interactions through depth-wise convolutions and feature mixing, complementing ASB’s spectral emphasis. Together, TSLANet enriches the model’s ability to capture both spectral and temporal dynamics.

To further refine high-frequency representations, we incorporate FECAM, a channel attention module inspired by the Discrete Cosine Transform (DCT). FECAM selectively amplifies informative frequency-aware features across channels, helping the model focus on fluctuation-prone patterns.

Finally, the enhanced representations are passed to FastKAN, a kernel-based nonlinear mapping layer capable of approximating complex functional relationships more efficiently than traditional MLPs. FastKAN enables the model to project rich feature embeddings into the output space with high flexibility and expressiveness.

Each module contributes a distinct capability—WTConv1D for multiscale temporal decomposition, TSLANet for dynamic temporal-spectral learning, FECAM for high-frequency enhancement, and FastKAN for nonlinear regression. The sequential coupling ensures information from different domains is progressively integrated, enabling accurate and robust forecasting.

Wavelet convolutions

To effectively capture the complex multi-frequency characteristics of wind power data, the model first applies a Wavelet Transform Convolution (WTC) for preprocessing. Leveraging the multi-resolution capabilities of wavelet transform, WTC enables joint analysis in the time and frequency domains, allowing for the extraction of features across different temporal scales26. This improves the quality of representations fed into subsequent modules. Moreover, by extending the receptive field without increasing parameter complexity, WTC enhances the model’s ability to capture both low- and high-frequency components, which is critical for modeling the non-stationary and long-range dependencies inherent in wind power series. The WTC architecture is illustrated in Fig. 3

Fig. 3
figure 3

Schematic diagram of WTC structure.

The WTC decomposes the input data into distinct frequency components using convolution operations. For a given wind power series \(X = [x_{1} ,x_{2} , \ldots ,x_{n} ]\), the wavelet transform convolution can be expressed as follows:

$$Y_{{{\text{WT}}}} = {\text{WTConv}}(X) = [{\text{WT}}(x_{1} ),{\text{WT}}(x_{2} ), \ldots ,{\text{WT}}(x_{n} )]$$
(1)

In this context, \({\text{WT}}(x)\) represents the one-dimensional wavelet transform of the input sequence xxx. Specifically, high-frequency (detail) and low-frequency (trend) information are extracted through high-pass and low-pass filters, respectively, using the following formula:

$$Y_{{\text{WT, hi}}} = \sum\limits_{i} {x_{i} } \cdot h(i)$$
(2)
$$Y_{{\text{WT, lo}}} = \sum\limits_{i} {x_{i} } \cdot l(i)$$
(3)

Here, \(h(i)\) and \(l(i)\) are the coefficients for the high-pass and low-pass filters. After obtaining and concatenating the high- and low-frequency components, this combined information is used as input for the next layer.

After processing by the WTC, the wind data is decomposed into distinct frequency components, enriching the input data for the subsequent LSTM. With temporal patterns separated into different scales, the LSTM can effectively learn temporal dependencies without needing to isolate high- and low-frequency components. Additionally, WTC compresses the data into fewer, more relevant features, enabling the LSTM to focus on capturing temporal relationships with fewer parameters. This approach helps prevent overfitting, particularly in the presence of noisy or non-stationary wind power data.

Long short-term memory networks

The LSTM manages long-term dependencies using memory cells and gating mechanisms27, as shown in its structure in Fig. 4.

Fig. 4
figure 4

LSTM structure diagram.

For the input sequence \(Y_{{{\text{WT}}}}\), the LSTM update rules are as follows:

$$\begin{gathered} f_{t} = \sigma (W_{f} \cdot [h_{t - 1} ,Y_{{{\text{WT}},t}} ] + b_{f} ) \hfill \\ i_{t} = \sigma (W_{i} \cdot [h_{t - 1} ,Y_{{{\text{WT}},t}} ] + b_{i} ) \hfill \\ o_{t} = \sigma (W_{o} \cdot [h_{t - 1} ,Y_{{{\text{WT}},t}} ] + b_{o} ) \hfill \\ \tilde{C}_{t} = \tanh (W_{C} \cdot [h_{t - 1} ,Y_{{{\text{WT}},t}} ] + b_{C} ) \hfill \\ C_{t} = f_{t} \cdot C_{t - 1} + i_{t} \cdot \tilde{C}_{t} \hfill \\ h_{t} = o_{t} \cdot \tanh (C_{t} ) \hfill \\ \end{gathered}$$
(4)

Here, \(f_{t}\),\(i_{t}\) and \(o_{t}\) represent the forget, input, and output gates, respectively. \(C_{t}\) is the cell state, and \(h_{t}\) is the output state.

By feeding the frequency information \(Y_{{{\text{WT}}}}\), derived from wavelet decomposition, into the LSTM, the model can capture both short- and long-term dependencies across different frequencies. Combined with the LSTM’s ability to handle extended temporal dependencies, this approach allows the model to better manage irregular patterns and abrupt changes in wind power generation. This improves prediction accuracy and enhances model robustness, which is essential for practical wind power generation applications.

Time series lightweight adaptive network

To enhance the extraction of complex information from frequency domain signals, this paper introduces the TSLANet module28. The core components of TSLANet are the Adaptive Spectral Block (ASB) and the Interactive Convolution Block (ICB). The frequency-adaptive filtering in ASB, along with the feature interaction mechanism in ICB, significantly enhances the accuracy of wind power predictions. TSLANet structure is shown in Fig. 5.

Fig. 5
figure 5

TSLANet structure diagram.

The ASB module processes the input sequence in the frequency domain using the Fourier transform and applies an adaptive frequency mask to filter out noise, retaining only the frequency components valuable for prediction. Its primary function is to adaptively reduce high-frequency noise while preserving essential frequency information, thereby improving the model’s capacity to capture complex wind speed variations. ASB effectively addresses frequency fluctuations, periodic patterns, and sudden changes in wind speed, enhancing the model’s adaptability across different time scales.

For the input time series \(X \in^{B \times N \times C}\), we first apply the Fast Fourier Transform (FFT) to obtain its frequency domain representation:

$$X_{{{\text{FFT}}}} = {\text{FFT}}(X)$$
(5)

Next, an adaptive filter is applied to isolate the high-frequency components. The frequency domain energy is then calculated as follows:

$$E = \sum\limits_{i} | X_{{{\text{FFT}},i}} |^{2}$$
(6)

The energy is normalized and compared with a threshold θ to generate an adaptive mask:

$${\text{Mask}} = E> \theta$$
(7)

This mask is then applied to the frequency domain representation:

$$X_{{{\text{Filtered}}}} = X_{{{\text{FFT}}}} \times {\text{Mask}}$$
(8)

Next, the masked frequency domain data is weighted appropriately:

$$X_{{{\text{Weighted}}}} = X_{{{\text{Filtered}}}} \times W_{{{\text{high}}}} + X_{{{\text{FFT}}}} \times W$$
(9)

Finally, the data is transformed back to the time domain using the Inverse Fast Fourier Transform (IFFT):

$$X_{{{\text{ASB}}}} = {\text{IFFT}}(X_{{{\text{Weighted}}}} )$$
(10)

The ICB module leverages multi-scale convolution operations to extract features from time series data, effectively capturing both local details and global trends in wind speed variations. By integrating these multi-scale features through interaction mechanisms, the ICB enables the model to learn feature interactions across different time scales, enhancing its ability to predict wind power generation.

Within the ICB module, the input time series undergoes convolutions of varying sizes, capturing both short-term dependencies and longer-range patterns. This multi-scale approach allows the model to extract a comprehensive range of features for improved accuracy in wind power forecasting.

$$\begin{gathered} X_{1} = {\text{Conv}}1(X) \hfill \\ X_{2} = {\text{Conv}}2(X) \hfill \\ \end{gathered}$$
(11)

After applying the activation function and dropout, the feature interaction is as follows:

$$\begin{gathered} {\text{Out}}_{1} = X_{1} \times {\text{Drop}}({\text{Act}}(X_{2} )) \hfill \\ {\text{Out}}_{2} = X_{2} \times {\text{Drop}}({\text{Act}}(X_{1} )) \hfill \\ \end{gathered}$$
(12)

The final output is then:

$$X_{{{\text{ICB}}}} = {\text{Conv3}}({\text{Out}}_{1} + {\text{Out}}_{2} )$$
(13)

The TSLANet model, which combines the ICB and ASB modules, integrates both temporal and frequency domain features. This design enables it to capture wind speed variations across different scales and their effects on power generation, significantly enhancing the model’s generalization capability.

Frequency enhanced channel attention mechanism

The complex frequency information resulting from multiple transformations can significantly impact predictions. To address this, we introduce the FECAM module29. FECAM improves time series predictions by combining frequency domain features with channel attention mechanisms, effectively managing time series data with intricate frequency components. FECAM structure is shown in Fig. 6.

Fig. 6
figure 6

FECAM structure diagram.

FECAM uses the Discrete Cosine Transform (DCT) to extract frequency information from the input data, applying channel-level weighting based on this information to enhance the model’s sensitivity. This approach enables the model to accurately capture key fluctuations, as well as high- and low-frequency components that most impact power output, ultimately improving prediction precision.

DCT effectively decomposes frequency information in time series data, allowing the model to directly utilize these features. Unlike some other transforms, DCT does not introduce the Gibbs phenomenon (high-frequency noise) when processing non-periodic signals, making it well-suited for wind power data, which often lacks strict periodicity.

For each channel in the input multivariate wind power series, FECAM first applies DCT, as represented by the following formula:

$$X_{{{\text{DCT}}}} = {\text{DCT}}(X) = \sum\limits_{i = 0}^{L - 1} {x_{i} } \cos \left( {\frac{\pi }{L}(i + 0.5)k} \right)$$
(14)

Here, X represents the input sequence, L is the sequence length, and the DCT converts the time-domain signal into the frequency-domain signal. The frequency information extracted through DCT effectively represents the periodic characteristics and short-term fluctuations in wind speed variations, helping the model identify the features that most significantly contribute to wind power generation.

After the DCT transformation, the resulting frequency domain feature, XFreq is used to construct channel-level attention weights, as shown in the formula below:

$${\text{Attn}} = \sigma (W_{2} \delta (W_{1} X_{{{\text{Freq}}}} ))$$
(15)

In this context, W1 and W2 represent the weights of the fully connected layers, σ is the sigmoid activation function, and δ is the ReLU activation function. The generated attention weights Attn are used to scale the original input, enhancing the emphasis on key frequency components.

This mechanism learns the weight distribution across each channel, highlighting the frequencies and channels that contribute most to the prediction. By combining these attention weights with the original features in a weighted manner, it allows the model to better capture the most relevant information for accurate forecasting.

The final output is then:

$$X_{{{\text{FECAM}}}} = X \times {\text{Attn}}$$
(16)

This mechanism scales features across different channels, increasing the weight of important channels while effectively reducing high-frequency noise, such as instantaneous wind speed fluctuations. It amplifies the most valuable frequency domain information, thereby enhancing the model’s predictive capability.

Fast Kolmogorov-Arnold network attention

To model nonlinear dependencies while balancing global and local representations, the model incorporates a multi-head attention mechanism built upon the FastKAN layer30, as illustrated in Fig. 7. This design integrates FastKAN’s nonlinear transformation capability into the attention framework, enabling simultaneous learning of temporal and spectral features. FastKAN, an efficient variant of the Kolmogorov-Arnold Network (KAN)31, replaces traditional B-splines with radial basis functions (RBFs) to approximate complex mappings, significantly improving computational efficiency. Each layer applies learnable RBFs to perform nonlinear transformations, enhancing the model’s capacity to represent intricate dynamics in wind power data.

Fig. 7
figure 7

FastKan attention structure diagram.

In this framework, FastKAN is applied to the queries, keys, and values, extracting local features through radial basis functions. A radial basis function measures the distance between the input x and a reference grid point, smoothing this distance using a Gaussian function. The specific formula is:

$${\text{RBF}}_{{{\text{query}}}} = \exp \left( { - \left( {\frac{{q - {\text{grid}}}}{{{\text{denominator}}}}} \right)^{2} } \right)$$
(17)

In this context, q represents the query feature, and similar operations are applied to the keys and values. The grid is created by discretizing a specified range, while the denominator controls the smoothness of the radial basis function.

Calculate the similarity weight between the query and the key:

$${\text{Att}}_{output} = {\text{softmax}}\left( {\frac{{W_{q} \cdot W_{k}^{T} }}{{\sqrt {d_{k} } }}} \right) \cdot W_{v}$$
(18)

Here, Wq, Wk, and Wv are the query, key, and value representations transformed by the FastKAN layer, and \(d_{k}\) is the dimensionality of these vectors.

The final output is generated by combining the attention-weighted results with gated modulation:

$$O_{{{\text{gated}}}} = \sigma (W_{g} \cdot q) \cdot {\text{Att}}_{{{\text{output}}}}$$
(19)

In this context, Wg controls the gating, and Ograted represents the final gated output.

The final prediction result is:

$$\hat{P}(t) = W_{{{\text{out}}}} O_{{{\text{gated}}}} + b$$
(20)

In wind power forecasting, integrating the attention mechanism with FastKAN allows the model to focus more effectively on key features, such as abrupt changes in wind speed. The FastKAN layer enhances the accuracy of feature transformations, enabling the model to capture the relationships between input features more precisely. Additionally, the gating mechanism further improves the model’s adaptability by dynamically adjusting outputs based on input patterns, enabling it to optimize prediction performance adaptively.

LSSVM error correction

To further enhance prediction accuracy, this paper employs LSSVM for error correction. LSSVM demonstrates strong generalization ability and effectively captures the nonlinear error characteristics in wind power forecasts32, enabling accurate correction of initial prediction results. LSSVM structure is shown in Fig. 8.

Fig. 8
figure 8

LSSVM structure diagram.

By using the least squares method instead of the traditional quadratic programming approach in SVM, LSSVM significantly reduces computational complexity, making it well-suited for error correction tasks involving large-scale data. In this study, we utilize the LSSVM model by taking the residuals—differences between preliminary predicted values and actual wind power values—as inputs. By learning the relationship between these residuals and the input features, LSSVM corrects errors, thereby improving overall prediction accuracy.

Given the preliminary forecast series for wind power \(\hat{P}(t)\), the error between this forecast and the actual wind power \(P{(}t{)}\) is defined as:

$$e(t) = P(t) - \hat{P}(t)$$
(21)

The LSSVM model constructs an error correction model by learning the error \(e(t)\) in the time series. The final corrected wind power forecast \(P_{corr} (t)\) is expressed as:

$$P_{corr} (t) = \hat{P}(t) + e_{pred} (t)$$
(22)

Here, \(e_{pred} (t)\) represents the LSSVM model’s prediction of the error \(e(t)\).

Case analysis and verification

Data sources

The effectiveness of the proposed model in wind power output forecasting was evaluated on a real-world dataset collected from a wind turbine unit in the United States. The dataset spans from 00:00 on January 1, 2023, to 23:50 on August 1, 2023, with 10-min intervals, totaling 30,672 data points. The data were split into 80% for training and 20% for testing. The input sequence length was set to 144, and prediction horizons of 1, 4, and 8 steps were used under a rolling prediction strategy to forecast future wind power outputs.

Data preprocessing

Due to the differences in magnitudes between wind power output and other influencing factors, directly inputting the raw data could degrade the model’s performance and generalization capability. To address this, a standardization method was applied to scale all features of the raw sample data to the same range. The calculation is defined as follows:

$$x_{std} = \frac{x - \mu }{\sigma }$$
(23)

where x represents the standardized value of a feature, μ is the mean of the feature, and σ is the standard deviation of the feature.

For the wind direction feature, whose original range is [0,360] degrees, a sine function was employed for normalization based on its physical properties. This approach mapped the wind direction data to the range [–1,1], preserving its periodicity and directionality while eliminating boundary discontinuities. Such preprocessing makes the data more suitable for neural network models. The calculation is given as follows:

$$x^{*} = \sin \left( {\frac{x}{180}\pi } \right)$$
(24)

where \(x^{*}\) is the normalized wind direction value.

Evaluation metrics

To comprehensively assess the predictive performance of the proposed model, four evaluation metrics were employed: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Symmetric Mean Absolute Percentage Error (sMAPE), and the Coefficient of Determination (R2). To enhance interpretability and maintain consistency with the other evaluation metrics, R2 is reported in percentage form33.

Additionally, multiple baseline models were selected for comparison to validate the model’s performance. Forecasting experiments were conducted for 1-step (10 min), 4-step (40 min), and 8-step (1 h 20 min) prediction horizons. The comparison models are summarized in Table 1.

Table 1 Comparison model abbreviation table.

Model configuration and hyperparameter settings

The proposed hybrid model integrates wavelet-based convolutional feature extraction, sequential modeling via LSTM, spectral attention, and kernel-based nonlinear mapping. The model input dimension is set to 19, corresponding to the number of observed meteorological and power-related features. The LSTM module is configured with 2 layers and a hidden size of 128, which provides a balance between model expressiveness and training efficiency. To enhance temporal and spectral representation, a two-level wavelet decomposition is applied using Daubechies-4 basis functions within the WTC module. The convolutional kernel size is fixed at 5 to ensure adequate temporal locality while maintaining computational feasibility.

For frequency domain modeling, the ASB includes a learnable threshold parameter initialized to 0.3, which allows the model to adaptively emphasize high-energy spectral components during training. The ICB contains two convolution branches (1 × 1 and 3 × 1) and incorporates a dropout layer with a rate of 0.2 to mitigate overfitting risks. The final prediction head is constructed using a two-layer FastKAN module with structure [128, 1], which enhances nonlinear approximation ability while preserving training stability.

The model is trained using the Adam optimizer with an initial learning rate of 1 × 10⁻3 and a batch size of 32. All weight parameters are initialized using truncated normal distribution where applicable, and activation functions throughout the model include GELU (in ICB) and tanh (implicitly in FastKAN basis functions). Hyperparameters are selected based on empirical validation performance and domain-specific modeling principles, rather than grid or random search, in consideration of computational constraints.

Experimental analysis

As illustrated in Fig. 9, we compared the prediction results of different models. Figure 9 (a), (b) and (c) represent the 1-step, 4-step, and 8-step prediction curves, respectively. In the single-step prediction, all models perform well. However, as the prediction horizon increases, the baseline models gradually deviate from the actual value curve, while the proposed model continues to deliver satisfactory results.

Figure 9
figure 9figure 9

(a). Comparison chart of prediction performance of various models in1-step prediction. (b). Comparison chart of prediction performance of each model in 4-step prediction. (c). Comparison chart of prediction performance of each model in 8-step prediction.

Table 2 presents the prediction accuracy of different models for wind turbine power output, highlighting the best results for each prediction horizon. The proposed model consistently achieves the highest prediction accuracy. The RMSE and MAE values for all models increase with longer prediction steps, indicating higher errors as the forecast horizon extends. This phenomenon arises because capturing dependencies between distant time points becomes increasingly challenging. However, the proposed model consistently outperforms the baseline models, demonstrating its superior feature extraction and prediction capabilities across different time steps. For the 1-step prediction task, the proposed model reduces RMSE and MAE by up to 47.28% and 41.67%, respectively, compared to the baseline models. In the 4-step prediction task, the relative reductions in RMSE and MAE are 38.38% and 37.47%. In the 8-step prediction task, the relative reductions in RMSE and MAE are 45.17% and 42.95%.

Table 2 comparison of model evaluation results.

The exceptional performance of the proposed model in multi-step forecasting is primarily attributed to its ability to accurately capture the complex dynamic characteristics of wind power. Specifically, the WTC extracts both high-frequency and low-frequency components to capture short-term fluctuations and long-term trends. The frequency-domain decomposition method in the ASB adaptively enhances critical frequency features. Simultaneously, the ICB extracts multi-scale convolutional features, achieving deep integration of local and global patterns. Additionally, the FECAM improves the focus on key channel features, while the FastKAN-based nonlinear mapping effectively captures the complex nonlinear relationship between wind speed and power output.

Error correction

Although the proposed model demonstrates strong prediction performance, it may still exhibit systematic bias and fail to capture certain complex nonlinear relationships or multi-scale features, which can adversely affect forecasting accuracy. To address these issues, the Least Squares Support Vector Machine (LSSVM) model was employed to correct prediction errors. This approach aims to reduce systematic bias, mitigate the impact of random errors, and supplement the features missed by the original model, ultimately improving prediction accuracy.

A comparative analysis of multi-step prediction performance between the LSSVM-corrected ensemble model and the uncorrected ensemble model is illustrated in Fig. 10, Figs. 10, (a), (b) and represent the 1-step, 4-step, and 8-step prediction curves, respectively. The evaluation metrics for the LSSVM-corrected model are presented in Table 3.

Figure 10
figure 10figure 10

(a). Comparison chart of error correction in 1-tep prediction. (b). Comparison chart of error correction in 4-tep prediction. (c). Comparison chart of error correction in 8-tep prediction.

Table 3 Evaluation results of error correction.

The results indicate that error correction using LSSVM consistently improves forecasting performance across different prediction horizons. Specifically, in the 1-step, 4-step, and 8-step predictions, the RMSE and MAE values showed significant improvement compared to the uncorrected model. In the 1-step prediction task, the relative reductions in RMSE and MAE are 18.15% and 16.95%. In the 4-step prediction task, the relative reductions in RMSE and MAE are 24.31% and 16.56%. In the 8-step prediction task, the relative reductions in RMSE and MAE are 31.83% and 33.73%. The data further demonstrate that error correction effectively captures periodic and trend-related variations, providing substantial reductions in prediction error. It also alleviates error accumulation, with the accuracy improvement becoming more pronounced for longer prediction horizons.

Impact of wind speed variations

To assess the model’s capability in handling dynamic variations of key features, the relationship between absolute prediction error and wind speed fluctuations across different models was analyzed, as illustrated in Fig. 11. The horizontal axis represents the rate of wind speed change at the current time step relative to its value two time steps earlier (20 min), termed the "near-2 wind speed change rate." This metric quantifies the intensity of wind speed fluctuations.

Fig. 11
figure 11figure 11

Absolute error and wind speed fluctuations.

As shown in Fig. 11, the wind speed change rate at the wind farm was predominantly concentrated between 0.5 and 1.0, indicating significant and unstable wind fluctuations. Under these conditions, the proposed model effectively captured wind speed variation patterns, exhibiting high predictive stability. Specifically, the absolute error remained mostly below 5, and even in extreme cases where the wind speed change rate exceeded 1.0, the error was contained within the range of 5 to 10. The slow and concentrated growth of errors further demonstrated the model’s reliability under intense wind fluctuations.

By contrast, the traditional LSTM model exhibited greater instability and higher error fluctuations when handling rapid wind speed variations, with significantly larger errors at certain time steps compared to the proposed model. This suggests that the LSTM model lacked robustness in managing short-term abrupt wind speed changes, resulting in reduced prediction accuracy, particularly in cases of pronounced wind speed variability.

The analysis of error distribution confirms that the proposed model maintains strong stability and adaptability under intense wind fluctuations. While larger wind speed variations inevitably lead to some increase in error, the magnitude remained relatively small, with a concentrated distribution and no significant anomalies. These findings underscore the model’s superior predictive stability and robustness in complex, highly dynamic environments.

Conclusion

The proposed model effectively captures multi-scale features in both the frequency and time domains, significantly improving its ability to represent trend dynamics, and nonlinear patterns in wind power data. By integrating the WTC and the ASB, the model effectively addresses the challenge of multi-scale feature extraction that traditional methods often fail to capture.

Furthermore, the ICB and the FECAM enhance the robustness of feature representation by suppressing noise and amplifying informative temporal–spectral features. The FastKAN-based nonlinear mapping further strengthens the model’s ability to approximate complex dynamic relationships, ensuring both robustness and high predictive accuracy in short- and long-term forecasting tasks.

To further improve forecasting precision, a LSSVM is incorporated as a post-prediction residual correction layer. This hybrid structure effectively compensates for residual errors accumulated during multi-step prediction, thereby enhancing stability and maintaining high accuracy across extended forecasting horizons.

Experimental results confirm that the proposed model consistently outperforms baseline models in key metrics such as RMSE, MAE, sMAPE and R2, achieving superior performance in both short-term fluctuation tracking and long-term trend prediction. In particular, the Impact of Wind Speed Variations experiment demonstrates the model’s robustness under highly dynamic wind conditions. By analyzing the relationship between absolute prediction error and the near-2 wind speed change rate, it is shown that the proposed model maintains significantly lower error sensitivity compared to other models, confirming its adaptability to sharp wind speed fluctuations.

Overall, the integration of time–frequency decomposition, nonlinear mapping, and residual correction enables the model to achieve high accuracy, strong generalization, and robust performance under complex and rapidly changing wind conditions. Future research will further explore sensitivity analysis of key parameters and the incorporation of uncertainty quantification techniques to enhance interpretability and predictive reliability in large-scale wind power forecasting applications.