An interpretable statistical approach to photovoltaic power forecasting using factor analysis and ridge regression

Esen, Vedat; Coban, Berhan; Kavus, Bahar Yalcin; Karaca, Tolga Kudret; Dindar, Taner; Sarkin, Ali Samet

doi:10.1038/s41598-025-22838-x

Download PDF

Article
Open access
Published: 06 November 2025

An interpretable statistical approach to photovoltaic power forecasting using factor analysis and ridge regression

Scientific Reports volume 15, Article number: 38947 (2025) Cite this article

1926 Accesses
Metrics details

Subjects

Abstract

Accurate forecasting of solar energy is essential for balancing supply and demand, enhancing energy planning, and supporting the integration of renewable resources into modern electricity grids. While recent research has heavily focused on machine learning-based models such as Long Short-Term Memory networks for solar energy forecasting, these approaches often lack transparency and interpretability. This study presents an interpretable by design photovoltaic (PV) forecasting framework that couples hierarchical factor analysis (HFA) with ridge regression. HFA compresses high dimensional meteorology into three physics meaningful second order factors after which a single parameter ridge model provides coefficient level transparency and regularization in this compact space. Using 15 min measurements from a 93.6 kWp plant in Adıyaman, Türkiye (May 17, 2021–Jan 12, 2025), we evaluate under a unified chronological split (0.64/0.16/0.20). The model combines strong generalization with clear insights into how meteorological variables affect solar power generation, ensuring transparency and verifiability. These results highlight regression-based methods as robust, explainable alternatives to complex deep learning models in photovoltaic forecasting.Since development and forecasting using highly multivariate models is typically not an easy task, our approach is designed to provide a more streamlined model through which future prediction is easier. Simplifying complexity and making it easier to understand how parameters affect the result, our proposed model simplifies finding the most important drivers of solar power generation.

Introduction

Renewable energy (RE) is recognized as the key component of the future energy landscape due to the accelerating impact of fossil fuels on climate change¹. Solar energy, as a sustainable and fast-growing RE source, converts sunlight directly into electricity via PV devices². PV systems are favored for power generation because of their advantages like storage capacity, environmental friendliness, simple design, and grid integration³. As PV capacity grows, managing generated power and ensuring system reliability become increasingly critical⁴. While PV module characteristics can be tested in labs under standard conditions (American Society for Testing and Materials (ASTM) 927 − 10), real-world performance is affected by sunlight intensity and atmospheric factors such as temperature, humidity, and wind^5,6. The variability of meteorological conditions poses challenges for stable and reliable PV system operation and grid integration⁷. Accurate PV power forecasting helps balance electricity supply and demand, reduces grid fluctuations, avoids unnecessary costs, and supports efficient energy storage and grid management, particularly for large-scale solar plants^8,9,10,11.

In this study, a novel two-stage methodological framework is proposed to enhance PV power forecasting by combining HFA and Ridge Regression, with a specific focus on model interpretability, statistical validity, and practical applicability. Unlike the majority of recent studies that predominantly rely on machine learning (ML) and deep learning (DL) techniques, often perceived as black-box models, this study demonstrates that regression-based forecasting can offer not only competitive performance but also superior explainability and verifiability. The HFA is applied in the first stage of the framework-the process targets analyzing and extracting latent factors from enormous meteorologic variables to reduce their complexity while retaining the structural and temporal relationships essential to those variables. These latent factors are then used as inputs to a Ridge Regression model in the second stage, which mitigates multicollinearity and improves generalization through L2 regularization.

This approach challenges the current trend in the PV forecasting literature by showing that interpretable regression models, when combined with robust dimensionality reduction techniques, can outperform or rival more complex ML algorithms, particularly in terms of model transparency, traceability, and reproducibility. This study makes three main contributions to the field of PV power forecasting:

1.
It introduces a hierarchical structure that reduces dimensionality and preserves temporal and thematic relationships among features.
2.
It applies an interpretable and regularized linear regression method, which allows for transparent prediction modeling and stability under high correlation.
3.
Based on actual PV power data from a solar power plant in Türkiye, the applicability and effectiveness of the proposed framework are demonstrated, which in turn establishes its significance in terms of generalizability, integration into the grid, and energy planning.

Literature review

In PV power forecasting, three main estimation techniques are widely used. The first is physical approaches that model energy production based on meteorological variables and photovoltaic device interactions, often employing numerical weather prediction, sky imaging, and satellite data¹². The second involves statistical and ML methods, which are fast, efficient, and cost-effective for short- and medium-term forecasts¹³. These include time series analysis, ML, and DL techniques¹⁴. Time series models, such as ARIMA, have proven effective for short-term forecasts in large-scale PV plants¹⁵. ML models like Artificial Neural Networks (ANN), Random Forests, Support Vector Machines (SVM), Gaussian Process Regression (GPR), and Extreme Learning Machines (ELM) are popular due to their ability to handle nonlinearity and heterogeneous inputs^8,16. Ensemble and optimized ML models have shown improved accuracy; for instance, ANN ensembles, Random Forest after feature selection, and GPR were found most accurate in various studies^17,18,19. Optimized ELM models have also demonstrated stable short-term forecasts in microgrids²⁰. Due to scalability challenges with large PV datasets, DL models like Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and hybrid CNN–LSTM architectures are increasingly used¹². While early studies used standalone RNNs²¹, hybrid models currently dominate for better performance²². Examples include RNN-LSTM combined with physical models achieving high accuracy in France²³, a 4-layer LSTM model with low error in Brazil²⁴, and a CNN-GRU hybrid for a floating PV plant in Thailand²⁵. Hybrid approaches combining physical and statistical methods continue to gain attention, often involving data preprocessing, model building, and parameter tuning²⁶. Studies report hybrid models outperform standalone ones in one-week forecasts²⁷. Recent reviews emphasize managing uncertainty, improving data quality, and enhancing generalizability in ML-based renewable forecasting²⁸. Advanced hybrid frameworks incorporating feature selection, signal decomposition, and DL optimization, have achieved high accuracy on large PV systems^26,29. Besides nonlinear ML and DL models, traditional regression methods are also explored. For example, comparisons of multiple regression and ML models on Jordanian PV data with Chimpanzee Optimization Algorithm tuning showed MLP provided the best forecasting accuracy¹⁶.

Recent studies on PV power forecasting highlight the importance of both methodological innovation and explainability. Research on building-integrated PV systems in Switzerland has shown that careful model tuning and evaluation can significantly improve predictive performance. A weighted ensemble combining machine learning models with sky imagery demonstrated superior short-term forecasting accuracy compared to traditional approaches²⁹. Nematzadeh and Esen proposed an explainable ML framework that identifies the most influential meteorological parameters and delivers generalizable forecasts without relying on site-specific sensors³⁰. Tripathi et al. compared multiple regression techniques under fluctuating weather conditions and found Support Vector Machine Regression (SVMR) to outperform both multivariate and Gaussian regression³¹. comprehensive IEEE Access review emphasized the growing role of ensemble methods such as Random Forest and Boosting for short-term PV forecasting³². Complementing this, studies dedicated to SVMR confirmed its ability to capture nonlinear dependencies and provide highly accurate PV power predictions³¹.

Beyond methodological advances, PV forecasting research also extends to diverse application contexts such as BIPV and FSPV systems, where unique environmental challenges further highlight the need for robust and interpretable models. PV power forecasting studies can be conducted not only for power plants but also for BIPV (Building Integrated Photovoltaic) systems and FSPV (Floating Solar Photovoltaic) systems, which can be installed as alternatives in densely populated locations where land is scarce. BIPV systems are created by installing solar PV-based power systems on traditional structures^33,34. FSPV plants are systems installed on water surfaces and aim to increase production performance by eliminating land acquisition³⁵. Partial shading problems, physical properties of the environment, and many problems arising from objects can complicate forecasting studies in BIPV systems³⁶. In FSPV systems, unlike land, this difficulty is caused by factors such as water temperature, ambient humidity, panel temperature, and the cooling effect. Consequently, the sensor data used in these systems often contains multicollinearity. Therefore, Factor Analysis is used to identify critical parameters, and Ridge Regression is used to address the multiple correlation problem in BIPV and FSPV. The methods, strengths, and limitations of key studies in the literature together with application contexts such as BIPV and FSPV are summarized in Table 1.

Table 1 Comparative summary of existing research on PV power forecasting.

Full size table

Although time series forecasting and machine learning-based approaches have been extensively explored in the context of PV power prediction, the integration of dimensionality reduction techniques within time series frameworks remains significantly underrepresented in the literature. Moreover, studies that apply factor reduction are predominantly aligned with black-box ML models, while the potential of interpretable and statistically rigorous models has not been adequately examined. In light of this gap, the present study proposes a novel two-stage framework: initially employing Hierarchical Time Series Analysis to model temporal dependencies while simultaneously reducing feature dimensionality, followed by the application of Ridge Regression to the refined factor set for accurate, generalizable, and interpretable PV power forecasting.

Methodology

This study employed HFA and Ridge Regression methods to address the high dimensionality of time series data and improve the modeling process (Fig. 1). Initially, the Time Series Factor Analysis (TSFA) technique was used to transform the high-dimensional data into a more manageable structure by identifying latent constructs among all observed variables. The components obtained from the factor analysis were subsequently included in the model using the Ridge Regression method to evaluate their effects on the dependent variable. This section provides an overview of the dataset, including data source and factors characteristics, and brief descriptions of the analytical methods used.

Hierarchical time series factor analysis (HFA)

HFA is a multi-stage statistical procedure developed to uncover latent structures within complex and high-dimensional datasets, such as time series, and to facilitate a deeper interpretation of inter-variable relationships. HFA, unlike traditional ones, is not only to find the direct relationships between observed variables and latent factors but also at relationships that directly between extracted factors. The hierarchical structure thus proves to be of great importance in terms of theoretical clarity and statistical robustness. In the initial stage of HFA, factor loadings are estimated using techniques such as Principal Component Analysis (PCA) or the Maximum Likelihood method. These loadings extract a set of first-order latent factors that capture the primary shared variance among the observed variables. The next step, and the second stage, establishes the intercorrelations among the first-order factors to identify higher-order latent constructs. Hence comes about a multi-level structure of factors that parsimoniously summarizes the underlying data³⁷.

With this staged dimensionality reduction approach, both the understandability of the model increases and a critical problem like multicollinearity, which usually encounters regression analyses with highly correlated predictors, is satisfactorily addressed. Latent constructs will help reduce the number of variables while retaining a significant variance, as HFA makes the ensuing modeling phases statistically more efficient and informative³⁸.

The methodological application of HFA follows a systematic sequence of steps. The procedure starts with data preparation, where the dataset gets preprocessed through normalization and treatment of missing values to meet the assumptions required for factor analysis. Following this, exploratory factor analysis is applied to the observed variables to extract a reduced number of first-order latent constructs. Determining the optimal number of these factors involves using statistical heuristics such as the Kaiser criterion (eigenvalue > 1), scree plot examination, and parallel analysis. To improve the clarity and interpretability of factor structures, rotation techniques are employed- either orthogonal methods like Varimax or oblique alternatives such as Promax- depending on the nature of factor correlations. Once the first-order factors are established and refined, a second-level factor analysis is conducted on their intercorrelation matrix to identify broader, second-order latent dimensions. Finally, the hierarchical model is evaluated using various goodness-of-fit indices, which collectively assess the adequacy and validity of the proposed structural hierarchy³⁹.

Ridge regression

Ridge Regression was introduced by Hoerl and Kennard to address the problem of multicollinearity in the classical multiple linear regression model. Especially when there is a high correlation between independent variables, the regression coefficients estimated by the classical Ordinary Least Squares (OLS) method may show high variance and this may negatively affect the generalization capacity of the model. Ridge Regression aims to solve this problem by adding an L2 penalty term. Ridge Regression mitigates this issue by applying a shrinkage approach, introducing a small amount of bias to reduce variance. ⁴⁰. The classical linear regression model is defined as:

$$\:y=\beta\:X+\epsilon\:$$

(1)

where y is a vector of the dependent variable, X is a matrix of independent variables, β is a vector of regressions coefficients and ε is the error term. The OLS estimator is given by:

$$\:{\widehat{\beta\:}}_{OLS}={\left({X}^{T}X\right)}^{-1}{X}^{T}y$$

(2)

However, when the matrix X^TX is nearly singular, the variance of this estimator increases. To counter this, Ridge Regression introduces a penalty term to the objective function:

$$\:{\widehat{\beta\:}}_{ridge}=\underset{\beta\:}{\text{argmin}}\left\{\parallel{y-\beta\:X\parallel}^{2}+k{\parallel\beta\:\parallel}^{2}\right\}$$

(3)

Here, k ≥ 0 is referred to as the Ridge parameter and controls the degree of regularization. The Ridge solution is obtained as:

$$\:{\widehat{\beta\:}}_{ridge}={\left({X}^{T}X+k\text{I}\right)}^{-1}{X}^{T}y$$

(4)

This formulation penalizes the sum of squared coefficients, discouraging large weights and resulting in a more stable and generalizable model. Ridge Regression can outperform OLS in terms of mean squared error (MSE), particularly when multicollinearity is present.

Experimental results

This section presents the methods of data collection for the implementation of the proposed methodology and the findings obtained as a result of the methodology.

Data collection

The dataset consists of time series data collected from a PV power plant with 93,6 kWp (kilowatt-peek) installed in Besni, Adıyaman, Türkiye. Besni is located in southeastern Türkiye, on the western side of Adıyaman Province. Adıyaman is located at 38° 18’ 11.40” east longitude and 37° 48’ 1.19” north latitude. Many PV power generation systems have been installed in Adıyaman. In this study, among the PV systems, the system located in Besni, located between 38° 18’ 48” east and 37° 43’ 15” north, was selected as the system that best meets the ideal standards for data collection (Fig. 2). All data used for training and evaluating the models in this study are freely available in a public GitHub repository. It is accessible at: https://github.com/tkaraca/An-Interpretable-Statistical-Approach-to-Photovoltaic-Power-Forecastin.git.

A total of 234 400 Wp monocrystalline PV panels were used in the system. The electrical properties of the panels used are given in Table 2.

Table 2 PV panel electrical characteristics.

Full size table

In the database provided by these PV systems, data that has not been used in any previous study was recorded from the grid-connected power plant between 17/05/2021 and 12/01/2025. Data obtained from the facility includes active power data of photovoltaic panels. All data were recorded every 15 min. Meteorological data were taken from the Open Meteo website on dates and time intervals synchronous with the PV data.

Based on the data obtained, a total of 20 features are identified. Active Power represents the product of the voltage and current generated by solar radiation on the PV cell along with the power factor, indicating the efficiency of power usage. Total Irradiance is the amount of electromagnetic energy from the Sun reaching the Earth’s atmosphere and directly influences PV power generation as it increases. Temperature rise reduces the open-circuit voltage of the PV panel while causing a slight increase in short-circuit current. Relative Humidity negatively impacts energy received due to dew droplets reflecting solar radiation in different directions. The Dew Point is the temperature at which humid air becomes saturated and forms dew or fog. Wind Speed positively affects PV performance by cooling the modules, whereas the effect of Wind Direction varies depending on the hemisphere. Wind Gusts refer to sudden fluctuations in atmospheric wind and are considered along with speed and direction. Visibility, reduced by gases, water vapor, and dust, decreases solar radiation reaching the panels and thus lowers energy production. Precipitation increases surface wetness, impacting PV output. Apparent Temperature describes the temperature perceived by the human body. Shortwave Radiation covers solar radiation in the 300–3000 nm wavelength range. Direct Radiation is sunlight reaching the earth without scattering, while Diffuse Radiation consists of scattered or absorbed solar radiation due to atmospheric particles. Direct Normal Irradiance illuminates a surface perpendicular to the sun’s rays. Global Tilted Irradiance quantifies the total solar energy received by a sloped surface, useful for evaluating fixed-tilt PV panels. Terrestrial Radiation is the long-wave, low-energy radiation emitted by the Earth. Snowfall Height indicates snow accumulation that negatively affects PV system performance in snowy regions. Freezing Level Height marks the altitude where air temperature is 0 °C. Lastly, Sunshine Duration measures the total time the earth is exposed to direct solar radiation.

Application of HFA

The original dataset, composed of 29 meteorological and environmental factors, is subjected to HFA to reduce its dimensionality in a structured and interpretable manner. In the first stage, the data are compressed into nine first-level factors, which are subsequently grouped into three second-level factors, forming a hierarchical structure. Before factor reduction, correlation matrix and logical associations among variables are examined for factor analysis suitability. The number of factors to retain is finally determined by using different statistical heuristics: the eigenvalue criteria (λ > 1) and a scree plot.

The resulting first level factors (FL_Factor1–FL_Factor9) grouped thematically and statistically similar indicators. In the second stage, the correlations among these factors are analyzed to derive broader, conceptually cohesive second-level factors, as illustrated in Fig. 3.

The second-level factor structure derived from the first-level factors was designed to capture broader meteorological themes by aggregating conceptually and statistically related components.

SL_Factor 1, labeled the “Radiation Factor”, encapsulates all irradiance-related variables and includes FL_Factor5, FL_Factor8, and FL_Factor9.

SL_Factor 2, called the “Temperature Factor”, comprises thermal and humidity-related indicators and FL_Factor1, FL_Factor2, and FL_Factor7.

SL_Factor 3, termed the “Wind and Climate Factor”, integrates wind characteristics, precipitation data, and other general atmospheric parameters through FL_Factor3, FL_Factor4, and FL_Factor6.

This hierarchical reduction process preserves the underlying variance present in the big data set and effectively reduces multicollinearity. As a result, it contributes to the development of more stable, generalizable and interpretable regression models in the subsequent estimation phase. To further validate the hierarchical structure, a rotated component matrix is generated to examine the factor loadings of the first-level factors onto second-level factors.

Table 3 summarizes the results of the second-level hierarchical factor analysis by presenting the rotated component matrix, which illustrates how each first-level factor (FL_Factor1 to FL_Factor9) loads onto the second-level latent constructs (SL_Factor1, SL_Factor2, and SL_Factor3). The highest loading values in each row indicate the dominant association of each first-level factor with a particular second-level factor. For example, FL_Factor5, FL_Factor8, and FL_Factor9 load most strongly onto SL_Factor1, corresponding to the radiation factor. Likewise, FL_Factor1, FL_Factor2, and FL_Factor7 are primarily aligned with SL_Factor2, representing the temperature factor. Meanwhile, FL_Factor3, FL_Factor4, and FL_Factor6 are associated with SL_Factor3, forming the wind and climate factor. The rotation technique used was oblimin, which permits correlations between factors and is suitable for meteorological datasets where interdependence among variables is expected. The strength and clarity of the loadings confirm the internal consistency of the factor structure and support the validity of the thematic groupings used in subsequent modeling. This matrix thus provides a statistically reliable and interpretable basis for building regression models that are both accurate and explainable.

Table 3 Rotated component matrix of HFA.

Full size table

Ridge regression of three factors model

A linear regression model is trained using the L2 (Ridge) regularization method for the obtained three-factor model. Ridge regression penalizes the sum of squared coefficients to control model complexity and prevent overfitting. The selected λ value corresponds to the one that minimizes cross-validated MSE, while maintaining generalizability through the 1-SE rule criterion.

Figure 4 displays the relationship between different values of the regularization parameter (λ) and the cross-validated MSE in the Ridge Regression model. As λ increases from zero, the MSE initially decreases sharply, indicating improved generalization due to reduced overfitting. However, after a certain point, further increases in λ lead to a plateau and eventually a slight increase in error, reflecting the model’s underfitting tendency.

Two vertical reference lines are included on the plot: the first (blue) corresponds to the λ value that yields the minimum cross-validated MSE, while the second (green) denotes the 1 standard error (1-SE) rule, which selects the most regularized model whose error is still within one standard error of the minimum. In this study, the optimal λ value is chosen based on this trade-off to balance complexity and generalizability. The plot confirms that a moderate λ value prevents overfitting without sacrificing much predictive accuracy.

Detailed performance results of the model are provided in Table 2. Ridge regression penalizes the sum of squared coefficients to control model complexity and prevent overfitting. The regularization parameter lambda (λ) is optimized to minimize the MSE on the validation set, and the optimal value is determined as 2.171. The training process includes 45,351 samples for the training set, 11,338 samples for the validation set, and 14,172 samples for the test set. As a result, the MSE values on the validation and test sets are obtained as 104.637 and 104.251, respectively. The small difference between these two error values indicates that the model generalizes well from the validation set to the test set, demonstrating high generalization capability.

The model’s overall performance is further evaluated using various error metrics calculated on the test set (Table 4). In addition to the MSE, the root mean squared error (RMSE) is computed as 10.21, while the mean absolute error (MAE) and mean absolute percentage error (MAPE) are obtained as 7.087 and 8.91%, respectively. Coefficient of determination (R²), the proportion of variance in the dependent variable accounted for by the independent variables of the model, is calculated to be 0.823. It indicates that the Ridge regression model can account for approximately 82.3% of variation in the target variable. Such a high R² value suggests that the model effectively captures the underlying data structure and minimizes unexplained variance. In conjunction with the low error metrics observed (MSE, RMSE, MAE, and MAPE), this value supports the conclusion that the model achieves high predictive accuracy and robust generalizability when applied to unseen data (Table 5).

Table 4 Model summary of regularized linear regression.

Full size table

Table 5 Model performance Metrics.

Full size table

In this additive decomposition, the Base value (29.459) represents the model’s intercept, which remains constant across all cases and reflects the predicted outcome without any feature input. The final Predicted value is computed by summing this base with the weighted contributions of features x₁, x₂, and x₃. The contributions shown in the Table 6 are directly influenced by the coefficients learned by the Ridge Regression model. As Ridge applies L2 regularization, it penalizes large coefficients to reduce overfitting and multicollinearity, distributing the influence more smoothly across correlated features. In this case, x₁ consistently contributes the most to the prediction, indicating that it holds the highest weight in the model despite regularization.

Table 6 Additive explanations for predictions of test set cases.

Full size table

Conversely, x₂ has negligible contributions, indicating that Ridge has successfully reduced its coefficient towards zero since it is less predictive. x₃ has positive contributions with moderate effect in all cases, indicating its complementary but secondary nature in the regression equation. These contribution values provide insight into how Ridge Regression allocates influence among features and reinforces the interpretability of linear models when paired with additive explanation methods. Such analysis is valuable for evaluating model performance and understanding which features drive individual predictions.

The regression equation shown in Eq. 1 summarizes the coefficients estimated by the Ridge Regression model. The intercept value of 29.416 is the baseline prediction when all features are zero. Among the features, f₁ has the highest coefficient (19.274), indicating that it is the most influential predictor in the model. f₂ has a moderate effect (1.222), while f₃ has a minimal influence (0.168), which reflects Ridge Regression’s tendency to shrink less relevant coefficients toward zero through L2 regularization.

$$\:y=29.416+19.274{f}_{1}+1.222{f}_{2}+0.168{f}_{3}$$

(1)

The regression equation summarizes the coefficients estimated by These coefficients directly related to the feature contributions observed in earlier prediction tables, confirming that f1 consistently drives the model’s output. Ridge Regression’s regularization enhances the model’s stability and prevents overfitting, particularly when features are correlated, or the feature space is high-dimensional.

Ridge Regression is selected in this study due to its robustness and suitability for PV power forecasting tasks, where input features such as solar irradiance, temperature, humidity, and time variables are often interrelated and subject to noise. In PV forecasting, model generalizability is critical because environmental conditions vary significantly across time and location. Ridge Regression addresses this by penalizing big values of the coefficients, preventing overfitting of training noise or outliers by the model. It also ensures a balance such that all the predictors contribute proportionally irrespective of multicollinearity, which is common in meteorological data. By stabilizing coefficient estimates and reducing model variance, Ridge Regression enhances prediction accuracy and provides more reliable outputs and is hence a suitable and interpretable model for predicting short-term PV power generation.

Figure 5 illustrates the predictive performance of the Ridge Regression model. The scatter plot compares the predicted values with the measured observations. The red diagonal line represents the ideal case in which predictions perfectly match the measured values. As shown in the figure, the majority of the points cluster around the diagonal, indicating that the Ridge Regression model achieves a strong degree of accuracy.

Despite some dispersion, particularly at lower and higher values, the general alignment suggests that the model captures the underlying trends in the data effectively. The marginal histograms on the axes are also instructive regarding predicted and observed values’ distribution and indicate that the model slightly underpredicts with higher irradiance levels.

Figure 6 presents the feature importance scores of the Ridge Regression model based on mean dropout loss. Among the features, x₁ has the largest mean dropout loss (30.13), the most significant feature in making accurate predictions by the model. In the contrary, x₂ and x₃ have comparatively lower contributions to the mmodel’sperformance, with dropout losses of 10.286 and 10.113, respectively. In contrast, x₂ and x₃ contribute comparably less to the mmodel’sperformance, with dropout losses of 10.286 and 10.113, respectively.

These findings are consistent with the coefficient magnitudes observed in the Ridge Regression equation. The high dropout loss for x₁ reaffirms its dominant predictive role, while the relatively low values suggest that the model can still perform adequately in their absence. This analysis highlights Ridge Regression’s ability to concentrate predictive weight on the most informative variables while maintaining regularized control over less significant ones, further enhancing generalizability.

Validation of the study

In this study, a validation process is conducted using real-world data to evaluate the predictive performance of the developed model. In the validation phase, we used a new, independent dataset distinct from the training and testing data. Specifically, the dataset composition was altered by incorporating observations from different seasons, enabling an out-of-season generalization assessment and preventing information leakage. In this context, the predicted values obtained through the Ridge regression model are compared with the observed values, and the model’s generalizability is analyzed. This comparative analysis based on real data provides an important reference for testing whether the model performs successfully not only on training data but also in real-world application scenarios.

The calculated performance metrics further affirm the predictive strength of the proposed model. The Proposed Method achieves the best performance on the test set (MSE = 2.92, RMSE = 1.71, MAE = 1.47), accompanied by near-perfect goodness-of-fit (R² = 0.95) and minimal normalized error (Table 7). Additionally, Scaled MSE value of 0.005 reflects that the prediction errors are quantitatively limited and relatively small in proportion to the overall data range. Overall, the evidence indicates that the Proposed Method generalizes more effectively and tracks the magnitude of the signal substantially better than the competing approaches.

Table 7 Additive explanations for predictions of test set cases.

Full size table

Figure 7 complements the quantitative performance with a visual presentation of total active power predicted and actual values in a scatter plot. Ridge Regression predictions against observed values are plotted, and a red dashed identity line depicts perfect agreement. Points lie in a tight cluster around this line throughout the full operating range, save for very few marked exceptions, which indicates slight underestimation or overestimation. This close correspondence with respect to the ideal fit shows the model gives precedence to the dominant signal, which is further supported by the low RMSE and high R² shown in Table 6, thus emphasizing that the model is robust and generalizes well in forecasting photovoltaic power from weather inputs.

The error distribution of Fig. 8 is closely bunched around zero and appears to be nearly perfectly balanced, with no more than the odd outlier. Symmetry indicates zero systematic bias—predictions are not uniformly high or uniformly low. The near-normal shape of the histogram is in agreement with residual assumptions of a well-specified model, and the reasonably even spread suggests close to constant variance (homoskedasticity), as consistent with stable performance at both low and high power levels.

Although the overall distribution of Absolute Percentage Error (APE) values has low and flat central tendency without visible drift, indicating consistent error behavior over the evaluation period. Regular spikes increase means that 90% of the predictions have ~ 27% relative error (Fig. 9). The sparsity and bursty nature of the peaks suggest they appear under high-ramp or low-power conditions, rather than systematic bias. In most cases, the model’s percentage errors are always low and consistent, with some high-APE cases that can be mitigated by ramp-aware features or special handling of low-power regimes.

Benchmark comparison and result analysis

We evaluated persistence (lag‑1), ARIMA, SVR with Radial Basis Function (RBF), Random Forest, Gradient Boosting, LSTM, and the proposed HFA–Ridge under a single, time‑ordered split (train/validation/test = 0.64/0.16/0.20), with the dataset chronologically sorted by the order field and no shuffling. Pre‑processing followed train‑only fitting, standardization of covariates on train, applied to validation/test; the target was never scaled for metric computation. Hyperparameters were selected only on train + validation using blocked time‑series CV (K = 3); each model was then refit on train + validation at the selected settings and evaluated once on the held‑out test slice. We report MSE, RMSE, MAE, MASE, skill vs. persistence, and R². On the internal slice, the persistence denominators are RMSE = 6.623 kW and MAE = 3.308, which anchor all skill/MASE values (Tables 8 and 9).

Table 8 Tuned hyperparameters of emsemble Methods.

Full size table

Table 9 Tuned hyperparameters of LSTM.

Full size table

Model-specific tuning is carried out through grid searches and parameter selection. Using the same CV protocol on train + validation:

ARIMA orders were chosen by Akaike Information Criterion (AIC) over a bounded grid (p, q ≤ 4, d ≤ 2) with daily seasonality (s = 96) retained only if it improved AIC;

SVR (RBF) used standardized inputs and a grid over C ∈ {1, 10, 100},zhen.

C ∈ {1, 10, 100}, γ ∈ {“scale”, 0.1, 0.01, 0.001}, ε ∈ {0.1, 0.2, 0.3}, employing Halving Grid Search CV (factor ≈ 3) where available, otherwise Grid Search CV;

Random Forest used n_estimators ∈ {200, 400}, max_depth ∈ {None, 12, 20}, min_samples_leaf ∈ {1, 3, 5};

Gradient Boosting used n_estimators ∈ {200, 300, 400}, learning_rate ∈ {0.01, 0.05, 0.1}, max_depth ∈ {2, 3, 4}, subsample ∈ {0.6, 0.8, 1.0} with neg-MSE scoring;

LSTM consumed standardized features in 96-step sliding windows (≈ 24 h − 15-min), two stacked LSTM layers (64 units each), dropout = 0.1, Adam (1e-3), batch = 256, max 400 epochs with Early Stopping/Reduce LR On Plateau;

HFA–Ridge used three second-order latent factors (Radiation, Temperature, Wind/Climate) learned via hierarchical factor analysis, with λ chosen by CV + 1-SE rule.

Final hyperparameters actually used in Tables 10 and 8.

Table 10 Models performance metrics.

Full size table

On the internal test split (Table 9), tree ensembles (Gradient Boosting, Random Forest) achieve the lowest absolute errors. Persistence is, as expected for a 15-minute horizon, a strong reference; against this anchor (RMSE = 6.623 kW; MAE = 3.308), the proposed HFA–Ridge is competitive in explained variance (R² = 0.823) yet shows negative skill in RMSE/MAE; typical in short-lag-dominated settings with high autocorrelation. ARIMA, LSTM, and SVR trail across metrics on this internal slice.

The study shows that the factor-analysis + ridge regression framework is both accurate and interpretable, offering robustness against multicollinearity and clear parameter transparency. While ensemble methods like Boosting and Random Forest achieve strong predictive performance, their complexity, overfitting risk, and low interpretability reduce their practical value. Ridge regression is highlighted as a balanced alternative, combining precision with interpretability for reliable photovoltaic forecasting.

For the validation purpose, we used a new, independent dataset distinct from the training and testing data. Specifically, the dataset was restructured by systematically selecting one full day from each month across 45 months, yielding 53 observations per day. This design ensured a homogeneous and seasonally balanced dataset, supporting fair evaluation and reducing the risk of information leakage. To test robustness under distribution shift, we froze all models and evaluated them on an independent, data set (Table 11). Ensembles show smaller gains; LSTM and SVR underperform. Taken together, these results indicate that HFA–Ridge generalizes best under seasonal shift while retaining full interpretability. It should be point out ARIMA is omitted from this Table 11 because it is a univariate baseline without exogenous inputs and does not admit feature-level additive attributions. Its predictive metrics are reported alongside the other baselines in Table 9.

Table 11 External multi-season validation*.

Full size table

Moreover, Table 11 illustrates how the Ridge model assigns consistent, additive contributions that align with its learned coefficients on factorized inputs. This shows clearly how physically meaningful factors translate into predictions, something black-box models cannot provide. Such clarity reinforces the value of regression-based approaches as both precise and interpretable tools for PV forecasting.

In addition, the joint evidence from Tables 9 and 11 shows a consistent trade-off: ensembles win in-split absolute error, whereas the HFA–Ridge model wins validation data set with positive skill and MASE < 1, while uniquely preserving coefficient-level interpretability and low deployment cost. Given the high multicollinearity and physical coherence among meteorological drivers, the factor-plus-ridge design yields stable coefficients, transparent case-level attributions, and favorable generalization key requirements for operational forecasting.

Finally, we analysis the computational cost and scalability of the proposed method The hierarchical factor analysis reduces the original predictor space to three second-order latent factors. Training cost is dominated by factor extraction on p variables, followed by a ridge fit on a 3-dimensional design. In big-O terms this is O(n·p²) for factor extraction plus O(n·k² + k³) for ridge with k = 3; inference is O(k) per time step (compute factors + one dot-product). Only a single regularization parameter λ is tuned (selected by time-series CV with the 1-SE rule), so search cost is minimal. The entire pipeline trains and serves on CPU, supports streaming deployment, and retrains rapidly for new periods or sites with the same preprocessing.

Compared with the competing baselines, the proposed HFA–Ridge is the most economical to train and redeploy: hierarchical factor analysis compresses the predictor space to three second-order factors, after which ridge is fit in a 3-D design with a single regularization parameter (λ), yielding CPU-only training and constant-time per-step inference in the reduced factor space.

Tree ensembles (random forest, gradient boosting) require multi-parameter search over depth/trees/subsample/leaf size; their training cost grows roughly with (trees × depth × cross-validation passes), and although inference is fast, multiple tree traversals and model size complicate frequent re-training.
SVR (RBF) incurs at least quadratic scaling in the number of samples during hyperparameter search, and prediction latency scales with the number of support vectors.

LSTM entails epoch-based backpropagation over sequences with a larger hyperparameter surface (layers/units/dropout/batch/optimizer) and typically benefits from GPU acceleration; inference is linear in sequence length yet heavier than a single linear model.

Classical ARIMA needs order/seasonality selection and becomes costly under repeated re-fitting or seasonal terms. As a result, while tree ensembles attain the lowest in-split absolute errors,

HFA–Ridge offers the most favorable accuracy–interpretability–cost trade-off—minimal tuning (one λ), compact CPU-only deployment, and positive external performance (MASE < 1, positive skill) on the independent multi-season set—hence it scales best for operational use. Scalability is analyzed analytically via algorithmic complexity and tuning burden and linked empirically to external performance.

Discussion and conclusion

This study introduces an interpretable, statistically grounded two-stage forecasting framework that combines HFA with ridge regression. By organizing meteorological drivers into a small set of second-order latent factors and regularizing the final linear model, the approach addresses multicollinearity explicitly while preserving coefficient-level interpretability. The resulting representation is conceptually coherent and computationally lean, offering a structured alternative to flat, high-dimensional inputs commonly used in PV forecasting.

The benchmarking results, read jointly across the internal split (Table 8) and the independent data set (Table 9), reveal a consistent pattern. On the internal 15-minute horizon, the lag-1 persistence baseline is strong and tree ensembles achieve the lowest absolute errors; the proposed HFA–Ridge attains competitive explained variance (R² ≈ 0.82) but shows negative skill vs. persistence in RMSE/MAE—an outcome consistent with short-lag-dominated settings. Crucially, when evaluated out-of-season without refitting, HFA–Ridge outperforms all baselines on the external set, achieving MASE < 1 and positive skill vs. persistence (Table 9). This indicates superior transportability to unseen seasonal regimes, which is often the decisive criterion for operational deployment.

From a practical standpoint, the method is inexpensive to train, simple to redeploy, and auditable. HFA compresses correlated inputs to three physically meaningful factors and ridge regression tunes only a single regularization parameter; the entire pipeline runs on CPU with constant-time per-step inference in the reduced factor space. In contexts where retraining frequency, resource constraints, or regulatory accountability matter the combination of external-set accuracy, parameter transparency, and low operational cost provides a favorable accuracy–interpretability–deployment trade-off. While ensembles remain attractive for minimizing in-split error, their tuning burden, model size, and limited coefficient-level interpretability can be constraining in field settings; conversely, HFA–Ridge is slightly behind in-split yet best under seasonal shift and far easier to scale across sites and periods.

In sum, the proposed HFA-Ridge design provides an accurate, generalizable, and explainable forecasting pipeline that aligns scientific interpretability with operational needs. It complements high-performing ensembles by offering a transparent, low-cost option that excels under seasonal shift, thereby broadening the toolkit available for reliable PV power forecasting.

Data availability

The datasets generated and/or analysed during the current study are available in the GitHub repository: https://github.com/tkaraca/An-Interpretable-Statistical-Approach-to-Photovoltaic-Power-Forecastin.git.

Abbreviations

ANN:: Artificial Neural Networks
AIC:: Akaike Information Criterion
BIPV:: Building Integrated Photovoltaic
DL:: Deep Learning
ELM:: Extreme Learning Machines
FSPV:: Floating Solar Photovoltaic
GPR:: Gaussian Process Regression
GRU:: Gated Recurrent Units
HFA:: Hierarchical Factor Analysis
LSTM:: Long Short-Term Memory
ML:: Machine Learning
PCA:: Principal Component Analysis
PV:: Photovoltaic
RBF:: Radial Basis Function
RE:: Renewable Energy
RNN:: Recurrent Neural Networks
SVM:: Support Vector Machines
TSFA:: Time Series Factor Analysis

References

Wang, K., Qi, X. & Liu, H. Photovoltaic power forecasting based LSTM-Convolutional network. Energy 189, 116225 (2019).
Article Google Scholar
Tang, H., Kang, F., Li, X. & Sun, Y. Short-term photovoltaic power prediction model based on feature construction and improved transformer. Energy 320, 135213 (2025).
Article Google Scholar
Konstantinou, M., Peratikou, S. & Charalambides, A. G. Solar photovoltaic forecasting of power output using LSTM networks. Atmosphere 12, 124 (2021).
Article ADS Google Scholar
Kim, G. G. et al. Prediction model for PV performance with correlation analysis of environmental variables. IEEE J. Photovolt. 9, 832–841 (2019).
Article Google Scholar
Esen, V., Saglam, Ş. & Oral, B. & Ceylan Esen, Ö. Toward class AAA LED large scale solar simulator with active cooling system for PV module tests. (2022). https://doi.org/10.1109/JPHOTOV.2021.3117912
Mellit, A., Massi Pavan, A., Ogliari, E., Leva, S. & Lughi, V. Advanced methods for photovoltaic output power forecasting: A review. Appl. Sci. 10, 487 (2020).
Article Google Scholar
Sangrody, H., Zhou, N. & Zhang, Z. Similarity-Based models for Day-Ahead solar PV generation forecasting. IEEE Access. 8, 104469–104478 (2020).
Article Google Scholar
AlKandari, M. & Ahmad, I. Solar power generation forecasting using ensemble approach based on deep learning and statistical methods. Appl. Comput. Inf. 20, 231–250 (2024).
Google Scholar
Eseye, A. T., Zhang, J. & Zheng, D. Short-term photovoltaic solar power forecasting using a hybrid Wavelet-PSO-SVM model based on SCADA and meteorological information. Renew. Energy. 118, 357–367 (2018).
Article Google Scholar
Liu, R. et al. A short-term probabilistic photovoltaic power prediction method based on feature selection and improved LSTM neural network. Electr. Power Syst. Res. 210, 108069 (2022).
Article Google Scholar
Pretto, S., Ogliari, E., Niccolai, A. & Nespoli, A. A new probabilistic ensemble method for an enhanced Day-Ahead PV power forecast. IEEE J. Photovolt 12, 581–588 (2022).
Rajagukguk, R. A., Ramadhan, R. A. A. & Lee, H. J. A review on deep learning models for forecasting time series data of solar irradiance and photovoltaic power. Energies 13, 6623 (2020).
Article Google Scholar
Jogunuri, S. et al. Random forest machine learning algorithm based seasonal multi-step ahead short‐term solar photovoltaic power output forecasting. IET Renew. Power Gener. 19, 1–16 (2024).
Harrou, F., Kadri, F. & Sun, Y. Forecasting of Photovoltaic Solar Power Production Using LSTM Approach. in Advanced Statistical Modeling, Forecasting, and Fault Detection in Renewable Energy Systems (eds. Harrou, F. & Sun, Y.)IntechOpen, (2020). https://doi.org/10.5772/intechopen.91248
Sharadga, H., Hajimirza, S. & Balog, R. S. Time series forecasting of solar power generation for large-scale photovoltaic plants. Renew. Energy. 150, 797–807 (2020).
Article Google Scholar
Al-Dahidi, S., Alrbai, M., Alahmer, H., Rinchi, B. & Alahmer, A. Enhancing solar photovoltaic energy production prediction using diverse machine learning models tuned with the chimp optimization algorithm. Sci. Rep. 14, 18583 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Amiri, A. F., Chouder, A., Oudira, H., Silvestre, S. & Kichou, S. Improving photovoltaic power prediction: insights through computational modeling and feature selection. Energies 17, 3078 (2024).
Article CAS Google Scholar
Nespoli, A., Leva, S., Mussetta, M. & Ogliari, E. G. C. A selective ensemble approach for accuracy improvement and computational load reduction in ANN-Based PV power forecasting. IEEE Access. 10, 32900–32911 (2022).
Article Google Scholar
Tahir, M. F., Yousaf, M. Z., Tzes, A., Moursi, E., El-Fouly, T. H. & M. S. & M. Enhanced solar photovoltaic power prediction using diverse machine learning algorithms with hyperparameter optimization. Renew. Sustain. Energy Rev. 200, 114581 (2024).
Article Google Scholar
Ganesh, R., Saha, T. K. & Kumar, M. L. S. Implementation of optimized extreme learning machine-based energy storage scheme for grid connected photovoltaic system. J. Energy Storage. 88, 111611 (2024).
Article Google Scholar
Kermia, M. H., Abbes, D. & Bosche, J. Photovoltaic power prediction using a recurrent neural network RNN. in 6th IEEE International Energy Conference (ENERGYCon) 545–549 (2020). 545–549 (2020). (2020).
Yu, J. et al. Deep learning models for PV power forecasting. Rev. Energies. 17, 3973 (2024).
Article Google Scholar
Narayanan, S., Kumar, R., Ramadass, S. & Ramasamy, J. Hybrid forecasting model integrating RNN-LSTM for renewable energy production. Electr. Power Compon. Syst. 1–19. https://doi.org/10.1080/15325008.2024.2316247 (2024).
Dhaked, D. K., Dadhich, S. & Birla, D. Power output forecasting of solar photovoltaic plant using LSTM. Green. Energy Intell. Transp. 2, 100113 (2023).
Article Google Scholar
Thipwangmek, N., Suetrong, N., Taparugssanagorn, A., Tangparitkul, S. & Promsuk, N. Enhancing Short-Term solar photovoltaic power forecasting using a hybrid deep learning approach. IEEE Access. 12, 108928–108941 (2024).
Article Google Scholar
Hou, Z., Zhang, Y., Liu, Q. & Ye, X. A hybrid machine learning forecasting model for photovoltaic power. Energy Rep. 11, 5125–5138 (2024).
Article Google Scholar
Asiedu, S. T., Nyarko, F. K. A., Boahen, S., Effah, F. B. & Asaaga, B. A. Machine learning forecasting of solar PV production using single and hybrid models over different time horizons. Heliyon 10, e28898 (2024).
Eze, B. O. & Ayorinde, O. S. Prediction of renewable energy generation using machine learning: A systematic review of literature. Int. J. Innov. Res. Electron. Commun. 11, 1714–1718 (2024).
Google Scholar
Yang, S. & Luo, Y. Short-term photovoltaic power prediction based on RF-SGMD-GWO-BiLSTM hybrid models. Energy 316, 134545 (2025).
Article Google Scholar
Nematzadeh, S. & Esen, V. Explainable machine learning and predictive statistics for sustainable photovoltaic power prediction on areal meteorological variables. Appl. Sci. 15, 8005 (2025).
Article CAS Google Scholar
Tripathi, A. K. et al. Advancing solar PV panel power prediction: A comparative machine learning approach in fluctuating environmental conditions. Case Stud. Therm. Eng. 59, 104459 (2024).
Article Google Scholar
Gaboitaolelwe, J. et al. Machine learning based solar photovoltaic power forecasting: a review and comparison. IEEE Access. 11, 40820–40845 (2023).
Article Google Scholar
Sarkar, D., Kumar, A. & Sadhu, P. K. Different diode models comparison using Lambert W function for extracting maximum power from BIPV modules. Int. J. Energy Res. 45, 691–702 (2021).
Article Google Scholar
Sarkar, D. & Sadhu, P. K. A new hybrid BIPV array for enhancing maximum power with reduced mismatch losses under extreme partial shading scenarios. Energy Sources Part. Recovery Util. Environ. Eff. 44, 5172–5198 (2022).
CAS Google Scholar
Singh, N. K., Goswami, A. & Sadhu, P. K. Parameter extraction of floating solar PV system with war strategy optimization for sustainable cleaner generation. Microsyst. Technol. 30, 481–488 (2024).
Article CAS Google Scholar
Sarkar, D. & Sadhu, P. K. A novel fixed BIPV array for improving maximum power with low mismatch losses under partial shading. IETE J. Res. 69, 8423–8443 (2023).
Article Google Scholar
Wansbeek, T. & Meijer, E. Measurement error and latent Variables - A companion to theoretical Econometrics - Wiley Online Library. https://onlinelibrary.wiley.com/doi/ (2003). https://doi.org/10.1002/9780470996249.ch9
Bai, J. & Ng, S. Determining the Number of Factors in Approximate Factor Models - Bai – 2002 - Econometrica - Wiley Online Library. https://onlinelibrary.wiley.com/doi/abs/ (2002). https://doi.org/10.1111/1468-0262.00273
Buskmiller, C. et al. Validation of a brief measure for complicated grief specific to reproductive loss. Cureus https://doi.org/10.7759/cureus.37884 (2023).
Article PubMed PubMed Central Google Scholar
Hoerl, A. E., Kennard, R. W. & and Ridge regression: biased Estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970).
Article Google Scholar

Download references

Funding

No funding was received to assist with preparing this manuscript.

Author information

Authors and Affiliations

Department of Electric Electronics Engineering, Istanbul Topkapi University, 34087, İstanbul, Turkey
Vedat Esen
Statistical Consultancy Assessment and Evaluation Research and Application Center, İzmir Katip Çelebi University, 35620, İzmir, Turkey
Berhan Coban
Quality Coordination Office, İzmir Katip Çelebi University, 35620, İzmir, Turkey
Bahar Yalcin Kavus
Department of Industrial Engineering, Istanbul Topkapi University, 34087, İstanbul, Turkey
Tolga Kudret Karaca
Department of Electronic and Automation, Ankara University, Ankara, Turkey
Taner Dindar
Department of Electrical and Energy, Osmaniye Korkut Ata University, 80750, Osmaniye, Turkey
Ali Samet Sarkin

Authors

Vedat Esen
View author publications
Search author on:PubMed Google Scholar
Berhan Coban
View author publications
Search author on:PubMed Google Scholar
Bahar Yalcin Kavus
View author publications
Search author on:PubMed Google Scholar
Tolga Kudret Karaca
View author publications
Search author on:PubMed Google Scholar
Taner Dindar
View author publications
Search author on:PubMed Google Scholar
Ali Samet Sarkin
View author publications
Search author on:PubMed Google Scholar

Contributions

V.E. conceived the main idea and supervised the study. B.C. and B.Y.K. performed the data analysis. T.K.K. prepared the figures and tables. A.S.S. and T.D. contributed to the literature review. The methodology was developed by B.C., B.Y.K., and T.K.K. All authors contributed to the writing of the manuscript. V.E. performed the editing and final revision. All authors reviewed and approved the final version of the manuscript.

Corresponding author

Correspondence to Vedat Esen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Conflicts of interest

The authors have no conflicts of interest to declare relevant to this article’s content.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Esen, V., Coban, B., Kavus, B.Y. et al. An interpretable statistical approach to photovoltaic power forecasting using factor analysis and ridge regression. Sci Rep 15, 38947 (2025). https://doi.org/10.1038/s41598-025-22838-x

Download citation

Received: 01 August 2025
Accepted: 01 October 2025
Published: 06 November 2025
Version of record: 06 November 2025
DOI: https://doi.org/10.1038/s41598-025-22838-x