Optimizing photovoltaic power prediction at extreme altitudes using stacking metamodels and dimensionality reduction

Huaquipaco, Saul; Mamani, Wilson; Beltran, Norman; Ramos, Jose; Sarmiento, Vilma; Puma, Pedro; Pizarro, Henry; Yana-Mamani, Victor; Cruz, Jose; Condori, Reynaldo

doi:10.1038/s41598-025-22185-x

Download PDF

Article
Open access
Published: 03 November 2025

Optimizing photovoltaic power prediction at extreme altitudes using stacking metamodels and dimensionality reduction

Saul Huaquipaco¹,
Wilson Mamani²,
Norman Beltran³,
Jose Ramos³,
Vilma Sarmiento⁴,
Pedro Puma⁴,
Henry Pizarro⁴,
Victor Yana-Mamani¹,
Jose Cruz³ &
…
Reynaldo Condori⁴

Scientific Reports volume 15, Article number: 38423 (2025) Cite this article

1081 Accesses
Metrics details

Subjects

Abstract

Accurate active power prediction in photovoltaic (PV) systems installed at extreme altitudes above 3800 m a.s.l. faces critical challenges due to non-stationary climate variability, monitoring equipment failure, and limitations in conventional prediction models. This study developed a hybrid stacking metamodel to overcome these barriers through a four-stage process: adaptive preprocessing that reconstructs shifted time series, sequential feature selection (SFS), dimensionality reduction with PCA, and ensemble that integrates Lasso/Ridge mathematical regularization with the nonlinear capabilities of LightGBM/CatBoost. The results demonstrate exceptional accuracy: the LightGBM metamodel achieved R² = 99.9858%, MAE 6.76 and MSE 13.66, outperforming CatBoost and OLS models, with stable convergence in < 50 iterations and minimal training-validation discrepancy evidencing the synergy between robust dimensionality reduction (PCA/SFS) and dual stacking architecture (linear + boosting). Concluding that the LightGBM model is presented as the best option to solve the dual challenge of data fragmentation and environmental complexity in mountainous areas, future research could evaluate the generalizability of this model in other geographical contexts.

Nowcasting the next hour of residential load using boosting ensemble machines

Article Open access 28 February 2025

A photovoltaic power forecasting method based on the LSTM-XGBoost-EEDA-SO model

Article Open access 18 August 2025

An improved RIME optimization algorithm based maximum power point tracking method for photovoltaic system under partially shading condition

Article Open access 04 June 2025

Introduction

The massive adoption of PV systems as a renewable energy source has transformed the global energy landscape^1,2,3,4,5, driving studies on their performance under extreme weather conditions^6,7,8,9. In particular, high-altitude environments > 3800 m a.s.l^10,11,12. present unique challenges: high atmospheric variability, non-stationary irradiance, and accelerated component degradation^{6,7,8,13,14,15,16}. Given this, conventional predictive models exhibit limitations in accuracy and generalization^17,18, which has motivated the development of machine learning (ML)-based metamodels that combine multiple techniques^{19,20,21,22,23,24}. Recent advances integrate adaptive preprocessing^25,26, feature selection^27,28,29, and regularization^30,31 with algorithms such as LightGBM and CatBoost^{32,33,34,35,36,37}, while stacking approaches leverage synergies between underlying models to improve robustness^{38,39,40,41,42}. However, gaps in reliable prediction under extreme climate variability in mountainous areas persist^{43,44,45,46,47,48}, where the scarcity of quality data exacerbates the problem¹⁶. Despite progress in PV metamodels^{38,39,40,41,42,47,48,49,50}, existing studies suffer from three critical limitations in high-altitude environments: (i) insufficient handling of artifacts in monitoring data (losses, redundancies, and time shifts)^16,51,52, (ii) underutilization of dimensionality reduction techniques to mitigate noise and spurious correlations^27,28,29, and (iii) reliance on stacking architectures with homogeneous base models, limiting ensemble diversity^39,40,41. This leads to overfitting in non-stationary weather conditions^40,42 and high errors in active power prediction (MAE > 12.24 according to³⁹, RMSE > 47.78 according to⁴¹). Furthermore, most models do not incorporate adaptive strategies for fragmented or incomplete data^16,51,53, a recurring problem in systems monitored in remote areas^14,15.

To overcome these limitations, this study proposes a robust methodology based on a four-stage process: (1) adaptive data preprocessing; (2) sequential feature selection (SFS), a key technique for overcoming the obstacles of missing data, a recurring problem in extreme climate monitoring; (3) principal component analysis (PCA) for dimensionality reduction and model optimization; and (4) a hybrid stacking ensemble, designed to substantially improve prediction accuracy by combining the regularization of models such as Lasso and Ridge with the predictive agility of LightGBM and CatBoost. This approach not only seeks to contribute to the advancement of scientific knowledge but also offers practical implications for improving the efficiency and planning of photovoltaic systems in the challenging conditions of mountainous regions. Therefore, the main objective of this research was:

To develop and validate a high-precision active power prediction metamodel for photovoltaic (PV) systems installed at extreme altitudes.

Methodology

Photovoltaic array

The photovoltaic system was installed at 3,800 m above sea level and consists of a grid-tied photovoltaic system that uses SolarEdge P370 DC-DC optimizers on each of the 8 ERA SOLAR model ESPSC 370, 370 Wp monocrystalline photovoltaic modules, with an installed capacity of 2,960 watts, connected to a 3 kW SolarEdge model SE3000H HD-Wave single-phase inverter.

Data description and preprocessing

In order to guarantee data acquisition, the system was monitored using high standards such as “IEC 60904-1⁵⁴ “Measurement of photovoltaic current-voltage characteristics” and IEC 61724-1⁵⁵ “Monitoring of photovoltaic system performance”, which were key to ensuring adequate data acquisition. This system, to measure voltage and current on the direct current side, used ZELIO ANALOG transducers of the SCHNEIDER brand that sent the captured information to the PLC. For the metrics on the AC side, a HIKING TOMZN power meter was used that monitors information on current, voltage, active power, reactive power, COS, frequency, total energy in positive and inverse kWh with IEC 62053 − 21⁵⁶ " The requirements for alternating current active energy meters”. This information was also sent to the SIEMENS LOGO PLC, via the RS 485 Modbus industrial communication protocol. Finally, the software was configured in LabVIEW, which received the information contained in the PLC via Modbus RS485.

The instantaneous prediction approach adopted in this study was specifically designed to address the operational requirements of grid-tied photovoltaic systems at extreme altitudes, where real-time power forecasting is essential for inverter control and grid stability. Unlike time series models that require continuous historical data sequences, our approach maintains functionality with incomplete datasets, a critical advantage in remote monitoring environments where communication failures and equipment malfunctions frequently occur. This design philosophy enables immediate deployment in new installations without historical data accumulation periods, while the computational efficiency (convergence in < 50 iterations) allows implementation on resource-constrained edge computing devices typical of remote mountainous installations.

Model description

To carry out the two prediction metamodels for a PV system under extreme high-altitude conditions, four elements were used: Data Processing, Sequential Feature Selector (SFS), Principal Component Analysis (PCA) and Stacking using CatBoost and LightGBM, which are detailed in the general flowchart in Fig. 1.

As you can see in Fig. 1 First, we pre-processed the data acquired from the sensors into a single dataset so that everything was sorted before applying SFS and PCA. Then, with the SFS, we test the combinations to choose the most important variables and reduce the size of the dataset, which helps the regression model not drown in irrelevant data. Then, with the PCA, we transform everything into new, uncorrelated variables (the principal components) that capture the essence of the data, improving clarity and avoiding overfitting. And finally, we used stacking with CatBoost and LightGBM as level 2 models, but before that we trained level 1 models like Lasso and Ridge (with their regularizations) using PCA components.

The following metrics were used to determine the performance of the two proposed metamodels:

Score train and score test: which indicates at what level from 0 to 1 the generalized model or adjusted to the training and test data is obtained. If the model fits the data so well with a lot of variances, this causes an overfit, which leads to a bad result in the test score, as the model was too curved to fit the training data and generalized very poorly.
MAE: Represents the average of the absolute difference between the actual and predicted values in the dataset. It is calculated with Eq. (1) where N is the total number of data processed, is the actual data, and is the predicted data.$\:{y}_{i}$⁴⁹.

$$\:MAE=\frac{1}{N}{\sum\:}_{i=1}^{N}\left|{y}_{i}-\widehat{y}\right|$$

(1)

MSE: Represents the average of the squared difference between the original and predicted values in the dataset. Measure the variance of residuals. Its value was calculated using Eq. (2)⁴⁹.

$$\:MSE=\frac{1}{N}{\sum\:}_{i=1}^{N}{\left({y}_{i}-\widehat{y}\right)}^{2}$$

(2)

RMSE: It represents a measure of the average magnitude of errors in a model’s predictions. It is calculated as the square root of the average of the squared errors and is calculated using the Eq. (3)⁴⁹.

$$\:RMSE=\sqrt{\frac{1}{n}\sum\:_{i=1}^{n}({y}_{i}-{\widehat{y}}_{i}{)}^{2}}$$

(3)

nRMSE: This is a unitless metric, allowing for a direct comparison of the accuracy of models, even if they are predicting variables with very different units or ranges. The following equation represents the normalization by the range of observed values and is calculated using Eq. (4)⁴⁹.

$$\:nRMSE=\frac{RMSE}{{y}_{max}-{y}_{min}}$$

(4)

Determination: represents the proportion of variance in the dependent variable that is explained by the linear regression model. Its value was calculated using Eq. (5)⁴⁹.

$$\:{R}^{2}=1-\frac{\sum\:{\left({y}_{i}-\widehat{y}\right)}^{2}}{\sum\:{\left({y}_{i}-\stackrel{-}{y}\right)}^{2}}$$

(5)

Adjusted Determination: It is adjusted according to the number of independent variables in the model and will always be less than or equal to R². Eq. (6) was used for calculation. In the following formula, n is the number of observations in the data and k is the number of independent variables in the data⁴⁹.

$$\:{R}_{adj}^{2}=1-\left[\frac{\left(1-{R}^{2}\right)\left(n-1\right)}{n-k-1}\right]$$

(6)

Two important blocks to ensure the generalization of the model are the Sequential Feature Selector and the Principal Component Analysis. Both blocks are detailed in Fig. 2.

In Fig. 2 the flowchart begins with the identification of the dependent variable: “AC_Power” and the independent variables: Date, Time, AC_Current, AC_Voltage, AC_Apparent_Power, AC_Reactive_Power, AC_Power_Factor, DC_Voltage, AC_Frequency, DC_Power, DC_Current”. Performance was evaluated to establish a benchmark for model development. The characteristics or variables were added to the model one by one, and the performance was evaluated. If there was improvement, the variable was selected, and the model was repeated until all the independent variables were exhausted. The process stopped when the optimal number of variables for the system was selected. PCA was then applied to reduce the dimensionality by considering the representation of the variability of the original data with respect to the new data generated by the PCA algorithm until an acceptable threshold in this variability was found. To do this, the data provided by the FSS were introduced and standardized by subtracting mean and dividing by the standard deviation, ensuring that each characteristic contributes equally to the analysis. The covariance matrix was calculated to determine the linear relationships between the features. The values and eigenvectors of the covariance matrix were calculated. Eigenvalues indicate the amount of variance that a principal component captures in the data. The eigenvalues were then sorted in descending order with their eigenvectors, to identify which components capture most of the variance in the data. Subsequently, the number of principal components was selected according to the criterion of variance. In addition, the original data was transformed into the space defined by the selected eigenvectors, resulting in a new dataset (number of components) that captures most of the relevant information from the original dataset. This transformation reduces dimensionality, improves performance, reduces noise, and improves visualization.

Figure 3 shows that the process to create the metamodel represented by the assembly stacking model block begins with feature selection with SFS, and then the dimensionality of the variables is reduced with PCA to train the Ridge and Lasso models. Training both models generates data that will be the input for the Level 2 stackup models: CatBoost and LightGBM. Finally, the results obtained were compared to determine the best metamodel to predict the data of the CC system.

Results

Data pre-processing

To implement the prediction metamodels, the hosted Google Colab Jupyter Notebook service was used with the following characteristics: Processor: Intel^® Xeon^® CPU @ 2.20 GHz, RAM: 12.7 GB and hard disk: 107.7 GB.

For the generation of the model, 92,964 records were used, which are made up of the following fields or variables: ‘Date’, ‘Time’, ‘AC_Current’, ‘AC_Voltage’, ‘AC_Power’, ‘AC_Frequency’, ‘AC_Apparent_Power’, ‘AC_Reactive_Power’, ‘AC_Power_Factor’, ‘DC_Current’, ‘DC_Voltage’, ‘DC_Power’, ‘Date’, ‘AC_Frequency’, ‘DC_Current’. The data were separated into 65% for training and 35% for testing, which corresponds to 60,426 data for training for each variable and 32,538 data for testing, also for each variable.

To develop the prediction metamodels, the data collected from the photovoltaic system were integrated, organized into daily directories generated by the monitoring system, resulting in a data hub presented in Fig. 4 This figure illustrates the initial structure of the dataset, highlighting its complexity and the specificities of the research case study, such as missing values (1.40% on average, according to Table 1), unnamed variables, data displaced by errors in the collection system, and redundancies. During preprocessing, unnamed variables were eliminated, and displaced data were corrected by imputing missing values. These anomalies, common in monitoring systems in extreme altitude conditions, were effectively addressed by combining adaptive preprocessing and dimensionality reduction (SFS and PCA), which allowed the generation of a clean and representative dataset for training predictive models. Figure 4, therefore, represents the starting point of the data cleaning and consolidation process, essential for the robustness of the proposed metamodels.

The complexity observed in Fig. 4 reflects the genuine challenges of data acquisition in extreme altitude environments rather than presentation artifacts. This visualization demonstrates several critical aspects: (i) irregular temporal structure with inconsistent timestamps typical of remote monitoring systems, (ii) variable data density showing periods with complete measurements alternating with significant gaps, and (iii) the necessity for sophisticated preprocessing techniques to handle data fragmentation effectively. The figure represents raw data collected under industrial standards (IEC 60904-1, IEC 61724-1, IEC 62053-21), illustrating why conventional electrical calculations (P = V×I) become unreliable and sophisticated prediction models are essential for maintaining system operability under such challenging conditions.

From Table 1, missing data were analyzed for all variables, with only 1.40% on average. Subsequently, records that did not have full value were deleted.

Table 1 Percentage of missing data.

Full size table

Table 2 describes the statistics of all the variables to be processed. As a selection criterion, it was determined that the threshold for each variable not to be eliminated is that its coefficient of variation is not greater than 100% and the coefficient of asymmetry does not exceed 4%. This criterion is considered from existing literature. Table 2 shows that all variables meet these two requirements.

Table 2 Statistics of the data to be processed.

Full size table

The correlation matrix Fig. 5 quantifies interdependencies among electrical variables in the 3800 m.a.s.l. photovoltaic system, revealing critical redundancies that validate our dimensionality reduction strategy. Perfect multicollinearity (|r| = 1.00) between AC_Current, AC_Power, AC_Apparent_Power, and DC_Power confirm operational coupling where current measurements inherently dictate power outputs, justifying the Sequential Feature Selector’s (SFS) elimination of redundant variables to isolate 8 non-collinear predictors. Similarly, near-unity correlations between reactive power metrics (AC_Reactive_Power–AC_Power_Factor: r = 0.93; AC_Reactive_Power–DC_Voltage: r = 0.92) demonstrate harmonic distortion artifacts exacerbated by altitude-induced grid instability, necessitating Principal Component Analysis (PCA) to decompose these spurious relationships into orthogonal components capturing 99.999% variance. Crucially, AC_Frequency’s weak correlations (r ≤ 0.63) with other variables underscore its role as a unique indicator of grid transients, explaining its retention during adaptive preprocessing.

The strategic implementation of Sequential Feature Selection (SFS) directly addresses the multicollinearity challenges evident in Fig. 5 While perfect correlations (|r| = 1.00) between electrical parameters might suggest redundancy, the practical value of our approach becomes apparent in real-world scenarios where sensor failures, communication disruptions, and environmental artifacts create incomplete datasets. The SFS methodology systematically eliminates redundant variables while preserving predictive capability, enabling accurate power forecasting even when primary measurements (current/voltage) are unavailable. This capability is particularly crucial in extreme altitude installations where maintenance access is limited and sensor reliability is compromised by harsh environmental conditions.

Sequential feature selector

To reduce the number of variables, the Sequential Feature Selector was applied in the second stage. For this regularization method, the metric used to determine the optimal number of variables avg_score was used, seeking to find the lowest value in the iteration. The value found was − 165.20, so the optimal variables according to the SFS method were 8: ‘AC_Current’, ‘AC_Voltage’, ‘AC_Apparent_Power’, ‘AC_Reactive_Power’, ‘AC_Power_Factor’, ‘DC_Voltage’, ‘DC_Power’, ‘DC_Current’, as shown in Fig. 6.

Principal component analysis

In the third stage, the reduction of dimensionality was carried out by means of Principal Component Analysis. The result of the application of the analysis is shown in Fig. 7.

By reducing the dimensionality of the original data into principal components, it was verified that the last three components have a variability equal to the original data, so they were discarded, as shown in Fig. 6 and the values in Table 3.

Table 3 Values of the components obtained by the PCA.

Full size table

Stacking level 1

Ridge

To obtain the base models (level 1) for stacking processing, the Ridge model was first generated on the components determined by PCA, with the results shown in Table 4.

Table 4 Results of ridge regression.

Full size table

The importance of the main components is shown in Fig. 8, where it can be seen that the PC3 (21.29%), PC4 (57.39%) and PC5 (20.97%) components are the components that have the greatest influence on the model.

Lasso

In the same way, to obtain the other base model (level 1) for stacking processing, the Lasso model was also generated on the components determined by PCA, the results of which are shown in Table 5.

Table 5 Lasso regression results.

Full size table

Similarly, to show the importance of the principal components, Fig. 9 is shown, in which it can be seen that the components PC3(21.35%), PC4(57.53%) and PC5(20.77%) are the components that have the greatest influence on the Lasso model.

To verify that the results of the component analysis were consistent, the statistical analysis of the two models was performed and compared with the original dependent variable, as shown in Table 6.

Table 6 Statistics of level 1 and objective models.

Full size table

From the table above, it can be seen that the behavior of the data predicted by Ridge, Lasso and the original dependent variable does not present any notable anomalies or differences. Based on the Lasso and Ridge Level 1 models, the Stacking Level 2 models were generated for the development of prediction metamodels for a photovoltaic system under extreme altitude conditions.

Stacking level 2

At this level see Fig. 10, the CatBoost model after applying the stacking shows that Ridge contributes 60% to the metamodel, while Lasso contributes 40%. Similarly, after applying level 2 stacking using LightGBM to generate the metamodel, the results show that the Lasso model contributes 55% to the metamodel, while the Ridge model contributes 45%.

Figure 11 shows the evolution of the MSE during training of the LightGBM model. Evidencing a rapid decline in the first 50 iterations, stabilizing around 13.66 (MSE) in the test set. The slightest discrepancy between the training curves (solid line) and validation (dotted line) confirms the robustness of the model against overfitting. Compared to CatBoost and OLS (Fig. 13), LightGBM not only achieves lower error values, but also requires fewer iterations to converge, thus optimizing computational resources. Additionally, the figure was manually validated to correct minor distortions generated by automated tools, guaranteeing the reliability of their graphical representation.

Figure 12 illustrates the evolution of the Coefficient Determination of the LightGBM model during training (blue line) and validation (orange line). The R2 reaches a stable value of 0.999858 in the test set after 150 iterations, explaining virtually all the variance in the data. The rapid convergence (before iteration 50) and the close proximity between the two curves confirm that the model is not only accurate, but also generalizable. Compared to CatBoost and OLS (Fig. 12), LightGBM outperforms R2 points by 0.0011 and 0.0071, respectively, highlighting its suitability for complex environments. It should be noted that the figure was manually refined to correct minor deviations in the axes, ensuring accurate interpretation.

Metamodel comparison

Comparing the performance of the two developed metamodels, as well as the reference or base OLS model, Fig. 13, where it shows the results for the following metrics: Score, Coef. of determination and Coef. and it can be observed that the LightGBM model has a score of 99.9858%, while the CatBoost has a lower performance with a score of 99.9848%, as well as the OLS model that reaches 99.9787%. It can also be seen that the LightGBM model exhibits a determination coefficient of 99.9858, while the CatBoost metamodel shows inferior performance with a score of 99.9848%. Similarly, the Ordinary Linear Regression (OLS) model achieves a lower score of 99.9787%. Finally, it can be seen that the LightGBM model exhibits an Adjusted Coefficient of Determination score of 99.9858%, while the CatBoost metamodel shows inferior performance with a score of 99.9848%, similar to the OLS model, whose score only reaches 99.9787%.

In Fig. 14 illustrates the comparative performance of the LightGBM, CatBoost, and ordinary least squares (OLS) models in predicting active power for photovoltaic systems under extreme high-altitude conditions. The LightGBM model demonstrates superior performance with a mean absolute error (MAE) of 6.7601, a mean squared error (MSE) of 13.6614, a root mean squared error (RMSE) of 14.24, and a normalized RMSE (nRMSE) of 0.01, outperforming the CatBoost metamodel, which exhibits an MAE of 7.2036, an MSE of 14.1517, an RMSE of 14.52, and an nRMSE of 0.01. In contrast, the OLS model achieves an MAE of 4.7180, an MSE of 16.24, and comparable RMSE and nRMSE values that reflect lower predictive accuracy. These metrics highlight LightGBM’s enhanced capability to fine-tune predictions and capture complex data patterns in challenging environmental contexts, underscoring its effectiveness for accurate power forecasting in high-altitude photovoltaic systems.

Therefore, it is concluded that the LightGBM metamodel is the most suitable for making predictions for the photovoltaic system because it has better values in the metrics, as well as shorter training and prediction time.

Discussion

Table 7 presents the results of previous studies related to this research work, providing details on the algorithms used, the performance metrics used, and the values obtained.

Table 7 Model comparison.

Full size table

From Table 7, it is indicated that the analysis of the results obtained with the proposed metamodel based on LightGBM demonstrates its superiority over other models and approaches presented in the literature. The metamodel achieved a Test Score and an Adjusted Coefficient of Determination of 99.9858%, values significantly higher than those reported by other studies, such as the use of artificial neural networks combined with XGBoost and LSTM, which achieved an adjusted coefficient of determination of 0.98⁴⁰, in the same way it exceeds the value of 90.58% obtained by⁵⁰ that used a network combining XGBoost, CatBoost, LGBM and RF. Likewise, the proposed model exceeds the value obtained by⁴⁷ of 96.2% obtained by combining the following algorithms: Stacked LSTM Sequence to Sequence Autoencoder hybrid DL. This result underscores the effectiveness of the LightGBM Stacking approach in accurately predicting energy production under extreme high-altitude conditions. In terms of mean square error (MSE), the LightGBM metamodel showed a value of 13.6614, considerably lower than the MSE of 29 reported in models combining convolutional neural networks and LSTM⁴², and much lower than the MSE of 2283.18 achieved by other models such as XGB, LGB, and RF⁴¹. This indicates a greater ability of the proposed metamodel to minimize prediction errors. In addition, the MAE of the LightGBM metamodel was 6.7601, which represents a marked improvement over the MAE of 12.24 observed in the studies using Random Forest, XGBoost and AdaBoost³⁹ and also better than the 8.59 obtained with artificial neural networks combined with SVM and LSTM⁴⁸.

Practical applications and model utility

Our model offers practical and valuable solutions for high-altitude solar systems. It enables smarter predictive maintenance by identifying critical sensors, reducing operating costs. Furthermore, the system is fault-tolerant, maintaining exceptional accuracy (R2 = 99.9858%) even with partial sensor data, thus ensuring continuous operation. This instantaneous prediction capability is crucial for real-time grid integration and management of mountainous microgrids, while its computationally lightweight design enables deployment on resource-constrained edge devices.

Data fragmentation challenges in extreme altitude monitoring

Photovoltaic systems above 3,800 m face unique monitoring challenges due to harsh environmental conditions and limited access, resulting in common data fragmentation (an 8.03% loss in our study). Traditional predictive approaches, which require comprehensive data sets, are ineffective in these circumstances. To overcome this, our hybrid stacking method is specifically designed to handle this incomplete information, maintaining high predictive accuracy and ensuring continuous system operation regardless of the sensor status.

Conclusions

Reliable active power prediction in photovoltaic systems at extreme altitudes above 3800 m a.s.l. faces critical limitations due to non-stationary climate variability, monitoring data loss (8.03% loss in AC_Frequency/DC_Current, temporal shifts), and overfitting in conventional models (MAE > 12.24 in previous studies). Faced with this, this study developed and validated a high-accuracy metamodel through a four-stage process: adaptive preprocessing for series reconstruction, sequential feature selection (SFS) that identified 8 optimal predictors, dimensionality reduction with PCA (capturing 99.999% of variance with 5 components), and hybrid stacking that integrates Lasso/Ridge regularization with the nonlinear capability of LightGBM/CatBoost. The results demonstrate exceptional accuracy: the LightGBM model achieved R² = 99.9858%, MAE = 6.7601 and MSE = 13.6614, significantly outperforming CatBoost (MAE: 7.2036) and OLS (MSE: 16.24), with stable convergence in < 50 iterations and minimal training-validation discrepancy. The novelty lies in the algorithmic synergy that combines mathematical rigor (PCA for component orthogonality) and computational flexibility (boosting for nonlinearities), solving the dual challenge of fragmented data and environmental complexity. This approach allows managing PV systems in mountains with climatic uncertainty (CV < 82.98% in key variables), optimizing grid integration in remote areas. Future research should validate the model in high-irradiation deserts, incorporate autoencoders for unsupervised fault detection, and develop fault-tolerant hardware to mitigate acquisition artifacts.

Data availability

The datasets produced and analyzed in this study are publicly available on Kaggle and can be accessed freely at [https://www.kaggle.com/datasets/antony171060/photovoltaic-system-in-extreme-altitude-conditions] (https:/www.kaggle.com/datasets/antony171060/photovoltaic-system-in-extreme-altitude-conditions).

References

Ashraf, M. et al. Recent trends in sustainable solar energy conversion technologies: Mechanisms, Prospects, and challenges. Energy Fuels. 37, 6283–6301 (2023).
Article CAS Google Scholar
Khare, V., Chaturvedi, P. & Mishra, M. Solar energy system concept change from trending technology: A comprehensive review. e-Prime - Adv. Electr. Eng. Electron. Energy. 4, 100183 (2023).
Article Google Scholar
Victoria, M. et al. Solar photovoltaics is ready to power a sustainable future. Joule 5, 1041–1056 (2021).
Article CAS Google Scholar
Frehner, A., Stucki, M. & Itten, R. Are alpine floatovoltaics the way forward? Life-cycle environmental impacts and energy payback time of the worlds’ first high-altitude floating solar power plant. Sustain. Energy Technol. Assess. 68, 103880 (2024).
Google Scholar
Kay Lup, A. N. et al. Sustainable energy technologies for the global south: challenges and solutions toward achieving SDG 7. Environ. Sci. : Adv. 2, 570–585 (2023).
Google Scholar
Tawalbeh, M. et al. Environmental impacts of solar photovoltaic systems: A critical review of recent progress and future outlook. Sci. Total Environ. 759, 143528 (2021).
Article CAS PubMed Google Scholar
L.T.A. & T., I. Environmental factors and the performance of PV panels: an experimental investigation. Afr. J. Environ. Nat. Sci. Res. 6, 231–247 (2023).
Google Scholar
Shaik, F., Lingala, S. S. & Veeraboina, P. Effect of various parameters on the performance of solar PV power plant: a review and the experimental study. Sustainable Energy res. 10, 6 (2023).
Article Google Scholar
Sweeney, C., Bessa, R. J., Browell, J. & Pinson, P. The future of forecasting for renewable energy. WIREs Energy Environ. 9, e365 (2020).
Article Google Scholar
Zhang, L., Zhu, W., Du, H. & Lv, M. Multidisciplinary design of high altitude airship based on solar energy optimization. Aerosp. Sci. Technol. 110, 106440 (2021).
Article Google Scholar
Huaquipaco Encinas, S. et al. Latin American and caribbean consortium of engineering institutions,. modeling and prediction of a multivariate photovoltaic system, using the multiparametric regression model with shrinkage regularization and extreme gradient boosting. In Proceedings of the 19th LACCEI International Multi-Conference for Engineering, Education, and Technology: Prospective and trends in technology and skills for sustainable social development Leveraging emerging technologies to construct the future https://doi.org/10.18687/LACCEI2021.1.1.557 (2021).
Cruz, J. et al. Predictive hybrid model of a grid-connected photovoltaic system with DC-DC converters under extreme altitude conditions at 3800 meters above sea level. PLoS One. 20, e0324047 (2025).
Article CAS PubMed PubMed Central Google Scholar
Chitturi, S. R. P., Sharma, E. & Elmenreich, W. Efficiency of photovoltaic systems in mountainous areas. in IEEE International Energy Conference (ENERGYCON). 1–6 https://doi.org/10.1109/ENERGYCON.2018.8398766 (IEEE, 2018).
Martinez-Martinez, Z. & Orellana-Lafuente, R. J. & Sempertegui- Tapia, D. F. Comparative Analysis of Power Generation Between Concentrated Solar Power and Photovoltaic Systems Under High Altitude Conditions. in International Conference on Electrical, Computer and Energy Technologies (ICECET). 1–6 https://doi.org/10.1109/ICECET58911.2023.10389354 (IEEE, 2023).
Quinde-Abril, M., Calle-Siguencia, J. & Amador Guerra, J. Design of a photovoltaic system for self-consumption in buildings at high-altitude cities located in the equator. In Communication, Smart Technologies and Innovation for Society (eds. Rocha, Á., López-López, P. C. & Salgado-Guerrero, J. P.) 252 433–443 (Springer, 2022).
Eyring, N. & Kittner, N. High-resolution electricity generation model demonstrates suitability of high-altitude floating solar power. iScience 25, 104394 (2022).
Article PubMed PubMed Central ADS Google Scholar
Krechowicz, A., Krechowicz, M. & Poczeta, K. Machine learning approaches to predict electricity production from renewable energy sources. Energies 15, 9146 (2022).
Article Google Scholar
Iheanetu, K. J. Solar photovoltaic power forecasting: A review. Sustainability 14, 17005 (2022).
Article ADS Google Scholar
Brooks, J. D. & Mavris, D. N. Metamodels to enable renewable energy system consideration within thermal and electrical systems optimization. Energy Nexus. 10, 100190 (2023).
Article Google Scholar
Pourmohammadi, P. & Saif, A. Robust metamodel-based simulation-optimization approaches for designing hybrid renewable energy systems. Appl. Energy. 341, 121132 (2023).
Article Google Scholar
Li, P., Zhou, K., Lu, X. & Yang, S. A hybrid deep learning model for short-term PV power forecasting. Appl. Energy. 259, 114216 (2020).
Article Google Scholar
Mubarak, H. et al. A hybrid machine learning method with explicit time encoding for improved Malaysian photovoltaic power prediction. J. Clean. Prod. 382, 134979 (2023).
Article Google Scholar
Mayer, M. J. Benefits of physical and machine learning hybridization for photovoltaic power forecasting. Renew. Sustain. Energy Rev. 168, 112772 (2022).
Article Google Scholar
Li, G. et al. Photovoltaic power forecasting with a hybrid deep learning approach. IEEE Access. 8, 175871–175880 (2020).
Article Google Scholar
Macaire, J., Zermani, S. & Linguet, L. New feature selection approach for Photovoltaïc power forecasting using KCDE. Energies 16, 6842 (2023).
Article Google Scholar
Mellit, A., Massi Pavan, A., Ogliari, E., Leva, S. & Lughi, V. Advanced methods for photovoltaic output power forecasting: A review. Appl. Sci. 10, 487 (2020).
Article Google Scholar
Meyers, B. E. Dimensionality Reduction and Clustering Analysis of Daily PV Power Signals. in IEEE 48th Photovoltaic Specialists Conference (PVSC). 1217–1221 https://doi.org/10.1109/PVSC43889.2021.9518762 (IEEE, 2021).
Hao, P., Zhang, Y., Lu, H. & Lang, Z. A novel method for parameter identification and performance Estimation of PV module under varying operating conditions. Energy. Conv. Manag. 247, 114689 (2021).
Article Google Scholar
Niu, D., Wang, K., Sun, L., Wu, J. & Xu, X. Short-term photovoltaic power generation forecasting based on random forest feature selection and CEEMD: A case study. Appl. Soft Comput. 93, 106389 (2020).
Article Google Scholar
Shi, J., Li, C. & Yan, X. Artificial intelligence for load forecasting: A stacking learning approach based on ensemble diversity regularization. Energy 262, 125295 (2023).
Article Google Scholar
Incremona, A. & Nicolao, G. D. Regularization methods for the short-term forecasting of the Italian electric load. Sustain. Energy Technol. Assess. 51, 101960 (2022).
Google Scholar
Zhang, L. & Jánošík, D. Enhanced short-term load forecasting with hybrid machine learning models: catboost and XGBoost approaches. Expert Syst. Appl. 241, 122686 (2024).
Article Google Scholar
Banik, R., Biswas, A. & Improving Solar, P. V. Prediction performance with RF-CatBoost ensemble: A robust and complementary approach. Renew. Energy Focus. 46, 207–221 (2023).
Article Google Scholar
Peng, Y. et al. LightGBM-Integrated PV power prediction based on Multi-Resolution similarity. Processes 11, 1141 (2023).
Article ADS Google Scholar
Park, S., Jung, S., Jung, S., Rho, S. & Hwang, E. Sliding window-based LightGBM model for electric load forecasting using anomaly repair. J. Supercomput. 77, 12857–12878 (2021).
Article Google Scholar
Liao, S. et al. Short-Term wind power prediction based on LightGBM and meteorological reanalysis. Energies 15, 6287 (2022).
Article Google Scholar
Fraccanabbia, N. et al. Solar Power Forecasting Based on Ensemble Learning Methods. in. International Joint Conference on Neural Networks (IJCNN) 1–7 https://doi.org/10.1109/IJCNN48605.2020.9206777 (IEEE, 2020).
Chakraborty, D., Mondal, J., Barua, H. B. & Bhattacharjee, A. Computational solar energy – Ensemble learning methods for prediction of solar power generation based on meteorological parameters in Eastern India. Renew. Energy Focus. 44, 277–294 (2023).
Article Google Scholar
Abdellatif, A. et al. Forecasting photovoltaic power generation with a stacking ensemble model. Sustainability 14, 11083 (2022).
Article CAS ADS Google Scholar
Khan, W., Walker, S. & Zeiler, W. Improved solar photovoltaic energy generation forecast using deep learning-based ensemble stacking approach. Energy 240, 122812 (2022).
Article Google Scholar
Zhang, H. & Zhu, T. Stacking model for Photovoltaic-Power-Generation prediction. Sustainability 14, 5669 (2022).
Article ADS Google Scholar
Elizabeth Michael, N., Mishra, M., Hasan, S. & Al-Durra, A. Short-Term solar power predicting model based on Multi-Step CNN stacked LSTM technique. Energies 15, 2150 (2022).
Article Google Scholar
El Bourakadi, D., Ramadan, H., Yahyaouy, A. & Boumhidi, J. A novel solar power prediction model based on stacked BiLSTM deep learning and improved extreme learning machine. Int. j. inf. Tecnol. 15, 587–594 (2023).
Article Google Scholar
Alghamdi, A. A., Ibrahim, A., El-Kenawy, E. S. M. & Abdelhamid, A. A. Renewable energy forecasting based on stacking ensemble model and Al-Biruni Earth radius optimization algorithm. Energies 16, 1370 (2023).
Article Google Scholar
Moon, J., Jung, S., Rew, J., Rho, S. & Hwang, E. Combination of short-term load forecasting models based on a stacking ensemble approach. Energy Build. 216, 109921 (2020).
Article Google Scholar
Xia, M., Shao, H., Ma, X. & De Silva, C. W. A stacked GRU-RNN-Based approach for predicting renewable energy and electricity load for smart grid operation. IEEE Trans. Ind. Inf. 17, 7050–7059 (2021).
Article Google Scholar
Ghimire, S. et al. Stacked LSTM Sequence-to-Sequence autoencoder with feature selection for daily solar radiation prediction: A review and new modeling results. Energies 15, 1061 (2022).
Article Google Scholar
Lateko, A. A. H. et al. Stacking ensemble method with the RNN Meta-Learner for Short-Term PV power forecasting. Energies 14, 4733 (2021).
Article Google Scholar
Gasser, P. et al. A review on resilience assessment of energy systems. Sustainable Resilient Infrastructure. 6, 273–299 (2021).
Article Google Scholar
Guo, X., Gao, Y., Zheng, D., Ning, Y. & Zhao, Q. Study on short-term photovoltaic power prediction model based on the stacking ensemble learning. Energy Rep. 6, 1424–1431 (2020).
Article Google Scholar
Cross-validation of. The operation of photovoltaic systems connected to the grid in extreme conditions of the highlands above 3800 meters above sea level. IJRER https://doi.org/10.20508/ijrer.v12i2.12570.g8490 (2022).
Article Google Scholar
Soomar, A. M. et al. Solar photovoltaic energy optimization and challenges. Front. Energy Res. 10, 879985 (2022).
Article Google Scholar
Cruz, J. et al. Multiparameter regression of a photovoltaic system by applying hybrid methods with variable selection and stacking ensembles under extreme conditions of altitudes higher than 3800 meters above sea level. Energies 16, 4827 (2023).
Article Google Scholar
IEC. IEC 60904-1 Measurement of photovoltaic current-voltage characteristics. (2020).
IEC. IEC 61724 Monitoring of photovoltaic system performance. (2021).
IEC. IEC 62053-21 This standard defines the requirements for static meters for AC active energy. (2020).

Download references

Acknowledgements

The authors gratefully acknowledge the support of the Escuela Profesional de Ingeniería en Energías Renovables de la Universidad Nacional de Juliaca, the Facultad FIMEES de la Universidad Nacional del Altiplano, and the Escuela Profesional de Ingeniería de Sistemas e Informática de la Universidad Nacional de Moquegua for their contributions to this research.

Author information

Authors and Affiliations

School of Ingeniería de Sistemas e Informática Faculty of Engineering, Universidad Nacional de Moquegua, Moquegua, Peru
Saul Huaquipaco & Victor Yana-Mamani
University de Alicante , Alicante, España
Wilson Mamani
Faculty of Ingeniería de Mecánica Eléctrica, Electrónica y Sistemas, Universidad Nacional del Altiplano, Puno, Peru
Norman Beltran, Jose Ramos & Jose Cruz
School of Ingeniería en Energías Renovables Faculty of Engineering , Universidad Nacional de Juliaca, Juliaca, Peru
Vilma Sarmiento, Pedro Puma, Henry Pizarro & Reynaldo Condori

Authors

Saul Huaquipaco
View author publications
Search author on:PubMed Google Scholar
Wilson Mamani
View author publications
Search author on:PubMed Google Scholar
Norman Beltran
View author publications
Search author on:PubMed Google Scholar
Jose Ramos
View author publications
Search author on:PubMed Google Scholar
Vilma Sarmiento
View author publications
Search author on:PubMed Google Scholar
Pedro Puma
View author publications
Search author on:PubMed Google Scholar
Henry Pizarro
View author publications
Search author on:PubMed Google Scholar
Victor Yana-Mamani
View author publications
Search author on:PubMed Google Scholar
Jose Cruz
View author publications
Search author on:PubMed Google Scholar
Reynaldo Condori
View author publications
Search author on:PubMed Google Scholar

Contributions

Saul Huaquipaco: Conceptualization, Supervision, Writing – review & editing; Wilson Mamani: Software, Validation, Methodology, Investigation, Data curation; Norman Beltran: Resources, Project administration; Jose Ramos: Investigation, Resources, Visualization; Vilma Sarmiento: Writing – original draft, Data curation; Pedro Puma: Software, Validation, Formal analysis; Reynaldo Condori: Formal analysis, Validation, Software; Henry Pizarro: Methodology, Visualization; Victor Yana-Mamani: Supervision, Funding acquisition; Jose Cruz: Methodology, Visualization, Writing – original draft, Data curation.

Corresponding author

Correspondence to Saul Huaquipaco.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Huaquipaco, S., Mamani, W., Beltran, N. et al. Optimizing photovoltaic power prediction at extreme altitudes using stacking metamodels and dimensionality reduction. Sci Rep 15, 38423 (2025). https://doi.org/10.1038/s41598-025-22185-x

Download citation

Received: 13 March 2025
Accepted: 26 September 2025
Published: 03 November 2025
Version of record: 03 November 2025
DOI: https://doi.org/10.1038/s41598-025-22185-x

Subjects

Abstract

Similar content being viewed by others

Nowcasting the next hour of residential load using boosting ensemble machines

A photovoltaic power forecasting method based on the LSTM-XGBoost-EEDA-SO model

An improved RIME optimization algorithm based maximum power point tracking method for photovoltaic system under partially shading condition

Introduction

Methodology

Photovoltaic array

Data description and preprocessing

Model description

Results

Data pre-processing

Sequential feature selector

Principal component analysis

Stacking level 1

Ridge

Lasso

Stacking level 2

Metamodel comparison

Discussion

Practical applications and model utility

Data fragmentation challenges in extreme altitude monitoring

Conclusions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links