Introduction

The massive adoption of PV systems as a renewable energy source has transformed the global energy landscape1,2,3,4,5, driving studies on their performance under extreme weather conditions6,7,8,9. In particular, high-altitude environments > 3800 m a.s.l10,11,12. present unique challenges: high atmospheric variability, non-stationary irradiance, and accelerated component degradation6,7,8,13,14,15,16. Given this, conventional predictive models exhibit limitations in accuracy and generalization17,18, which has motivated the development of machine learning (ML)-based metamodels that combine multiple techniques19,20,21,22,23,24. Recent advances integrate adaptive preprocessing25,26, feature selection27,28,29, and regularization30,31 with algorithms such as LightGBM and CatBoost32,33,34,35,36,37, while stacking approaches leverage synergies between underlying models to improve robustness38,39,40,41,42. However, gaps in reliable prediction under extreme climate variability in mountainous areas persist43,44,45,46,47,48, where the scarcity of quality data exacerbates the problem16. Despite progress in PV metamodels38,39,40,41,42,47,48,49,50, existing studies suffer from three critical limitations in high-altitude environments: (i) insufficient handling of artifacts in monitoring data (losses, redundancies, and time shifts)16,51,52, (ii) underutilization of dimensionality reduction techniques to mitigate noise and spurious correlations27,28,29, and (iii) reliance on stacking architectures with homogeneous base models, limiting ensemble diversity39,40,41. This leads to overfitting in non-stationary weather conditions40,42 and high errors in active power prediction (MAE > 12.24 according to39, RMSE > 47.78 according to41). Furthermore, most models do not incorporate adaptive strategies for fragmented or incomplete data16,51,53, a recurring problem in systems monitored in remote areas14,15.

To overcome these limitations, this study proposes a robust methodology based on a four-stage process: (1) adaptive data preprocessing; (2) sequential feature selection (SFS), a key technique for overcoming the obstacles of missing data, a recurring problem in extreme climate monitoring; (3) principal component analysis (PCA) for dimensionality reduction and model optimization; and (4) a hybrid stacking ensemble, designed to substantially improve prediction accuracy by combining the regularization of models such as Lasso and Ridge with the predictive agility of LightGBM and CatBoost. This approach not only seeks to contribute to the advancement of scientific knowledge but also offers practical implications for improving the efficiency and planning of photovoltaic systems in the challenging conditions of mountainous regions. Therefore, the main objective of this research was:

To develop and validate a high-precision active power prediction metamodel for photovoltaic (PV) systems installed at extreme altitudes.

Methodology

Photovoltaic array

The photovoltaic system was installed at 3,800 m above sea level and consists of a grid-tied photovoltaic system that uses SolarEdge P370 DC-DC optimizers on each of the 8 ERA SOLAR model ESPSC 370, 370 Wp monocrystalline photovoltaic modules, with an installed capacity of 2,960 watts, connected to a 3 kW SolarEdge model SE3000H HD-Wave single-phase inverter.

Data description and preprocessing

In order to guarantee data acquisition, the system was monitored using high standards such as “IEC 60904-154 “Measurement of photovoltaic current-voltage characteristics” and IEC 61724-155 “Monitoring of photovoltaic system performance”, which were key to ensuring adequate data acquisition. This system, to measure voltage and current on the direct current side, used ZELIO ANALOG transducers of the SCHNEIDER brand that sent the captured information to the PLC. For the metrics on the AC side, a HIKING TOMZN power meter was used that monitors information on current, voltage, active power, reactive power, COS, frequency, total energy in positive and inverse kWh with IEC 62053 − 2156 " The requirements for alternating current active energy meters”. This information was also sent to the SIEMENS LOGO PLC, via the RS 485 Modbus industrial communication protocol. Finally, the software was configured in LabVIEW, which received the information contained in the PLC via Modbus RS485.

The instantaneous prediction approach adopted in this study was specifically designed to address the operational requirements of grid-tied photovoltaic systems at extreme altitudes, where real-time power forecasting is essential for inverter control and grid stability. Unlike time series models that require continuous historical data sequences, our approach maintains functionality with incomplete datasets, a critical advantage in remote monitoring environments where communication failures and equipment malfunctions frequently occur. This design philosophy enables immediate deployment in new installations without historical data accumulation periods, while the computational efficiency (convergence in < 50 iterations) allows implementation on resource-constrained edge computing devices typical of remote mountainous installations.

Model description

To carry out the two prediction metamodels for a PV system under extreme high-altitude conditions, four elements were used: Data Processing, Sequential Feature Selector (SFS), Principal Component Analysis (PCA) and Stacking using CatBoost and LightGBM, which are detailed in the general flowchart in Fig. 1.

Fig. 1
figure 1

General flow chart of the metamodels created.

As you can see in Fig. 1 First, we pre-processed the data acquired from the sensors into a single dataset so that everything was sorted before applying SFS and PCA. Then, with the SFS, we test the combinations to choose the most important variables and reduce the size of the dataset, which helps the regression model not drown in irrelevant data. Then, with the PCA, we transform everything into new, uncorrelated variables (the principal components) that capture the essence of the data, improving clarity and avoiding overfitting. And finally, we used stacking with CatBoost and LightGBM as level 2 models, but before that we trained level 1 models like Lasso and Ridge (with their regularizations) using PCA components.

The following metrics were used to determine the performance of the two proposed metamodels:

  • Score train and score test: which indicates at what level from 0 to 1 the generalized model or adjusted to the training and test data is obtained. If the model fits the data so well with a lot of variances, this causes an overfit, which leads to a bad result in the test score, as the model was too curved to fit the training data and generalized very poorly.

  • MAE: Represents the average of the absolute difference between the actual and predicted values in the dataset. It is calculated with Eq. (1) where N is the total number of data processed, is the actual data, and is the predicted data.\(\:{y}_{i}\)49.

$$\:MAE=\frac{1}{N}{\sum\:}_{i=1}^{N}\left|{y}_{i}-\widehat{y}\right|$$
(1)
  • MSE: Represents the average of the squared difference between the original and predicted values in the dataset. Measure the variance of residuals. Its value was calculated using Eq. (2)49.

$$\:MSE=\frac{1}{N}{\sum\:}_{i=1}^{N}{\left({y}_{i}-\widehat{y}\right)}^{2}$$
(2)
  • RMSE: It represents a measure of the average magnitude of errors in a model’s predictions. It is calculated as the square root of the average of the squared errors and is calculated using the Eq. (3)49.

$$\:RMSE=\sqrt{\frac{1}{n}\sum\:_{i=1}^{n}({y}_{i}-{\widehat{y}}_{i}{)}^{2}}$$
(3)
  • nRMSE: This is a unitless metric, allowing for a direct comparison of the accuracy of models, even if they are predicting variables with very different units or ranges. The following equation represents the normalization by the range of observed values ​​and is calculated using Eq. (4)49.

$$\:nRMSE=\frac{RMSE}{{y}_{max}-{y}_{min}}$$
(4)
  • Determination: represents the proportion of variance in the dependent variable that is explained by the linear regression model. Its value was calculated using Eq. (5)49.

$$\:{R}^{2}=1-\frac{\sum\:{\left({y}_{i}-\widehat{y}\right)}^{2}}{\sum\:{\left({y}_{i}-\stackrel{-}{y}\right)}^{2}}$$
(5)
  • Adjusted Determination: It is adjusted according to the number of independent variables in the model and will always be less than or equal to R². Eq. (6) was used for calculation. In the following formula, n is the number of observations in the data and k is the number of independent variables in the data49.

$$\:{R}_{adj}^{2}=1-\left[\frac{\left(1-{R}^{2}\right)\left(n-1\right)}{n-k-1}\right]$$
(6)

Two important blocks to ensure the generalization of the model are the Sequential Feature Selector and the Principal Component Analysis. Both blocks are detailed in Fig. 2.

In Fig. 2 the flowchart begins with the identification of the dependent variable: “AC_Power” and the independent variables: Date, Time, AC_Current, AC_Voltage, AC_Apparent_Power, AC_Reactive_Power, AC_Power_Factor, DC_Voltage, AC_Frequency, DC_Power, DC_Current”. Performance was evaluated to establish a benchmark for model development. The characteristics or variables were added to the model one by one, and the performance was evaluated. If there was improvement, the variable was selected, and the model was repeated until all the independent variables were exhausted. The process stopped when the optimal number of variables for the system was selected. PCA was then applied to reduce the dimensionality by considering the representation of the variability of the original data with respect to the new data generated by the PCA algorithm until an acceptable threshold in this variability was found. To do this, the data provided by the FSS were introduced and standardized by subtracting mean and dividing by the standard deviation, ensuring that each characteristic contributes equally to the analysis. The covariance matrix was calculated to determine the linear relationships between the features. The values and eigenvectors of the covariance matrix were calculated. Eigenvalues indicate the amount of variance that a principal component captures in the data. The eigenvalues were then sorted in descending order with their eigenvectors, to identify which components capture most of the variance in the data. Subsequently, the number of principal components was selected according to the criterion of variance. In addition, the original data was transformed into the space defined by the selected eigenvectors, resulting in a new dataset (number of components) that captures most of the relevant information from the original dataset. This transformation reduces dimensionality, improves performance, reduces noise, and improves visualization.

Fig. 2
figure 2

Detailed flowchart for SFS and PCA.

Figure 3 shows that the process to create the metamodel represented by the assembly stacking model block begins with feature selection with SFS, and then the dimensionality of the variables is reduced with PCA to train the Ridge and Lasso models. Training both models generates data that will be the input for the Level 2 stackup models: CatBoost and LightGBM. Finally, the results obtained were compared to determine the best metamodel to predict the data of the CC system.

Fig. 3
figure 3

Metamodel flowchart.

Results

Data pre-processing

To implement the prediction metamodels, the hosted Google Colab Jupyter Notebook service was used with the following characteristics: Processor: Intel® Xeon® CPU @ 2.20 GHz, RAM: 12.7 GB and hard disk: 107.7 GB.

For the generation of the model, 92,964 records were used, which are made up of the following fields or variables: ‘Date’, ‘Time’, ‘AC_Current’, ‘AC_Voltage’, ‘AC_Power’, ‘AC_Frequency’, ‘AC_Apparent_Power’, ‘AC_Reactive_Power’, ‘AC_Power_Factor’, ‘DC_Current’, ‘DC_Voltage’, ‘DC_Power’, ‘Date’, ‘AC_Frequency’, ‘DC_Current’. The data were separated into 65% for training and 35% for testing, which corresponds to 60,426 data for training for each variable and 32,538 data for testing, also for each variable.

To develop the prediction metamodels, the data collected from the photovoltaic system were integrated, organized into daily directories generated by the monitoring system, resulting in a data hub presented in Fig. 4 This figure illustrates the initial structure of the dataset, highlighting its complexity and the specificities of the research case study, such as missing values ​​(1.40% on average, according to Table 1), unnamed variables, data displaced by errors in the collection system, and redundancies. During preprocessing, unnamed variables were eliminated, and displaced data were corrected by imputing missing values. These anomalies, common in monitoring systems in extreme altitude conditions, were effectively addressed by combining adaptive preprocessing and dimensionality reduction (SFS and PCA), which allowed the generation of a clean and representative dataset for training predictive models. Figure 4, therefore, represents the starting point of the data cleaning and consolidation process, essential for the robustness of the proposed metamodels.

The complexity observed in Fig. 4 reflects the genuine challenges of data acquisition in extreme altitude environments rather than presentation artifacts. This visualization demonstrates several critical aspects: (i) irregular temporal structure with inconsistent timestamps typical of remote monitoring systems, (ii) variable data density showing periods with complete measurements alternating with significant gaps, and (iii) the necessity for sophisticated preprocessing techniques to handle data fragmentation effectively. The figure represents raw data collected under industrial standards (IEC 60904-1, IEC 61724-1, IEC 62053-21), illustrating why conventional electrical calculations (P = V×I) become unreliable and sophisticated prediction models are essential for maintaining system operability under such challenging conditions.

Fig. 4
figure 4

Raw dataset structure.

From Table 1, missing data were analyzed for all variables, with only 1.40% on average. Subsequently, records that did not have full value were deleted.

Table 1 Percentage of missing data.

Table 2 describes the statistics of all the variables to be processed. As a selection criterion, it was determined that the threshold for each variable not to be eliminated is that its coefficient of variation is not greater than 100% and the coefficient of asymmetry does not exceed 4%. This criterion is considered from existing literature. Table 2 shows that all variables meet these two requirements.

Table 2 Statistics of the data to be processed.

The correlation matrix Fig. 5 quantifies interdependencies among electrical variables in the 3800 m.a.s.l. photovoltaic system, revealing critical redundancies that validate our dimensionality reduction strategy. Perfect multicollinearity (|r| = 1.00) between AC_Current, AC_Power, AC_Apparent_Power, and DC_Power confirm operational coupling where current measurements inherently dictate power outputs, justifying the Sequential Feature Selector’s (SFS) elimination of redundant variables to isolate 8 non-collinear predictors. Similarly, near-unity correlations between reactive power metrics (AC_Reactive_Power–AC_Power_Factor: r = 0.93; AC_Reactive_Power–DC_Voltage: r = 0.92) demonstrate harmonic distortion artifacts exacerbated by altitude-induced grid instability, necessitating Principal Component Analysis (PCA) to decompose these spurious relationships into orthogonal components capturing 99.999% variance. Crucially, AC_Frequency’s weak correlations (r ≤ 0.63) with other variables underscore its role as a unique indicator of grid transients, explaining its retention during adaptive preprocessing.

Fig. 5
figure 5

Correlation matrix.

The strategic implementation of Sequential Feature Selection (SFS) directly addresses the multicollinearity challenges evident in Fig. 5 While perfect correlations (|r| = 1.00) between electrical parameters might suggest redundancy, the practical value of our approach becomes apparent in real-world scenarios where sensor failures, communication disruptions, and environmental artifacts create incomplete datasets. The SFS methodology systematically eliminates redundant variables while preserving predictive capability, enabling accurate power forecasting even when primary measurements (current/voltage) are unavailable. This capability is particularly crucial in extreme altitude installations where maintenance access is limited and sensor reliability is compromised by harsh environmental conditions.

Sequential feature selector

To reduce the number of variables, the Sequential Feature Selector was applied in the second stage. For this regularization method, the metric used to determine the optimal number of variables avg_score was used, seeking to find the lowest value in the iteration. The value found was − 165.20, so the optimal variables according to the SFS method were 8: ‘AC_Current’, ‘AC_Voltage’, ‘AC_Apparent_Power’, ‘AC_Reactive_Power’, ‘AC_Power_Factor’, ‘DC_Voltage’, ‘DC_Power’, ‘DC_Current’, as shown in Fig. 6.

Fig. 6
figure 6

Selecting variables for SFS.

Principal component analysis

In the third stage, the reduction of dimensionality was carried out by means of Principal Component Analysis. The result of the application of the analysis is shown in Fig. 7.

Fig. 7
figure 7

Outcome of the GPA application (cumulative explained variability as a function of the number of principal components).

By reducing the dimensionality of the original data into principal components, it was verified that the last three components have a variability equal to the original data, so they were discarded, as shown in Fig. 6 and the values in Table 3.

Table 3 Values of the components obtained by the PCA.

Stacking level 1

Ridge

To obtain the base models (level 1) for stacking processing, the Ridge model was first generated on the components determined by PCA, with the results shown in Table 4.

Table 4 Results of ridge regression.

The importance of the main components is shown in Fig. 8, where it can be seen that the PC3 (21.29%), PC4 (57.39%) and PC5 (20.97%) components are the components that have the greatest influence on the model.

Fig. 8
figure 8

Importance of label for level 1 model: ridge.

Lasso

In the same way, to obtain the other base model (level 1) for stacking processing, the Lasso model was also generated on the components determined by PCA, the results of which are shown in Table 5.

Table 5 Lasso regression results.

Similarly, to show the importance of the principal components, Fig. 9 is shown, in which it can be seen that the components PC3(21.35%), PC4(57.53%) and PC5(20.77%) are the components that have the greatest influence on the Lasso model.

Fig. 9
figure 9

Importance of tag for level 1 mode: lasso.

To verify that the results of the component analysis were consistent, the statistical analysis of the two models was performed and compared with the original dependent variable, as shown in Table 6.

Table 6 Statistics of level 1 and objective models.

From the table above, it can be seen that the behavior of the data predicted by Ridge, Lasso and the original dependent variable does not present any notable anomalies or differences. Based on the Lasso and Ridge Level 1 models, the Stacking Level 2 models were generated for the development of prediction metamodels for a photovoltaic system under extreme altitude conditions.

Stacking level 2

At this level see Fig. 10, the CatBoost model after applying the stacking shows that Ridge contributes 60% to the metamodel, while Lasso contributes 40%. Similarly, after applying level 2 stacking using LightGBM to generate the metamodel, the results show that the Lasso model contributes 55% to the metamodel, while the Ridge model contributes 45%.

Fig. 10
figure 10

Stacking level 2.

Figure 11 shows the evolution of the MSE during training of the LightGBM model. Evidencing a rapid decline in the first 50 iterations, stabilizing around 13.66 (MSE) in the test set. The slightest discrepancy between the training curves (solid line) and validation (dotted line) confirms the robustness of the model against overfitting. Compared to CatBoost and OLS (Fig. 13), LightGBM not only achieves lower error values, but also requires fewer iterations to converge, thus optimizing computational resources. Additionally, the figure was manually validated to correct minor distortions generated by automated tools, guaranteeing the reliability of their graphical representation.

Fig. 11
figure 11

LightGBM MSE convergence.

Figure 12 illustrates the evolution of the Coefficient Determination of the LightGBM model during training (blue line) and validation (orange line). The R2 reaches a stable value of 0.999858 in the test set after 150 iterations, explaining virtually all the variance in the data. The rapid convergence (before iteration 50) and the close proximity between the two curves confirm that the model is not only accurate, but also generalizable. Compared to CatBoost and OLS (Fig. 12), LightGBM outperforms R2 points by 0.0011 and 0.0071, respectively, highlighting its suitability for complex environments. It should be noted that the figure was manually refined to correct minor deviations in the axes, ensuring accurate interpretation.

Fig. 12
figure 12

LightGBM coefficient of determination performance.

Metamodel comparison

Comparing the performance of the two developed metamodels, as well as the reference or base OLS model, Fig. 13, where it shows the results for the following metrics: Score, Coef. of determination and Coef. and it can be observed that the LightGBM model has a score of 99.9858%, while the CatBoost has a lower performance with a score of 99.9848%, as well as the OLS model that reaches 99.9787%. It can also be seen that the LightGBM model exhibits a determination coefficient of 99.9858, while the CatBoost metamodel shows inferior performance with a score of 99.9848%. Similarly, the Ordinary Linear Regression (OLS) model achieves a lower score of 99.9787%. Finally, it can be seen that the LightGBM model exhibits an Adjusted Coefficient of Determination score of 99.9858%, while the CatBoost metamodel shows inferior performance with a score of 99.9848%, similar to the OLS model, whose score only reaches 99.9787%.

Fig. 13
figure 13

Performance of all models.

In Fig. 14 illustrates the comparative performance of the LightGBM, CatBoost, and ordinary least squares (OLS) models in predicting active power for photovoltaic systems under extreme high-altitude conditions. The LightGBM model demonstrates superior performance with a mean absolute error (MAE) of 6.7601, a mean squared error (MSE) of 13.6614, a root mean squared error (RMSE) of 14.24, and a normalized RMSE (nRMSE) of 0.01, outperforming the CatBoost metamodel, which exhibits an MAE of 7.2036, an MSE of 14.1517, an RMSE of 14.52, and an nRMSE of 0.01. In contrast, the OLS model achieves an MAE of 4.7180, an MSE of 16.24, and comparable RMSE and nRMSE values that reflect lower predictive accuracy. These metrics highlight LightGBM’s enhanced capability to fine-tune predictions and capture complex data patterns in challenging environmental contexts, underscoring its effectiveness for accurate power forecasting in high-altitude photovoltaic systems.

Fig. 14
figure 14

Model error performance.

Therefore, it is concluded that the LightGBM metamodel is the most suitable for making predictions for the photovoltaic system because it has better values in the metrics, as well as shorter training and prediction time.

Discussion

Table 7 presents the results of previous studies related to this research work, providing details on the algorithms used, the performance metrics used, and the values obtained.

Table 7 Model comparison.

From Table 7, it is indicated that the analysis of the results obtained with the proposed metamodel based on LightGBM demonstrates its superiority over other models and approaches presented in the literature. The metamodel achieved a Test Score and an Adjusted Coefficient of Determination of 99.9858%, values significantly higher than those reported by other studies, such as the use of artificial neural networks combined with XGBoost and LSTM, which achieved an adjusted coefficient of determination of 0.9840, in the same way it exceeds the value of 90.58% obtained by50 that used a network combining XGBoost, CatBoost, LGBM and RF. Likewise, the proposed model exceeds the value obtained by47 of 96.2% obtained by combining the following algorithms: Stacked LSTM Sequence to Sequence Autoencoder hybrid DL. This result underscores the effectiveness of the LightGBM Stacking approach in accurately predicting energy production under extreme high-altitude conditions. In terms of mean square error (MSE), the LightGBM metamodel showed a value of 13.6614, considerably lower than the MSE of 29 reported in models combining convolutional neural networks and LSTM42, and much lower than the MSE of 2283.18 achieved by other models such as XGB, LGB, and RF41. This indicates a greater ability of the proposed metamodel to minimize prediction errors. In addition, the MAE of the LightGBM metamodel was 6.7601, which represents a marked improvement over the MAE of 12.24 observed in the studies using Random Forest, XGBoost and AdaBoost39 and also better than the 8.59 obtained with artificial neural networks combined with SVM and LSTM48.

Practical applications and model utility

Our model offers practical and valuable solutions for high-altitude solar systems. It enables smarter predictive maintenance by identifying critical sensors, reducing operating costs. Furthermore, the system is fault-tolerant, maintaining exceptional accuracy (R2 = 99.9858%) even with partial sensor data, thus ensuring continuous operation. This instantaneous prediction capability is crucial for real-time grid integration and management of mountainous microgrids, while its computationally lightweight design enables deployment on resource-constrained edge devices.

Data fragmentation challenges in extreme altitude monitoring

Photovoltaic systems above 3,800 m face unique monitoring challenges due to harsh environmental conditions and limited access, resulting in common data fragmentation (an 8.03% loss in our study). Traditional predictive approaches, which require comprehensive data sets, are ineffective in these circumstances. To overcome this, our hybrid stacking method is specifically designed to handle this incomplete information, maintaining high predictive accuracy and ensuring continuous system operation regardless of the sensor status.

Conclusions

Reliable active power prediction in photovoltaic systems at extreme altitudes above 3800 m a.s.l. faces critical limitations due to non-stationary climate variability, monitoring data loss (8.03% loss in AC_Frequency/DC_Current, temporal shifts), and overfitting in conventional models (MAE > 12.24 in previous studies). Faced with this, this study developed and validated a high-accuracy metamodel through a four-stage process: adaptive preprocessing for series reconstruction, sequential feature selection (SFS) that identified 8 optimal predictors, dimensionality reduction with PCA (capturing 99.999% of variance with 5 components), and hybrid stacking that integrates Lasso/Ridge regularization with the nonlinear capability of LightGBM/CatBoost. The results demonstrate exceptional accuracy: the LightGBM model achieved R² = 99.9858%, MAE = 6.7601 and MSE = 13.6614, significantly outperforming CatBoost (MAE: 7.2036) and OLS (MSE: 16.24), with stable convergence in < 50 iterations and minimal training-validation discrepancy. The novelty lies in the algorithmic synergy that combines mathematical rigor (PCA for component orthogonality) and computational flexibility (boosting for nonlinearities), solving the dual challenge of fragmented data and environmental complexity. This approach allows managing PV systems in mountains with climatic uncertainty (CV < 82.98% in key variables), optimizing grid integration in remote areas. Future research should validate the model in high-irradiation deserts, incorporate autoencoders for unsupervised fault detection, and develop fault-tolerant hardware to mitigate acquisition artifacts.