Introduction

Solar energy, a renewable and sustainable source, plays a pivotal role in the global transition toward a future of clean energy. In a world increasingly driven by the imperative to reduce carbon emissions and mitigate climate change, solar energy emerges as a vital solution due to its abundance and availability. The growing demand for renewable energy sources places solar energy at the forefront, not only as an alternative to fossil fuels, but also as a fundamental pillar in the quest for energy sustainability.

As highlighted by Chandel and Roy 1, accurate forecasting of solar radiation is critical to optimizing the performance of photovoltaic systems and efficiently integrating solar energy into the electrical grid. However, the effectiveness of solar energy utilization is inherently linked to the precision and completeness of the collected meteorological data. Precise data on solar irradiation, temperature, humidity, and other environmental variables are essential to evaluate the solar energy potential of a region. However, data collection can be fraught with challenges, including data gaps due to equipment limitations, maintenance, or adverse environmental conditions. These data gaps can lead to inaccurate analyzes that affect decision-making and efficiency in the implementation of solar energy solutions.

Recent advances in machine learning have significantly enhanced the methods available for optimizing and managing solar energy systems. For instance, Zhou et al. 2 developed a machine learning-based optimal design of a phase change material integrated renewable system, which includes on-site PV, radiative cooling, and hybrid ventilations. This study demonstrated substantial improvements in energy performance across different climatic regions. Further advancements were made by Khare et al. 3, who investigated the use of Generative Adversarial Networks (GANs) for the imputation of solar radiation data, demonstrating significant improvements in the quality of imputed data.

The challenges associated with forecasting solar radiation, arising from the complexity of atmospheric processes and the dynamics of climate patterns, are addressed through advanced machine learning techniques4,5. In Ref.1, the authors presented a hybrid strategy using Random Forest and XGBoost to fill significant gaps in solar data, considering its effectiveness in large and complex datasets. The study conducted by Xu et al.6introduces an innovative approach to predict the higher heating value (HHV) of coal, using a regression tree model boosting the gradient. The authors proposed the use of the gradient boosting (GB) model, which depends on the correlation between the immediate analysis of coal and its HHV, offering a less resource-intensive alternative to estimate the calorific values of coal. This approach not only highlights the efficacy of the gradient boosting regression tree (GBRT) model in predicting the HHV of coal with greater accuracy but also offers a promising methodology for coal quality analysis, potentially reducing the costs and time associated with traditional evaluation.

Necati Aksoy and Istemihan Genc 7developed innovative predictive models to forecast energy production in solar power plants, employing gradient boost methods, specifically XGBoost, LightGBM, and CatBoost. This study emphasizes the importance of accuracy and speed in forecasting the energy generated by solar power plants in smart grids, microgrids or small-scale grids and how it influences critical decisions related to dynamic energy management of these networks. The authors have developed machine learning models that use a training dataset composed of various meteorological characteristics. These models offer high accuracy and rapid learning and bring significant benefits to the solar energy sector. Furthermore, the study compares the performance of these models, discussing their practical applicability.

The research conducted by Sasirekha et al. 8highlighted the importance of solar radiation forecasting in the energy generation sector, emphasizing the efficiency of Random Forest in large datasets and in estimating missing data. This comparative study found that, despite Random Forest requiring more processing time due to the larger number of decision trees, it provides more accurate predictions, especially useful in practical applications such as solar radiation analysis for electric power production.

Villegas-Mier, C. G. et al. 9 addressed solar radiation forecasting using optimization with the Random Forest algorithm. This work compared the results with other machine learning models and showed a significant improvement in the accuracy of the results compared to conventional methods such as linear regression and recurrent neural networks. The study was carried out in Querétaro, Mexico, and highlighted the effectiveness of Random Forest in predicting solar radiation, especially in locations with average weather conditions for most of the year.

In this context, machine learning techniques such as Random Forest and Gradient Boosting emerge as powerful tools to address limitations in the analysis of solar energy data. Random Forest, an ensemble learning method, is known for its high accuracy and ability to handle large datasets with multiple input variables4. By applying this model to the imputation of missing solar energy generation data, we can significantly improve the quality and reliability of the analyses. For this purpose, the work uses variables such as temperature, radiation, humidity, and wind speed for data estimation.

Similarly, Gradient Boosting (GB) is an ensemble machine learning method that applies to classification and regression problems. Trains models sequentially, focusing on more difficult cases by adjusting the weights of instances with incorrect predictions. The aim is to minimize a loss function, similarly to the training of neural networks, but combining multiple models to enhance accuracy. Numerous hyperparameters, including learning rate and loss function, are used to optimize model performance10.

The selection of random forests and gradient boosts for this study is motivated by their ability to process large volumes of data and capture complex relationships between input variables without the need for extensive data transformations. Moreover, while Random Forest is known for its ease of use, robust generalization, and resistance to overfitting, Gradient Boosting offers flexibility in optimizing different loss functions and can achieve greater accuracy by adaptively adjusting to failures in previous predictions. Other methods such as neural networks, SVM (Support Vector Machines), and linear models have their advantages, such as the ability of neural networks to model complex nonlinear relationships and the efficiency of SVMs in high-dimensional spaces. However, these methods may require more data for effective training or be more susceptible to overfitting, making Random Forest and Gradient Boosting a more balanced choice for many solar energy analysis scenarios11,12,13.

Thus, this study explores the implementation of Gradient Boosting and Random Forest models to enhance the integrity of solar data, assessing which method is more suitable to impute missing photovoltaic generation data between April and June. The database refers to a PV plant on the campus of the Federal Institute of Bahia, located in the city of Irecê—Bahia, Brazil. Data is open, available on the Kaggle platform, and includes meteorological data provided by the National Institute of Meteorology, INMET of Brazil.

Literature review

The field of artificial intelligence has seen significant advancements in the past few decades, fuelling various innovative applications in various domains, from medicine to industry. At the heart of this progress lies machine learning. This branch of artificial intelligence is focused on developing systems that learn and improve from experience without explicit programming. It encompasses supervised, unsupervised, and reinforcement learning14,15. In supervised learning, the model is trained with a labeled dataset, using techniques such as linear and logistic regression, neural networks, and decision trees. Decision trees, a method of supervised learning, are used for classification and regression, learning from simple rules based on the data’s characteristics16,17,18.

In unsupervised learning, models are trained on unlabeled data to discover patterns, clusters, or hidden structures within the data. Popular algorithms include K-Means, PCA (Principal Component Analysis), and Gaussian Mixture Models 19,20,21. Meanwhile, in reinforcement learning, agents learn to make sequential decisions to maximize a cumulative reward. This is used in applications such as games and robotics. Popular algorithms include Q-Learning and DDPG (Deep Deterministic Policy Gradients)22,23.

In recent years, the application of machine learning techniques has become increasingly important in the field of solar energy. Several studies have demonstrated the effectiveness of various algorithms in predicting solar radiation and imputing missing data in solar power generation databases.

Machine learning techniques are broadly categorized according to the function they perform in data processing. These categories span various algorithms and methods, each designed to solve specific types of problems. The main categories include classification, regression, clustering, dimensionality reduction, and probability estimation17.

  • Classification: is one of the most common tasks in machine learning. It involves assigning a class or category to a given object or instance based on its features. This is particularly useful in cases where one wishes to predict a specific category or label for new data. Some of the most popular classification algorithms include the following:

    • Decision Trees: Hierarchical decision structures that divide the data into subsets based on characteristics.

    • Random Forests: Ensembles of decision trees that aggregate their predictions to increase accuracy.

These algorithms are effective in dealing with categorical data and are widely used in tasks such as spam detection, medical diagnosis, and pattern recognition.

  • Regression: deals with the prediction of continuous values, rather than discrete categories. It is used when the goal is to estimate a numerical value based on a set of input variables. Common regression algorithms include the following:

    • Linear regression: Models the linear relationship between the input variables and the output variable.

    • Logistic regression: Used for binary classification problems, where the output is a probability.

Regression is applied in a wide range of domains, such as prediction of real estate prices, analysis of economic trends, and prediction of sports results.

  • Clustering: is a technique that groups unlabeled data based on their intrinsic similarities. It is used to identify groups or clusters of data that share common characteristics. Some popular clustering algorithms include the following:

    • K-Means: Groups the data into K clusters, where K is specified by the user.

    • Gaussian Mixture Models: Models the clusters as Gaussian distributions and is useful when the data have a more complex structure.

Clustering is used in market segmentation, social network analysis, and image segmentation, among other applications.

  • Dimensionality Reduction: is a technique that aims to reduce the complexity of the data while retaining the most important information. This is especially useful when dealing with high-dimensional datasets. One of the most common methods of dimensionality reduction is Principal Component Analysis (PCA), which identifies the directions of greatest variability in the data and projects the data onto a lower-dimensional space. Dimensionality reduction is applied to data visualization, image compression, and the simplification of machine learning models.

  • Probability Estimation: often used in classification tasks to calculate the probability that an example belongs to a specific class. The Naive Bayes algorithm is a classic example of a probability estimation method, which relies on Bayes’ theorem to calculate the conditional probabilities of the classes.

Supervised learning is one of the most fundamental and widely used categories in machine learning models. In this paradigm, machine learning algorithms learn from data consisting of input examples and their corresponding outputs, often called target labels. Formally, this is defined as having input examples represented as \({W}_{i}\) and their respective target labels as for each example \(i\in\left\{1,\dots,N\right\}\). The crucial aspect that differentiates supervised learning is the presence of these target labels, which provide valuable feedback to guide the algorithm through the learning process, allowing for the accurate capture of the relationships between the inputs and outputs.

The development process of a machine learning model involves solving an optimization problem. However, frequently, a single model may not be the best solution for a complex issue. Here the ensemble learning technique comes into play, aiming to improve performance by combining the predictions of multiple models24.

The Random Forest method is a machine learning algorithm that operates by constructing multiple decision trees during training and outputting the average of these trees’ predictions to improve accuracy and control overfitting. Each decision tree in the Random Forest is built from a bootstrap sample of the data, and the splits of each tree are determined by a random subset of the features. This ensemble learning approach allows the model to capture both the variability and the patterns in the data, making it robust and precise, especially in cases with complex and multidimensional data. Random Forest is widely used for its effectiveness, ease of implementation, and good performance on various prediction tasks.

The state-of-the-art approach in applying random forest for solar energy analysis predictions reveals a growing trend in the use of artificial intelligence (AI) techniques to forecast photovoltaic energy generation. In addition to Random Forest, other methods such as NARX, NARMAX, ARIMA, and neural network models, such as the backpropagation neural network and the extreme learning machine (ELM), have been explored. Each of these models presents unique features, such as the ELM’s ability to self-learning and adapt and ARIMA’s sensitivity to reflect changes in solar irradiation22,25,26.

Random Forest stands out for its ability to handle large volumes of data and reduce overfitting, making it an effective choice for precise predictions. However, the field still faces challenges, such as the need for efficient processing of large data volumes and the development of more efficient training algorithms for neural networks. Thus, while Random Forest is a promising option, there is room for innovation and improvement in solar energy prediction using AI.

The application of machine learning models, especially Random Forest, in solar energy data analysis has been explored in various recent research. For example, a study conducted by H. Sun et al.27 demonstrated how Random Forest could be effectively used to predict solar energy production under varying weather conditions. This study highlights the model’s robustness in handling the nonlinearity of environmental data, a challenge also encountered in our research28.

In contrast, Gradient Boosting is an ensemble machine learning algorithm that enhances accuracy through the sequential construction of decision trees. Unlike methods that build trees independently, Gradient Boosting focuses on correcting errors from previous trees, adapting to the most challenging cases during training. Each new tree is built to minimize the residual errors of previous predictions, using a process that gradually reduces the loss function. This process involves selecting specific hyperparameters, such as the learning rate and loss function, which guide the fine-tuning of the model for improved performance.

The application of Gradient Boosting in solar energy predictions demonstrates the versatility and efficacy of this method in handling complex and variable data. Similar to Random Forest, Gradient Boosting benefits from ensemble learning to provide robust and accurate predictions. However, it is distinguished by its ability to progressively correct errors, which can lead to higher accuracy in scenarios where data patterns are particularly challenging.

Despite the effectiveness of these methods, challenges remain in processing large data volumes efficiently and developing training algorithms that can fully leverage the available data29,30. The evolving landscape of machine learning in solar energy prediction continues to promise enhancements in accuracy and operational efficiency, underscoring the importance of innovation in this field31,32,33.

Methodology

The Methodology section of this study outlines the process employed to address gaps in solar energy generation data by utilizing the Random Forest and Gradient Boosting algorithms. It provides a detailed account of data preparation, the handling of missing values, the division of data into training and testing sets, model training procedures, and the evaluation of model performance.

Materials

Data were collected from the solar station at the photovoltaic plant and the database of the National Institute of Meteorology (INMET) of Brazil, covering variables such as air temperature, humidity, wind speed, and irradiation, alongside energy generation. This information was sourced from a photovoltaic solar plant located at the Federal Institute of Bahia in Irecê, Brazil, a region known for its high solar potential due to its proximity to the equator. The dataset was obtained from the open-source platform Kaggle, and the author has no affiliation with the mentioned institution. Figure 1 presents the system data used for this study.

Fig. 1
figure 1

Irecê generation power data missing.

The first notable observation is the consistency of energy generation over time, with peaks following a daily pattern indicative of solar generation relative to daylight. Energy production varies between approximately 0.5 and 2.0 kWh, with most days showing robust energy production, consistent with expectations for a photovoltaic plant in a high-insolation region. However, the area highlighted in the chart indicates a data gap between the end of May and the beginning of June 2019, which draws attention. This gap could be attributed to various causes, such as failures in the monitoring system, plant maintenance, or extreme weather events that may have temporarily halted operations. The absence of data during this period is critical as it prevents a comprehensive analysis of the plant’s performance and efficiency, as well as potential environmental or economic impacts.

A slight trend of reduction in energy generation is also observable immediately before the data interruption. This could indicate a gradual degradation of solar panel efficiency or expected seasonal variations due to changes in the solar angle over the year.

With access to the primary data, the imputation study was conducted. The process used in the development of the work is illustrated in Fig. 2. Initially, the data were separated into two sets: one containing complete records of energy generation and another where these values were absent. The SimpleImputer from the sklearn library was used to replace missing values in the predictor variables with the mean of the set, preparing the data for training.

Fig. 2
figure 2

Development process flowchart.

The Random Forest Regressor and Gradient Boosting algorithms were adopted to model the relationship between environmental variables and energy generation. The models were trained with complete data and then used to estimate energy generation for records where it was missing.

The validity of the predictions was ensured by checking for the absence of negative values, which would be inconsistent with the nature of energy generation data. The evaluation also considered the performance of the models on test data using metrics such as mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and the coefficient of determination (R2). These metrics provide a comprehensive view of the models’ performance, including their accuracy and ability to explain the variance in the data.

Additionally, the relative importance of the climatic variables in the prediction models was observed and evaluated using the feature_importances_ attribute of the trained models. This allows understanding which variables have the most significant impact on the prediction of energy generation.

Random Forest

The Random Forest used in the study is an ensemble learning method for regression, which constructs multiple decision trees during training. The method is suitable for our study because of its effectiveness in handling high-dimensional data and its ability to detect interactions between dependent and independent variables.

Random Forest is a versatile machine learning algorithm that is applicable to both regression and classification tasks. It builds multiple decision trees through bootstrap sampling and combines their predictions. In regression, the prediction is given by the average of the outputs of the trees:

$$\hat{{\text{y}}}\left(x\right)=\frac{1}{K}{\sum}_{k=1}^{K}{T}_{k}\left(x\right)$$
(1)

where \(\hat{{\text{y}}}\left(x\right)\) is the prediction for a new data point x, and \({T}_{k}\left(x\right)\) is the prediction of the k-th tree. While in classification, it’s by the most voted class among the trees:

$$\hat{{\text{y}}}\left(x\right)=mode\left\{{T}_{k}\left(x\right)\right\}$$
(2)

The algorithm also assesses the importance of variables, which is crucial for understanding the features that most influence the model’s predictions.

The machine learning process is carried out in two phases: training and application. In the training phase, the Random Forest model learns from the training data. The application phase uses the trained model to make predictions on new data.

Gradient Boosting

Gradient Boosting is another ensemble learning method used in this study, specifically for regression and classification tasks. Unlike the Random Forest, Gradient Boosting optimizes predictive models sequentially, correcting the errors of previous predictors through an iterative procedure. This method turned out to be efficient for our study, especially due to its ability to minimize both bias and variance, resulting in more accurate predictions5.

Gradient Boosting constructs an additive model in a progressive stage-by-stage manner; at each stage, decision trees are introduced that correct the residuals of the current model. The final prediction is a weighted combination of the predictions from all the decision trees:

$$\hat{{\text{Y}}}\left(\varvec{x}\right)={\sum}_{\varvec{k}=1}^{\varvec{K}}{\varvec{\gamma}}_{\varvec{k}}{\varvec{T}}_{\varvec{k}}\left(\varvec{x}\right)$$
(3)

where \(\hat{{\text{y}}}\left(x\right)\) represents the prediction for a new data point x, \({T}_{k}\left(x\right)\) is the prediction from the k-th decision tree, and \({\gamma}_{k}\) is the weight associated with the k-th tree.

This method is particularly known for its flexibility, being adjustable to optimize different loss functions and control overfitting through parameters such as the learning rate and the number of trees. Furthermore, like Random Forest, Gradient Boosting performs a feature importance analysis, allowing for a detailed interpretation of the variables that have the most significant influence on the predictions.

During the training phase, the Gradient Boosting model iteratively learns to correct the errors from previous trees, and in the application phase, it applies the trained model to predict the test data. The ability of Gradient Boosting to handle various types of data and its efficacy in complex situations make it a robust choice for predictive modeling in our study.

Comparison between Random Forest and Gradient Boosting

The ensemble learning models Random Forest and Gradient Boosting are widely used in regression and classification tasks due to their high accuracy and generalization capability. Both models utilize decision trees as building blocks but employ different strategies in the learning process.

Random Forest is a learning model that operates by constructing a set of independent decision trees, each trained with a random sample of the data. The diversity among the trees is what gives Random Forest its robustness and resistance to overfitting. In contrast, Gradient Boosting constructs an additive model sequentially, where each new tree corrects the errors of the previous one, which can result in better performance if the model’s complexity is carefully managed to avoid overfitting.

While Random Forest may be easier to tune and less prone to overfitting due to its random nature, Gradient Boosting often achieves higher accuracy, though it requires fine-tuning of parameters and special attention to regularization.

Table 1 summarizes the main features of each model, facilitating the comparison between them.

Table 1 Comparison of the main features of Random Forest and Gradient Boosting.

When comparing the machine learning models Random Forest and Gradient Boosting, we evaluate two powerful approaches for building predictive models using decision trees. However, the way they construct and combine these trees is fundamentally different. Random Forest generates many trees independently, each learning from a random sample of the data. This is akin to having a series of experts with diverse opinions reaching a consensus. The method is quite tolerant to errors (or overfitting) and generally performs well even without fine-tuning of parameters, making it an attractive choice for beginners.

On the other hand, Gradient Boosting builds trees sequentially. Each new tree is created to correct the errors left by the previous tree, somewhat as if each new expert builds their knowledge based on the weaknesses of those before them. This can lead to more accurate outcomes but also a higher risk of overfitting, especially if the model is overly complex for the data at hand. Moreover, Gradient Boosting may be more sensitive to outliers and mostly requires more time and expertise to correctly adjust the parameters.

Both models allow for evaluating which variables are most important for making predictions, which is incredibly useful for understanding the model’s outcomes. However, the decision between using Random Forest or Gradient Boosting may depend on several considerations, such as the size and nature of the data, the available computational capacity, and the user’s experience in tuning model parameters.

Performance metrics

Prediction metrics are crucial for evaluating the accuracy and reliability of machine learning models. For this development, the following are used29,31:

  • Mean Bias Error (MBE): measures the tendency of the model to predict values higher or lower than the actual values:

    $$MBE=\frac{1}{N}{\sum}_{i=1}^{N}\left({y}_{i}^{pred}-{y}_{i}^{obs}\right)$$
    (4)
  • Mean Absolute Error (MAE): is suitable for situations with linear cost functions:

    $$MAE=\frac{1}{N}{\sum}_{i=1}^{N}\left|{y}_{i}^{pred}-{y}_{i}^{obs}\right|$$
    (5)
  • Root Mean Square Error (RMSE): is more sensitive to significant prediction errors:

    $$RMSE=\sqrt{{\frac{1}{N}{\sum}_{i=1}^{N}\left({y}_{i}^{pred}-{y}_{i}^{obs}\right)}^{2}}$$
    (6)
  • Mean Squared Error (MSE): provides a measure of the variance of the prediction errors, penalizing larger deviations more heavily:

    $$MSE={\frac{1}{N}{\sum}_{i=1}^{N}\left({y}_{i}^{pred}-{y}_{i}^{obs}\right)}^{2}$$
    (7)
  • Coefficient of Determination (R2): indicates the proportion of the variance in the dependent variable that is predictable from the independent variables:

    $${R}^{2}=1-\frac{{{\sum}_{i=1}^{N}\left({y}_{i}^{obs} -{y}_{i}^{pred}\right)}^{2}}{{{\sum}_{i=1}^{N}\left({y}_{i}^{obs}-\underline{{\gamma}}_{obs}\right)}^{2}}$$
    (8)

    where \({\underline{\gamma}}_{obs}\) is the average of the observed values.

In the equation presented, \({y}_{i}^{pred}\)represents the values predicted by the model for the i-th observation, \({y}_{i}^{obs}\)indicates observed or actual values, N is the total number of observations, and y¯obs is the average of the observed values. These metrics, combined, offer a comprehensive view of the model’s performance, highlighting not only the accuracy and bias of the predictions but also the variance of the errors and the model’s ability to explain variations in the observed data.

Relative importance of predictors

In this section, we explore the relative importance of predictors in the machine learning models Random Forest and Gradient Boosting, both widely used for data imputation and predictions across various application domains, including solar energy generation. Understanding which variables have the greatest influence on the prediction of energy generation is crucial, as it allows the optimization of models to obtain more accurate and efficient predictions. We use the criterion of impurity decrease (Random Forest) and the contribution to error reduction (Gradient Boosting) to quantify the importance of each predictor. This analysis not only identifies the most significant factors affecting solar energy generation but also provides valuable insights for future research and operational practices in the field of renewable energy. Below, we mathematically detail how this importance is calculated and discuss the implications of our findings for the design and implementation of more effective photovoltaic systems.

  • The importance of predictors \(\left(I\right(f)\)in Gradient Boosting models is quantified by each predictor’s contribution to reducing the loss function throughout the model’s training process. This contribution is measured as the sum of error reductions attributed to each split that uses the predictor in question, across all trees \(\left(T\right)\) that comprise the model. Mathematically, the importance of a predictor f can be expressed as:

    $$I\left(f\right)={\sum}_{t=1}^{T}\sum j\in{S}_{t}\left(f\right){\gamma}_{tj}$$
    (9)

    where \({S}_{t}\left(f\right)\) represents the set of all splits performed on feature f in the tree t, and \({\gamma}_{tj}\) indicates the contribution of the split j in tree t to the reduction of errors. The error reduction of a specific split is calculated as the difference between the sum of losses before and after the split, providing a quantitative measure of the improvement in the model attributed to that split.

    After calculating the contributions of all the splits across all trees, the values are summed for each predictor. Finally, the importance is normalized to ensure that the sum of the importance of all features equals 1. In scikit-learn, these importances are accessible through the feature_importances_ property of the GradientBoostingRegressor object, offering a direct view of the relevance of each predictor in the model.

  • In the Random Forest model, the importance of predictors is calculated based on each predictor’s contribution to the decrease in weighted average impurity across all decision trees that make up the forest. For regression problems, impurity is measured by the decrease in variance. The importance of a feature \(\left(f\right)\) is determined as follows:

    $$I\left(f\right)=\frac{{\sum}_{t=1}^{T}\Delta V(f,t)}{\sum f{\prime}\in F{\sum}_{t=1}^{T}\Delta V(f{\prime},t)}$$
    (10)

    where T represents the total number of trees in the forest, \(\Delta V(f,t)\)is the decrease in variance due to feature f in tree t, and F is the set of all features. The decrease in variance is calculated as the difference between the variance before and after each split that uses feature f. This decrease is summed for all trees in which the feature appears.

    The importance calculated in this way is normalized so that the sum of the importances of all features equals 1. This method provides a quantitative view of the relevance of each predictor in the model, reflecting its ability to improve the accuracy of Random Forests predictions. In scikit-learn, the importance of predictors is accessible through the feature_importances_ property of the GradientBoostingRegressor object, facilitating the analysis and interpretation of each feature’s contribution.

Data application: sizing of the green hydrogen production system through photovoltaic energy data

With the results obtained from data imputation through advanced machine learning techniques, specifically random forest and gradient boosting, a necessity to apply them in a real context to observe the practical impact of this processed information emerges. Thus, to verify the efficacy of the imputed data and its influence on practical decision-making, it is proposed to test them by sizing an electrolyzer for green hydrogen production. This step is crucial because the treated data originates from a photovoltaic plant located in a strategic region of the country, recognized for its potential to contribute significantly to green hydrogen production34,35.

The choice of sizing an electrolyzer as the data application method is not arbitrary. It reflects a growing trend in the search for alternative and sustainable energy production sources, where green hydrogen stands out as a promising energy vector36.

Applying these data in a real-sizing scenario not only validates the quality and precision of the imputations performed, but also provides valuable insights on how the effective integration of renewable sources can be optimized to meet future energy demands.

To size the proposed system, the following mathematical equations are utilized, considering key variables such as the efficiency of the electrolyzer, additional consumptions in the electrolysis system, and the energy required for hydrogen production:

  1. 1.

    Total Energy Generated (Etotal) was calculated by summing all the energy generated after data imputation, providing a solid basis for subsequent estimates of hydrogen production.

  2. 2.

    Annual Operating Hours (Hop_annual) were determined by counting all hours with positive energy production, adjusted to reflect the effective operational time in an annual format.

  3. 3.

    Annual Energy Available for Hydrogen Production (Eh2_annual) reflects the portion of the generated energy that is effectively available for electrolysis, after considering additional consumptions in the system.

  4. 4.

    Mass of Hydrogen Produced (mh2) was calculated by dividing the energy available for electrolysis by the energy required to produce one kilogramme of hydrogen, providing a direct estimate of the system’s production potential.

  5. 5.

    Electrolyzer Power (Pelectrolyzer) was estimated to appropriately size the electrolyzer capacity needed to process the available energy during effective operating hours, maximizing hydrogen production.

Applying these equations, significant differences were observed between the imputation models and the original database. The Random Forest model produced estimates indicating a substantially greater capacity for hydrogen production, evidenced by a total generated energy, mass of hydrogen produced, and electrolyzer power superior to those of the Gradient Boosting model, and, notably, to the database with missing data.

These results highlight the critical importance of precise and robust data imputation for the effective planning and sizing of renewable energy systems. Furthermore, they underscore the potential of machine learning technologies to overcome challenges associated with the variability and incompleteness of solar energy data, facilitating the transition to a more sustainable energy future through optimized green hydrogen production. The practical application of these imputed data in the design of green hydrogen production systems serves as a rigorous test of their validity, demonstrating how they can be employed to drive real advances in renewable energy technologies.

Results and discussion

This section presents an in-depth examination of the original solar energy generation data from a photovoltaic (PV) plant located in the climatically diverse region of Irecê, Bahia. The dataset is made up of measurements of daily energy output, quantified in kilowatt-hours (kWh), covering the period from September 2018 to July 2019, as shown in Fig. 3. The visualization of these time-series data reveals a conspicuous gap, as highlighted by the “Data missing” annotation, spanning from May to July 2019. The absence of data during this interval is a substantial impediment to the seamless record of energy generation, a critical element for the operational management and performance analysis of the plant.

Fig. 3
figure 3

Daily solar energy generation data with missing gap: a time series from September 2018 to July 2019 for a photovoltaic plant in Irerê, Bahia.

The imputation of this missing data is not merely a procedural necessity but a vital process to ensure the accuracy of trend analysis, yield optimization, and reliability assessments for the energy facility. To achieve this, our study leveraged a comprehensive set of climatic parameters, hypothesizing their potential as predictive variables to reconstruct the lost information effectively. These parameters, recorded over the same time span, consist of daily values for temperature; humidity, wind speed, and solar irradiance, including the mean and its variability measure, standard deviation (see Fig. 4).

Fig. 4
figure 4

Time series analysis of climatic variables as predictors for solar energy generation: temperature, humidity, wind speed, and solar irradiance (average and standard deviation) from July 2018 to July 2019.

Temperature readings are imperative to gauge the operational efficiency of PV panels, as their performance is known to be sensitive to thermal conditions. Humidity is another critical factor; its fluctuation can significantly affect the level of solar insolation, as it is closely related to cloud formation and, consequently, the amount of solar energy reaching the ground. Wind speed also plays a dual role: it can assist in cooling the panels, thereby improving their efficiency, but it can also lead to soiling, which hampers energy absorption. Finally, solar irradiance is the cornerstone predictor, directly correlating with the potential solar energy that can be harnessed. The average irradiance indicates the expected energy input to the system, while its standard deviation captures the variations, adding a layer of complexity to the prediction models due to the inconsistency in solar energy supply.

Utilizing these variables for imputation is paramount because they embody the environmental factors that directly influence solar power generation. Accurate imputation of missing data using these variables can significantly enhance the operational and strategic decisions made for the management of solar energy facilities. The forthcoming sections will delve into the methodologies employed for imputation, the comparative analysis of their predictive accuracy, and the implications of these findings on future practices in solar energy data management.

Based on the comprehensive analysis of climate variables, the main findings obtained from the application of two advanced machine learning algorithms: Random Forest and Gradient Boosting. These methodologies were selected for their proven efficacy in handling nonlinear relationships and their robustness in dealing with diverse datasets. The Random Forest algorithm, known for its simplicity and the ability to perform efficiently on large datasets, was the first to be employed in the imputation task. This ensemble learning method operates by constructing multiple decision trees during training and producing the average prediction of the individual trees, thus reducing overfitting while maintaining high accuracy.

Subsequently, Gradient Boosting was applied as an alternative approach to the first model. Gradient Boosting allowed the optimization of arbitrary differentiable loss functions, working to reduce bias and variance in the problem.

The results of both models offer insightful revelations about the predictive capabilities of the selected variables in the energy generation data. The performance metrics of each model are meticulously dissected, providing a comparative analysis to discern the strengths and limitations inherent to each approach. Evaluation is anchored in metrics that include MAE, MSE, RMSE, and R2, thus painting a comprehensive picture of their predictive power in the domain of solar energy imputation.

The graphical results, seen in Fig. 5, represent the data imputation of solar energy generation using two distinct machine learning models: Random Forest (RF) and Gradient Boosting (GB). Both images display time series covering the period from September 2018 to July 2019, with the imputation indicated in the regions highlighted by the respective models.

Fig. 5
figure 5

Results of solar energy generation data imputation using Random Forest and Gradient Boosting.

In the first part of the chart, the region highlighted in blue demonstrates the imputation performed by RF Prediction for energy generation in the period where data were missing. The pattern of imputation closely follows the fluctuations observed in the real data, suggesting that the Random Forest model successfully captured the temporal variation of the energy generation data. The imputation performed shows where each predicted point follows the trend and variation of the historical data. The Random Forest model, which uses multiple decision trees to generate the prediction, appears to maintain consistency with the natural fluctuations observed in the real data, an indicator of a successful imputation.

The second part of the chart shows the data imputation performed by the Gradient Boosting model, highlighted in green. Similarly, the model filled the data gap while maintaining consistency with the observed trends, indicating a comparable ability of Gradient Boosting to simulate energy generation based on the existing patterns in the data. This model provided an imputation that, similarly to RF, follows the characteristics of the original data but with a tendency to smooth out fluctuations, possibly due to its iterative optimization approach.

The results, when observed together, suggest that both models could perform imputations that visually align with the observed data patterns. This is a positive indication of the applicability of advanced machine learning techniques to fill in the gaps in the time series data of solar energy generation. The selection between random forest and gradient boost should be based on a careful analysis that includes the precision of the imputations, computational complexity, and the ease of model interpretation.

The direct comparison of the two models is exemplified in the combined image (see Fig. 6), where the RF and GB predictions are overlaid. It is observed that both techniques can capture the dynamics of energy generation data. However, subtle differences are noted in how each model responds to variations in the data. RF tends to follow more closely the peaks and valleys, while GB shows a generalization that may be preferable in certain operational contexts where the smoothness of predictions is desirable.

Fig. 6
figure 6

Time series analysis of solar energy generation from July 2018 to July 2019, applied Random Forest and Gradient Boosting.

In the Gradient Boosting model graph, it is observed that the points tend to cluster closer to the reference line for lower energy values, up to approximately 1.5 kWh. However, for higher energy values, the model’s predictions appear to significantly diverge from the ideal line, indicating a reduction in accuracy for predicting higher energy values.

However, the random forest model graph (Figs. 7 and 8.) shows a more uniform dispersion of points along and around the reference line, covering the entire spectrum of data. This suggests that the Random Forest model can make consistently close predictions to the actual values, regardless of the energy level involved, demonstrating robustness and less bias compared to Gradient Boosting.

Fig. 7
figure 7

Gradient Boosting model dispersion results.

Fig. 8
figure 8

Random forest model dispersion results.

Comparatively, the Random Forest model exhibits greater consistency in its predictions relative to the Gradient Boosting model, as evidenced by the more uniform distribution of points along the reference line. Furthermore, while Gradient Boosting faces challenges in accurately predicting higher energy values, Random Forest demonstrates superior and more consistent performance across different energy levels.

Therefore, the selection of the most appropriate model for the data imputation task should consider not only the accuracy of the predictions but also the nature of the missing data and the specific requirements of the application in terms of computational complexity and interpretability of the results. The findings presented here lay the groundwork for a more in-depth analysis of these factors, contributing to the existing literature in the field of data imputation in renewable energy systems.

Metrics

In this study, the effectiveness of two machine learning models is compared using metrics. The analysis of performance metrics revealed that the Random Forest model significantly outperformed the Gradient Boosting model across all evaluated metrics. The Random Forest model achieved, and the Gradient Boosting model recorded performance metrics as detailed in the Table 2.

Table 2 Comparing the performance of Random Forest and Gradient Boosting models across four different metrics: MAE, MSE, RMSE, and R2.

The superiority of the Random Forest model suggests greater accuracy in predictions, with significantly lower errors compared to Gradient Boosting. The lower MAE and RMSE indicate that on average the predictions of the Random Forest model are closer to the actual values, while the higher R2 reveals that this model can explain a larger proportion of the variance in the test data.

Notably, both models presented a Mean Bias Error (MBE) close to zero, indicating the absence of a significant systematic bias in the predictions. However, the slightly more negative MBE of Gradient Boosting suggests a bias towards lower predictions, although this bias is minimal.

The superior performance of the Random Forest model can be attributed to its ability to effectively model complexities and non-linear interactions in the data without the risk of overfitting, an essential characteristic for analyzing complex time series and environmental data like that of energy generation. On the other hand, the relatively inferior performance of the Gradient Boosting model could be improved through more detailed parameter tuning, feature selection, and preprocessing techniques.

Feature importance

The results for feature importance are presented in the Random Forest model, where the variable ’Irradiance_Std’, representing the standard deviation of solar irradiance, emerged as the most significant predictor with a relative importance of 83.494%, see Table 3. This result underscores the relevance of irradiance variability in predicting solar energy, suggesting that fluctuations in the amount of solar radiation received are a critical indicator of energy yield. Following in importance, ’Irradiance_Avg’, the average irradiance, contributed 10.894%, reinforcing the notion that energy generation prediction heavily depends on the average amount of solar radiation. Surprisingly, factors like ’Humidity’ and ’Temperature’ showed lower importance, with 2.6948% and 2.6113% respectively, which may indicate that despite their influence on the operational environment of photovoltaic panels, it’s the solar irradiance metrics that prevail in determining energy production. The predictor ’Wind’ showed the least importance, with only 0.3106%, reflecting a marginal influence of wind on operational conditions or efficiency of photovoltaic installations.

Table 3 Comparison of feature importance with model result differences expressed as percentages.

Conversely, the Gradient Boosting model presented a similar predictor importance profile but with some notable differences. ’Irradiance_Std’ was also the most prominent with 82.8195%, confirming its position as the dominant indicator in predicting solar energy generation. Interestingly, ’Irradiance_Avg’ saw an increase in its importance to 15.6734%, which may reflect Gradient Boosting’s ability to delve deeper into the relationship between average solar energy and energy generation. The predictors ’Humidity’, ’Temperature’, and ’Wind’ registered importances of 0.8869%, 0.5679%, and 0.0523% respectively. This pattern reiterates the trend observed in the Random Forest model, where the average and variability of solar irradiance are the most critical components in modeling solar energy generation, while other climatic variables seem to play secondary roles. These findings suggest that models reliant on variables related to solar irradiance are more robust and reliable for data imputation in solar energy generation systems. The subordination of factors such as humidity, temperature, and wind speed highlight the need to focus on the most influential aspects of weather that directly affect the efficiency and output of solar installations. These insights provide valuable input for optimizing the design and operation of photovoltaic systems, as well as for enhancing the accuracy of solar energy prediction models.

Data application results

Based on the results obtained from the application of random forest and gradient booster models for the estimation of data in solar energy generation records and their subsequent application in the size of green hydrogen production systems, significant differences in performance and results were observed, Table 4.

Table 4 Comparative overview of green hydrogen production system sizing.

The Random Forest model produced a data imputation that led to a total estimated energy production of 581,103.58 kWh, with a total of 61,601.25 operational hours. This resulted in an estimated annual hydrogen production (Eh2_annual) of 493,938.05 kWh and a total mass of hydrogen produced (mh2) of 14,819.62 kg, with an electrolyzer power (Pelectrolyzer) of 5.61 kW.

In contrast, the application of the Gradient Boosting model provided an imputation that resulted in a total estimated energy production of 216,970.60 kWh, with 67,300.75 operational hours. This translated into an estimated annual hydrogen production of 184,425.01 kWh and a total mass of hydrogen produced of 5,533.30 kg, with the electrolyzer power being 1.92 kW.

To contextualize these results, the analysis of the original database with missing data revealed a much lower total energy production of only 14,261.87 kWh, with 3,771.75 operational hours, an annual hydrogen production of 12,122.59 kWh, a total mass of hydrogen produced of 363.71 kg, and an electrolyzer power of 2.25 kW.

The discrepancy between the results obtained through the imputation models and the original database highlights the significant impact that data imputation can have on the sizing and evaluation of green hydrogen production systems. The Random Forest model demonstrated a remarkable ability to optimize the utilization of solar energy generation for hydrogen production, suggesting that accurate data imputation can play a essencial role in maximizing the efficiency and yield of such systems.

Conclusions

This study provided significant insights into the effectiveness of machine learning models, Random Forest and Gradient Boosting, for imputing missing data in solar energy generation databases. The Random Forest model was shown to excel in accuracy and efficiency, as indicated by performance metrics including mean absolute error (MAE) of 0.0364, mean squared error (MSE) of 0.0097, root mean squared error (RMSE) of 0.0985, and a coefficient of determination (R2) of 0.9779. These notably superior values, compared to standard models, highlight Random Forest’s robustness in handling the complexities of solar data.

Through the case study involving the application of imputed data for sizing a green hydrogen production system, it became evident that accurate data imputation can play a crucial role in maximizing operational efficiency and yield in renewable energy systems. The case study also underscored the importance of feature importance variables, with ’Irradiance_Std’ emerging as the most significant predictor, indicating that fluctuations in solar radiation are critical determinants for energy generation efficacy. These findings not only validate the utility of advanced machine learning methods in energy data management but also provide a solid foundation for the future development of more accurate and efficient predictive models. The widespread implementation of these techniques could significantly enhance decision-making and operational strategy at solar energy facilities, promoting a more sustainable and efficient energy transition.