Employing machine learning for advanced gap imputation in solar power generation databases

Costa, Tatiane; Falcão, Bruno; Mohamed, Mohamed A.; Annuk, Andres; Marinho, Manoel

doi:10.1038/s41598-024-74342-3

Download PDF

Article
Open access
Published: 11 October 2024

Employing machine learning for advanced gap imputation in solar power generation databases

Tatiane Costa¹,
Bruno Falcão¹,
Mohamed A. Mohamed²,
Andres Annuk³ &
…
Manoel Marinho¹

Scientific Reports volume 14, Article number: 23801 (2024) Cite this article

4060 Accesses
25 Citations
1 Altmetric
Metrics details

Subjects

Abstract

This research evaluates the application of advanced machine learning algorithms, specifically Random Forest and Gradient Boosting, for the imputation of missing data in solar energy generation databases and their impact on the size of green hydrogen production systems. The study demonstrates that the Random Forest model notably excels in harnessing solar data to optimize hydrogen production, achieving superior prediction accuracy with mean absolute error (MAE) of 0.0364, mean squared error (MSE) of 0.0097, root mean squared error (RMSE) of 0.0985, and a coefficient of determination (R²) of 0.9779. These metrics surpass those obtained from baseline models including linear regression and recurrent neural networks, highlighting the potential of accurate imputation to significantly enhance the efficiency and output of renewable energy systems. The findings advocate for the integration of robust data imputation methods in the design and operation of photovoltaic systems, contributing to the reliability and sustainability of energy resource management. Furthermore, this research makes significant contributions by showcasing the comparative performance of traditional machine learning models in handling data gaps, emphasizing the practical implications of data imputation on optimizing hydrogen production systems. By providing a detailed analysis and validation of the imputation models, this work offers valuable insights for future advancements in renewable energy technology.

Comparative analysis of machine learning techniques for temperature and humidity prediction in photovoltaic environments

Article Open access 05 May 2025

Optimizing solar power efficiency in smart grids using hybrid machine learning models for accurate energy generation prediction

Article Open access 24 July 2024

Multi-label machine learning for power forecasting of a grid-connected photovoltaic solar plant over multiple time horizons

Article Open access 23 September 2025

Introduction

Solar energy, a renewable and sustainable source, plays a pivotal role in the global transition toward a future of clean energy. In a world increasingly driven by the imperative to reduce carbon emissions and mitigate climate change, solar energy emerges as a vital solution due to its abundance and availability. The growing demand for renewable energy sources places solar energy at the forefront, not only as an alternative to fossil fuels, but also as a fundamental pillar in the quest for energy sustainability.

As highlighted by Chandel and Roy ¹, accurate forecasting of solar radiation is critical to optimizing the performance of photovoltaic systems and efficiently integrating solar energy into the electrical grid. However, the effectiveness of solar energy utilization is inherently linked to the precision and completeness of the collected meteorological data. Precise data on solar irradiation, temperature, humidity, and other environmental variables are essential to evaluate the solar energy potential of a region. However, data collection can be fraught with challenges, including data gaps due to equipment limitations, maintenance, or adverse environmental conditions. These data gaps can lead to inaccurate analyzes that affect decision-making and efficiency in the implementation of solar energy solutions.

Recent advances in machine learning have significantly enhanced the methods available for optimizing and managing solar energy systems. For instance, Zhou et al. ² developed a machine learning-based optimal design of a phase change material integrated renewable system, which includes on-site PV, radiative cooling, and hybrid ventilations. This study demonstrated substantial improvements in energy performance across different climatic regions. Further advancements were made by Khare et al. ³, who investigated the use of Generative Adversarial Networks (GANs) for the imputation of solar radiation data, demonstrating significant improvements in the quality of imputed data.

The challenges associated with forecasting solar radiation, arising from the complexity of atmospheric processes and the dynamics of climate patterns, are addressed through advanced machine learning techniques^4,5. In Ref.¹, the authors presented a hybrid strategy using Random Forest and XGBoost to fill significant gaps in solar data, considering its effectiveness in large and complex datasets. The study conducted by Xu et al.⁶introduces an innovative approach to predict the higher heating value (HHV) of coal, using a regression tree model boosting the gradient. The authors proposed the use of the gradient boosting (GB) model, which depends on the correlation between the immediate analysis of coal and its HHV, offering a less resource-intensive alternative to estimate the calorific values of coal. This approach not only highlights the efficacy of the gradient boosting regression tree (GBRT) model in predicting the HHV of coal with greater accuracy but also offers a promising methodology for coal quality analysis, potentially reducing the costs and time associated with traditional evaluation.

Necati Aksoy and Istemihan Genc ⁷developed innovative predictive models to forecast energy production in solar power plants, employing gradient boost methods, specifically XGBoost, LightGBM, and CatBoost. This study emphasizes the importance of accuracy and speed in forecasting the energy generated by solar power plants in smart grids, microgrids or small-scale grids and how it influences critical decisions related to dynamic energy management of these networks. The authors have developed machine learning models that use a training dataset composed of various meteorological characteristics. These models offer high accuracy and rapid learning and bring significant benefits to the solar energy sector. Furthermore, the study compares the performance of these models, discussing their practical applicability.

The research conducted by Sasirekha et al. ⁸highlighted the importance of solar radiation forecasting in the energy generation sector, emphasizing the efficiency of Random Forest in large datasets and in estimating missing data. This comparative study found that, despite Random Forest requiring more processing time due to the larger number of decision trees, it provides more accurate predictions, especially useful in practical applications such as solar radiation analysis for electric power production.

Villegas-Mier, C. G. et al. ⁹ addressed solar radiation forecasting using optimization with the Random Forest algorithm. This work compared the results with other machine learning models and showed a significant improvement in the accuracy of the results compared to conventional methods such as linear regression and recurrent neural networks. The study was carried out in Querétaro, Mexico, and highlighted the effectiveness of Random Forest in predicting solar radiation, especially in locations with average weather conditions for most of the year.

In this context, machine learning techniques such as Random Forest and Gradient Boosting emerge as powerful tools to address limitations in the analysis of solar energy data. Random Forest, an ensemble learning method, is known for its high accuracy and ability to handle large datasets with multiple input variables⁴. By applying this model to the imputation of missing solar energy generation data, we can significantly improve the quality and reliability of the analyses. For this purpose, the work uses variables such as temperature, radiation, humidity, and wind speed for data estimation.

Similarly, Gradient Boosting (GB) is an ensemble machine learning method that applies to classification and regression problems. Trains models sequentially, focusing on more difficult cases by adjusting the weights of instances with incorrect predictions. The aim is to minimize a loss function, similarly to the training of neural networks, but combining multiple models to enhance accuracy. Numerous hyperparameters, including learning rate and loss function, are used to optimize model performance¹⁰.

The selection of random forests and gradient boosts for this study is motivated by their ability to process large volumes of data and capture complex relationships between input variables without the need for extensive data transformations. Moreover, while Random Forest is known for its ease of use, robust generalization, and resistance to overfitting, Gradient Boosting offers flexibility in optimizing different loss functions and can achieve greater accuracy by adaptively adjusting to failures in previous predictions. Other methods such as neural networks, SVM (Support Vector Machines), and linear models have their advantages, such as the ability of neural networks to model complex nonlinear relationships and the efficiency of SVMs in high-dimensional spaces. However, these methods may require more data for effective training or be more susceptible to overfitting, making Random Forest and Gradient Boosting a more balanced choice for many solar energy analysis scenarios^11,12,13.

Thus, this study explores the implementation of Gradient Boosting and Random Forest models to enhance the integrity of solar data, assessing which method is more suitable to impute missing photovoltaic generation data between April and June. The database refers to a PV plant on the campus of the Federal Institute of Bahia, located in the city of Irecê—Bahia, Brazil. Data is open, available on the Kaggle platform, and includes meteorological data provided by the National Institute of Meteorology, INMET of Brazil.

Literature review

The field of artificial intelligence has seen significant advancements in the past few decades, fuelling various innovative applications in various domains, from medicine to industry. At the heart of this progress lies machine learning. This branch of artificial intelligence is focused on developing systems that learn and improve from experience without explicit programming. It encompasses supervised, unsupervised, and reinforcement learning^14,15. In supervised learning, the model is trained with a labeled dataset, using techniques such as linear and logistic regression, neural networks, and decision trees. Decision trees, a method of supervised learning, are used for classification and regression, learning from simple rules based on the data’s characteristics^16,17,18.

In unsupervised learning, models are trained on unlabeled data to discover patterns, clusters, or hidden structures within the data. Popular algorithms include K-Means, PCA (Principal Component Analysis), and Gaussian Mixture Models ^19,20,21. Meanwhile, in reinforcement learning, agents learn to make sequential decisions to maximize a cumulative reward. This is used in applications such as games and robotics. Popular algorithms include Q-Learning and DDPG (Deep Deterministic Policy Gradients)^22,23.

In recent years, the application of machine learning techniques has become increasingly important in the field of solar energy. Several studies have demonstrated the effectiveness of various algorithms in predicting solar radiation and imputing missing data in solar power generation databases.

Machine learning techniques are broadly categorized according to the function they perform in data processing. These categories span various algorithms and methods, each designed to solve specific types of problems. The main categories include classification, regression, clustering, dimensionality reduction, and probability estimation¹⁷.

Classification: is one of the most common tasks in machine learning. It involves assigning a class or category to a given object or instance based on its features. This is particularly useful in cases where one wishes to predict a specific category or label for new data. Some of the most popular classification algorithms include the following:
- Decision Trees: Hierarchical decision structures that divide the data into subsets based on characteristics.
- Random Forests: Ensembles of decision trees that aggregate their predictions to increase accuracy.

These algorithms are effective in dealing with categorical data and are widely used in tasks such as spam detection, medical diagnosis, and pattern recognition.

Regression: deals with the prediction of continuous values, rather than discrete categories. It is used when the goal is to estimate a numerical value based on a set of input variables. Common regression algorithms include the following:
- Linear regression: Models the linear relationship between the input variables and the output variable.
- Logistic regression: Used for binary classification problems, where the output is a probability.

Regression is applied in a wide range of domains, such as prediction of real estate prices, analysis of economic trends, and prediction of sports results.

Clustering: is a technique that groups unlabeled data based on their intrinsic similarities. It is used to identify groups or clusters of data that share common characteristics. Some popular clustering algorithms include the following:
- K-Means: Groups the data into K clusters, where K is specified by the user.
- Gaussian Mixture Models: Models the clusters as Gaussian distributions and is useful when the data have a more complex structure.

Clustering is used in market segmentation, social network analysis, and image segmentation, among other applications.

Dimensionality Reduction: is a technique that aims to reduce the complexity of the data while retaining the most important information. This is especially useful when dealing with high-dimensional datasets. One of the most common methods of dimensionality reduction is Principal Component Analysis (PCA), which identifies the directions of greatest variability in the data and projects the data onto a lower-dimensional space. Dimensionality reduction is applied to data visualization, image compression, and the simplification of machine learning models.
Probability Estimation: often used in classification tasks to calculate the probability that an example belongs to a specific class. The Naive Bayes algorithm is a classic example of a probability estimation method, which relies on Bayes’ theorem to calculate the conditional probabilities of the classes.

Supervised learning is one of the most fundamental and widely used categories in machine learning models. In this paradigm, machine learning algorithms learn from data consisting of input examples and their corresponding outputs, often called target labels. Formally, this is defined as having input examples represented as ${W}_{i}$ and their respective target labels as for each example $i\in\left\{1,\dots,N\right\}$. The crucial aspect that differentiates supervised learning is the presence of these target labels, which provide valuable feedback to guide the algorithm through the learning process, allowing for the accurate capture of the relationships between the inputs and outputs.

The development process of a machine learning model involves solving an optimization problem. However, frequently, a single model may not be the best solution for a complex issue. Here the ensemble learning technique comes into play, aiming to improve performance by combining the predictions of multiple models²⁴.

The Random Forest method is a machine learning algorithm that operates by constructing multiple decision trees during training and outputting the average of these trees’ predictions to improve accuracy and control overfitting. Each decision tree in the Random Forest is built from a bootstrap sample of the data, and the splits of each tree are determined by a random subset of the features. This ensemble learning approach allows the model to capture both the variability and the patterns in the data, making it robust and precise, especially in cases with complex and multidimensional data. Random Forest is widely used for its effectiveness, ease of implementation, and good performance on various prediction tasks.

The state-of-the-art approach in applying random forest for solar energy analysis predictions reveals a growing trend in the use of artificial intelligence (AI) techniques to forecast photovoltaic energy generation. In addition to Random Forest, other methods such as NARX, NARMAX, ARIMA, and neural network models, such as the backpropagation neural network and the extreme learning machine (ELM), have been explored. Each of these models presents unique features, such as the ELM’s ability to self-learning and adapt and ARIMA’s sensitivity to reflect changes in solar irradiation^22,25,26.

Random Forest stands out for its ability to handle large volumes of data and reduce overfitting, making it an effective choice for precise predictions. However, the field still faces challenges, such as the need for efficient processing of large data volumes and the development of more efficient training algorithms for neural networks. Thus, while Random Forest is a promising option, there is room for innovation and improvement in solar energy prediction using AI.

The application of machine learning models, especially Random Forest, in solar energy data analysis has been explored in various recent research. For example, a study conducted by H. Sun et al.²⁷ demonstrated how Random Forest could be effectively used to predict solar energy production under varying weather conditions. This study highlights the model’s robustness in handling the nonlinearity of environmental data, a challenge also encountered in our research²⁸.

In contrast, Gradient Boosting is an ensemble machine learning algorithm that enhances accuracy through the sequential construction of decision trees. Unlike methods that build trees independently, Gradient Boosting focuses on correcting errors from previous trees, adapting to the most challenging cases during training. Each new tree is built to minimize the residual errors of previous predictions, using a process that gradually reduces the loss function. This process involves selecting specific hyperparameters, such as the learning rate and loss function, which guide the fine-tuning of the model for improved performance.

The application of Gradient Boosting in solar energy predictions demonstrates the versatility and efficacy of this method in handling complex and variable data. Similar to Random Forest, Gradient Boosting benefits from ensemble learning to provide robust and accurate predictions. However, it is distinguished by its ability to progressively correct errors, which can lead to higher accuracy in scenarios where data patterns are particularly challenging.

Despite the effectiveness of these methods, challenges remain in processing large data volumes efficiently and developing training algorithms that can fully leverage the available data^29,30. The evolving landscape of machine learning in solar energy prediction continues to promise enhancements in accuracy and operational efficiency, underscoring the importance of innovation in this field^31,32,33.

Methodology

The Methodology section of this study outlines the process employed to address gaps in solar energy generation data by utilizing the Random Forest and Gradient Boosting algorithms. It provides a detailed account of data preparation, the handling of missing values, the division of data into training and testing sets, model training procedures, and the evaluation of model performance.

Materials

Data were collected from the solar station at the photovoltaic plant and the database of the National Institute of Meteorology (INMET) of Brazil, covering variables such as air temperature, humidity, wind speed, and irradiation, alongside energy generation. This information was sourced from a photovoltaic solar plant located at the Federal Institute of Bahia in Irecê, Brazil, a region known for its high solar potential due to its proximity to the equator. The dataset was obtained from the open-source platform Kaggle, and the author has no affiliation with the mentioned institution. Figure 1 presents the system data used for this study.

The first notable observation is the consistency of energy generation over time, with peaks following a daily pattern indicative of solar generation relative to daylight. Energy production varies between approximately 0.5 and 2.0 kWh, with most days showing robust energy production, consistent with expectations for a photovoltaic plant in a high-insolation region. However, the area highlighted in the chart indicates a data gap between the end of May and the beginning of June 2019, which draws attention. This gap could be attributed to various causes, such as failures in the monitoring system, plant maintenance, or extreme weather events that may have temporarily halted operations. The absence of data during this period is critical as it prevents a comprehensive analysis of the plant’s performance and efficiency, as well as potential environmental or economic impacts.

A slight trend of reduction in energy generation is also observable immediately before the data interruption. This could indicate a gradual degradation of solar panel efficiency or expected seasonal variations due to changes in the solar angle over the year.

With access to the primary data, the imputation study was conducted. The process used in the development of the work is illustrated in Fig. 2. Initially, the data were separated into two sets: one containing complete records of energy generation and another where these values were absent. The SimpleImputer from the sklearn library was used to replace missing values in the predictor variables with the mean of the set, preparing the data for training.

The Random Forest Regressor and Gradient Boosting algorithms were adopted to model the relationship between environmental variables and energy generation. The models were trained with complete data and then used to estimate energy generation for records where it was missing.

The validity of the predictions was ensured by checking for the absence of negative values, which would be inconsistent with the nature of energy generation data. The evaluation also considered the performance of the models on test data using metrics such as mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and the coefficient of determination (R²). These metrics provide a comprehensive view of the models’ performance, including their accuracy and ability to explain the variance in the data.

Additionally, the relative importance of the climatic variables in the prediction models was observed and evaluated using the feature_importances_ attribute of the trained models. This allows understanding which variables have the most significant impact on the prediction of energy generation.

Random Forest

The Random Forest used in the study is an ensemble learning method for regression, which constructs multiple decision trees during training. The method is suitable for our study because of its effectiveness in handling high-dimensional data and its ability to detect interactions between dependent and independent variables.

Random Forest is a versatile machine learning algorithm that is applicable to both regression and classification tasks. It builds multiple decision trees through bootstrap sampling and combines their predictions. In regression, the prediction is given by the average of the outputs of the trees:

$$\hat{{\text{y}}}\left(x\right)=\frac{1}{K}{\sum}_{k=1}^{K}{T}_{k}\left(x\right)$$

(1)

where $\hat{{\text{y}}}\left(x\right)$ is the prediction for a new data point x, and ${T}_{k}\left(x\right)$ is the prediction of the k-th tree. While in classification, it’s by the most voted class among the trees:

$$\hat{{\text{y}}}\left(x\right)=mode\left\{{T}_{k}\left(x\right)\right\}$$

(2)

The algorithm also assesses the importance of variables, which is crucial for understanding the features that most influence the model’s predictions.

The machine learning process is carried out in two phases: training and application. In the training phase, the Random Forest model learns from the training data. The application phase uses the trained model to make predictions on new data.

Gradient Boosting

Gradient Boosting is another ensemble learning method used in this study, specifically for regression and classification tasks. Unlike the Random Forest, Gradient Boosting optimizes predictive models sequentially, correcting the errors of previous predictors through an iterative procedure. This method turned out to be efficient for our study, especially due to its ability to minimize both bias and variance, resulting in more accurate predictions⁵.

Gradient Boosting constructs an additive model in a progressive stage-by-stage manner; at each stage, decision trees are introduced that correct the residuals of the current model. The final prediction is a weighted combination of the predictions from all the decision trees:

$$\hat{{\text{Y}}}\left(\varvec{x}\right)={\sum}_{\varvec{k}=1}^{\varvec{K}}{\varvec{\gamma}}_{\varvec{k}}{\varvec{T}}_{\varvec{k}}\left(\varvec{x}\right)$$

(3)

where $\hat{{\text{y}}}\left(x\right)$ represents the prediction for a new data point x, ${T}_{k}\left(x\right)$ is the prediction from the k-th decision tree, and ${\gamma}_{k}$ is the weight associated with the k-th tree.

This method is particularly known for its flexibility, being adjustable to optimize different loss functions and control overfitting through parameters such as the learning rate and the number of trees. Furthermore, like Random Forest, Gradient Boosting performs a feature importance analysis, allowing for a detailed interpretation of the variables that have the most significant influence on the predictions.

During the training phase, the Gradient Boosting model iteratively learns to correct the errors from previous trees, and in the application phase, it applies the trained model to predict the test data. The ability of Gradient Boosting to handle various types of data and its efficacy in complex situations make it a robust choice for predictive modeling in our study.

Comparison between Random Forest and Gradient Boosting

The ensemble learning models Random Forest and Gradient Boosting are widely used in regression and classification tasks due to their high accuracy and generalization capability. Both models utilize decision trees as building blocks but employ different strategies in the learning process.

Random Forest is a learning model that operates by constructing a set of independent decision trees, each trained with a random sample of the data. The diversity among the trees is what gives Random Forest its robustness and resistance to overfitting. In contrast, Gradient Boosting constructs an additive model sequentially, where each new tree corrects the errors of the previous one, which can result in better performance if the model’s complexity is carefully managed to avoid overfitting.

While Random Forest may be easier to tune and less prone to overfitting due to its random nature, Gradient Boosting often achieves higher accuracy, though it requires fine-tuning of parameters and special attention to regularization.

Table 1 summarizes the main features of each model, facilitating the comparison between them.

Table 1 Comparison of the main features of Random Forest and Gradient Boosting.

Full size table

When comparing the machine learning models Random Forest and Gradient Boosting, we evaluate two powerful approaches for building predictive models using decision trees. However, the way they construct and combine these trees is fundamentally different. Random Forest generates many trees independently, each learning from a random sample of the data. This is akin to having a series of experts with diverse opinions reaching a consensus. The method is quite tolerant to errors (or overfitting) and generally performs well even without fine-tuning of parameters, making it an attractive choice for beginners.

On the other hand, Gradient Boosting builds trees sequentially. Each new tree is created to correct the errors left by the previous tree, somewhat as if each new expert builds their knowledge based on the weaknesses of those before them. This can lead to more accurate outcomes but also a higher risk of overfitting, especially if the model is overly complex for the data at hand. Moreover, Gradient Boosting may be more sensitive to outliers and mostly requires more time and expertise to correctly adjust the parameters.

Both models allow for evaluating which variables are most important for making predictions, which is incredibly useful for understanding the model’s outcomes. However, the decision between using Random Forest or Gradient Boosting may depend on several considerations, such as the size and nature of the data, the available computational capacity, and the user’s experience in tuning model parameters.

Performance metrics

Prediction metrics are crucial for evaluating the accuracy and reliability of machine learning models. For this development, the following are used^29,31:

Mean Bias Error (MBE): measures the tendency of the model to predict values higher or lower than the actual values:
$$MBE=\frac{1}{N}{\sum}_{i=1}^{N}\left({y}_{i}^{pred}-{y}_{i}^{obs}\right)$$
(4)

Mean Absolute Error (MAE): is suitable for situations with linear cost functions:
$$MAE=\frac{1}{N}{\sum}_{i=1}^{N}\left|{y}_{i}^{pred}-{y}_{i}^{obs}\right|$$
(5)

Root Mean Square Error (RMSE): is more sensitive to significant prediction errors:
$$RMSE=\sqrt{{\frac{1}{N}{\sum}_{i=1}^{N}\left({y}_{i}^{pred}-{y}_{i}^{obs}\right)}^{2}}$$
(6)

Mean Squared Error (MSE): provides a measure of the variance of the prediction errors, penalizing larger deviations more heavily:
$$MSE={\frac{1}{N}{\sum}_{i=1}^{N}\left({y}_{i}^{pred}-{y}_{i}^{obs}\right)}^{2}$$
(7)

Coefficient of Determination (R²): indicates the proportion of the variance in the dependent variable that is predictable from the independent variables:
$${R}^{2}=1-\frac{{{\sum}_{i=1}^{N}\left({y}_{i}^{obs} -{y}_{i}^{pred}\right)}^{2}}{{{\sum}_{i=1}^{N}\left({y}_{i}^{obs}-\underline{{\gamma}}_{obs}\right)}^{2}}$$
(8)
where ${\underline{\gamma}}_{obs}$ is the average of the observed values.

In the equation presented, ${y}_{i}^{pred}$represents the values predicted by the model for the i-th observation, ${y}_{i}^{obs}$indicates observed or actual values, N is the total number of observations, and y¯_obs is the average of the observed values. These metrics, combined, offer a comprehensive view of the model’s performance, highlighting not only the accuracy and bias of the predictions but also the variance of the errors and the model’s ability to explain variations in the observed data.

Relative importance of predictors

In this section, we explore the relative importance of predictors in the machine learning models Random Forest and Gradient Boosting, both widely used for data imputation and predictions across various application domains, including solar energy generation. Understanding which variables have the greatest influence on the prediction of energy generation is crucial, as it allows the optimization of models to obtain more accurate and efficient predictions. We use the criterion of impurity decrease (Random Forest) and the contribution to error reduction (Gradient Boosting) to quantify the importance of each predictor. This analysis not only identifies the most significant factors affecting solar energy generation but also provides valuable insights for future research and operational practices in the field of renewable energy. Below, we mathematically detail how this importance is calculated and discuss the implications of our findings for the design and implementation of more effective photovoltaic systems.

The importance of predictors $\left(I\right(f)$in Gradient Boosting models is quantified by each predictor’s contribution to reducing the loss function throughout the model’s training process. This contribution is measured as the sum of error reductions attributed to each split that uses the predictor in question, across all trees $\left(T\right)$ that comprise the model. Mathematically, the importance of a predictor f can be expressed as:
$$I\left(f\right)={\sum}_{t=1}^{T}\sum j\in{S}_{t}\left(f\right){\gamma}_{tj}$$
(9)
where ${S}_{t}\left(f\right)$ represents the set of all splits performed on feature f in the tree t, and ${\gamma}_{tj}$ indicates the contribution of the split j in tree t to the reduction of errors. The error reduction of a specific split is calculated as the difference between the sum of losses before and after the split, providing a quantitative measure of the improvement in the model attributed to that split.

After calculating the contributions of all the splits across all trees, the values are summed for each predictor. Finally, the importance is normalized to ensure that the sum of the importance of all features equals 1. In scikit-learn, these importances are accessible through the feature_importances_ property of the GradientBoostingRegressor object, offering a direct view of the relevance of each predictor in the model.
In the Random Forest model, the importance of predictors is calculated based on each predictor’s contribution to the decrease in weighted average impurity across all decision trees that make up the forest. For regression problems, impurity is measured by the decrease in variance. The importance of a feature $\left(f\right)$ is determined as follows:
$$I\left(f\right)=\frac{{\sum}_{t=1}^{T}\Delta V(f,t)}{\sum f{\prime}\in F{\sum}_{t=1}^{T}\Delta V(f{\prime},t)}$$
(10)
where T represents the total number of trees in the forest, $\Delta V(f,t)$is the decrease in variance due to feature f in tree t, and F is the set of all features. The decrease in variance is calculated as the difference between the variance before and after each split that uses feature f. This decrease is summed for all trees in which the feature appears.

The importance calculated in this way is normalized so that the sum of the importances of all features equals 1. This method provides a quantitative view of the relevance of each predictor in the model, reflecting its ability to improve the accuracy of Random Forests predictions. In scikit-learn, the importance of predictors is accessible through the feature_importances_ property of the GradientBoostingRegressor object, facilitating the analysis and interpretation of each feature’s contribution.

Data application: sizing of the green hydrogen production system through photovoltaic energy data

With the results obtained from data imputation through advanced machine learning techniques, specifically random forest and gradient boosting, a necessity to apply them in a real context to observe the practical impact of this processed information emerges. Thus, to verify the efficacy of the imputed data and its influence on practical decision-making, it is proposed to test them by sizing an electrolyzer for green hydrogen production. This step is crucial because the treated data originates from a photovoltaic plant located in a strategic region of the country, recognized for its potential to contribute significantly to green hydrogen production^34,35.

The choice of sizing an electrolyzer as the data application method is not arbitrary. It reflects a growing trend in the search for alternative and sustainable energy production sources, where green hydrogen stands out as a promising energy vector³⁶.

Applying these data in a real-sizing scenario not only validates the quality and precision of the imputations performed, but also provides valuable insights on how the effective integration of renewable sources can be optimized to meet future energy demands.

To size the proposed system, the following mathematical equations are utilized, considering key variables such as the efficiency of the electrolyzer, additional consumptions in the electrolysis system, and the energy required for hydrogen production:

1.
Total Energy Generated (E_total) was calculated by summing all the energy generated after data imputation, providing a solid basis for subsequent estimates of hydrogen production.
2.
Annual Operating Hours (H_{op_annual}) were determined by counting all hours with positive energy production, adjusted to reflect the effective operational time in an annual format.
3.
Annual Energy Available for Hydrogen Production (E_{h2_annual}) reflects the portion of the generated energy that is effectively available for electrolysis, after considering additional consumptions in the system.
4.
Mass of Hydrogen Produced (m_h2) was calculated by dividing the energy available for electrolysis by the energy required to produce one kilogramme of hydrogen, providing a direct estimate of the system’s production potential.
5.
Electrolyzer Power (P_electrolyzer) was estimated to appropriately size the electrolyzer capacity needed to process the available energy during effective operating hours, maximizing hydrogen production.

Applying these equations, significant differences were observed between the imputation models and the original database. The Random Forest model produced estimates indicating a substantially greater capacity for hydrogen production, evidenced by a total generated energy, mass of hydrogen produced, and electrolyzer power superior to those of the Gradient Boosting model, and, notably, to the database with missing data.

These results highlight the critical importance of precise and robust data imputation for the effective planning and sizing of renewable energy systems. Furthermore, they underscore the potential of machine learning technologies to overcome challenges associated with the variability and incompleteness of solar energy data, facilitating the transition to a more sustainable energy future through optimized green hydrogen production. The practical application of these imputed data in the design of green hydrogen production systems serves as a rigorous test of their validity, demonstrating how they can be employed to drive real advances in renewable energy technologies.

Results and discussion

This section presents an in-depth examination of the original solar energy generation data from a photovoltaic (PV) plant located in the climatically diverse region of Irecê, Bahia. The dataset is made up of measurements of daily energy output, quantified in kilowatt-hours (kWh), covering the period from September 2018 to July 2019, as shown in Fig. 3. The visualization of these time-series data reveals a conspicuous gap, as highlighted by the “Data missing” annotation, spanning from May to July 2019. The absence of data during this interval is a substantial impediment to the seamless record of energy generation, a critical element for the operational management and performance analysis of the plant.

The imputation of this missing data is not merely a procedural necessity but a vital process to ensure the accuracy of trend analysis, yield optimization, and reliability assessments for the energy facility. To achieve this, our study leveraged a comprehensive set of climatic parameters, hypothesizing their potential as predictive variables to reconstruct the lost information effectively. These parameters, recorded over the same time span, consist of daily values for temperature; humidity, wind speed, and solar irradiance, including the mean and its variability measure, standard deviation (see Fig. 4).

Temperature readings are imperative to gauge the operational efficiency of PV panels, as their performance is known to be sensitive to thermal conditions. Humidity is another critical factor; its fluctuation can significantly affect the level of solar insolation, as it is closely related to cloud formation and, consequently, the amount of solar energy reaching the ground. Wind speed also plays a dual role: it can assist in cooling the panels, thereby improving their efficiency, but it can also lead to soiling, which hampers energy absorption. Finally, solar irradiance is the cornerstone predictor, directly correlating with the potential solar energy that can be harnessed. The average irradiance indicates the expected energy input to the system, while its standard deviation captures the variations, adding a layer of complexity to the prediction models due to the inconsistency in solar energy supply.

Utilizing these variables for imputation is paramount because they embody the environmental factors that directly influence solar power generation. Accurate imputation of missing data using these variables can significantly enhance the operational and strategic decisions made for the management of solar energy facilities. The forthcoming sections will delve into the methodologies employed for imputation, the comparative analysis of their predictive accuracy, and the implications of these findings on future practices in solar energy data management.

Based on the comprehensive analysis of climate variables, the main findings obtained from the application of two advanced machine learning algorithms: Random Forest and Gradient Boosting. These methodologies were selected for their proven efficacy in handling nonlinear relationships and their robustness in dealing with diverse datasets. The Random Forest algorithm, known for its simplicity and the ability to perform efficiently on large datasets, was the first to be employed in the imputation task. This ensemble learning method operates by constructing multiple decision trees during training and producing the average prediction of the individual trees, thus reducing overfitting while maintaining high accuracy.

Subsequently, Gradient Boosting was applied as an alternative approach to the first model. Gradient Boosting allowed the optimization of arbitrary differentiable loss functions, working to reduce bias and variance in the problem.

The results of both models offer insightful revelations about the predictive capabilities of the selected variables in the energy generation data. The performance metrics of each model are meticulously dissected, providing a comparative analysis to discern the strengths and limitations inherent to each approach. Evaluation is anchored in metrics that include MAE, MSE, RMSE, and R², thus painting a comprehensive picture of their predictive power in the domain of solar energy imputation.

The graphical results, seen in Fig. 5, represent the data imputation of solar energy generation using two distinct machine learning models: Random Forest (RF) and Gradient Boosting (GB). Both images display time series covering the period from September 2018 to July 2019, with the imputation indicated in the regions highlighted by the respective models.

In the first part of the chart, the region highlighted in blue demonstrates the imputation performed by RF Prediction for energy generation in the period where data were missing. The pattern of imputation closely follows the fluctuations observed in the real data, suggesting that the Random Forest model successfully captured the temporal variation of the energy generation data. The imputation performed shows where each predicted point follows the trend and variation of the historical data. The Random Forest model, which uses multiple decision trees to generate the prediction, appears to maintain consistency with the natural fluctuations observed in the real data, an indicator of a successful imputation.

The second part of the chart shows the data imputation performed by the Gradient Boosting model, highlighted in green. Similarly, the model filled the data gap while maintaining consistency with the observed trends, indicating a comparable ability of Gradient Boosting to simulate energy generation based on the existing patterns in the data. This model provided an imputation that, similarly to RF, follows the characteristics of the original data but with a tendency to smooth out fluctuations, possibly due to its iterative optimization approach.

The results, when observed together, suggest that both models could perform imputations that visually align with the observed data patterns. This is a positive indication of the applicability of advanced machine learning techniques to fill in the gaps in the time series data of solar energy generation. The selection between random forest and gradient boost should be based on a careful analysis that includes the precision of the imputations, computational complexity, and the ease of model interpretation.

The direct comparison of the two models is exemplified in the combined image (see Fig. 6), where the RF and GB predictions are overlaid. It is observed that both techniques can capture the dynamics of energy generation data. However, subtle differences are noted in how each model responds to variations in the data. RF tends to follow more closely the peaks and valleys, while GB shows a generalization that may be preferable in certain operational contexts where the smoothness of predictions is desirable.

In the Gradient Boosting model graph, it is observed that the points tend to cluster closer to the reference line for lower energy values, up to approximately 1.5 kWh. However, for higher energy values, the model’s predictions appear to significantly diverge from the ideal line, indicating a reduction in accuracy for predicting higher energy values.

However, the random forest model graph (Figs. 7 and 8.) shows a more uniform dispersion of points along and around the reference line, covering the entire spectrum of data. This suggests that the Random Forest model can make consistently close predictions to the actual values, regardless of the energy level involved, demonstrating robustness and less bias compared to Gradient Boosting.

Comparatively, the Random Forest model exhibits greater consistency in its predictions relative to the Gradient Boosting model, as evidenced by the more uniform distribution of points along the reference line. Furthermore, while Gradient Boosting faces challenges in accurately predicting higher energy values, Random Forest demonstrates superior and more consistent performance across different energy levels.

Therefore, the selection of the most appropriate model for the data imputation task should consider not only the accuracy of the predictions but also the nature of the missing data and the specific requirements of the application in terms of computational complexity and interpretability of the results. The findings presented here lay the groundwork for a more in-depth analysis of these factors, contributing to the existing literature in the field of data imputation in renewable energy systems.

Metrics

In this study, the effectiveness of two machine learning models is compared using metrics. The analysis of performance metrics revealed that the Random Forest model significantly outperformed the Gradient Boosting model across all evaluated metrics. The Random Forest model achieved, and the Gradient Boosting model recorded performance metrics as detailed in the Table 2.

Table 2 Comparing the performance of Random Forest and Gradient Boosting models across four different metrics: MAE, MSE, RMSE, and R².

Full size table

The superiority of the Random Forest model suggests greater accuracy in predictions, with significantly lower errors compared to Gradient Boosting. The lower MAE and RMSE indicate that on average the predictions of the Random Forest model are closer to the actual values, while the higher R² reveals that this model can explain a larger proportion of the variance in the test data.

Notably, both models presented a Mean Bias Error (MBE) close to zero, indicating the absence of a significant systematic bias in the predictions. However, the slightly more negative MBE of Gradient Boosting suggests a bias towards lower predictions, although this bias is minimal.

The superior performance of the Random Forest model can be attributed to its ability to effectively model complexities and non-linear interactions in the data without the risk of overfitting, an essential characteristic for analyzing complex time series and environmental data like that of energy generation. On the other hand, the relatively inferior performance of the Gradient Boosting model could be improved through more detailed parameter tuning, feature selection, and preprocessing techniques.

Feature importance

The results for feature importance are presented in the Random Forest model, where the variable ’Irradiance_Std’, representing the standard deviation of solar irradiance, emerged as the most significant predictor with a relative importance of 83.494%, see Table 3. This result underscores the relevance of irradiance variability in predicting solar energy, suggesting that fluctuations in the amount of solar radiation received are a critical indicator of energy yield. Following in importance, ’Irradiance_Avg’, the average irradiance, contributed 10.894%, reinforcing the notion that energy generation prediction heavily depends on the average amount of solar radiation. Surprisingly, factors like ’Humidity’ and ’Temperature’ showed lower importance, with 2.6948% and 2.6113% respectively, which may indicate that despite their influence on the operational environment of photovoltaic panels, it’s the solar irradiance metrics that prevail in determining energy production. The predictor ’Wind’ showed the least importance, with only 0.3106%, reflecting a marginal influence of wind on operational conditions or efficiency of photovoltaic installations.

Table 3 Comparison of feature importance with model result differences expressed as percentages.

Full size table

Conversely, the Gradient Boosting model presented a similar predictor importance profile but with some notable differences. ’Irradiance_Std’ was also the most prominent with 82.8195%, confirming its position as the dominant indicator in predicting solar energy generation. Interestingly, ’Irradiance_Avg’ saw an increase in its importance to 15.6734%, which may reflect Gradient Boosting’s ability to delve deeper into the relationship between average solar energy and energy generation. The predictors ’Humidity’, ’Temperature’, and ’Wind’ registered importances of 0.8869%, 0.5679%, and 0.0523% respectively. This pattern reiterates the trend observed in the Random Forest model, where the average and variability of solar irradiance are the most critical components in modeling solar energy generation, while other climatic variables seem to play secondary roles. These findings suggest that models reliant on variables related to solar irradiance are more robust and reliable for data imputation in solar energy generation systems. The subordination of factors such as humidity, temperature, and wind speed highlight the need to focus on the most influential aspects of weather that directly affect the efficiency and output of solar installations. These insights provide valuable input for optimizing the design and operation of photovoltaic systems, as well as for enhancing the accuracy of solar energy prediction models.

Data application results

Based on the results obtained from the application of random forest and gradient booster models for the estimation of data in solar energy generation records and their subsequent application in the size of green hydrogen production systems, significant differences in performance and results were observed, Table 4.

Table 4 Comparative overview of green hydrogen production system sizing.

Full size table

The Random Forest model produced a data imputation that led to a total estimated energy production of 581,103.58 kWh, with a total of 61,601.25 operational hours. This resulted in an estimated annual hydrogen production (E_{h2_annual}) of 493,938.05 kWh and a total mass of hydrogen produced (m_h2) of 14,819.62 kg, with an electrolyzer power (P_electrolyzer) of 5.61 kW.

In contrast, the application of the Gradient Boosting model provided an imputation that resulted in a total estimated energy production of 216,970.60 kWh, with 67,300.75 operational hours. This translated into an estimated annual hydrogen production of 184,425.01 kWh and a total mass of hydrogen produced of 5,533.30 kg, with the electrolyzer power being 1.92 kW.

To contextualize these results, the analysis of the original database with missing data revealed a much lower total energy production of only 14,261.87 kWh, with 3,771.75 operational hours, an annual hydrogen production of 12,122.59 kWh, a total mass of hydrogen produced of 363.71 kg, and an electrolyzer power of 2.25 kW.

The discrepancy between the results obtained through the imputation models and the original database highlights the significant impact that data imputation can have on the sizing and evaluation of green hydrogen production systems. The Random Forest model demonstrated a remarkable ability to optimize the utilization of solar energy generation for hydrogen production, suggesting that accurate data imputation can play a essencial role in maximizing the efficiency and yield of such systems.

Conclusions

This study provided significant insights into the effectiveness of machine learning models, Random Forest and Gradient Boosting, for imputing missing data in solar energy generation databases. The Random Forest model was shown to excel in accuracy and efficiency, as indicated by performance metrics including mean absolute error (MAE) of 0.0364, mean squared error (MSE) of 0.0097, root mean squared error (RMSE) of 0.0985, and a coefficient of determination (R²) of 0.9779. These notably superior values, compared to standard models, highlight Random Forest’s robustness in handling the complexities of solar data.

Through the case study involving the application of imputed data for sizing a green hydrogen production system, it became evident that accurate data imputation can play a crucial role in maximizing operational efficiency and yield in renewable energy systems. The case study also underscored the importance of feature importance variables, with ’Irradiance_Std’ emerging as the most significant predictor, indicating that fluctuations in solar radiation are critical determinants for energy generation efficacy. These findings not only validate the utility of advanced machine learning methods in energy data management but also provide a solid foundation for the future development of more accurate and efficient predictive models. The widespread implementation of these techniques could significantly enhance decision-making and operational strategy at solar energy facilities, promoting a more sustainable and efficient energy transition.

Data availability

The data supporting this study and findings are available from the corresponding author upon reasonable request and are publicly accessible at https://www.kaggle.com/datasets/lscadfacomufms/solar2photovoltaicsolarplantdatairece/data.

References

Chandel, P. & Roy, L. Solar radiation prediction based on hybrid machine learning technique. In 2023 3rd International Conference on Technological Advancements in Computational Sciences (ICTACS) 418–424. https://doi.org/10.1109/ICTACS59847. 2023.10390492 (2023).
Zhou, Y., Zheng, S. & Zhang, G. Machine learning-based optimal design of a phase change material integrated renewable system with on-site PV, radiative cooling and hybrid ventilations—Study of modelling and application in five climatic regions. Energy 192, Article 116608 (2020).
Khare, P., Wadhvani, R. & Shukla, S. Missing data imputation for solar radiation using generative adversarial networks. In Tiwari, R., Mishra, A., Yadav, N., Pavone, M. (eds) Proceedings of International Conference on Computational Intelligence. Algorithms for Intelligent Systems (Springer, 2022). https://doi.org/10.1007/978-981-16-3802-2_1.
de O. Santos, D. S. et al. Solar irradiance forecasting using dynamic ensemble selection. Appl. Sci. 12. https://doi.org/10.3390/app12073510 (2022).
de Oliveira, J. F. L. et al. Forecasting methods for photovoltaic energy in the scenario of battery energy storage systems: A comprehensive review. Energies16. https://doi.org/10.3390/en16186638 (2023).
Xu, N. et al. Prediction of higher heating value of coal based on gradient boosting regression tree model. Int. J. Coal Geol., 104293. https://doi.org/10.1016/j.coal.2023.104293 (2023).
Aksoy, N. & Genc, I. Predictive models development using gradient boosting based methods for solar power plants. J. Comput. Sci. 67, 101958. https://doi.org/10.1016/j.jocs.2023.101958 (2023).
Article Google Scholar
Sasirekha, P. et al. Comparative analysis of prediction on solar radiation in energy generation system using random forest and decision tree. In 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS) 899–903. https://doi.org/10.1109/ICSCDS53736.2022.9760819 (2022).
Villegas-Mier, C. G., Rodriguez-Resendiz, J., Álvarez Alvarado, J. M., Jiménez-Hernández, H. & Odry. Optimized random forest for solar radiation prediction using sunshine hours. Micromachines 13 (2022).
Belyadi, H. & Haghighat, A. Chapter 5 - supervised learning. In Belyadi, H. & Haghighat, A. (eds.) Machine Learn- ing Guide for Oil and Gas Using Python 169–295. https://doi.org/10.1016/B978-0-12-821929-4.00004-4 (Gulf Professional Publishing, 2021).
Wazirali, R., Yaghoubi, E., Abujazar, 8M. S. S., Ahmad, R. & Vakili, A. H. State-of-the-art review on energy and load forecasting in microgrids using artificial neural networks, machine learning, and deep learning techniques. Electr. Power Syst. Res. 225, 109792. https://doi.org/10.1016/j.epsr.2023.109792 (2023).
Meshram, K. 2 - basic machine learning models for data pre-processing. In Meshram, K. (ed.) Machine Learning Applications in Civil Engineering, Woodhead Publishing Series in Civil and Structural Engineering 17–32. https://doi.org/10.1016/B978-0-443-15364-8.00002-0 (Elsevier, 2024).
Meshram, K. 3 - use of machine learning models for data representation. In Meshram, K. (ed.) Machine Learning Applications in Civil Engineering, Woodhead Publishing Series in Civil and Structural Engineering 33–50. https://doi.org/10.1016/B978-0-443-15364-8.00003-2 (Elsevier, 2024).
Zhou, Y., Zheng, S. & Zhang, G. Machine-learning based study on the on-site renewable electrical performance of an optimal hybrid PCMs integrated renewable system with high-level parameters’ uncertainties. Renew. Energy 151, 403–418 (2020).
Article Google Scholar
de Mattos Neto, P. S. et al. An adaptive hybrid system using deep learning for wind speed forecasting. Inf. Sci. 581, 495–514. https://doi.org/10.1016/j.ins.2021.09.054 (2021).
Meng, F. et al. An intelligent hybrid wavelet-adversarial deep model for accurate prediction of solar power generation. Energy Rep. 7, 2155–2164 (2021).
Article Google Scholar
Zhou, Y. et al. Passive and active phase change materials integrated building energy systems with advanced machine-learning based climate-adaptive designs, intelligent operations, uncertainty-based analysis and optimisations: A state-of-the-art review. Renew. Sustain. Energy Rev. 130, 109889 (2020).
Article Google Scholar
Chen, J., Alnowibet, K., Annuk, A. & Mohamed, M. A. An effective distributed approach based machine learning for energy negotiation in networked microgrids. Energy Strategy Rev. 38, 100760 (2021).
Article Google Scholar
Chander, B. & Gopalakrishnan, K. 10 - data clustering using unsupervised machine learning. In Goswami, T. & Sinha, G. (eds.) Statistical Modeling in Machine Learning 179–204. https://doi.org/10.1016/B978-0-323-91776-6.00015-4 (Academic Press, 2023).
Belyadi, H. & Haghighat, A. Chapter 4 - unsupervised machine learning: clustering algorithms. In Belyadi, H. & Haghighat, A. (eds.) Machine Learning Guide for Oil and Gas Using Python 125–168. 10.1016/ B978–0–12–821929–4.00002–0 (Gulf Professional Publishing, 2021).
Cersonsky, R. K. & De, S. Chapter 7 - unsupervised learning. In Dral, P. O. (ed.) Quantum Chemistry in the Age of Machine Learning 153–181. https://doi.org/10.1016/B978-0-323-90049-2.00025-1 (Elsevier, 2023).
Eslami, N., Rahbar, M., Bozorgi, S. M. & Yazdani, S. Chapter 5—Whale optimization algorithm and its application in machine learning. In Mirjalili, S. (ed.) Handbook of Whale Optimization Algorithm 69–80. https://doi.org/10.1016/B978-0-32-395365-8.00011-7 (Academic Press, 2024).
Mellit, A. & Kalogirou, S. 2 - artificial intelligence techniques: Machine learning and deep learning algorithms. In Mellit, A. & Kalogirou, S. (eds.) Handbook of Artificial Intelligence Techniques in Photovoltaic Systems 43–83. https://doi.org/10.1016/B978-0-12-820641-6.00002-8 (Academic Press, 2022).
Schwartz, D., Shokoufandeh, A., Grady, M. & Soroush, M. Chapter 1 - machine learning methods. In Soroush, M. & D Braatz, R. (eds.) Artificial Intelligence in Manufacturing 1–38. https://doi.org/10.1016/B978-0-323-99134-6.00008-6 (Academic Press, 2024).
Mobarak, M. H. et al. Scope of machine learning in materials research—A review. Appl. Surf. Sci. Adv. 18, 100523. https://doi.org/10.1016/j.apsadv.2023.100523 (2023).
Article Google Scholar
Sankalp, S. & Panda, P. K. Chapter 5—A comparative evaluation of machine learning and arima models for forecasting relative humidity over odisha districts. In Kasiviswanathan, K., Soundharajan, B., Patidar, S., He, J. & Ojha, C. S. P. (eds.) Modeling and Mitigation Measures for Managing Extreme Hydrometeorological Events Under a Warming Climate, vol. 14 of Developments in Environmental Science 91–105. https://doi.org/10.1016/B978-0-443-18640-0.00013-4 (Elsevier, 2023).
Sun, H. et al. Assessing the potential of random forest method for estimating solar radiation using air pollution index. Energy Convers. Manag. 119, 121–129 (2016).
Article ADS Google Scholar
Malakouti, S. M. Improving the prediction of wind speed and power production of SCADA system with ensemble method and 10-fold cross-validation. Case Stud. Chem. Environ. Eng. 8, 100351. https://doi.org/10.1016/j.cscee.2023.100351 (2023).
Article CAS Google Scholar
Notton, G. & Voyant, C. Chapter 3 - forecasting of intermittent solar energy resource. In Yahyaoui, I. (ed.) Advances in Renewable Energies and Power Technologies 77–114. https://doi.org/10.1016/B978-0-12-812959-3.00003-4 (Elsevier, 2018).
Malakouti, S. M. & Ghiasi, A. R. Evaluation of the application of computational model machine learning methods to simulate wind speed in predicting the production capacity of the swiss basel wind farm. 26th International Electrical Power Distribution Conference (EPDC), Tehran, Iran 31–36. https://doi.org/10.1109/EPDC56235.2022.9817304 (2022).
Zahraoui, Y., Alhamrouni, I., Mekhilef, S. & Basir Khan, M. R. Chapter one—Machine learning algorithms used for short- term pv solar irradiation and temperature forecasting at microgrid. In Shaw, R. N., Ghosh, A., Mekhilef, S. & Balas, V. E. (eds.) Applications of AI and IOT in Renewable Energy 1–17. https://doi.org/10.1016/B978-0-323-91699-8.00001-2 (Academic Press, 2022).
Malakouti, S. M. & Ghiasi, A. R. Predicting wind power generation using machine learning and CNN-LSTM approaches. Appl. Energy 282, 116127. https://doi.org/10.1016/j.apenergy.2020.116127 (2022).
Article Google Scholar
Malakouti, S. M. Heart disease classification based on ECG using machine learning models. Biomed. Signal Process. Control 74, 104796. https://doi.org/10.1016/j.bspc.2022.104796 (2023).
Article Google Scholar
Zhou, Y., 2022. Transition towards carbon-neutral districts based on storage techniques and spatiotemporal energy sharing with electrification and hydrogenation. Renew. Sustain. Energy Rev. 162, 112444 (2022).
Zhou, L., Song, A. & Zhou, Y. Electrification and hydrogenation on a PV-battery-hydrogen energy flexible community for carbon–neutral transformation with transient aging and collaboration operation. Energy Convers. Manag. 300, 117984 (2024).
Article CAS Google Scholar
Gong, X., Dong, F., Mohamed, M. A., Abdalla, O. M. & Ali, Z. M. Secured energy management architecture for smart hybrid microgrids considering PEM-fuel cell and electric vehicles. IEEE Access 8, 47807–47823 (2020).
Article Google Scholar

Download references

Acknowledgements

The authors express their gratitude for the substantial support received from various institutions. Firstly, we acknowledge the Support for Science and Technology Foundation of the State of Pernambuco (FACEPE) through Process BFP-0214-3.04/23, which provided a Researcher Grant for Post-doctorate.We are also thankful for the support from Brazilian national agencies, including the Coordination for the Improvement of Higher Education Personnel (CAPES) under Financing Code 001, and the Brazilian National Council for Scientific and Technological Development (CNPq), through process number 310862/2022-1.Further appreciation is extended to the Polytechnic School of the University of Pernambuco (POLI-UPE) and the Postgraduate Program in Systems Engineering (PPGES) for their invaluable support and contribution to our research endeavors.

Author information

Authors and Affiliations

Polytechnic School of Engineering (POLI-UPE), Postgraduate Program in Systems Engineering, University of Pernambuco (UPE), Recife, Brazil
Tatiane Costa, Bruno Falcão & Manoel Marinho
Department of Electrical Engineering, Faculty of Engineering, Minia University, Minia, 61519, Egypt
Mohamed A. Mohamed
Institute of Forestry and Engineering, Estonian University of Life Sciences, Tartu, 51006, Estonia
Andres Annuk

Authors

Tatiane Costa
View author publications
Search author on:PubMed Google Scholar
Bruno Falcão
View author publications
Search author on:PubMed Google Scholar
Mohamed A. Mohamed
View author publications
Search author on:PubMed Google Scholar
Andres Annuk
View author publications
Search author on:PubMed Google Scholar
Manoel Marinho
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, T.C.; methodology, T.C., A.A, M.A.M; validation, B.F, A.A, M.A.M, M.M.; formal analysis, T.C., M.A.M; investigation, T.C., B.F. M.A.M; resources, A.A, M.M.; supervision, M.A.M, M.M. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Tatiane Costa or Mohamed A. Mohamed.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Costa, T., Falcão, B., Mohamed, M.A. et al. Employing machine learning for advanced gap imputation in solar power generation databases. Sci Rep 14, 23801 (2024). https://doi.org/10.1038/s41598-024-74342-3

Download citation

Received: 12 July 2024
Accepted: 25 September 2024
Published: 11 October 2024
Version of record: 11 October 2024
DOI: https://doi.org/10.1038/s41598-024-74342-3

Keywords

This article is cited by

Hydrophobic wavy tape to augment the parabolic trough solar collector performance
- Mohammad Mahdi Heyhat
- Ali Ismail Al-Shiblawi
- Mostafa Bahmani
Scientific Reports (2025)
Comparative analysis of daily global solar radiation prediction using deep learning models inputted with stochastic variables
- Amit Kumar Yadav
- Raj Kumar
- Tej Singh
Scientific Reports (2025)

Subjects

Abstract

Similar content being viewed by others

Comparative analysis of machine learning techniques for temperature and humidity prediction in photovoltaic environments

Optimizing solar power efficiency in smart grids using hybrid machine learning models for accurate energy generation prediction

Multi-label machine learning for power forecasting of a grid-connected photovoltaic solar plant over multiple time horizons

Introduction

Literature review

Methodology

Materials

Random Forest

Gradient Boosting

Comparison between Random Forest and Gradient Boosting

Performance metrics

Relative importance of predictors

Data application: sizing of the green hydrogen production system through photovoltaic energy data

Results and discussion

Metrics

Feature importance

Data application results

Conclusions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Hydrophobic wavy tape to augment the parabolic trough solar collector performance

Comparative analysis of daily global solar radiation prediction using deep learning models inputted with stochastic variables

Search

Quick links