Introduction

Research background and literature overview

Predicting solar irradiation values is useful for different areas of knowledge, especially agriculture1,2,3,4 and energy production5,6. The efficiency of this prediction is essential in both in systems planning and in the execution of short, medium, and long-term actions in these knowledge areas7. Considering that the physical measurement of solar irradiation, due to the cost and technology involved, is not possible for all locations worldwide, predicting future values is usually restricted to areas with measurement stations8.

When predicting solar irradiation values is necessary or desired, and there are no meteorological stations, prediction models are used9,10. Empirical models have long been used for this purpose. However, these generally present strong generalizations and do not consistently provide the best results11. In recent years, with the advancement of machine learning techniques, computational resources have been leveraged to partially or entirely replace empirical models, yielding promising results12,13. Regardless of the classification method, it must be appropriately calibrated for efficient execution.

There are many challenges in solar irradiation prediction; therefore, different techniques are employed in various locations14,15. Furthermore, the literature clearly shows that a prediction model that works well in one place is not necessarily the best in another, just as the best variables for one place are not necessarily the best for another16. When selecting the most suitable machine learning model, it is essential to consider whether the data originates from weather stations or satellites17,18 and the desired prediction granularity19,20,21,22. Regardless of the type of data and model, hybrid methods that optimize the hyperparameters of the learning models have better results23,24,25.

In this paper we present a framework for optimizing solar irradiation prediction computational models applied to data from Minas Gerais state, Brazil. Minas Gerais is in economic and population terms the third largest state in Brazil, with a territory of 586,852.35 \(km^2\), and a population of 20 million inhabitants. Its Gross Domestic Product is the third largest in Brazil26 and with a monthly energy demand of approximately \(5\times 10^6\) MWh27, the state is in a constant process of expanding its energy matrix. In Brazil, the power grid’s primary source of production is hydroelectric resources. However, the exploitation of these resources is nearing its limit, and the strong dependence on a seasonal energy source causes disturbances and price fluctuations during periods of drought. In this way, considerable investment has been made in producing photovoltaic energy, primarily in distributed micro- and mini-generation28.

Despite this growing demand, few published studies use computational models to predict future solar irradiation values in Minas Gerais. We found 14 works that address this topic in this region, and only two of these cover the entire state, using a daily granularity. One of these uses empirical models, and the other uses computational models. Most papers use empirical models for specific regions of the state29,30,31,32,33,34,35,36,37. Some use computational resources to solve more detailed numerical models or apply empirical models as input to computational models38,39. The most comprehensive works are40,41.

Reference40 compiled a database with 51 stations, showing the challenges in collecting and cleaning data. Fifteen different empirical models are applied to all available cities. The best result, as determined by the \(R^2\) metric, was 59.58%, with an average of 36.97%. Reference41 uses the same database as the previous work, differentiated by applying computational models, specifically Artificial Neural Networks (ANN) and Multivariate Adaptive Regression Spline (MARS). The present proposal uses them as the main parameter for comparing results.

Two other works that do not directly deal with solar irradiation prediction, however, are strongly related to the present paper are42, where authors use a computational approach on this same database, observing the entire state of Minas Gerais using an artificial neural network, random forest, support vector machine, and multiple linear regression for the prediction of reference evapotranspiration, and43 that maps data on solar radiation and rainfall in Minas Gerais from different bases, showing how geographic factors influence these measurements. The paper presents an interesting set of maps showing, in addition to the climatic factors, the study’s objective, the location and type of various meteorological stations, and their maintainers. However, the work is just mapping, and no prediction is made.

This research distinguishes from the previous ones by proposing a geolocated transfer learning framework that leverages spatial dependencies by incorporating data from neighboring meteorological stations. This fundamentally departs from previous works by enabling robust predictions in areas without direct meteorological stations, a challenge largely unaddressed by the current literature’s focus on isolated or empirical models. This innovative approach, specifically the use of neighboring city data for transfer learning, represents a key methodological advancement in solar irradiation forecasting for regions with sparse meteorological networks.

Research significance and motivation

Brazil has an estimated potential of 176 GW for producing electricity from hydro resources. However, approximately 108 GW of this is already in use, and more than half of the available potential is located in the Tocantins-Araguaia and Amazon watersheds. Consequently, the expansion of hydroelectric resource utilization has encountered recent obstacles, including adverse socio-environmental consequences stemming from the execution of large hydroelectric ventures, substantial upfront investment expenditures, and the complexity of establishing power plants close to significant consumption hubs, this, in turn, necessitates supplementary investments in extensive transmission infrastructure to transport the generated electricity44.

Given the importance of the state of Minas Gerais in the Brazilian economy, its potential for photovoltaic energy generation, the growing demand for new energy sources, especially renewable ones, and the expectation of increased installation of photovoltaic power plants in the state, it is clear that there is a need for different studies to develop efficient forecasting models for solar radiation resources in this location. The scarcity of research in the literature on this region of the country, particularly utilizing computational resources, underscores the need for further investigation.

Research objectives

In this work, we present a computational model for the entire state of Minas Gerais and the methodology for its development. This model utilizes data from 67 weather stations distributed across all state regions, as well as data from an additional 67 stations in neighboring states, with measurements taken over more than 20 years in 129 cities. Different approaches are studied, manipulating the data used in the training and test databases, to answer the following question: Does the addition of geolocation data improve the prediction of solar irradiation? Is it possible to efficiently predict solar irradiation values in a place without meteorological stations?

Two different approaches are compared in this work. In the first experiment, we add data from neighboring stations to the training data of each station, aiming to verify how the inclusion of geolocation data affects the results. In a second moment, we remove from the training base the data for the location where we want to make the forecast, leaving only the data from the neighboring stations. In this way, we have a training base without data from the location where the forecast will be performed. Both methods are compared with the punctual executions of each station.

This paper is organized as follows: Section 2.1 provides a detailed description of the database. Section 2.2 presents all the performance metrics adopted to compare this proposal with other works, describes the framework for selecting the fittest machine learning model and optimizing its hyperparameters, and also presents the methodology for developing a geolocated model. Section 3 details the proposed framework execution over the utilized database. Finally, conclusions and extensions are considered in Section 5.

Material and methods

Data

The National Institute of Meteorology (Instituto Nacional de Meteorologia-INMET)45 maintains and makes available data from more than 500 meteorological stations throughout Brazil, 68 of which are automatic meteorological stations in Minas Gerais-MG. This work utilized data from various periods, spanning from December 2002 to December 2021, collected from 67 different stations in 66 cities across all regions of Minas Gerais, which formed its primary database. The availability of data at each station depends on several external factors, including the installation date, potential failures, and scheduled maintenance periods. We also used data from stations in other states besides Minas Gerais to develop the methodology proposed in this work. To select these stations, we calculated the 10 closest stations for each of the 66 cities in Minas Gerais. All cities on this list were included in the database for this work, regardless of whether they are located in Minas Gerais or not. Data from 67 stations outside Minas Gerais were used: 17 from the state of São Paulo (SP), 13 from Rio de Janeiro (RJ), 12 from Espírito Santo (ES), 11 from Bahia (BA), 9 from Goiás (GO), 4 from the Distrito Federal (DF), and 1 from Mato Grosso do Sul (MS). Supplementary Table S1 details the code of each station, its city, federative unit, geographic coordinates, observed period, and the total data analyzed, while Fig. 1 highlights all cities on the map where there are analyzed weather stations, with colors differing between states. In this work, 20 variables were used. Table 1 describes them according to45.

Fig. 1
figure 1

Map from studied municipalities [map generated with the pyearth library46 with shapefiles (.shp) provided by IBGE47].

Table 1 Used variables.

It is worth noting that the INMET data have an hourly measurement granularity. In this work, we used the daily values for each variable, considering that for the variables Global and Qo, the sum of each day was used, and for the others, the average of the daily readings. The Qo variable, unlike the others, is not measured by devices and is not available through INMET. This variable was synthetically generated by the PySolar48 library. The choice of daily granularity is mainly to facilitate comparison with the main works in the literature. Different granularities, such as hourly and monthly, yield distinct results and are typically employed in distinct problems.

Methodology

Performance metrics

The metrics described in Table 2, as outlined in the scikit-learn library documentation49, were used to evaluate the performance of the methods. This Table also describes the purpose of using each metric, the best and worst results that can be found, and what they mean.

Table 2 Performance metrics and Indicators for Model Evaluation. For each performance metric, \(\hat{y}_i\) represents the predicted output, \(y_i\) is the measured output, \(\bar{y}\) denotes the mean of the measured values, and N is the total number of samples in the dataset.

Performance profiles

Performance profiles are a valuable resource in optimization benchmarking. They provide a comprehensive method for assessing and comparing the efficacy of diverse optimization algorithms across a range of test scenarios. When we view each station as an independent problem, we deal with 67 individual problems. These problems are analyzed under five distinct scenarios, each evaluated across six different metrics. Consequently, this approach yields a comprehensive dataset of 2010 distinct results. Managing such a vast volume of results can be challenging, particularly when dealing with potentially inconsistent metrics. To address this, we employ the technique of evaluating models through performance profiles, as detailed by50. This approach facilitates a graphical assessment of one solver’s superiority over another, and its methodology is elaborated upon below.

Consider a set P of test problems \(p_{j}\), with j = 1, 2, ..., \(n_{p}\), a set A of algorithms \(a_{i}\), with i = 1, 2, ..., \(n_{a}\) and \(t_{p,a} > 0\) a performance metric (such as compute time, average, etc.). The performance ratio is defined as:

$$\begin{aligned} r_{p,a}= \frac{t_{p,a}}{min\{t_{p,a}:a \in A\}} \end{aligned}$$
(1)

The algorithm performance profile is defined as:

$$\begin{aligned} \rho _{a}(\tau )=\frac{1}{n_{p}}\Vert \{p \in P: r_{p,a} \le \tau \}\Vert \end{aligned}$$
(2)

where \(\rho _{a}(\tau )\) is the fraction of problems solved by the algorithm with performance within a factor \(\tau\) of the best performance obtained, considering all algorithms.

Transfer learning

Transfer learning is a machine learning technique that involves utilizing knowledge gained from one task or domain to enhance the performance of a different, yet related, task or domain. In transfer learning, a pre-trained model, typically trained on a large dataset, serves as a starting point for a new task, rather than training a model from scratch.

The idea behind transfer learning is that the knowledge acquired by a model while learning one task can be leveraged to accelerate learning or improve generalization on a different task. By starting with a pre-trained model, the model already possesses learned features, patterns, or representations that are generally useful across tasks. These learned features can be utilized as a foundation for the new task, allowing the model to adapt and specialize more quickly.

The process of transfer learning typically involves the following steps:

Pre-training: A model is trained on a large dataset from a source task or domain. This training step is usually computationally expensive and time-consuming.

Feature extraction: The pre-trained model is used to extract relevant features or representations from the data of the source task. These features capture important patterns or information in the data.

Fine-tuning: The extracted features are then used to initialize a new model that is specifically designed for the target task or domain. This new model is trained on a smaller dataset specific to the target task, which is often labeled or annotated.

Adaptation: The new model is further trained on the target task dataset, typically with a lower learning rate, to adjust the model’s parameters to the task’s specific requirements. This step allows the model to fine-tune its learned features and improve its performance on the target task.

Transfer learning can be particularly beneficial when the target task has limited data available, as it helps mitigate the risk of overfitting and improves the model’s ability to generalize. It has been successfully applied in various domains, including computer vision, natural language processing, and audio analysis, enabling the development of more accurate and efficient models.

Automatic feature selection

The feature selection process is crucial in building machine learning models. This process implies selecting the most relevant variables or features from a dataset. It is essential for reducing dimensionality, enhancing model interpretability, and improving overall performance.

The developer can perform the variable selection process arbitrarily, where the variables judged attractive are used and others discarded. In this process, the correlation matrix of the variables is usually observed. Considering the numerous databases and their combinations, this approach would be highly time-consuming and inefficient for the present work. In this way, the feature selection process was performed automatically by an optimization algorithm, which also optimized the hyperparameters of the machine learning model. Therefore, all available variables were initially considered usable in training the final model. However, the optimization process selected only those that performed best. Details of this process are described below in the 2.2 section.

Hybrid computational approach

Aiming at an efficient comparison parameter for the geolocated model, we developed a punctual computational model for each station in Minas Gerais. When dealing with a specific station individually, we refer to it as the Target Station. The methodology applied for the development of these models has already been widely validated in41,51,52 and is described below: Initially, all machine learning models are individually executed to find the best set of hyperparameters for each model applied to each station. Each machine learning model has a set of values containing the upper and lower bounds of its hyperparameters. At this initial moment, it is possible to choose to use a specific subset of variables, all available variables or to apply a feature selection technique. After initial settings, the optimization algorithm randomly generates a population of candidate solutions. Each candidate solution represents a set of hyperparameters associated with the machine learning model. If the feature selection process is carried out automatically, each solution will have its own set of variables. The solution is evaluated using a k-fold cross-validation strategy, where the objective to be minimized is the root mean square error (RMSE) value, calculated between the observed and predicted values. As we work with time series, it is worth noting that in the folds of the cross-validation process, future values are not used for training. This process is known as Time Series Split Cross-Validation. When the stopping criterion is met, the evolutionary cycle ends, and the solution that presents the best RMSE is stored. With the models already executed, we can analyze which one performed the best at the Target Station and determine the optimal configuration for its hyperparameters. Figure 2 illustrates this process.

Fig. 2
figure 2

Optimization process for each model.

Geospatial model

This paper introduces a novel geolocated transfer learning (TL) approach designed to overcome the limitations of localized prediction models. Unlike conventional methods that typically use data solely from the target station or rely on satellite imagery (which, while intrinsically geolocated, can be less accurate than ground measurements when available), our framework explicitly integrates detailed geolocation data from multiple meteorological stations. Specifically, for developing a geospatial computational model in Minas Gerais, data from the N closest stations to the Target Station are aggregated into a unified dataset enriched with comprehensive geolocation information. The distinct innovation lies in our two-pronged strategy: (1) enhancing the Target Station model by incorporating neighboring data into its training, thereby creating a truly ’geolocated’ model, and (2) critically demonstrating the feasibility of solar irradiation prediction in areas without existing meteorological stations. This is achieved by constructing a training base exclusively from neighboring station data and validating it against the Target Station’s data, a process we define as transfer learning. This approach enables the model to learn broader spatial patterns and generalize effectively, addressing a significant gap in the current literature, where predictions are often constrained to measured locations.

For developing a geospatial computational model of Minas Gerais, the data from the N closest stations to the Target Station are grouped in a single base with the geolocation data of all the stations involved. The methodology described in Section 2.2 is then applied to this new database. However, the validation process only takes place with the Target Station data. In this way, the model representing the Target Station will also incorporate data from neighboring stations into its formulation and can be considered a geolocated model. Figure 3 shows this dynamic, while Fig. 4 illustrates this step on the map. There is a need for special care when dividing the training and testing intervals. Some stations, such as Caratinga and Espinosa, do not have data up to the end of the usually observed period, December 31, 2021. Therefore, data from stations neighboring these bases must also be dropped on the date of the last reading of these stations when they are the Target Station.

Fig. 3
figure 3

Population of the geospatial database.

Fig. 4
figure 4

Neighboring grouping.

In order to verify the feasibility of predicting solar irradiation in areas without meteorological stations, we employed an alternative approach, slightly modifying the methodology described above. We created a database with data from the N closest stations to the Target Station. However, we removed the Target Station data from the training base. In this manner, all training is conducted using data from neighboring stations, and validation is performed using data from the Target Station. This process is, in essence, what we refer to as transfer learning. Figure 5 shows the geospatial base creation process without the presence of the Target Station in the training base.

Fig. 5
figure 5

Transfer learning geolocated database creation.

Computational experiments, analysis and discussion

Although the methodology described above can be applied to any forecast horizon, in this work, the forecast is made one day ahead, using data from the previous day. As the data from each station has different periods of availability in each Scenario and Target Station combination, different periods were used for the machine learning model’s testing and training periods.

First phase of experiments

In the developemnt of process of the present study, different approaches were tested. Initially, the framework developed did not account for the transfer learning process. At this stage, experiments were conducted using various machine learning models. We will report the results found at this stage in the following to justify some choices made in the final process.

At this preliminary stage, we also analyzed the biome’s correlation with the results. Initially, we assumed this variable would be strongly correlated, given that the biome is highly influenced by various climatic factors, such as altitude and the region’s general geography. However, in most of the tests performed, the variable selection process excluded this exogenous variable, demonstrating that it did not influence the results. Therefore, we chose not to use this data in the second phase of the experiments.

Initially, six different machine learning models were applied, following the methodology described in41. The models were Artificial Neural Networks (ANN), Extreme Learning Machine53,54 (ELM), Elastic Net55 (EN), Multivariate Adaptive Regression Spline56,57 (MARS), Extreme Gradient Boosting58,59 (XGB), and Support Vector Regression60 (SVR). Of all the applied models, the one that demonstrated the best performance was the ELM, as shown in Table 3, which displays the normalized area under the curves of the performance profiles for each learning model.

Table 3 Normalized area under the curves of the performance profiles.

Second phase of experiments

Considering that the ELM model yielded the best result in the initial tests, we applied it exclusively in the second development stage. ELM is a type of machine learning algorithm, specifically a variation of the feedforward neural network, and its training is done differently from most other neural networks. It is one of the most used machine-learning models for predicting solar irradiation61.

Unlike traditional neural networks, which use a stepwise, iterative training process to adjust the weights of connections between layers, ELM randomly assigns these weights. ELM does not require a long and time-consuming training phase, making it much faster than other machine learning methods.

ELM is also known to be highly efficient in terms of processing and requires relatively little training data to produce accurate results. It is often used in classification and prediction tasks such as pattern recognition, time series analysis, and signal processing.

As an optimizer, we use a Simple Genetic Algorithm (SGA). SGA is a method to solve optimization problems with and without constraints, inspired by natural selection. Each possible solution to the problem is considered an individual from the population, and the algorithm repeatedly modifies a population in search of fitter individuals. At each step, the algorithm selects individuals from the current population as parents and uses them to produce offspring for the next generation. Over successive generations, the population evolves towards an optimal solution. The implementation used was the Pygmo library62.

All scenarios presented were executed 30 times independently, aiming at greater statistical confidence in the results found. Five different scenarios were analyzed for all stations:

  1. 1.

    The Target Station database was used alone for training and testing. This scenario represents the station model independently, is the one traditionally analyzed in other works40,41, and serves as the main object of comparison of the results.

  2. 2.

    Data from the two closest neighboring cities were added to the Target Station training bases. Only Target Station data was used as a test basis. In this scenario, we verify the benefit of using data from neighboring stations to compose the training base. We can consider the models generated in this scenario as geolocated.

  3. 3.

    Data from the two closest neighboring cities were used as a training base, and data from the Target Station as a test base. This scenario tries to verify the feasibility of applying transfer learning using cities with meteorological stations as training to generate models for cities without meteorological stations.

  4. 4.

    Data from the four nearest neighboring cities were added to the Target Station training bases. Only Target Station data was used as a test basis. This scenario is similar to scenario 2, with the only difference being the number of stations used.

  5. 5.

    Data from the four nearest neighboring cities were used as a training base, and Target Station data as a test base. This scenario is similar to scenario 3, with the only difference being the number of stations used.

Supplementary Table S2 presents the averages of 30 independent runs of all city scenarios. The best values found for each city are highlighted in bold, and the standard deviation is in parentheses. Furthermore, Supplementary Table S3 presents the count of the selected variables across 30 independent runs for all scenarios in all cities. Table 4 presents the parameters used in the genetic algorithm, while Table 5 presents the ELM parameters, as well as the search intervals employed.

Table 4 Optimization algorithm parameters.
Table 5 Parameters used for machine learning model.

Discussion

Considering scenario 1 and comparing it with the main works found in the literature40,41, this work focuses on 16 stations that were not studied in the literature. At 22 stations, the results presented here are more favorable. In the other 29 stations, the results reported in the literature are better, always observing the metric \(R^2\). These results suggest that the MARS model may be superior to the ELM. However, the experiments presented here are insufficient to confirm this situation. This variation in the results can be attributed to several factors, including the observed interval, the percentage of division between the training and test sets, and the differences between the empirical and machine learning models used in other works. Regardless, this work aims to demonstrate the viability of using geographically distant data to forecast values without meteorological stations, rather than to obtain better models where stations already exist. The comparison of this scenario’s results with those found in the literature is presented in Supplementary Table S4, with the best results from each city highlighted in bold.

Through scenarios 2 and 4, we attempted to verify the hypothesis that adding data from neighboring stations would improve the results at a specific point. We found that for most stations, 47 exactly, adding these data to the training base was not advantageous. That is, in 47 stations, the result of scenario 1 was better than the results of scenarios 2 and 4. In 11 stations, scenario 2 had better results, and in 9, scenario 4 had better results, always observing the metric \(R^2\). This outcome suggests that a simple distance-based selection of neighboring stations may not be sufficient to capture the intricate spatial dependencies in a region with diverse topography, such as Minas Gerais. While proximity is a factor, variations in altitude, terrain, and microclimates can significantly influence solar irradiation patterns, leading to a non-uniform decay of spatial correlation. Therefore, a station’s geographical closeness does not automatically guarantee a strong positive correlation in solar irradiation. In some cases, it may even introduce noise if the neighboring station is in a significantly different microenvironment.

Scenarios 3 and 5 present the central hypothesis of this work, where we apply transfer learning to predict solar irradiation values in cities without meteorological stations. To this end, as previously explained, we consider each city as if it did not have a meteorological station present, and we use data from neighboring cities to train the model. We only use data from the Target Station to confirm the transfer learning efficiency in the model validation period.

In scenario 3 specifically, in 39 of the 67 cities, there were independent executions where the metric \(R^2\) was negative. In 16, the average results of this metric were less than zero. In the other 23, even with some bad executions, the final average was reasonable. In the 2010 independent runs (\(67 \times 30\)), 234 had a metric \(R^2\) value less than zero, which is approximately 11.65%. Figure 6 shows the Boxplot of \(\hbox {R}^2\) for each city, considering only the executions where \(\hbox {R}^2\) was positive. Considering only these executions, \(\hbox {R}^2\) mean was 0.5633 with a standard deviation of 0.1849.

Fig. 6
figure 6

Scenario 3 \(\hbox {R}^2\) Boxplot.

In scenario 5, in 11 cities, there were independent executions where the metric \(R^2\) was negative, and in 4, the average results of this metric were less than zero. Analyzing the total executions in this scenario, of the 2010 independent executions (\(67 \times 30\)), 83 had a metric \(R^2\) less than zero, which is approximately 4.13%. Figure 7 shows the Boxplot of \(\hbox {R}^2\) for each city, considering only the executions where \(\hbox {R}^2\) was positive. Considering only these executions, \(\hbox {R}^2\) mean was 0.5851 with a standard deviation of 0.1681.

Fig. 7
figure 7

Scenario 5 \(\hbox {R}^2\) Boxplot.

The negative values of the \(\hbox {R}^2\) metric are associated with two main groups of factors: (1) geographic and topographic characteristics that generate local microclimates distinct from those of neighboring cities used in training; and (2) the scarcity or inadequacy of data, which compromises the spatiotemporal representativeness of the model. The average temperature decreases approximately 6.5 \(^{\circ }\)C for each kilometer of altitude due to the environmental lapse rate. In cities located at high altitudes or on steep slopes, the microclimate differs substantially from that of neighboring stations at lower altitudes. Automatic modeling based on neighborhood data, without accounting for altitudinal differences, can lead to significant underestimations or overestimations.

Figure 8 shows the scatter plots of the measured and estimated solar irradiation for the best individual executions of each station, considering the \(R^2\) metric, grouped by scenarios. In it, we highlight the ideal regression line, where the predicted values match the measured ones within a 20% margin. Figure 9 shows the plot of the performance profiles for the five scenarios, considering all five metrics used. Table 6 shows the values of the curves of the normalized performance profiles. Analyzing these results, despite the performance profile curves of the five scenarios being very close and the scatter plots being very similar, we can confirm that, for this study, using neighboring cities’ data in the training base setup was not advantageous, as the curve of scenario 1 grew faster than those of scenarios 2 and 4. Regarding the use of transfer learning for data forecasting in areas without meteorological stations, the analysis through performance profiles suggests that, for this study, increasing the number of cities used generates better performance. However, analysis of variance (ANOVA) produced a p-value of 0.206. This value indicates that there is no statistically significant difference between the sample means of the groups. This result, when analyzed in conjunction with the proximity of the curves from scenarios 3 and 5 to the curve from scenario 1, suggests that transfer learning can be used efficiently, considering that the scenarios where learning transfer occurred had statistically similar results to those of the original scenario. It also reinforces the inefficiency of using neighboring stations to improve the prediction of the target station (scenarios 2 and 4).

Fig. 8
figure 8

Scatter plot for the five studied scenarios.

Fig. 9
figure 9

Performance profiles.

Table 6 Performance profiles with normalized areas.

Model strengths and limitations

With these results, it is clear that, in most cases, the use of transfer learning is a viable option for predicting solar irradiation in areas without meteorological stations, even if the results are not satisfactory in some stations. Observing the improvement between the results of scenario 5 and scenario 3, we can also see that the number of cities used in the training base is relevant. In this study, the results improved with the increase in the number of stations in the training base, as evident in 46 stations.

Conclusion

This work presents the methodology and results of applying geolocation data in constructing computational models to predict solar irradiation in Minas Gerais, Brazil. Data from 134 cities, acquired over more than 20 years of observations, were analyzed to answer two main questions: Does adding geolocation data to predict solar irradiation improve results? Is it possible to efficiently predict solar irradiation values in a place without meteorological stations?

Five different scenarios were performed, where we verified the independent results in each of the 67 cities in Minas Gerais where there are meteorological stations, we analyzed the inclusion of geolocation data from neighboring cities to improve the results, and we applied transfer learning in a methodology to allow prediction in cities where there are no weather stations.

The results of scenario 1, with independent executions of each station, were compared to other works found in the literature and served as a comparison parameter for the other studies conducted. In scenarios 2 and 4, we added geolocation data from neighboring cities, and we found that this type of approach generally worsens the results compared to scenario 1. In scenarios 3 and 5, we used data from the neighboring stations of the studied cities to compose the training bases. We used data from each analyzed city as a test base, thus applying the transfer learning technique. The study of these scenarios revealed that, in most cases, it is possible to efficiently predict solar irradiation values where no meteorological stations are available in the databases of neighboring cities. The study of these scenarios also reveals that the choice of the number of cities used in the training database creation process affects the results obtained.

The main achievement of this work is the development of models for predicting solar irradiation in cities lacking meteorological stations. The vast number of cities analyzed allowed us to verify that transfer learning techniques work for most sites. One limitation of this work is the choice of cities that should be used to compose the training base, as the current selection relies solely on Euclidean distance. Initially, we verified that using the 5 stations closest to the location we want to predict yields the best results. However, more studies can be done in this area. This paper’s contributions are summarized as follows:

  1. 1.

    A methodology for transfer learning to geographically distributed locations;

  2. 2.

    Methodology for developing an optimized computational model;

  3. 3.

    Automatic feature selection;

  4. 4.

    Unprecedented application of the ELM model for predicting solar irradiation in Minas Gerais;

  5. 5.

    Study covering the entire state of Minas Gerais;

This work represents a significant step forward in predicting solar irradiation in regions lacking meteorological stations, particularly in Minas Gerais, Brazil. The proposed methodology, which combines geolocation data and transfer learning techniques, has demonstrated effectiveness in achieving high accuracy in solar irradiation predictions.

The intelligent selection of neighborhood stations can drive potential improvement for the model to compose the training database. The choice of stations can significantly impact the model’s performance, and future research should focus on refining the criteria for selecting the optimal set of stations, thereby enhancing the accuracy and applicability of the developed models.

Despite the limitations, this study contributes valuable methodologies and insights to propose precise solar irradiation predictions and further advancements in renewable energy planning and implementation. The findings of this study could be used to develop more efficient and cost-effective solar energy systems, which could help reduce our reliance on fossil fuels and combat climate change.