Introduction

One of the most significant hazards to satellites in geostationary orbit (GEO) is internal charging and discharging caused by high-energy electrons1. Energetic (> 2 MeV) electrons fluxes > 108 cm− 2 d− 1 sr− 1 have been linked to satellite anomalies in GEO2. Extensive studies have focused on data mining using > 2 MeV high-energy electron detection from GOES satellites, leading to real-time assessments and risk predictions for internal charging1,3.

Artificial intelligence algorithms are frequently applied to predict high-energy electron fluxes. Data preprocessing and continuous model improvements are key strategies for enhancing prediction accuracy. Space radiation detection datasets often contain missing values because of factors such as satellite communication interruptions, equipment malfunctions, or abnormal detections. These missing values pose significant challenges for data-driven machine learning models4. Imputing missing data is a widely used technique to preserve as much information as possible in space radiation datasets5. This imputation process is a crucial preprocessing step for machine learning models used in predicting GEO high-energy electron fluxes, especially in time series forecasting, which relies on continuous satellite detection data6,7,8,9. In previous studies, linear interpolation9 and second-order polynomial interpolation10 have been employed to impute missing data in GOES high-energy electron flux measurements. High-energy electron flux varies with space weather, which is stochastic and exhibits strong nonlinear characteristics. The aforementioned interpolation methods have limitations in terms of model accuracy, especially when dealing with datasets with large-scale missing data. These limitations become more pronounced when imputing data points that have large fluctuations within a small time scale. Fully leveraging the advantages of neural network models in handling nonlinear feather datasets, Ruifei Cui et al.7 utilized a random forest (RF) algorithm to impute missing data in the MEO orbit on the basis of GEO detection data, and good and realistic results were achieved.

The GOES satellites are positioned in a GEO 35,786 km above the Earth’s equator, following a dual-satellite operation strategy. One satellite, located at 75°W, is known as GOES-West (referred to as GOES-W), whereas the other satellite, positioned at 135°W, is referred to as GOES-East (referred to as GOES-E). In this study, the 5-min averaged data of electron fluxes sourced from GOES-E and GOES-W are used, focusing on months with large-scale missing data. A genetic algorithm–random forest (GA-RF) model is developed to impute large-scale missing data, and the model results are compared with those of other machine learning models and interpolation methods.

Data

Datasets

In this study, 5-min averaged data of > 2 MeV electron fluxes are collected from GOES-E and GOES-W. All the GOES data can be accessed from the National Oceanic and Atmospheric Administration (NOAA) website (www.ncei.noaa.gov).

The number of missing data points for each month in the high-energy electron detection data from the GOES-W and GOES-E satellites is shown in Fig. 1. There are many consecutive months with large-scale missing values, with the maximum number of missing data points reaching 8640 of 8928. Missing data points exceeding 2880 per month are considered large-scale missing data are included in this study, as shown in Table 1.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Missing data points for the GOES-E/W satellites.

Table 1 Large-scale missing data points by month.

Excluding the months listed in Table 1, the datasets used in this study were collected from 1999 to 2016, as listed in Table 2.

Table 2 Collected data for the GOES-E/W satellites.

Considering the evolution of the solar cycle in the modeled process, simply training the model with some parts of the solar cycle and validating with another part when dividing the dataset will not yield an adequate performance estimate11. To address this issue, a 5-fold cross-validation method is adopted, and the datasets are split into ten folds. In this method, eight folds are merged to create the training datasets and the remaining two folds are used as the testing datasets, as shown in Fig. 2.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Diagram of the 5-fold cross-validation method.

Parameter selection and correlation analysis

The input and output parameters are selected for correlation analysis on the basis of physical models related to Earth’s trapped electrons and machine learning models established by previous researchers8,9,12,13,14. For the GOES-W detection data, the input parameters used in this study are the average solar wind velocity (V), solar wind velocity in the x-direction (Vx), proton density, SYM/H index, x, y, and z components of the interplanetary magnetic field (IMF) in the GSE coordinate system, geomagnetic index parameters (AU and AE) and 5-minute integral flux values of > 0.6 MeV and > 2 MeV electrons from the GOES-E satellites. The output parameter is the 5-minute integral flux of > 2 MeV electrons from the GOES-W satellites. If the GOES-E satellite is considered instead, the input parameters include the 5-minute integral fluxes of > 0.6 MeV and > 2 MeV electrons from the GOES-W satellites, whereas the output parameter is the 5-minute integral flux of > 2 MeV electrons from the GOES-E satellites.

Since variations in the outer radiation belt are influenced primarily by its previous state, time series data of these parameters are used as inputs rather than relying on instantaneous values15. A correlation coefficient greater than 0.3 is generally considered as indicative of a strong correlation. However, when the time offset exceeds 120 h, the correlation coefficients between the input parameters and the output parameter consistently fall below 0.316,17,18,19,20,21. Therefore, we select a maximum time scale of 120 h, which corresponds to a 5-day period, to conduct the correlation analysis by the Spearman rank correlation analysis method. The resulting Spearman correlation coefficients within 5 days of offset time are shown in Fig. 3. The correlation between the input high-energy electron fluxes (> 0.6 MeV and > 2 MeV) and the target > 2 MeV electron flux decreases as the offset time lag increases. For the GOES-E data, the correlation coefficients remain above 0.3 within the first 5 days. The correlation is stronger at smaller offset time lags. The parameters V and Vx exhibit strong correlations with the output, with correlation coefficients exceeding 0.3 for up to 5 offset time lags. The other parameters have correlation coefficients greater than 0.3 for up to 2 offset time lags each. Previous studies have indicated that relativistic electron fluxes in GEO do not strongly correlate with the interplanetary magnetic field (IMF)9. The IMF typically contains both southward and northward components, leading to minimal daily average variation. In our study, the half-day mean Bz value is used as a substitute for the average IMF as an input feature. The correlation coefficient for Bz exceeds 0.3 for two offset time lags. For the GOES-W data, the correlation pattern mirrors that of the GOES-E data. Following the correlation analysis, the numbers of input parameters for the GOES-E and GOES-W datasets are 43 and 48, respectively.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Correlations between the input parameters used and the logarithm of daily > 2 MeV electron fluxes within 5 days of offset for GOES-E and GOES-W.

Method

Algorithm

A GA-RF model was used in this study for large-scale missing data imputation. The RF algorithm has been efficiently applied in classification and regression tasks because of its superior training speed and good generalizability22,23,24. The GA is a form of inductive learning, providing an alternative to conventional optimization methods based on adaptive search techniques. It excels at identifying near-optimal solutions for complex optimization problems25. The optimal solutions are achieved upon the completion of iterations.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Flowchart of the GA-RF.

The overall algorithm flowchart of the GA-RF is shown in Fig. 4. In this model, the input parameters of > 2 MeV electron integral flux are sequences from Table 2, and the other input parameters are selected on the basis of correlation analysis. The output parameter is > 2 MeV electron integral flux for the target satellites. This algorithm aims to optimize the parameters of the RF algorithm using GA techniques. This is achieved by first flattening all decision tree parameters into a chromosome, with the number of trees (trees), maximum tree depth (depth), and minimum number of samples required to be at a leaf node (leaf) determining the length of each chromosome in the GA population. The initial population size is set to 20, and the maximum evolution generation is set to 100. The main steps can be summarized as follows:

  • Initialization: An initial population of 20 chromosomes is generated using binary encoding on the basis of the parameters of the decision trees (trees, depth, and leaf).

  • Fitness function: The fitness function is defined as the inverse of the root mean square error (RMSE) value obtained from training the RF model.

  • Genetic operations:

    • Selection: A roulette wheel selection method is used, where chromosomes with higher fitness have a greater chance of being selected.

    • Crossover: A two-point crossover method is employed, where two chromosomes exchange segments. The crossover probability is set to 0.7.

    • Mutation: Mutation is applied by randomly flipping bits in a chromosome. The mutation probability is set to 0.01.

  • Evolution: The GA iterates through selection, crossover, and mutation operations to evolve the population. The goal is to find the chromosome with the best fitness value.

  • Optimal parameters: The chromosome with the highest fitness value represents the optimal set of parameters for the RA model.

Furthermore, other machine learning algorithms, including back propagation (BP), long short-term memory (LSTM), ELM (extreme learning machine (ELM), extreme gradient boosting (XGBoost), and random forest (RF), are compared with the GA-RF algorithm. The parameter settings for these algorithms are detailed in Table 3.

Table 3 BP, LSTM, ELM, XGBoost, and RF model settings.

Evaluation indicators

Four evaluation indicators, including the linear correlation coefficient (LC), prediction efficiency (PE), mean absolute error (MAE) and root mean squared error (RMSE), are introduced for quantification when assessing and comparing the predictive performance of the models. They are defined as follows:

\(LC=\frac{{\sum\nolimits_{{i=1}}^{n} {\left( {{t_i} - \bar {t}} \right)\left( {{T_i} - \bar {T}} \right)} }}{{\sqrt {{{\sum\limits_{{i=1}}^{n} {{{\left( {{t_i} - \bar {t}} \right)}^2}\left( {{T_i} - \bar {T}} \right)} }^2}} }}\)

\(PE=1 - \frac{{\sum\limits_{{i=1}}^{n} {{{\left( {{t_i} - {T_i}} \right)}^2}} }}{{\sum\limits_{{i=1}}^{n} {{{\left( {{T_i} - \bar {T}} \right)}^2}} }}\)

\(MAE=\frac{{\sum\limits_{{i=1}}^{n} {|{t_i} - {T_i}|} }}{n}\)

\(RMSE=\sqrt {\frac{{\sum\limits_{{i=1}}^{n} {{{({t_i} - \bar {T})}^2}} }}{n}}\)

where \({t_i}\) is the forecasting value, \({T_i}\) is the observation value, \(\bar {t}\) is the mean of the forecasting value, \(\bar {T}\) is the mean value of the observation, and n is the number of samples. Each of these indicators evaluates the model from a different perspective. The LC denotes the strength and correlation of the linear relationship between the forecasted and observed values. The PE measures the prediction accuracy. The closer the LC and PE values are to 1, the better. The MAE and RMSE reflect the level of fit between the prediction and observed values. The smaller the values are, the better.

Results and analysis

Imputation data evaluation

The evolution of the RMSE over the training epochs is shown in Fig. 5. The model with the best performance was achieved after approximately 83 training epochs, with an RMSE of 0.2045.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Optimization of the iterative performance of the random forest model using the genetic algorithm.

The specific evolution indicators listed in Table 4 demonstrate the performance of the 5-fold cross-validation of BP, LSTM, RF, ELM, XGBoost and GA-RF. Among the four evaluation metrics—RMSE, PE, MAE, and LC—the RF-GA model exhibited the best performance, with the smallest RMSE and MAE and the largest PE and LC. Specifically, the RMSE, PE, MAE, and LC for the RF-GA model were 0.3872, 0.2084, 0.9199, and 0.8140, respectively, for the GOES-E satellite data and 0.4197, 0.2474, 0.9290, and 0.8595, respectively, for the GOES-W satellite data.

Table 4 Evolution indicators of the BP, LSTM, RF, ELM, XGBoost and GA-RF models for the GOES-E/W satellites.

Figure 6 presents scatter density plots comparing the imputed data and the detection data for the total dataset, training set, validation set, and test set for both the GOES-E and the GOES-W satellites. Most of the data points are aligned along the diagonal line, with a slope of 1:1, indicating that the RF-GA model effectively imputed the missing data in this study.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Gaussian kernel density estimation for the logarithm of imputation data by GA-RF and detection data (black dashed line: the imputation value is equal to the observed value; red dashed line: log10(Flux(GA-RF)) = log10(Flux(detection)) ± 1.0).

Imputation performance between the GA-RF and interpolation methods

Figure 7 demonstrate the imputation results of missing values for > 2 MeV electron integral flux from GOES-E satellites, comparing commonly used imputation methods, including cubic sample interpolation and linear interpolation, and the GA-RF algorithm presented in this paper. The data obtained from cubic sample interpolation and linear interpolation are smooth extensions, resembling the data mean. These methods, however, fail to capture the variations in the data effectively with respect to space weather, which is crucial for 5-minute resolution data. As a result, they do not align with the physical laws governing the data, limiting their utility in this study.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Comparison of the imputation of missing data from GOES-E from December 5, 2007, to December 16, 2007, by GA-RF, cubic sample interpolation and linear interpolation.

Results analysis for the imputation of missing values

In this section, different time series of data from GOES-12 are selected as new test sets, specifically from November 2007 to December 2007, December 2008 to February 2009, November 2009 to January 2010, and March 2010 to April 2010. These periods are excluded from the time ranges listed in Table 1. The data imputation performance of GA-RF is presented in Table 5, where it is also compared with that of cubic spline interpolation and linear interpolation. The results, shown in Table 5, demonstrate that GA-RF outperforms the other methods in terms of imputation accuracy.

Table 5 Evaluation indicators for GA-RF, cubic sample interpolation and linear interpolation.

The resulting imputation data of different time series selected from Table 2 compared with the detection data are shown in Figs. 8, 9, 10 and 11. The figures show that the model accurately captures rapid increases and decreases in high-energy electron fluxes, with the overall predicted values closely aligning with the satellite-detected data.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Comparison of the observation data with the imputation data of GOES-12 from November 1, 2007 to December 4, 2007.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Comparison of the observation data with the imputation data of GOES-12 from December 16, 2007 to December 31, 2007.

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

Comparison of the observation data with the imputation data of GOES-10 from August 1, 2000 to September 12, 2000.

Fig. 11
Fig. 11The alternative text for this image may have been generated using AI.
Full size image

Comparison of the observation data with the imputation data of GOES-10 from May 1, 2006 to June 23, 2006.

Conclusions

Missing data are a significant issue that can impact the performance of machine learning models for predicting high-energy electron fluxes on the basis of satellite data. This study focuses on high-energy electron fluxes from GOES satellites, targeting large-scale missing data months for the imputation process. The GA-RF model, along with other machine learning models, was trained to impute missing data when large-scale gaps occurred in satellite detection data within a given month.

The GA-RF model demonstrated the best overall performance in model evaluation. The RMSE, MAE, PE, and LC for the GA-RF model were 0.3872, 0.2084, 0.9199, and 0.8140, respectively, for the GOES-E satellite data and 0.4197, 0.2474, 0.9290, and 0.8595, respectively, for the GOES-W satellite data. Additionally, we compared the results of GA-RF with those of cubic spline interpolation and linear interpolation. The imputed data from the GA-RF model effectively captured electron flux variations, with the imputed values closely matching the satellite detection data. Specifically, the RMSE, MAE, PE, and LC for the GA-RF model were 0.3983, 0.1938, 0.8275, and 0.9259, respectively, for the GOES-E satellite data and 0.4082, 0.2625, 0.8814, and 0.9395, respectively, for the GOES-W satellite data.

Building on the imputation process presented in this study, future studies will focus on developing a prediction model for relativistic electrons at GEO.