Abstract
A genetic algorithm-optimized random forest algorithm (GA-RF) model is constructed to impute large-scale missing data for 5-min averaged data of > 2 MeV electron integral fluxes from GOES-E/W satellites. The model inputs include V, Vx, the proton density, the SYM/H index, B, Bx, By, Bz, AU, AE, and > 0.6 MeV and > 2 MeV electron integral fluxes from GOES-E/W. The target variable is the > 2 MeV electron integral flux from GOES-W/E. A comparison of the GA-RF model with other machine learning models, including the backpropagation (BP), long short-term memory (LSTM), random forest (RF), extreme learning machine (ELM), and extreme gradient boosting (XGBoost) models, reveals that the GA-RF model achieves the highest PE and LC values and the lowest RMSE and MAE values, indicating that the GA-RF model outperforms the other models in imputing large-scale missing data. Compared with commonly used interpolation methods, such as cubic spline interpolation and linear interpolation, the GA-RF model effectively captures electron flux variations and provides imputed data that closely align with satellite-detected values.
Similar content being viewed by others
Introduction
One of the most significant hazards to satellites in geostationary orbit (GEO) is internal charging and discharging caused by high-energy electrons1. Energetic (> 2 MeV) electrons fluxes > 108 cm− 2 d− 1 sr− 1 have been linked to satellite anomalies in GEO2. Extensive studies have focused on data mining using > 2 MeV high-energy electron detection from GOES satellites, leading to real-time assessments and risk predictions for internal charging1,3.
Artificial intelligence algorithms are frequently applied to predict high-energy electron fluxes. Data preprocessing and continuous model improvements are key strategies for enhancing prediction accuracy. Space radiation detection datasets often contain missing values because of factors such as satellite communication interruptions, equipment malfunctions, or abnormal detections. These missing values pose significant challenges for data-driven machine learning models4. Imputing missing data is a widely used technique to preserve as much information as possible in space radiation datasets5. This imputation process is a crucial preprocessing step for machine learning models used in predicting GEO high-energy electron fluxes, especially in time series forecasting, which relies on continuous satellite detection data6,7,8,9. In previous studies, linear interpolation9 and second-order polynomial interpolation10 have been employed to impute missing data in GOES high-energy electron flux measurements. High-energy electron flux varies with space weather, which is stochastic and exhibits strong nonlinear characteristics. The aforementioned interpolation methods have limitations in terms of model accuracy, especially when dealing with datasets with large-scale missing data. These limitations become more pronounced when imputing data points that have large fluctuations within a small time scale. Fully leveraging the advantages of neural network models in handling nonlinear feather datasets, Ruifei Cui et al.7 utilized a random forest (RF) algorithm to impute missing data in the MEO orbit on the basis of GEO detection data, and good and realistic results were achieved.
The GOES satellites are positioned in a GEO 35,786 km above the Earth’s equator, following a dual-satellite operation strategy. One satellite, located at 75°W, is known as GOES-West (referred to as GOES-W), whereas the other satellite, positioned at 135°W, is referred to as GOES-East (referred to as GOES-E). In this study, the 5-min averaged data of electron fluxes sourced from GOES-E and GOES-W are used, focusing on months with large-scale missing data. A genetic algorithm–random forest (GA-RF) model is developed to impute large-scale missing data, and the model results are compared with those of other machine learning models and interpolation methods.
Data
Datasets
In this study, 5-min averaged data of > 2 MeV electron fluxes are collected from GOES-E and GOES-W. All the GOES data can be accessed from the National Oceanic and Atmospheric Administration (NOAA) website (www.ncei.noaa.gov).
The number of missing data points for each month in the high-energy electron detection data from the GOES-W and GOES-E satellites is shown in Fig. 1. There are many consecutive months with large-scale missing values, with the maximum number of missing data points reaching 8640 of 8928. Missing data points exceeding 2880 per month are considered large-scale missing data are included in this study, as shown in Table 1.
Missing data points for the GOES-E/W satellites.
Excluding the months listed in Table 1, the datasets used in this study were collected from 1999 to 2016, as listed in Table 2.
Considering the evolution of the solar cycle in the modeled process, simply training the model with some parts of the solar cycle and validating with another part when dividing the dataset will not yield an adequate performance estimate11. To address this issue, a 5-fold cross-validation method is adopted, and the datasets are split into ten folds. In this method, eight folds are merged to create the training datasets and the remaining two folds are used as the testing datasets, as shown in Fig. 2.
Diagram of the 5-fold cross-validation method.
Parameter selection and correlation analysis
The input and output parameters are selected for correlation analysis on the basis of physical models related to Earth’s trapped electrons and machine learning models established by previous researchers8,9,12,13,14. For the GOES-W detection data, the input parameters used in this study are the average solar wind velocity (V), solar wind velocity in the x-direction (Vx), proton density, SYM/H index, x, y, and z components of the interplanetary magnetic field (IMF) in the GSE coordinate system, geomagnetic index parameters (AU and AE) and 5-minute integral flux values of > 0.6 MeV and > 2 MeV electrons from the GOES-E satellites. The output parameter is the 5-minute integral flux of > 2 MeV electrons from the GOES-W satellites. If the GOES-E satellite is considered instead, the input parameters include the 5-minute integral fluxes of > 0.6 MeV and > 2 MeV electrons from the GOES-W satellites, whereas the output parameter is the 5-minute integral flux of > 2 MeV electrons from the GOES-E satellites.
Since variations in the outer radiation belt are influenced primarily by its previous state, time series data of these parameters are used as inputs rather than relying on instantaneous values15. A correlation coefficient greater than 0.3 is generally considered as indicative of a strong correlation. However, when the time offset exceeds 120 h, the correlation coefficients between the input parameters and the output parameter consistently fall below 0.316,17,18,19,20,21. Therefore, we select a maximum time scale of 120 h, which corresponds to a 5-day period, to conduct the correlation analysis by the Spearman rank correlation analysis method. The resulting Spearman correlation coefficients within 5 days of offset time are shown in Fig. 3. The correlation between the input high-energy electron fluxes (> 0.6 MeV and > 2 MeV) and the target > 2 MeV electron flux decreases as the offset time lag increases. For the GOES-E data, the correlation coefficients remain above 0.3 within the first 5 days. The correlation is stronger at smaller offset time lags. The parameters V and Vx exhibit strong correlations with the output, with correlation coefficients exceeding 0.3 for up to 5 offset time lags. The other parameters have correlation coefficients greater than 0.3 for up to 2 offset time lags each. Previous studies have indicated that relativistic electron fluxes in GEO do not strongly correlate with the interplanetary magnetic field (IMF)9. The IMF typically contains both southward and northward components, leading to minimal daily average variation. In our study, the half-day mean Bz value is used as a substitute for the average IMF as an input feature. The correlation coefficient for Bz exceeds 0.3 for two offset time lags. For the GOES-W data, the correlation pattern mirrors that of the GOES-E data. Following the correlation analysis, the numbers of input parameters for the GOES-E and GOES-W datasets are 43 and 48, respectively.
Correlations between the input parameters used and the logarithm of daily > 2 MeV electron fluxes within 5 days of offset for GOES-E and GOES-W.
Method
Algorithm
A GA-RF model was used in this study for large-scale missing data imputation. The RF algorithm has been efficiently applied in classification and regression tasks because of its superior training speed and good generalizability22,23,24. The GA is a form of inductive learning, providing an alternative to conventional optimization methods based on adaptive search techniques. It excels at identifying near-optimal solutions for complex optimization problems25. The optimal solutions are achieved upon the completion of iterations.
Flowchart of the GA-RF.
The overall algorithm flowchart of the GA-RF is shown in Fig. 4. In this model, the input parameters of > 2 MeV electron integral flux are sequences from Table 2, and the other input parameters are selected on the basis of correlation analysis. The output parameter is > 2 MeV electron integral flux for the target satellites. This algorithm aims to optimize the parameters of the RF algorithm using GA techniques. This is achieved by first flattening all decision tree parameters into a chromosome, with the number of trees (trees), maximum tree depth (depth), and minimum number of samples required to be at a leaf node (leaf) determining the length of each chromosome in the GA population. The initial population size is set to 20, and the maximum evolution generation is set to 100. The main steps can be summarized as follows:
-
Initialization: An initial population of 20 chromosomes is generated using binary encoding on the basis of the parameters of the decision trees (trees, depth, and leaf).
-
Fitness function: The fitness function is defined as the inverse of the root mean square error (RMSE) value obtained from training the RF model.
-
Genetic operations:
-
Selection: A roulette wheel selection method is used, where chromosomes with higher fitness have a greater chance of being selected.
-
Crossover: A two-point crossover method is employed, where two chromosomes exchange segments. The crossover probability is set to 0.7.
-
Mutation: Mutation is applied by randomly flipping bits in a chromosome. The mutation probability is set to 0.01.
-
-
Evolution: The GA iterates through selection, crossover, and mutation operations to evolve the population. The goal is to find the chromosome with the best fitness value.
-
Optimal parameters: The chromosome with the highest fitness value represents the optimal set of parameters for the RA model.
Furthermore, other machine learning algorithms, including back propagation (BP), long short-term memory (LSTM), ELM (extreme learning machine (ELM), extreme gradient boosting (XGBoost), and random forest (RF), are compared with the GA-RF algorithm. The parameter settings for these algorithms are detailed in Table 3.
Evaluation indicators
Four evaluation indicators, including the linear correlation coefficient (LC), prediction efficiency (PE), mean absolute error (MAE) and root mean squared error (RMSE), are introduced for quantification when assessing and comparing the predictive performance of the models. They are defined as follows:
\(LC=\frac{{\sum\nolimits_{{i=1}}^{n} {\left( {{t_i} - \bar {t}} \right)\left( {{T_i} - \bar {T}} \right)} }}{{\sqrt {{{\sum\limits_{{i=1}}^{n} {{{\left( {{t_i} - \bar {t}} \right)}^2}\left( {{T_i} - \bar {T}} \right)} }^2}} }}\)
\(PE=1 - \frac{{\sum\limits_{{i=1}}^{n} {{{\left( {{t_i} - {T_i}} \right)}^2}} }}{{\sum\limits_{{i=1}}^{n} {{{\left( {{T_i} - \bar {T}} \right)}^2}} }}\)
\(MAE=\frac{{\sum\limits_{{i=1}}^{n} {|{t_i} - {T_i}|} }}{n}\)
\(RMSE=\sqrt {\frac{{\sum\limits_{{i=1}}^{n} {{{({t_i} - \bar {T})}^2}} }}{n}}\)
where \({t_i}\) is the forecasting value, \({T_i}\) is the observation value, \(\bar {t}\) is the mean of the forecasting value, \(\bar {T}\) is the mean value of the observation, and n is the number of samples. Each of these indicators evaluates the model from a different perspective. The LC denotes the strength and correlation of the linear relationship between the forecasted and observed values. The PE measures the prediction accuracy. The closer the LC and PE values are to 1, the better. The MAE and RMSE reflect the level of fit between the prediction and observed values. The smaller the values are, the better.
Results and analysis
Imputation data evaluation
The evolution of the RMSE over the training epochs is shown in Fig. 5. The model with the best performance was achieved after approximately 83 training epochs, with an RMSE of 0.2045.
Optimization of the iterative performance of the random forest model using the genetic algorithm.
The specific evolution indicators listed in Table 4 demonstrate the performance of the 5-fold cross-validation of BP, LSTM, RF, ELM, XGBoost and GA-RF. Among the four evaluation metrics—RMSE, PE, MAE, and LC—the RF-GA model exhibited the best performance, with the smallest RMSE and MAE and the largest PE and LC. Specifically, the RMSE, PE, MAE, and LC for the RF-GA model were 0.3872, 0.2084, 0.9199, and 0.8140, respectively, for the GOES-E satellite data and 0.4197, 0.2474, 0.9290, and 0.8595, respectively, for the GOES-W satellite data.
Figure 6 presents scatter density plots comparing the imputed data and the detection data for the total dataset, training set, validation set, and test set for both the GOES-E and the GOES-W satellites. Most of the data points are aligned along the diagonal line, with a slope of 1:1, indicating that the RF-GA model effectively imputed the missing data in this study.
Gaussian kernel density estimation for the logarithm of imputation data by GA-RF and detection data (black dashed line: the imputation value is equal to the observed value; red dashed line: log10(Flux(GA-RF)) = log10(Flux(detection)) ± 1.0).
Imputation performance between the GA-RF and interpolation methods
Figure 7 demonstrate the imputation results of missing values for > 2 MeV electron integral flux from GOES-E satellites, comparing commonly used imputation methods, including cubic sample interpolation and linear interpolation, and the GA-RF algorithm presented in this paper. The data obtained from cubic sample interpolation and linear interpolation are smooth extensions, resembling the data mean. These methods, however, fail to capture the variations in the data effectively with respect to space weather, which is crucial for 5-minute resolution data. As a result, they do not align with the physical laws governing the data, limiting their utility in this study.
Comparison of the imputation of missing data from GOES-E from December 5, 2007, to December 16, 2007, by GA-RF, cubic sample interpolation and linear interpolation.
Results analysis for the imputation of missing values
In this section, different time series of data from GOES-12 are selected as new test sets, specifically from November 2007 to December 2007, December 2008 to February 2009, November 2009 to January 2010, and March 2010 to April 2010. These periods are excluded from the time ranges listed in Table 1. The data imputation performance of GA-RF is presented in Table 5, where it is also compared with that of cubic spline interpolation and linear interpolation. The results, shown in Table 5, demonstrate that GA-RF outperforms the other methods in terms of imputation accuracy.
The resulting imputation data of different time series selected from Table 2 compared with the detection data are shown in Figs. 8, 9, 10 and 11. The figures show that the model accurately captures rapid increases and decreases in high-energy electron fluxes, with the overall predicted values closely aligning with the satellite-detected data.
Comparison of the observation data with the imputation data of GOES-12 from November 1, 2007 to December 4, 2007.
Comparison of the observation data with the imputation data of GOES-12 from December 16, 2007 to December 31, 2007.
Comparison of the observation data with the imputation data of GOES-10 from August 1, 2000 to September 12, 2000.
Comparison of the observation data with the imputation data of GOES-10 from May 1, 2006 to June 23, 2006.
Conclusions
Missing data are a significant issue that can impact the performance of machine learning models for predicting high-energy electron fluxes on the basis of satellite data. This study focuses on high-energy electron fluxes from GOES satellites, targeting large-scale missing data months for the imputation process. The GA-RF model, along with other machine learning models, was trained to impute missing data when large-scale gaps occurred in satellite detection data within a given month.
The GA-RF model demonstrated the best overall performance in model evaluation. The RMSE, MAE, PE, and LC for the GA-RF model were 0.3872, 0.2084, 0.9199, and 0.8140, respectively, for the GOES-E satellite data and 0.4197, 0.2474, 0.9290, and 0.8595, respectively, for the GOES-W satellite data. Additionally, we compared the results of GA-RF with those of cubic spline interpolation and linear interpolation. The imputed data from the GA-RF model effectively captured electron flux variations, with the imputed values closely matching the satellite detection data. Specifically, the RMSE, MAE, PE, and LC for the GA-RF model were 0.3983, 0.1938, 0.8275, and 0.9259, respectively, for the GOES-E satellite data and 0.4082, 0.2625, 0.8814, and 0.9395, respectively, for the GOES-W satellite data.
Building on the imputation process presented in this study, future studies will focus on developing a prediction model for relativistic electrons at GEO.
Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Horne, R. B. et al. The satellite risk prediction and radiation forecast system (SaRIF). Space Weather. 19, e2021SW002823. https://doi.org/10.1029/2021SW002823 (2021).
Iucci, N. et al. Space weather conditions and spacecraft anomalies in different orbits. Space Weather. 3, 1–16 (2005). S01001.
Benoı̂t, T. et al. Presentation and validation of the internal charging risk forecast in the PAGER framework, advances in Space Research, 72,9: 3666–3676, (2023) https://doi.org/10.1016/j.asr.2023.07.047
mmanuel, T. et al. A survey on missing data in machine learning. J. Big Data. 8, 140. https://doi.org/10.1186/s40537-021-00516-9 (2021).
Wang, S., Li, W., Hou, S., Guan, J. & Yao, J. STA-GAN: ASpatio-temporal attention generative adversarial network for missing value imputation in satellite data. Remote Sens. 15, 88. https://doi.org/10.3390/rs15010088 (2023).
Zhang, H. et al. A prediction model of relativistic electrons at geostationary orbit using the EMD-LSTM network and geomagnetic indices. Space Weather. 20 (3), 1–15 (2022).
Cui, R. et al. Machine learning for the relationship of high-energy Electron flux between GEO and MEO with application to missing values imputation for Beidou MEO Data. Open. Astronomy. 30 (1), 62–72. https://doi.org/10.1515/astro-2021-0008 (2021).
Zhang, H. et al. Relativistic electron flux prediction at geosynchronous orbit based on the neural network and the quantile regression method. Space Weather. 18 (9). https://doi.org/10.1029/2020SW002445 (2020). e2020SW002445.
Wei, L. et al. Quantitative prediction of high-energy electron integral flux at geostationary orbit based on deep learning. Space Weather. 16 (7), 903–916. https://doi.org/10.1029/2018SW001829 (2018).
Ling, A. G., Ginet, G. P., Hilmer, R. V. & Perry, K. L. A neural network-based geosynchronous relativistic electron flux forecasting model. Space Weather. 8 (9), S09003. https://doi.org/10.1029/2010SW000576 (2010).
Smirnov, A. G. et al. Medium energy electron flux in earth’s outer radiation belt (MERLIN): a machine learning model. Space Weather. 18, e2020SW002532. https://doi.org/10.1029/2020SW002532 (2020).
Chu, X. et al. Relativistic electron model in the outer radiation belt using a neural network approach. Space Weather. 19, e2021SW002808. https://doi.org/10.1029/2021SW002808 (2021).
Li, X. et al. Energetic electrons, 50 keV to 6 MeV, at geosynchronous orbit: their responses to solar wind variations. Space Weather. 3, S04001. https://doi.org/10.1029/2004SW000105 (2005).
Li, L. Y., Cao, J. B. & Zhou, G. C. Relation between the variation of geomagnetospheric relativistic electron flux and storm/substorm. Chin. J. Geophysics- Chin. Ed. 49 (1), 9–15 (2006).
Ma, D. et al. Modeling the dynamic variability of sub-relativistic outer radiation belt electron fluxes using machine learning. Space Weather. 20, e2022SW003079. https://doi.org/10.1029/2022SW003079 (2022).
Rilling, G., Flandrin, P. & Goncalves, P. On empirical mode decomposition and its algorithms. Proc. IEEE-EURASIP Workshop Nonlinear Signal. Image Process. NSIP-03. 3 (3), 8–11 (2003).
Rycroft, M., Nicoll, K., Aplin, K. & Harrison, R. Recent advances in global electric circuit coupling between the space environment and the Troposphere. J. Atmos. Solar Terr. Phys. 90–91. https://doi.org/10.1016/j.jastp.2012.03.015 (2012).
Sain, S. R. & Stephan, R. The nature of statistical learning theory. Technometrics 38 (4), 409–422. https://doi.org/10.1080/00401706.1996.10484565 (1997).
Sakaguchi, K. et al. Relativistic electron flux forecast at geostationary orbit using Kalman filter based on a multivariate autoregressive model. Space Weather. 11 (2), 79–89. https://doi.org/10.1002/swe.20020 (2013).
Seppälä, A., Matthes, K., Randall, C. & Mironova, I. What is the solar influence on climate? Overview of activities during CAWSES-II. Progress Earth Planet. Sci. 1, 24. https://doi.org/10.1186/s40645-014-0024-3 (2014).
Simms, L. et al. A distributed lag autoregressive model of geostationary relativistic electron fluxes: comparing the influences of waves, seed and source electrons and solar wind inputs. J. Geophys. Research: Space Phys. 123 (5), 3646–3671. https://doi.org/10.1029/2017ja025002 (2018).
Fang, X. et al. Mar., Forecasting incidence of infectious diarrhea using random forest in Jiangsu province, China. BMC Infect. Dis., 20, 1, (2020)
Jamei, M. G. M., Ahmadianfar, I. & Pourrajab, R. Prediction of nanofluids viscosity using random forest (RF) approach. Chemom Intell. Lab. Syst., 201, (2020). Art. 104010.
Breiman, L. Random forests, Mach. Learn., vol. 45, no. 1, pp. 5–32, (2001).
Assiri, A. Anomaly classification using genetic algorithm-based random forest model for network attack detection. Computers Mater. Continua. 66 (1), 767–778. https://doi.org/10.32604/cmc.2020.013813 (2021).
Author information
Authors and Affiliations
Contributions
Meihua Fang primarily contributed to the conceptualization and methodology of the manuscript. Dingyi Song was responsible for data processing and algorithm development. Jianfei Chen mainly handled data collection and algorithm design. Biao Wang was in charge of data preprocessing. Mengyun He managed the data. Yukuan Ma conducted the literature Investigation.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Fang, M., Song, D., Chen, J. et al. Missing value imputation for > 2 MeV electron fluxes in geostationary orbit based on GA-RF model. Sci Rep 15, 10427 (2025). https://doi.org/10.1038/s41598-025-87082-9
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-87082-9













