Missing value imputation for > 2 MeV electron fluxes in geostationary orbit based on GA-RF model

Fang, Meihua; Song, Dingyi; Chen, JianFei; Wang, Biao; He, Mengyun; Ma, Yukuan

doi:10.1038/s41598-025-87082-9

Download PDF

Article
Open access
Published: 26 March 2025

Missing value imputation for > 2 MeV electron fluxes in geostationary orbit based on GA-RF model

Meihua Fang¹,
Dingyi Song¹,
JianFei Chen²,
Biao Wang¹,
Mengyun He¹ &
…
Yukuan Ma¹

Scientific Reports volume 15, Article number: 10427 (2025) Cite this article

2082 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

A genetic algorithm-optimized random forest algorithm (GA-RF) model is constructed to impute large-scale missing data for 5-min averaged data of > 2 MeV electron integral fluxes from GOES-E/W satellites. The model inputs include V, Vx, the proton density, the SYM/H index, B, B_x, B_y, B_z, AU, AE, and > 0.6 MeV and > 2 MeV electron integral fluxes from GOES-E/W. The target variable is the > 2 MeV electron integral flux from GOES-W/E. A comparison of the GA-RF model with other machine learning models, including the backpropagation (BP), long short-term memory (LSTM), random forest (RF), extreme learning machine (ELM), and extreme gradient boosting (XGBoost) models, reveals that the GA-RF model achieves the highest PE and LC values and the lowest RMSE and MAE values, indicating that the GA-RF model outperforms the other models in imputing large-scale missing data. Compared with commonly used interpolation methods, such as cubic spline interpolation and linear interpolation, the GA-RF model effectively captures electron flux variations and provides imputed data that closely align with satellite-detected values.

Employing machine learning for advanced gap imputation in solar power generation databases

Article Open access 11 October 2024

Prediction of time-dependent bearing capacity of concrete pile in cohesive soil using optimized relevance vector machine and long short-term memory models

Article Open access 30 December 2024

Modeling based on machine learning to investigate flue gas desulfurization performance by calcium silicate absorbent in a sand bed reactor

Article Open access 10 January 2024

Introduction

One of the most significant hazards to satellites in geostationary orbit (GEO) is internal charging and discharging caused by high-energy electrons¹. Energetic (> 2 MeV) electrons fluxes > 10⁸ cm^− 2 d^− 1 sr^− 1 have been linked to satellite anomalies in GEO². Extensive studies have focused on data mining using > 2 MeV high-energy electron detection from GOES satellites, leading to real-time assessments and risk predictions for internal charging^1,3.

Artificial intelligence algorithms are frequently applied to predict high-energy electron fluxes. Data preprocessing and continuous model improvements are key strategies for enhancing prediction accuracy. Space radiation detection datasets often contain missing values because of factors such as satellite communication interruptions, equipment malfunctions, or abnormal detections. These missing values pose significant challenges for data-driven machine learning models⁴. Imputing missing data is a widely used technique to preserve as much information as possible in space radiation datasets⁵. This imputation process is a crucial preprocessing step for machine learning models used in predicting GEO high-energy electron fluxes, especially in time series forecasting, which relies on continuous satellite detection data^6,7,8,9. In previous studies, linear interpolation⁹ and second-order polynomial interpolation¹⁰ have been employed to impute missing data in GOES high-energy electron flux measurements. High-energy electron flux varies with space weather, which is stochastic and exhibits strong nonlinear characteristics. The aforementioned interpolation methods have limitations in terms of model accuracy, especially when dealing with datasets with large-scale missing data. These limitations become more pronounced when imputing data points that have large fluctuations within a small time scale. Fully leveraging the advantages of neural network models in handling nonlinear feather datasets, Ruifei Cui et al.⁷ utilized a random forest (RF) algorithm to impute missing data in the MEO orbit on the basis of GEO detection data, and good and realistic results were achieved.

The GOES satellites are positioned in a GEO 35,786 km above the Earth’s equator, following a dual-satellite operation strategy. One satellite, located at 75°W, is known as GOES-West (referred to as GOES-W), whereas the other satellite, positioned at 135°W, is referred to as GOES-East (referred to as GOES-E). In this study, the 5-min averaged data of electron fluxes sourced from GOES-E and GOES-W are used, focusing on months with large-scale missing data. A genetic algorithm–random forest (GA-RF) model is developed to impute large-scale missing data, and the model results are compared with those of other machine learning models and interpolation methods.

Data

Datasets

In this study, 5-min averaged data of > 2 MeV electron fluxes are collected from GOES-E and GOES-W. All the GOES data can be accessed from the National Oceanic and Atmospheric Administration (NOAA) website (www.ncei.noaa.gov).

The number of missing data points for each month in the high-energy electron detection data from the GOES-W and GOES-E satellites is shown in Fig. 1. There are many consecutive months with large-scale missing values, with the maximum number of missing data points reaching 8640 of 8928. Missing data points exceeding 2880 per month are considered large-scale missing data are included in this study, as shown in Table 1.

Table 1 Large-scale missing data points by month.

Full size table

Excluding the months listed in Table 1, the datasets used in this study were collected from 1999 to 2016, as listed in Table 2.

Table 2 Collected data for the GOES-E/W satellites.

Full size table

Considering the evolution of the solar cycle in the modeled process, simply training the model with some parts of the solar cycle and validating with another part when dividing the dataset will not yield an adequate performance estimate¹¹. To address this issue, a 5-fold cross-validation method is adopted, and the datasets are split into ten folds. In this method, eight folds are merged to create the training datasets and the remaining two folds are used as the testing datasets, as shown in Fig. 2.

Parameter selection and correlation analysis

The input and output parameters are selected for correlation analysis on the basis of physical models related to Earth’s trapped electrons and machine learning models established by previous researchers^8,9,12,13,14. For the GOES-W detection data, the input parameters used in this study are the average solar wind velocity (V), solar wind velocity in the x-direction (Vx), proton density, SYM/H index, x, y, and z components of the interplanetary magnetic field (IMF) in the GSE coordinate system, geomagnetic index parameters (AU and AE) and 5-minute integral flux values of > 0.6 MeV and > 2 MeV electrons from the GOES-E satellites. The output parameter is the 5-minute integral flux of > 2 MeV electrons from the GOES-W satellites. If the GOES-E satellite is considered instead, the input parameters include the 5-minute integral fluxes of > 0.6 MeV and > 2 MeV electrons from the GOES-W satellites, whereas the output parameter is the 5-minute integral flux of > 2 MeV electrons from the GOES-E satellites.

Since variations in the outer radiation belt are influenced primarily by its previous state, time series data of these parameters are used as inputs rather than relying on instantaneous values¹⁵. A correlation coefficient greater than 0.3 is generally considered as indicative of a strong correlation. However, when the time offset exceeds 120 h, the correlation coefficients between the input parameters and the output parameter consistently fall below 0.3^{16,17,18,19,20,21}. Therefore, we select a maximum time scale of 120 h, which corresponds to a 5-day period, to conduct the correlation analysis by the Spearman rank correlation analysis method. The resulting Spearman correlation coefficients within 5 days of offset time are shown in Fig. 3. The correlation between the input high-energy electron fluxes (> 0.6 MeV and > 2 MeV) and the target > 2 MeV electron flux decreases as the offset time lag increases. For the GOES-E data, the correlation coefficients remain above 0.3 within the first 5 days. The correlation is stronger at smaller offset time lags. The parameters V and Vx exhibit strong correlations with the output, with correlation coefficients exceeding 0.3 for up to 5 offset time lags. The other parameters have correlation coefficients greater than 0.3 for up to 2 offset time lags each. Previous studies have indicated that relativistic electron fluxes in GEO do not strongly correlate with the interplanetary magnetic field (IMF)⁹. The IMF typically contains both southward and northward components, leading to minimal daily average variation. In our study, the half-day mean Bz value is used as a substitute for the average IMF as an input feature. The correlation coefficient for Bz exceeds 0.3 for two offset time lags. For the GOES-W data, the correlation pattern mirrors that of the GOES-E data. Following the correlation analysis, the numbers of input parameters for the GOES-E and GOES-W datasets are 43 and 48, respectively.

Method

Algorithm

A GA-RF model was used in this study for large-scale missing data imputation. The RF algorithm has been efficiently applied in classification and regression tasks because of its superior training speed and good generalizability^22,23,24. The GA is a form of inductive learning, providing an alternative to conventional optimization methods based on adaptive search techniques. It excels at identifying near-optimal solutions for complex optimization problems²⁵. The optimal solutions are achieved upon the completion of iterations.

The overall algorithm flowchart of the GA-RF is shown in Fig. 4. In this model, the input parameters of > 2 MeV electron integral flux are sequences from Table 2, and the other input parameters are selected on the basis of correlation analysis. The output parameter is > 2 MeV electron integral flux for the target satellites. This algorithm aims to optimize the parameters of the RF algorithm using GA techniques. This is achieved by first flattening all decision tree parameters into a chromosome, with the number of trees (trees), maximum tree depth (depth), and minimum number of samples required to be at a leaf node (leaf) determining the length of each chromosome in the GA population. The initial population size is set to 20, and the maximum evolution generation is set to 100. The main steps can be summarized as follows:

Initialization: An initial population of 20 chromosomes is generated using binary encoding on the basis of the parameters of the decision trees (trees, depth, and leaf).
Fitness function: The fitness function is defined as the inverse of the root mean square error (RMSE) value obtained from training the RF model.
Genetic operations:
- Selection: A roulette wheel selection method is used, where chromosomes with higher fitness have a greater chance of being selected.
- Crossover: A two-point crossover method is employed, where two chromosomes exchange segments. The crossover probability is set to 0.7.
- Mutation: Mutation is applied by randomly flipping bits in a chromosome. The mutation probability is set to 0.01.
Evolution: The GA iterates through selection, crossover, and mutation operations to evolve the population. The goal is to find the chromosome with the best fitness value.
Optimal parameters: The chromosome with the highest fitness value represents the optimal set of parameters for the RA model.

Furthermore, other machine learning algorithms, including back propagation (BP), long short-term memory (LSTM), ELM (extreme learning machine (ELM), extreme gradient boosting (XGBoost), and random forest (RF), are compared with the GA-RF algorithm. The parameter settings for these algorithms are detailed in Table 3.

Table 3 BP, LSTM, ELM, XGBoost, and RF model settings.

Full size table

Evaluation indicators

Four evaluation indicators, including the linear correlation coefficient (LC), prediction efficiency (PE), mean absolute error (MAE) and root mean squared error (RMSE), are introduced for quantification when assessing and comparing the predictive performance of the models. They are defined as follows:

\(LC=\frac{{\sum\nolimits_{{i=1}}^{n} {\left( {{t_i} - \bar {t}} \right)\left( {{T_i} - \bar {T}} \right)} }}{{\sqrt {{{\sum\limits_{{i=1}}^{n} {{{\left( {{t_i} - \bar {t}} \right)}^2}\left( {{T_i} - \bar {T}} \right)} }^2}} }}\)

\(PE=1 - \frac{{\sum\limits_{{i=1}}^{n} {{{\left( {{t_i} - {T_i}} \right)}^2}} }}{{\sum\limits_{{i=1}}^{n} {{{\left( {{T_i} - \bar {T}} \right)}^2}} }}\)

\(MAE=\frac{{\sum\limits_{{i=1}}^{n} {|{t_i} - {T_i}|} }}{n}\)

\(RMSE=\sqrt {\frac{{\sum\limits_{{i=1}}^{n} {{{({t_i} - \bar {T})}^2}} }}{n}}\)

where \({t_i}\) is the forecasting value, \({T_i}\) is the observation value, \(\bar {t}\) is the mean of the forecasting value, \(\bar {T}\) is the mean value of the observation, and n is the number of samples. Each of these indicators evaluates the model from a different perspective. The LC denotes the strength and correlation of the linear relationship between the forecasted and observed values. The PE measures the prediction accuracy. The closer the LC and PE values are to 1, the better. The MAE and RMSE reflect the level of fit between the prediction and observed values. The smaller the values are, the better.

Results and analysis

Imputation data evaluation

The evolution of the RMSE over the training epochs is shown in Fig. 5. The model with the best performance was achieved after approximately 83 training epochs, with an RMSE of 0.2045.

The specific evolution indicators listed in Table 4 demonstrate the performance of the 5-fold cross-validation of BP, LSTM, RF, ELM, XGBoost and GA-RF. Among the four evaluation metrics—RMSE, PE, MAE, and LC—the RF-GA model exhibited the best performance, with the smallest RMSE and MAE and the largest PE and LC. Specifically, the RMSE, PE, MAE, and LC for the RF-GA model were 0.3872, 0.2084, 0.9199, and 0.8140, respectively, for the GOES-E satellite data and 0.4197, 0.2474, 0.9290, and 0.8595, respectively, for the GOES-W satellite data.

Table 4 Evolution indicators of the BP, LSTM, RF, ELM, XGBoost and GA-RF models for the GOES-E/W satellites.

Full size table

Figure 6 presents scatter density plots comparing the imputed data and the detection data for the total dataset, training set, validation set, and test set for both the GOES-E and the GOES-W satellites. Most of the data points are aligned along the diagonal line, with a slope of 1:1, indicating that the RF-GA model effectively imputed the missing data in this study.

Imputation performance between the GA-RF and interpolation methods

Figure 7 demonstrate the imputation results of missing values for > 2 MeV electron integral flux from GOES-E satellites, comparing commonly used imputation methods, including cubic sample interpolation and linear interpolation, and the GA-RF algorithm presented in this paper. The data obtained from cubic sample interpolation and linear interpolation are smooth extensions, resembling the data mean. These methods, however, fail to capture the variations in the data effectively with respect to space weather, which is crucial for 5-minute resolution data. As a result, they do not align with the physical laws governing the data, limiting their utility in this study.

Results analysis for the imputation of missing values

In this section, different time series of data from GOES-12 are selected as new test sets, specifically from November 2007 to December 2007, December 2008 to February 2009, November 2009 to January 2010, and March 2010 to April 2010. These periods are excluded from the time ranges listed in Table 1. The data imputation performance of GA-RF is presented in Table 5, where it is also compared with that of cubic spline interpolation and linear interpolation. The results, shown in Table 5, demonstrate that GA-RF outperforms the other methods in terms of imputation accuracy.

Table 5 Evaluation indicators for GA-RF, cubic sample interpolation and linear interpolation.

Full size table

The resulting imputation data of different time series selected from Table 2 compared with the detection data are shown in Figs. 8, 9, 10 and 11. The figures show that the model accurately captures rapid increases and decreases in high-energy electron fluxes, with the overall predicted values closely aligning with the satellite-detected data.

Conclusions

Missing data are a significant issue that can impact the performance of machine learning models for predicting high-energy electron fluxes on the basis of satellite data. This study focuses on high-energy electron fluxes from GOES satellites, targeting large-scale missing data months for the imputation process. The GA-RF model, along with other machine learning models, was trained to impute missing data when large-scale gaps occurred in satellite detection data within a given month.

The GA-RF model demonstrated the best overall performance in model evaluation. The RMSE, MAE, PE, and LC for the GA-RF model were 0.3872, 0.2084, 0.9199, and 0.8140, respectively, for the GOES-E satellite data and 0.4197, 0.2474, 0.9290, and 0.8595, respectively, for the GOES-W satellite data. Additionally, we compared the results of GA-RF with those of cubic spline interpolation and linear interpolation. The imputed data from the GA-RF model effectively captured electron flux variations, with the imputed values closely matching the satellite detection data. Specifically, the RMSE, MAE, PE, and LC for the GA-RF model were 0.3983, 0.1938, 0.8275, and 0.9259, respectively, for the GOES-E satellite data and 0.4082, 0.2625, 0.8814, and 0.9395, respectively, for the GOES-W satellite data.

Building on the imputation process presented in this study, future studies will focus on developing a prediction model for relativistic electrons at GEO.

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Horne, R. B. et al. The satellite risk prediction and radiation forecast system (SaRIF). Space Weather. 19, e2021SW002823. https://doi.org/10.1029/2021SW002823 (2021).
Article ADS MATH Google Scholar
Iucci, N. et al. Space weather conditions and spacecraft anomalies in different orbits. Space Weather. 3, 1–16 (2005). S01001.
Article MATH Google Scholar
Benoı̂t, T. et al. Presentation and validation of the internal charging risk forecast in the PAGER framework, advances in Space Research, 72,9: 3666–3676, (2023) https://doi.org/10.1016/j.asr.2023.07.047
mmanuel, T. et al. A survey on missing data in machine learning. J. Big Data. 8, 140. https://doi.org/10.1186/s40537-021-00516-9 (2021).
Article MATH Google Scholar
Wang, S., Li, W., Hou, S., Guan, J. & Yao, J. STA-GAN: ASpatio-temporal attention generative adversarial network for missing value imputation in satellite data. Remote Sens. 15, 88. https://doi.org/10.3390/rs15010088 (2023).
Article ADS CAS Google Scholar
Zhang, H. et al. A prediction model of relativistic electrons at geostationary orbit using the EMD-LSTM network and geomagnetic indices. Space Weather. 20 (3), 1–15 (2022).
Google Scholar
Cui, R. et al. Machine learning for the relationship of high-energy Electron flux between GEO and MEO with application to missing values imputation for Beidou MEO Data. Open. Astronomy. 30 (1), 62–72. https://doi.org/10.1515/astro-2021-0008 (2021).
Article ADS MATH Google Scholar
Zhang, H. et al. Relativistic electron flux prediction at geosynchronous orbit based on the neural network and the quantile regression method. Space Weather. 18 (9). https://doi.org/10.1029/2020SW002445 (2020). e2020SW002445.
Wei, L. et al. Quantitative prediction of high-energy electron integral flux at geostationary orbit based on deep learning. Space Weather. 16 (7), 903–916. https://doi.org/10.1029/2018SW001829 (2018).
Article ADS MATH Google Scholar
Ling, A. G., Ginet, G. P., Hilmer, R. V. & Perry, K. L. A neural network-based geosynchronous relativistic electron flux forecasting model. Space Weather. 8 (9), S09003. https://doi.org/10.1029/2010SW000576 (2010).
Article ADS Google Scholar
Smirnov, A. G. et al. Medium energy electron flux in earth’s outer radiation belt (MERLIN): a machine learning model. Space Weather. 18, e2020SW002532. https://doi.org/10.1029/2020SW002532 (2020).
Article ADS MATH Google Scholar
Chu, X. et al. Relativistic electron model in the outer radiation belt using a neural network approach. Space Weather. 19, e2021SW002808. https://doi.org/10.1029/2021SW002808 (2021).
Article ADS Google Scholar
Li, X. et al. Energetic electrons, 50 keV to 6 MeV, at geosynchronous orbit: their responses to solar wind variations. Space Weather. 3, S04001. https://doi.org/10.1029/2004SW000105 (2005).
Article ADS Google Scholar
Li, L. Y., Cao, J. B. & Zhou, G. C. Relation between the variation of geomagnetospheric relativistic electron flux and storm/substorm. Chin. J. Geophysics- Chin. Ed. 49 (1), 9–15 (2006).
MATH Google Scholar
Ma, D. et al. Modeling the dynamic variability of sub-relativistic outer radiation belt electron fluxes using machine learning. Space Weather. 20, e2022SW003079. https://doi.org/10.1029/2022SW003079 (2022).
Article ADS Google Scholar
Rilling, G., Flandrin, P. & Goncalves, P. On empirical mode decomposition and its algorithms. Proc. IEEE-EURASIP Workshop Nonlinear Signal. Image Process. NSIP-03. 3 (3), 8–11 (2003).
MATH Google Scholar
Rycroft, M., Nicoll, K., Aplin, K. & Harrison, R. Recent advances in global electric circuit coupling between the space environment and the Troposphere. J. Atmos. Solar Terr. Phys. 90–91. https://doi.org/10.1016/j.jastp.2012.03.015 (2012).
Sain, S. R. & Stephan, R. The nature of statistical learning theory. Technometrics 38 (4), 409–422. https://doi.org/10.1080/00401706.1996.10484565 (1997).
Article MATH Google Scholar
Sakaguchi, K. et al. Relativistic electron flux forecast at geostationary orbit using Kalman filter based on a multivariate autoregressive model. Space Weather. 11 (2), 79–89. https://doi.org/10.1002/swe.20020 (2013).
Article ADS MATH Google Scholar
Seppälä, A., Matthes, K., Randall, C. & Mironova, I. What is the solar influence on climate? Overview of activities during CAWSES-II. Progress Earth Planet. Sci. 1, 24. https://doi.org/10.1186/s40645-014-0024-3 (2014).
Article ADS MATH Google Scholar
Simms, L. et al. A distributed lag autoregressive model of geostationary relativistic electron fluxes: comparing the influences of waves, seed and source electrons and solar wind inputs. J. Geophys. Research: Space Phys. 123 (5), 3646–3671. https://doi.org/10.1029/2017ja025002 (2018).
Article ADS MATH Google Scholar
Fang, X. et al. Mar., Forecasting incidence of infectious diarrhea using random forest in Jiangsu province, China. BMC Infect. Dis., 20, 1, (2020)
Jamei, M. G. M., Ahmadianfar, I. & Pourrajab, R. Prediction of nanofluids viscosity using random forest (RF) approach. Chemom Intell. Lab. Syst., 201, (2020). Art. 104010.
Breiman, L. Random forests, Mach. Learn., vol. 45, no. 1, pp. 5–32, (2001).
Assiri, A. Anomaly classification using genetic algorithm-based random forest model for network attack detection. Computers Mater. Continua. 66 (1), 767–778. https://doi.org/10.32604/cmc.2020.013813 (2021).
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
Meihua Fang, Dingyi Song, Biao Wang, Mengyun He & Yukuan Ma
China Helicopter Research and Development Institute, Jingdezhen, 333000, China
JianFei Chen

Authors

Meihua Fang
View author publications
Search author on:PubMed Google Scholar
Dingyi Song
View author publications
Search author on:PubMed Google Scholar
JianFei Chen
View author publications
Search author on:PubMed Google Scholar
Biao Wang
View author publications
Search author on:PubMed Google Scholar
Mengyun He
View author publications
Search author on:PubMed Google Scholar
Yukuan Ma
View author publications
Search author on:PubMed Google Scholar

Contributions

Meihua Fang primarily contributed to the conceptualization and methodology of the manuscript. Dingyi Song was responsible for data processing and algorithm development. Jianfei Chen mainly handled data collection and algorithm design. Biao Wang was in charge of data preprocessing. Mengyun He managed the data. Yukuan Ma conducted the literature Investigation.

Corresponding author

Correspondence to Meihua Fang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Fang, M., Song, D., Chen, J. et al. Missing value imputation for > 2 MeV electron fluxes in geostationary orbit based on GA-RF model. Sci Rep 15, 10427 (2025). https://doi.org/10.1038/s41598-025-87082-9

Download citation

Received: 17 July 2024
Accepted: 15 January 2025
Published: 26 March 2025
Version of record: 26 March 2025
DOI: https://doi.org/10.1038/s41598-025-87082-9

Keywords

This article is cited by

A text mining-based approach for comprehensive understanding of Chinese railway operational equipment failure reports
- Xiaorui Yang
- Honghui Li
- Ruiyi He
Scientific Reports (2025)