Introduction

The rapid global spread of the novel coronavirus in 2019 triggered significant social and economic crises, prompting countries worldwide to implement emergency measures to contain the virus. This situation underscored the urgent need to establish a robust and quantifiable analytical system to mitigate the adverse impacts of future infectious diseases on society. Early investigations into the COVID-19 outbreak in China addressed this need by analyzing initial data1,2,3. As the pandemic progressed, Philip Nadler et al. first identified the limitations of single-model epidemiological frameworks4, which subsequently led to the application of data assimilation techniques to enhance the study of COVID-19 transmission.

The integration of data assimilation methods into infectious disease modeling has become increasingly crucial. Even with more flexible forecasting models, limitations in their predictive capabilities persist due to a lack of dynamism. The time-dependent Susceptible-Exposed-Asymptomatic-Infected-Quarantined-Removed (SEAIQR) model, an extension of the Susceptible-Infective-Removed (SIR) model5, incorporates temporal trends into epidemic progression by categorizing the population into multiple compartments to capture disease transmission dynamics. Although the inclusion of dynamic contact rate parameters improves predictive accuracy over traditional models, real-world scenarios often involve complex patterns of population interactions and stochastic epidemic control measures, resulting in more intricate and nonlinear disease evolution, thereby compromising predictive accuracy. Data assimilation techniques can rectify such issues by adjusting predictive data. Rhodes CJ et al. were pioneers in applying data assimilation methods to the SIR model for the study of influenza and other diseases6,7. Nonetheless, these methods are typically constrained to assimilating one-dimensional observational data, such as daily trend shifts, into the model. This limitation necessitates a more sophisticated data assimilation approach that can incorporate temporal trend variations over extended periods into predictive models.

The Ensemble Kalman Filter (EnKF) method has demonstrated superior performance in assimilating data for nonlinear systems8. It represents the system state distribution through a set of samples, and classical EnKF approximates the covariance matrix using ensemble-based stochastic updating methods. While this random update mechanism is effective for certain problems, it falls short of meeting the higher accuracy requirements in high-dimensional nonlinear scenarios. To address this shortcoming, Papadakis9proposed the Weighted Ensemble Kalman Filter (WENKF) method, which, in the context of infectious disease forecasting, has not sufficiently aligned with dynamic real-world conditions. In contrast, Lal R10introduced time-varying elements into epidemic prediction models and employed a Kalman filter with damping coefficients for data assimilation. This approach, while innovative, lacks the flexibility and stability necessary for handling complex epidemic dynamics. In this study, we propose a novel modification: a flexible weighting function designed to enhance the stability and fidelity of the data assimilation process to a greater extent. This work integrates WENKF with a hybrid data assimilation framework, combining real-time EnKF with the K-Nearest Neighbors (KNN) method. Unlike previous time-varying data assimilation approaches in infectious disease models4, this study is the first to explore real-time EnKF for data assimilation in such models, leveraging the high coupling between EnKF and the time-varying data assimilation framework. Furthermore, by incorporating KNN, the hybrid approach addresses issues related to ensemble divergence and sample validity during stochastic updates. The resampling of the weighting function improves the alignment of the ensemble with actual epidemic dynamics. In doing so, the study introduces epidemic dynamic adjustment patterns into the ensemble updating process of the EnKF, proposing a data assimilation method informed by dynamic adjustments based on empirical data.

Although existing methods have shown promise in specific contexts, they are generally limited by the imprecision in selecting time-varying parameters and their inability to adapt to the complexities of real-world conditions. The hybrid approach presented in this study overcomes these limitations by offering a more flexible and dynamic data assimilation framework. Future research should explore its adaptability across different regions or infectious diseases, particularly under conditions of initial parameter fluctuation.

In the data analysis section of this study, a comparative evaluation was conducted between the fundamental time-dependent SEAIQR model and traditional data assimilation techniques. The results demonstrate that integrating real-time EnKF with KNN significantly improves the predictive accuracy of the assimilation model, highlighting the efficacy of the hybrid approach in epidemic forecasting.

Data sources

The epidemic data was sourced from the Infectious Disease Surveillance System within the “China Disease Control and Prevention Information System,” covering the real time epidemic monitoring in Xi’an from December 9, 2021, to January 8, 2022.

Time-dependent SEAIQR model

The time-dependent SEAIQR model11,12,13 incorporates multiple state categories, allowing for a more precise representation of the disease transmission dynamics.This model is an extension of the SEIR model, known as the time-dependent SEAIQR model, which includes two additional compartments: asymptomatic infected individuals (A) and quarantined (Q). This model incorporates improvements such as isolation and recovery states, as well as time-varying factors related to actual epidemic control measures. The system of differential equations for this model is given by Equations 1:

$$\begin{aligned} \begin{aligned}&\frac{dS}{dt}=\frac{-\left[ \beta (t)+q(1-\beta (t))\right] S(I+\theta E+kA)}{N}+\lambda Sq-\nu S+hR\\&\frac{dE}{dt}=\frac{\beta (t)(1-q)S(I+\theta E+kA)}{N}-\sigma E\\&\frac{dI}{dt}=(1-p)\sigma E-(\delta _{I}+\alpha _{D}+\gamma _{I})I&\\&\frac{dR}{dt}=\gamma _{I}I+\gamma _{A}A+\gamma _{H}H+\nu S-hR+\alpha _{D}(I+A+H)\\&\frac{dA}{dt}=p\sigma E-(\delta _{I}+\alpha _{D}+\gamma _{A})A\\&\frac{dSq}{dt}=\frac{q(1-\beta (t))S(I+\theta E+kA)}{N}-\lambda Sq\\&\frac{dEq}{dt}=\frac{\beta (t)qS(I+\theta E+kA)}{N}-\delta _{q}Eq\\&\frac{dH}{dt}=\delta _{I}(I+A)+\delta _{q}Eq-(\alpha _{D}+\gamma _{H})H\\ \end{aligned} \end{aligned}$$
(1)

Eqs.1 simulates the evolution of key parameters using a time-series model. The initial conditions and parameter settings of the key parameters in Eqs. 1 are provided in Table 1.

Table 1 Parameters settings in the main analysis13.

Data assimilation scheme for the time-dependent SEAIQR model

Initial data processing

To account for errors in epidemic observations and to better simulate real epidemic data, reasonable perturbations are introduced to the initial epidemic observation data set W to reflect factors such as nucleic acid testing errors, thereby aligning the data more closely with actual values. The processed observation data is then used as the true values. The false positive rate of nucleic acid testing in China during this timeframe was \(0.23\%\)12.In this study, nucleic acid testing errors are incorporated into the observation error covariance matrix \(R_{w}\) and the observation operator H. The initial forecast error covariance matrix Pis based on the initial error parameters derived from using only the SEAIQR model13 during this epidemic period.

EnKF and KNN methods

Ensemble Kalman Filtering (EnKF) is a data assimilation technique used to estimate the state of a dynamic system from noisy and incomplete observations. Unlike traditional Kalman filters, which rely on a single estimate of the system state, EnKF uses an ensemble of model states to represent the uncertainty in the system. The method updates the ensemble by incorporating real-time observational data, thereby improving the model predictions. EnKF is widely used in fields such as weather forecasting, climate modeling, and epidemiology, where the system is governed by complex, nonlinear dynamics, and real-time observations are available.The process of classical EnKF is described by the following equations

$$\begin{aligned} & \begin{aligned} X_{i,k+1}^a=X_{i,k+1}^f+K_{k+1}\left[ Y_{k+1}^o-H_{k+1}\left( X_{i,k+1}^f\right) \right] \end{aligned} \end{aligned}$$
(2)
$$\begin{aligned} & \begin{aligned} \overline{X}^a_{k+1}=\frac{1}{N}\sum _{i=1}^NX_{i,k+1}^a \end{aligned} \end{aligned}$$
(3)
$$\begin{aligned} & \begin{aligned} K_{k+1}=P_{k+1}^fH^T\left( HP_{k+1}^fH^T+R_k\right) ^{-1} \end{aligned} \end{aligned}$$
(4)
$$\begin{aligned} & \begin{aligned} P_{k+1}^f=\frac{1}{N-1}\sum _{i=1}^N\Big (X_{i,k+1}^f-\overline{X}_{k+1}^a\Big )\Big (X_{i,k+1}^f-\overline{X}_{k+1}^a\Big )^T \end{aligned} \end{aligned}$$
(5)
$$\begin{aligned} & \begin{aligned} HP_{k+1}^fH^T=\frac{1}{N-1}\sum _{i=1}^N\left[ H\left( X_{i,k+1}^f\right) -H\left( \overline{X}_{k+1}^f\right) \right] \left[ H\left( X_{i,k+1}^f\right) -H\left( \overline{X}_{k+1}^f\right) \right] ^T \end{aligned} \end{aligned}$$
(6)

The Eq. 2 represents the data prediction process, while Eqs. 3 to 6 describe the data update process. The EnKF achieves optimized prediction by alternately iterating through prediction and update steps, coupling the observations with the predicted values.

K-Nearest Neighbors is a simple yet effective machine learning algorithm commonly used for classification and regression tasks. Its fundamental idea is to make predictions or classifications based on the proximity of samples in the feature space. Specifically, if a sample’s nearest neighbors in the feature space predominantly belong to a particular class, then the sample is likely to belong to that class as well. The basic procedure is as follows, with the training and testing sets remaining in standard format:

$$\begin{aligned} & \begin{aligned} D=\left\{ \left( x_1,y_1\right) ,\left( x_2,y_2\right) ,\cdots ,\left( x_N,y_N\right) \right\} \end{aligned} \end{aligned}$$
(7)
$$\begin{aligned} & \begin{aligned} X^l=\begin{Bmatrix}x_1^l,x_2^l,\cdots ,x_N^l\end{Bmatrix} \end{aligned} \end{aligned}$$
(8)

Here, D represents the training set, and \(X^l\) denotes the test set. By selecting k nearest points near the test set, the degree of association is assessed using the following formula:

$$\begin{aligned} \begin{aligned} L_p\left( x_i,x_j\right) =\left( \sum _{l=1}^n\left| x_i^l-x_j^l\right| ^p\right) ^{\frac{1}{p}} \end{aligned} \end{aligned}$$
(9)

In this study, the classification conditions of the KNN method are integrated with actual epidemic trends, using metrics such as the number of close contacts and the scope of isolation as classification indicators. The classification indicator \(\alpha\)is derived from studies on the actual number of contacts and isolated individuals within the same social context14,15,16. \(\alpha\) is an empirical function encompassing response attributes, which is further derived from \(\alpha _0,\alpha _1,\dots ,\alpha _{t_{days}}\). Here, \(\alpha _i\) represents response indicators observed under similar social contexts, it follows that

$$\begin{aligned} \begin{aligned} \alpha _i=\frac{I_i+A_i}{Sq_i},i=0,\cdots ,t_{days}. \end{aligned} \end{aligned}$$
(10)

Using the number of days in the empirical observation period as the independent variable for the time-varying function \(\alpha\), and use empirical observation days as interpolation nodes, the Newton interpolation method is applied to interpolate \(\alpha _i\), resulting in the empirical function of the indicator parameters.

$$\begin{aligned} \begin{aligned} \alpha _{t_{days}}=f\left( t_{days}\right) =f\left( t_0\right) +f\left[ t_0,t_1\right] \left( t_{days}-t_0\right) +f\left[ t_0,t_1,t_2\right] \left( t_{days}-t_0\right) \left( t_{days}-t_1\right) \\+\cdots +f\left[ t_0,t_1,\cdots ,t_{n-1},t_n\right] \left( t_{days}-t_0\right) \cdots \left( t_{days}-t_{n-1}\right) \end{aligned} \end{aligned}$$
(11)

Differentiating the empirical Eq. 12 yields the response indicators:

$$\begin{aligned} \begin{aligned}&f^{\prime }\Big (t_{days}\Big )=f\Big [t_{0},t_{1}\Big ]+f\Big [t_{0},t_{1},t_{2}\Big ]\Big (2t_{days}-1\Big )+\cdots \\ &+f\Big [t_{0},t_{1},\cdots ,t_{n-1},t_{n}\Big ]\sum _{j=0}^{n-1}T_{j},(t_0=0,t_1=1)\\ &T_{j}=\prod _{i=0,i\ne j}^{n-1}\left( t_{days}-t_{i}\right) \end{aligned} \end{aligned}$$
(12)

This yields \(\alpha\),

$$\begin{aligned} \begin{aligned}&\overline{\alpha }=\frac{\sum _{k=0}^{n}\alpha _{k}}{n}\\ &\alpha =\frac{\overline{\alpha }+f^{\prime }\left( t_{days-1}\right) +f\left( t_{days-1}\right) }{2},\\ &f\left( t_{0}\right) =\alpha _{0} \end{aligned} \end{aligned}$$
(13)

Real time data is classified based on the indicator \(\alpha\) with the first category having the following interval:

$$\begin{aligned} & \begin{aligned} p_{1\text {min}}=\alpha (t_{\text {days}})+c_{1} \end{aligned} \end{aligned}$$
(14)
$$\begin{aligned} & \begin{aligned} p_{1\text {max}}=\alpha (t_{\text {days}})+c_{2} \end{aligned} \end{aligned}$$
(15)

In Eqs. 14 and 15, \(p_{1\min }\) and \(p_{1\max }\) represent the time-varying weight ranges. This interval is used to generate a normal distribution for the training set \(X_1\) that satisfies the EnKF requirements.

$$\begin{aligned} & \begin{aligned} Var_{1}=\frac{\left( \begin{matrix}p_{1\min }-p_{1\max }\end{matrix}\right) ^{2}}{2} \end{aligned} \end{aligned}$$
(16)
$$\begin{aligned} & \begin{aligned} X_1\sim N\Bigg (\frac{p_{1\max }-p_{1\min }}{2},Var_1\Bigg ) \end{aligned} \end{aligned}$$
(17)

Similarly, the training set \(X_{2}\) the second group of exclusion intervals is generated as follows in Eqs. 18 to 21.

$$\begin{aligned} & \begin{aligned} p_{2\text {min}}=\alpha (t_{\text {days}})+c_{3} \end{aligned} \end{aligned}$$
(18)
$$\begin{aligned} & \begin{aligned} p_{2\text {max}}=\alpha (t_{\text {days}})+c_{4} \end{aligned} \end{aligned}$$
(19)
$$\begin{aligned} & \begin{aligned} Var_{2}=\frac{\left( \begin{matrix}p_{2\min }-p_{2\max }\end{matrix}\right) ^{2}}{2} \end{aligned} \end{aligned}$$
(20)
$$\begin{aligned} & \begin{aligned} X_2\sim N\Bigg (\frac{p_{2\max }-p_{2\min }}{2},Var_2\Bigg ) \end{aligned} \end{aligned}$$
(21)

In this context, the return value for the first category data is denoted as 1, while the return value for the second category data is denoted as 0. The values of \(c_1\) , \(c_2\) , \(c_3\) and \(c_4\) can be optimized through iterative experimentation. This approach facilitates the classification function for the real-time EnKF’s random update component.In this subsection, KNN and time-varying parameter \(\alpha\) are used to obtain the real-time classification interval of the updated part of the real-time EnKF, so that the data assimilation process is preliminarily coupled with the KNN method.

A hybrid data assimilation method based on real-time Ensemble Kalman filtering and KNN

In practical scenarios, the number of individuals in isolation varies with changes in the number of close contacts and newly confirmed cases. At the analysis value update juncture, This inquiry introduces an ensemble processing component that adapts to this background, affecting both the forecast and analysis values according to their time-varying proportions. This adjustment makes the assimilation process more aligned with the actual problem context. In real-time EnKF, the random updates of the ensemble are replaced by time-varying updates.We obtained N ensembles in the stochastic update step of the EnKF

$$\begin{aligned} \begin{aligned} X_{i,k+1}^a=\left\{ S_{i,k+1},Sq_{i,k+1},E_{i,k+1},Eq_{i,k+1},A_{i,k+1},I_{i,k+1},H_{i,k+1},R_{i,k+1}\right\} ,i=1,\cdots ,N \end{aligned} \end{aligned}$$
(22)

In Eq. 22, \(S_{i,k+1}\) denotes the number of susceptible individuals, \(Sq_{i,k+1}\) represents the isolated susceptible individuals, \(E_{i,k+1}\) corresponds to the exposed individuals, and \(Eq_{i,k+1}\) indicates the isolated exposed individuals. \(A_{i,k+1}\) refers to the asymptomatic cases, \(I_{i,k+1}\) accounts for the confirmed cases, \(H_{i,k+1}\) represents the hospitalized patients, and \(R_{i,k+1}\) signifies the recovered individuals. All these quantities are assessed at the time point \(k+1\) .

Within the context of social networks, changes in the number of confirmed cases and asymptomatic individuals can significantly affect isolation strategies. This impact is proportional to local response efficiency and public attention. In this analysis, the extent of government and societal response is embodied in parameter \(\lambda _{i,k+1}\) , which encompasses both computational and observational errors

$$\begin{aligned} \begin{aligned} \lambda _{i,k+1}=\frac{I_{i,k+1}+A_{i,k+1}}{Sq_{i,k+1}} \end{aligned} \end{aligned}$$
(23)

To more appropriately integrate time-varying characteristics17 into the data assimilation process, a time-varying parameter \(\beta _{k+1}\)11 is introduced. In the following expression, \(\beta _{0}\) is the initial infection rate at the onset of the epidemic. t denotes the number of days since the onset of the epidemic, and r signifies the response rate, their variation is proportional to the number of days since the outbreak of the epidemic

$$\begin{aligned} \begin{aligned} \beta _{k+1}=\beta _0\times e^{-rt} \end{aligned} \end{aligned}$$
(24)

By coupling parameter \(\lambda _{i,k+1}\) with parameter \(\beta _{k+1}\) as a weighting function in the real-time EnKF, the computational error of the latter is mitigated, and its time-varying properties are enhanced

$$\begin{aligned} \begin{aligned} W_{i,k+1}=\beta _{k+1}\frac{I_{i,k+1}+A_{i,k+1}}{Sq_{i,k+1}} \end{aligned} \end{aligned}$$
(25)

In the random update step, each random update \(X_{i,k+1}^a\) is assigned a specific weight \(W_{i,k+1}\), which serves to evaluate the correlation between the random sets and the actual control measures. The weights of the N random sets are dynamically categorized. Sets falling into the first category are retained, with the number of retained sets denoted as \(N_{new}\), while those in the second category are discarded. As a result, the updated sets are as follows

$$\begin{aligned} \begin{aligned} X_{N_{new},k+1}^a=\left\{ X_{N_{new}}\in X_{1}\right\} \end{aligned} \end{aligned}$$
(26)

Subsequently, the retained set undergoes resampling, initiated by calculating the cumulative weight of the \(X_{N_{new},k+1}^a\)

$$\begin{aligned} \begin{aligned} CDF_m=\sum _{j=1}^mW_{i,k+1}^j \end{aligned} \end{aligned}$$
(27)

We generate an array containing \(N_{new}\) elements, such that each element \(u_{N_{new}}\) satisfies the following conditions

$$\begin{aligned} \begin{aligned} u_{N_{new}}\sim [0,\frac{1}{N_{new}}) \end{aligned} \end{aligned}$$
(28)

By incrementing m , replace the element at the first position in \(u_{N_{new}}\) that is not greater than \(CDF_m\) but is greater than \(CDF_{m-1}\) with the corresponding element from the original array. Repeat this process until all elements are replaced. Subsequently, perform averaging on the resulting final array set to obtain the optimized array \(\overline{X}^a_{k+1}\) and the prediction error covariance matrix \(P^{new}_{k+1}\) :

$$\begin{aligned} \begin{aligned} \overline{X}^a_{k+1}=\frac{1}{N_{new}}\sum _{j=1}^{N_{new}}X_{j,k+1}^a \end{aligned} \end{aligned}$$
(29)

and

$$\begin{aligned} \begin{aligned} P_{k+1}^{new}=\frac{1}{N_{new}-1}\sum _{i=1}^{N}\Big (X_{i,k+1}^{f}-\overline{X}_{k+1}^{a}\Big )\Big (X_{i,k+1}^{f}-\overline{X}_{k+1}^{a}\Big )^{T} \end{aligned} \end{aligned}$$
(30)

Figs. 1 and 2 illustrate the overall process of the hybrid method. The data from the predictive model is first output and simultaneously fed into the EnKF. Key information such as the number of infections and isolations is used to extract weights. These extracted weights are then classified using a KNN model trained with periodic parameters, resulting in weighted sets. Resampling methods are then applied to generate an optimized ensemble by combining high-weight sets with low-weight ones. Finally, the optimized ensemble is used to update the parameters in the EnKF, providing an updated prediction covariance matrix and refined forecast data. This process enables more accurate real-time predictions of key information, such as the number of infections.

Fig. 1
figure 1

A Hybrid Data Assimilation Method Based on Real-Time EnKF and KNN. (The classified target data, obtained through the introduction of classification criteria, is updated via resampling. This process selects high-weight particles while ensuring coverage of low-weight particles, thereby improving the performance of the new ensemble. The updated ensemble is then used as the input for the filtering update, facilitating the prediction process.).

Fig. 2
figure 2

The data is input into the data assimilation model, where it undergoes weight extraction, classification, and resampling, followed by an update within the assimilation model.

Results

Data assimilation results

Denote the real-time EnKF assimilation method mentioned in this paper as R_EnKF. The assimilation results using the real-time EnKF and the KNN-based hybrid data assimilation method for Xi’an from December 9, 2021, to January 8, 2022, are shown in Fig. 3. Figs. 4 and 5 present the data assimilation results of the real-time EnKF and the EnKF under the same conditions.

Fig. 3
figure 3

Real-Time EnKF and KNN-Based Hybrid Data Assimilation Method for Xi’an (Dec 9, 2021–Jan 8, 2022) Achieving Improved Alignment Between Predicted and Observed COVID-19 Cases.

Fig. 4
figure 4

The real-time EnKF data assimilation method for Xi’an (Dec 9, 2021–Jan 8, 2022) demonstrates a comparison of optimization performance, showing that the real-time EnKF method outperforms traditional EnKF but performs worse than the hybrid method.

Fig. 5
figure 5

The EnKF data assimilation method for Xi’an (Dec 9, 2021–Jan 8, 2022) demonstrates a comparison of optimization performance, showing that the EnKF method underperforms relative to both the real-time EnKF and hybrid methods.

In this context, \(u\_analysis\) represents the analysis values from the hybrid data assimilation, \(u\_pred\) represents the predicted values from the SEAIQR model, \(u\_w\) represents the real-time epidemic observation data, and \(u\_real\) represents the “true values” constructed by incorporating observation data and accounting for observational errors, including nucleic acid testing inaccuracies. The experimental results demonstrate that data assimilation methods incorporating time-varying attributes achieve superior accuracy compared to conventional EnKF methods. Additionally, this hybrid approach outperforms the singular real-time EnKF method, with its accuracy being contingent upon the precise selection of key parameters, which enhances its potential for application in contexts with ample statistical samples.

Comparison of errors under different time-varying parameters

To assess the impact of the accuracy of time-varying parameters on data assimilation performance, experiments were conducted in this section on \(\alpha\) under varying perturbation trends. During the data assimilation process, the classification intervals of its weight vectors are influenced by the time-varying parameter \(\alpha\) . Different control intensities applied to the time-varying parameter can affect the error in data assimilation. The most realistic time-varying parameter yields the best processing results. When using the parameter \(\alpha\) that most accurately reflects practical control conditions, the errors of the updated array \(X^{a}_{N_{new},k+1}\) under high and low weights during the active control phase are shown in Fig. 6.

Fig. 6
figure 6

The errors of the updated array under high and low weights corresponding to the more accurate \(\alpha\) .

When the time-varying parameter \(\alpha\) is increased, the weight classification will be disrupted, resulting in a weakened correlation between the errors carried by array \(X^{a}_{N_{new},k+1}\) and the time-varying weights. The errors will gradually detach from the context of social control, and the results will increasingly resemble those of the classical EnKF. This effect is illustrated in Fig. 7.

Fig. 7
figure 7

The errors of the updated array under high and low weights corresponding to \(\alpha\) values that exceed the standard range.

Reducing \(\alpha\) will cause the initial experience parameters \(c_1\) , \(c_2\) , \(c_3\) and \(c_4\) to dominate the classification metrics. If the initial metrics are selected ambiguously, it will severely impact the data assimilation results, similarly causing the data update process to lose real-time filtering effectiveness and leading to increased errors, as shown in Fig. 8.

Fig. 8
figure 8

The errors of the updated array under high and low weights corresponding to \(\alpha\) values that fall below the standard range.

In summary, the selection of \(\alpha\) is critical values that are excessively high or low will lead to increased errors and reduced accuracy of this hybrid data assimilation method.

Sensitivity analysis

Evaluation metrics are the mean absolute error (MAE) and the root mean squared error (RMSE), used to compare the assimilation effects of various data assimilation methods. The formulas for calculating MAE and RMSE are as follows

$$\begin{aligned} & \begin{aligned} MAE=\frac{1}{n}\sum _{i=1}^{n}|y_{i}-\hat{y}_{i} | \end{aligned} \end{aligned}$$
(31)
$$\begin{aligned} & \begin{aligned} RMSE=\sqrt{\frac{\sum _{i=1}^n\left( y_i-\hat{y}_i\right) ^2}{n}} \end{aligned} \end{aligned}$$
(32)

In this regard, \(y_i\) represents the actual values, \(\hat{y}_{i}\) represents the predicted values, and n represents the number of forecast days.

In practical applications, both MAE and RMSE quantify the discrepancy between predicted and actual values, and parameters can be systematically adjusted through methods such as gradient descent or the construction of inversion neural networks. MAE and RMSE provide quantitative guidance for optimizing model parameters, and through hyperparameter tuning techniques, they enable a focus on those parameters that significantly reduce error metrics. By evaluating the impact of different parameter configurations on model performance, key parameters can be adjusted to enhance predictive accuracy.

In Fig. 9, the vertical axis represents the error between data assimilation analysis values and observed values, while the horizontal axis denotes the days since the epidemic outbreak. Fig. 10 shows the comparison of different data assimilation methods during the period of enhanced government response.

Fig. 9
figure 9

Compared to other methods, the hybrid data assimilation approach enhances the robustness and accuracy of infectious disease prediction.

Fig. 10
figure 10

During the pandemic control period, the hybrid data assimilation approach yields more accurate predictive results.

The errors of different data assimilation methods are presented in Table 2, where “EnKF” refers to the EnKF method, “R_EnKF+KNN” denotes the hybrid method of real-time EnKF and KNN, “R_EnKF” represents the real-time EnKF method, “4DVAR+EnKF” is the hybrid method of Four-Dimensional Variational data assimilation and EnKF, “4DVAR+R_EnKF” denotes the hybrid method of Four-Dimensional Variational data assimilation and real-time EnKF, and time-dependent SEAIQR represents the basic predictive model.

Table 2 Comparison of prediction results, demonstrating a 7.97% reduction in prediction error with the hybrid method compared to traditional EnKF. This hybrid approach improves predictive accuracy by integrating real-time adjustments with pattern recognition techniques, thereby outperforming other data assimilation methods.

To assess the uncertainty associated with the performance metrics of the predictive model, we computed the 95% confidence intervals for the MAE and RMSE values. A confidence interval provides a range within which we expect the true performance metric to lie with 95% probability, offering a more reliable estimate than a single-point value. The inclusion of confidence intervals is particularly important in our study, as it accounts for the inherent variability in model predictions across different parameter settings. Using the chi-squared distribution, we derived the 95% confidence interval for the MAE as [0.971, 2.649] and for the RMSE as [1.310, 2.248]. This statistical measure reinforces the validity of our model’s improvements and helps quantify the uncertainty in the reported performance enhancements.

The analysis results show that the real-time EnKF method proposed in this paper improves the prediction effect of the basic model and the single data assimilation model to a certain extent. On this basis, the hybrid method combined with the KNN algorithm further improves the prediction effect of the model, and the error is smaller than that of other mixed data assimilation methods. Compared with the traditional EnKF method, the hybrid data assimilation method reduces the model prediction error by 7.97%.

Discussion

This study demonstrates the effectiveness of the hybrid EnKF-KNN method in improving the accuracy of epidemic predictions. The hybrid approach significantly outperforms traditional models in forecasting epidemic dynamics. Compared to single-method approaches such as Kalman Filtering (KF) and Ensemble Kalman Filtering (EnKF), the hybrid EnKF-KNN method exhibits superior precision and robustness. When compared to other hybrid methods, such as the combination of 4DVAR and EnKF, the EnKF-KNN method shows similar performance but with enhanced computational efficiency, which is especially critical in dealing with the complex and rapidly changing transmission dynamics observed during the COVID-19 pandemic. Unlike traditional models that often rely on static assumptions or simplified transmission hypotheses, the hybrid method leverages real-time adjustments and pattern recognition to more accurately capture the dynamic features of epidemic spread. This makes it a valuable tool for informing public health strategies in dynamic environments and highlights its substantial potential for application in the field of infectious disease forecasting.

In future research, we plan to integrate recent data on population mobility and social behavior to further enhance the accuracy of real-time predictions. Mobility data provides more precise spatiotemporal information, enabling the model to capture more dynamic transmission patterns, thereby improving the timeliness and accuracy of forecasts. Additionally, we aim to apply the hybrid data assimilation method to a variety of epidemic models to improve forecasting performance across different objectives. For instance, in addition to the existing SEAIQR model, we plan to extend it to models such as the SQIR model18and the modified SEIR pandemic fractional-order model19, with the goal of addressing the unique transmission characteristics of different types of epidemics using diverse model structures.

With the increasing availability of multi-source observational data, we will also explore new numerical methods to optimize the coupling of different types of observational data20,21. By more effectively integrating data from multiple sources, we aim to enhance data utilization and generalizability, thus providing more reliable support for epidemic forecasting and control in various contexts. This approach not only improves the predictive accuracy of the models but also strengthens their adaptability to different regions and transmission patterns, further expanding their potential for real-world applications.

Conclusion

The hybrid data assimilation approach that combines real-time EnKF with KNN, as proposed in this study, demonstrates substantial effectiveness in refining data within infectious disease models, thereby enhancing their predictive accuracy to a notable extent. On the basis of these findings, the study provides a statistical projection indicating that the hybrid method is particularly effective in the context of epidemic control strategies tailored to specific regional characteristics, with its efficacy showing a positive correlation to the degree of regional specificity. Nevertheless, the method is not without its limitations, particularly in the precise selection of time-varying parameters. Moreover, there exists considerable potential for further investigation into the methods adaptability across various regions or infectious diseases, especially in scenarios where initial parameter fluctuations are present. It is anticipated that this study will serve as a catalyst for further scholarly research in this area.