Introduction

Viral hepatitis refers to liver inflammation caused by infection with one of five known viruses: hepatitis A, B, C, D, and E 1,2. This condition poses a significant global public health challenge, affecting billions worldwide and contributing to high rates of morbidity and mortality. Hepatitis A and E typically follow a self-limiting course with full recovery, whereas hepatitis B and C often progress to chronic infection and are associated with severe health outcomes. Historical records trace the prevalence of hepatitis back to ancient times, with documented outbreaks dating back 5000 years ago in China and descriptions of jaundice recorded by Hippocrates on the island of Sássos in the fifth century BC3. Viral hepatitis causes over 1.4 million deaths annually4. In a multicenter international study across 161 countries, the prevalence of hepatitis B virus (HBV) surface antigen (HBsAg) was reported at 3.61%5. Despite declines in the disease burden of HBV and HCV infections globally over the past three decades, HBV remains prevalent in China6. Consequently, viral hepatitis has emerged as a top global health priority, prompting the implementation of extensive public health policies.

To effectively inform health policies aimed at preventing viral hepatitis, accurate prediction of its trends is paramount. Research in Iran has identified the Holt Exponential Smoothing (HES) model as highly accurate in forecasting HBV incidence7. However, comprehensive predictive studies for viral hepatitis remain limited. Existing literature predominantly focuses on clinical and virological factors, often overlooking environmental influences. For instance, a study in Spain demonstrated that each additional rainy day increased the risk of contracting hepatitis A two weeks later (IRR = 1.03, 95% CI = 1.01–1.05)8. Additionally, Chen et al.9 found a correlation between PM2.5 exposure and hepatitis progression to hepatocellular carcinoma, though research on the synergistic effects of pollutants with hepatitis infection remains scarce.

This study aims to investigate the epidemiological characteristics of viral hepatitis of viral hepatitis, develop predictive models using various methods, and explore the singular, multiple, and interactive effects of pollutants. Specifically, our objectives are to: (a) construct and evaluate prediction models using diverse methodologies; (b) explore the single and multiple effects of pollutants across different groups; (c) analyze pollutant interactions over lagging timeframes.

Patients and methods

Overview of the study area

Beijing, situated in northern China, covers a land area of 16,410.54 square kilometers. It is centrally located at approximately 116°20′ east longitude and 39°56′ north latitude. Beijing experiences a warm temperate semi-humid and semi-arid monsoon climate, characterized by hot and rainy summers and cold and dry winters. Administratively, the city comprises 16 districts and serves as the capital of the People’s Republic of China.

Data source

Data on all reported cases of viral hepatitis in Beijing from 2005 to 2020 were sourced from the public health science data center website (https://www.phsciencedata.cn/). This dataset includes information on the incidence and morbidity of various types of viral hepatitis such as HAV, HBV, HCV, HDV, HEV, and unclassified hepatitis. Diagnosis of all patients followed the criteria outlined in the viral hepatitis management guidelines issued by the Ministry of Health of the People's Republic of China. Ethical approval for this study was obtained from the China Center for Disease Control and Prevention. To ensure confidentiality, viral hepatitis data were analyzed anonymously. Given that viral hepatitis is classified as a statutory infectious disease under national mandatory surveillance, informed consent was not required. Monthly pollutions information (2014–2020) were sourced from the National Oceanic and Atmospheric Administration (NOAA) (https://www.noaa.gov/) encompassing parameters such as AQI, PM2.5, PM10, SO2, CO, NO2 and O3.

Time series analysis of single and multiple interaction

This study employed three models for time series analysis. The SARIMA and Holt-Winters models were primarily used for predicting the incidence trends of viral hepatitis. The Holt-Winters exponential smoothing model is effective in smoothing out random fluctuations and assigns varying weights to data across cycles, thereby enhancing the accuracy of future trend predictions10. Holt-Winters' additive model has the following expression:

$$\begin{aligned} \hat{y}_{t + h/t} & = l_{t} + hb_{t} + s_{t - m + h} , \\ l_{t} & = \alpha \left( {y_{t} - s_{t - m} } \right) + \left( {1 - \alpha } \right)\left( {l_{t - 1} + b_{t - 1} } \right), \\ b_{t} & = \beta \left( {l_{t} - l_{t - 1} } \right) + \left( {1 - \beta } \right)b_{t - 1} , \\ s_{t} & = \gamma \left( {y_{t} - l_{t - 1} - b_{t - 1} } \right) + \left( {1 - \gamma } \right)s_{t - m} . \\ \end{aligned}$$

where, 0 ≤ α ≤ 1, 0 ≤ β ≤ 1, 0 ≤ γ ≤ 1 − α. st−m+h is the seasonal term. α, β, and γ are the smoothing parameters. m is seasonal periods, and h is the predicted step size.

The Seasonal Autoregressive Integrated Moving Average (SARIMA) model decomposes the observed values into three parts: residuals, seasonal features, and true trends11. The SARIMA (p, d, q) (P, D, Q) s model can be expressed as follows:

$$\begin{array}{*{20}c} {{\Phi }_{p} \left( L \right)A_{P} \left( {L^{s} } \right){\Delta }^{d} {\Delta }_{s}^{D} y_{t} = {\Theta }_{q} \left( L \right)B_{Q} \left( {L^{s} } \right)\varepsilon_{t} ,} \\ {{\Phi }_{p} \left( L \right) = 1 - \varphi_{1} L - \varphi_{2} L - \cdots - \varphi_{p} L^{p} ,} \\ {A_{P} \left( {L^{s} } \right) = 1 - \alpha_{1} L^{s} - \alpha_{2} L^{2s} - \cdots - \alpha_{P} L^{Ps} ,} \\ {{\Theta }_{q} \left( L \right) = 1 + \theta_{1} L + \theta_{2} L + \cdots + \theta_{q} L^{q} ,} \\ {\begin{array}{*{20}c} {B_{Q} \left( {L^{s} } \right) = 1 + \beta_{1} L^{s} + \beta_{2} L^{2s} + \cdots + \beta_{Q} L^{Qs} ,} \\ \end{array} } \\ {{\Delta }_{s} y_{t} = (1 - L^{s} )y_{t} = y_{t} - y_{t - s} ,} \\ {{\Delta }_{s} = 1 - L^{s} ,} \\ {\varepsilon_{t} :WN\left( {0,\sigma^{2} } \right)} \\ \end{array}$$

where, Δ and Δs denote non-seasonal and seasonal differences, respectively. φ, Φ, θ and Θ are the parameters of the model, εt is white noise with independent and identical distribution12.

Following this, Spearman correlation analysis was used to identify relevant pollutants. Subsequently, the GAM generalized additive model (GAMs) was used to explore the interaction of pollutant factors on the prevalence of viral hepatitis13. The following model formula are as followed:

$$\log [E(Y_{t} )] = \alpha_{1} + s(X_{1} ,X_{2} ) + \Sigma s\left( {X_{t} } \right)$$

α1 is the intercept; X1 and X2 indicate two interaction pollutants; s () indicates penalized spline function. s (X1, X2) is a spline function of the interaction between the parameters X1 and X2 (X1 and X2 are all 5–6 months lagged variables.). \(\Sigma s\left( {X_{t} } \right)\) are the factors of non-interaction pollutants.

Machine learning training process

To predict viral hepatitis across different age groups and subtypes, various machine learning (ML) algorithms were employed, and the results compared. The modeling utilized data from 2014 to 2018 for training set and data from 2019 to 2020 for testing, with both sets undergoing ten-fold cross-validation. The Gaussian Process Regression (GPR) model operates by defining a Gaussian process to model the distribution of functions, followed by Bayesian inference in function space14. Four kernel function algorithms—Rbf, Polynomial, Laplace, and Bessel—were employed in the GPR model for comparison. The support vector regression (SVR) algorithms were also utilized, which map input features to a higher dimensional space, maximizing the margin between classes15. The SVR model compared four kernel function algorithms: Linear, Polynomial, Radial and Sigmoid. This study used R4.3.1 package e1071 and kernlab to construct SVR and GPR models, respectively. We use pollutants as predictor variables in the model of the ML algorithm. Subsequently, we consider the overall incidence of the population, the incidence among different age groups, and the incidence among different types of viral hepatitis as outcome variables. This allows us to investigate the sensitivity of different populations to air pollutants in terms of disease incidence.

Single pollution and weighted quantile sum (WQS) statistical analyses

The WQS regression model serves to evaluate the combined effects of multiple exposure variables on a specified outcome. Each exposure variable is assigned a weight within the model to quantify its influence on the outcome variable16. Initially, this study employs the WQS model to identify pollutants significantly impacting the incidence rate of viral hepatitis across various age groups and subtypes. To assess the cumulative impact of simultaneous exposure to multiple pollutants and discern individual contributions of each pollutant, a “mixtures” approach via WQS regression analysis was utilized. Concurrently, epidemiological data was stratified into different air quality categories based on Beijing's AQI, distinguishing between pollution and good air quality levels. Within varying environmental quality states, the WQS regression model was applied to analyze how different pollutants influence the incidence and mortality of viral hepatitis.

Results

Demographic characteristics

From Table 1, the incidence of viral hepatitis in Beijing between 2005 and 2020 exhibited a general declining trend, with a notable short-term surge observed from 2016 to 2018. Conversely, the mortality rate displayed an increasing trend, peaking at 0.77 per 100,000 in 2011. Predominantly, HBV and HCV subtypes accounted for approximately 86.25% of cases, while HDV cases were rare, totaling only three. The seasonal distribution indicated spring and summer epidemics. Among age groups, individuals aged 35–64 years constituted the majority at 51.23%, followed by those aged 15–34 years at 31.38%.

Table 1 Distribution of viral hepatitis cases by age, types and season groups in Beijing, China, 2005–2020.

The analysis of time series model results

Comparing the predicted graphs from Fig. 1A, B, it can be observed that the Holt-Winters model outperforms the SARIMA model in time periods. In Table S1, the Deviation indicator reveals that the Holt-Winters model demonstrates a relatively minor discrepancy compared to the SARIMA model in predicting outcomes for the year 2019. However, the Holt-Winters model exhibits a notable advantage in its predictions for 2020. In Table S2, the parameters for the Holt-Winters additive model are determined as α = 0.44, β = 0.09, γ = 1, while the SARIMA multiplicative model is specified as SARIMA (1,1,0) (2,1,0) [12]. Despite comparing metrics such as RMSE, it was found that there is little discernible difference in the performance of the two models.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Forecast plots for Holt-Winters (A) and SARIMA (B) models. The deep shaded regions indicate 80% confidence intervals, the light shaded regions indicate 95% confidence intervals.

Model prediction comparisons

Figure S1 showed illustrates the results of Spearman's correlation analysis, revealing positive associations between five pollutants—PM2.5, PM10, SO2, CO and NO2—and the prevalence of viral hepatitis. Notably, PM2.5 shows a significant cross-correlation with both PM10 and CO (r = 0.84, P < 0.001). Table 2 compares four kernel algorithms of GPR, indicating relatively better predictive performance for HCV across different genotypes (R2test  [0.087, 0.202]). Similarly, among age groups, individuals aged 35 and above exhibit more accurate predictions (R2test  [0.024, 0.150]). The Besseldot kernel function within the GPR model demonstrates superior predictive capability. Table 3 evaluates four kernel algorithms of SVM, highlighting HBV as having better predictive outcomes across genotypes (R2test  [0.215, 0.303]). Additionally, individuals aged 35 and above show enhanced prediction accuracy (R2test  [0.010, 0.132]). The Polynomial kernel function proves advantageous within the SVM framework. Overall, SVM demonstrates superior predictive performance compared to GPR across the evaluated metrics, underscoring its efficacy in modeling the relationships between pollutants, genotypes, age groups, and viral hepatitis development.

Table 2 Comparison of the prediction results with different kernal of gaussian distribution regression (GPR) models.
Table 3 Comparison of the prediction results with different kernal of support vector machines (SVM) models.

Assess the combined association between multiple pollutions exposures and viral hepatitis

Table S3 presents the comprehensive sensitivity analysis, indicating that the combined effect of the five pollutants on viral hepatitis is βWQS (95% CI) = 0.066 (0.018, 0.114). Among different subtypes, pollutants demonstrate significant adverse effects on HAV, HCV, and HEV. Across different age groups, except for the 0–14 age group, pollutants show notable adverse effects. Subsequently, based on the results of the overall sensitivity analyses, the relevant key factors were initially screened. From Table 4, focusing on individual pollutant effects, PM2.5 emerges as the primary risk factor for viral hepatitis overall, with a risk effect of βWQS (95% CI) =  − 0.0050 (− 0.0089, − 0.0013). Among different subgroups, PM2.5 stands out as the most sensitive risk factor, particularly impacting HCV and HEV patients and individuals aged 35–64. SO2 primarily affects HCV patients and individuals aged 35–64, with risk effects of βWQS (95% CI) = 0.0022 (0.0004, 0.0040) and βWQS (95% CI) = 0.0043 (0.0005, 0.0080), respectively. CO mainly impacts HAV patients, with a risk effect of βWQS (95% CI) =  − 0.0355 (− 0.0695, − 0.0016). NO2 primarily affects individuals aged 0–14, while PM10 influences HEV patients. In terms of combined pollutant effects, pollutants mainly affect HCV patients and individuals aged 35–64 (with risk effects of βWQS (95% CI) = 0.0342 (0.0210, 0.0474) and βWQS (95% CI) = 0.0453 (0.0153, 0.1556), respectively).

Table 4 Comparison of results from the survey-weighted single pollution analyses and WQS regression of the matrix specific pollutions mixtures for the viral hepatitis.

Regarding environmental pollution periods, as illustrated by Fig. S2, SO2 and CO are key pollutants influencing the onset and mortality of viral hepatitis. During polluted periods (Fig. S2C), SO2 and PM2.5 predominantly affect onset, whereas during periods of good environmental conditions (Fig. S2A), SO2 and PM2.5 are primary factors. Similarly, for mortality during polluted periods (Fig. S2D), CO and SO2 play critical roles, while during good environmental periods (Fig. S2B), CO and PM2.5 are significant influencers.

Non-linear interaction of pollutions

From Table S4, significant interaction effects of pollutants with PM2.5-PM10 and PM2.5-CO are observed at lag periods of 5–6 months, respectively. Specifically, the interaction effect of PM2.5-PM10 is better fitted at a lag of 5 month, while the interaction effect of PM2.5-CO shows better fit at a lag of 6 months. Figure 2 illustrates fitting effect plots, revealing that the risk of viral hepatitis onset is elevated at lower levels of PM2.5 and PM10 (Fig. 2A and B), while high levels of PM2.5 (100–120 μg/m3) and CO (Fig. 2C and D) correspond to increased onset risk. Additionally, as depicted in the fitting curves of Fig. S3, the dose–response relationships of SO2 and NO2 with viral hepatitis onset become progressively clearer with increasing lag months. At lag 6 month, NO2 achieves its maximum risk effect at the level of 30–40 μg/m3.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

The fitting interactions of the association among pollutants and viral hepatitis cases in Beijing, 2014–2020 based on the generalized additive model (GAM), with lagging of 5 (A, B) and 6 (C, D) months.

Discussion

The incidence of viral hepatitis in Beijing Municipality exhibited an overall decreasing trend from 2005 to 2020, primarily attributed to widespread hepatitis vaccination and standardized antiviral treatments in China. These advancements have significantly reduced new cases among patients17. However, despite these preventive measures, factors such as improved quality of life and various environmental influences have exacerbated the progression of hepatitis, leading to increased incidences of cirrhosis and liver cancer. Furthermore, the chronic nature of viral hepatitis, combined with limited effective prevention and treatment options, has contributed to a slight rise in long-term mortality rates. The primary types of hepatitis in this region are HBV (Hepatitis B Virus) and HCV (Hepatitis C Virus). HBV transmission, particularly from mother to child, has historically been prevalent in China due to inadequate medical hygiene practices in the past. In contrast, HCV, which often presents with subtle symptoms and is not typically part of routine health screenings, has also contributed to its spread. Our study identified distinct seasonal patterns, with spring and summer showing higher incidence rates. The age group most susceptible to infection was predominantly 35–64 years old, consistent with findings from previous research18. This age distribution reflects the prolonged duration of hepatitis infections, with older individuals typically experiencing longer periods of infection.

Establishing robust statistical models is essential for predicting the occurrence trends of infectious diseases. Commonly utilized in time series analysis are models like Holt-Winters and ARIMA, each offering distinct advantages for predictive accuracy and practical application. In the context of viral hepatitis prediction, this study compared the Holt-Winters model with SARIMA and found that the former generally outperformed the latter. This superiority can be attributed to challenges in determining SARIMA parameters and the potential for overfitting due to complex calculations, leading to less stable predictions. The Holt-Winters model proves effective in capturing epidemiological patterns of hepatitis onset due to its computational simplicity and high predictive accuracy19. Furthermore, this study employs machine learning-based methods to predict hepatitis onset risks associated with pollutant levels. Evaluation across different hepatitis types and age groups consistently shows superior predictive performance for primary hepatitis types and highly susceptible populations, aligning with epidemiological insights. This underscores that individuals in sensitive demographics are more vulnerable to environmental pollutants, influencing hepatitis susceptibility.

Different types of viral hepatitis primarily spread through gastrointestinal and bloodborne routes. HAV and HEV, for instance, mainly transmit through the gastrointestinal tract, with transmission influenced by pollutants such as PM10 and CO. This can be linked to increasing industrialization and declining environmental awareness. Higher levels of airborne particulate matter and vehicle emissions exacerbate environmental pollution, thereby enhancing transmission through the gastrointestinal route. Other types of viral hepatitis primarily transmit through blood and bodily fluids, affected notably by pollutants like SO2 and PM2.5. Epidemiological studies have shown an association between PM2.5 levels and liver fibrosis20. Animal research indicates that air pollution can activate Kupffer cells, trigger endoplasmic reticulum stress responses, induce cytokine production, and promote collagen deposition, thereby exacerbating fibrosis progression21. This suggests environmental pollutants can impact hepatic metabolism through the bloodstream route. Furthermore, this study identifies SO2 and CO as significant pollutants influencing the onset and mortality of viral hepatitis. CO, due to its high affinity for hemoglobin binding in the bloodstream, poses a notable threat to the progression and mortality of hepatitis. These findings underscore the importance highlighted in China's infectious disease planning of addressing hepatitis transmitted through the bloodstream route.

Current literature on infectious disease prediction and pollutant impacts often focuses on single methodologies and specific effects. This study, however, employed diverse time-series methods to forecast and analyze the interactive effects of viral hepatitis, revealing significant month-to-month prediction intervals marked by considerable fluctuations. These findings underscore the challenge of capturing the inherent volatility in viral hepatitis data using conventional models. Moreover, regional constraints within the study area limited the generalizability of findings across different types of hepatitis affected by pollutants. Future research endeavors are encouraged to validate these macroscopic epidemiological insights at a microscopic level, utilizing animal models to elucidate underlying physiological mechanisms.

Conclusion

The Holt-Winters model outperformed SARIMA in predicting viral hepatitis incidence. SVM and GPR models utilizing pollutant data showed potential for enhanced prediction accuracy. Patients with HAV and HEV were primarily impacted by PM10 and CO, while SO2 and PM2.5 affected other types. The 35–64 age group exhibited higher susceptibility. Long-term exposure to mixed pollutants influenced hepatitis development with a lag of 5–6 months, emphasizing the need for sustained pollutant monitoring for effective public health strategies.