Introduction

Tuberculosis remains a formidable public health challenge globally1,2,3. According to the latest Global Tuberculosis Report 2024 released by the World Health Organization (WHO)4, there were approximately 10.8 million new TB cases and 1.3 million TB-related deaths worldwide in 2023. China accounted for 7.1% (approximately 767,000 cases) of the global TB incidence, ranking third in terms of disease burden, following India and Indonesia. Notably, the prevalence of TB in China exhibits significant regional disparities. In 2021, the incidence rate in Xinjiang reached as high as 128 per 100,000 population, which was 2.4 times the national average (53.1 per 100,000 population), and the rate of decline over the past decade has lagged behind that of eastern provinces5,6. This abnormal distribution suggests that, in addition to biomedical factors, environmental driving mechanisms may be a crucial explanatory pathway7,8.

A growing body of evidence indicates that environmental factors are closely associated with the transmission dynamics of TB9. Air pollutants, such as PM2.5 and NO₂, can impair respiratory immune function10,11,12,13, while meteorological conditions, such as temperature and humidity, indirectly modulate transmission risks by affecting the survival rate of Mycobacterium tuberculosis aerosols14,15. The Xinjiang Uygur Autonomous Region, situated in northwestern China and at the heart of the Eurasian continent, represents a quintessential arid and semi-arid zone. Its distinctive geographical structure—defined by the Tianshan Mountains partitioning the Junggar and Tarim Basins, and bordered by expansive deserts such as the Taklamakan-combines with an extreme continental climate featuring low annual precipitation, frequent dust storms, and a substantial dependence on coal for winter heating to form a unique environmental profile. Major emission sources comprise natural mineral dust derived from extensive arid terrains and desert areas, as well as anthropogenic pollutants originating from coal combustion used for centralized heating during the prolonged cold season, complemented by increasingly significant contributions from vehicular and industrial emissions in urban centers. These conditions result in consistently high levels of particulate matter (especially PM10) and unique patterns of air pollutant exposure, which may significantly influence the transmission dynamics of respiratory infectious diseases like TB. However, the TB epidemic in Xinjiang has long been characterized by “three highs and one low” (high infection rate, high prevalence rate, high rural epidemic, and low annual decline rate), with a disease burden far exceeding the national average. There is an urgent need to delve into its environmental driving factors to formulate precise prevention and control strategies. Existing studies have predominantly focused on single environmental factors or employed traditional regression models16,17,18,19, which are limited in capturing the combined effects of multiple factors and non-linear relationships. Moreover, these studies have predominantly been geographically concentrated in the developed eastern regions, with insufficient attention paid to the environmental specificities of the arid northwestern regions of China20,21,22. Additionally, traditional machine learning models have limitations in ranking variable importance and providing causal explanations, lacking in-depth exploration of model interpretability23,24. In response to the above-mentioned deficiencies, this study integrates TB incidence data, air pollutant concentrations, and meteorological indicators in Xinjiang from 2010 to 2022. It constructs and comparatively analyzes GBDT and XGBoost machine learning models, and uses SHAP values to elucidate the marginal contributions and action directions of various environmental factors.

This study is the first to apply the XGBoost-SHAP interpretable framework to quantify complex nonlinear interactions between multiple environmental factors and PTB in arid regions; identify dust-driven PM10 and coal-combustion derived CO as predominant drivers specific to arid ecosystems, differing from mechanisms in humid areas; propose targeted environmental intervention strategies for dust control and clean energy transition in northwestern China. These approaches provide new insights into the environmental driving mechanisms of TB and offer a scientific basis for the formulation of regional public health policies, contributing to the achievement of the global goal of ending the TB epidemic.

Methods

Data collection

Global and Chinese TB incidence data were obtained from the Global Tuberculosis Report 2023 published by the WHO and the Global Tuberculosis Database. The number of reported TB cases and incidence rates from 2010 to 2022 for 14 regions (prefectures) in Xinjiang were extracted from the “Infectious Disease Reporting Information Management System” of the China Information System for Disease Control and Prevention (excluding data from outside the province and the Xinjiang Production and Construction Corps). The data were aggregated according to the date of onset. Air pollutant monitoring data, including CO (mg/m3), O3 (µg/m3), NO2 (µg/m3), PM2.5 (µg/m3), and PM10 (µg/m3), were sourced from the China National Environmental Monitoring Centre’s Real-time Air Quality Monitoring Platform (https://air.cnemc.cn/). Meteorological data, comprising average temperature (AT, ℃), average wind speed (WS, m/s), average rainfall (AR, mm), and average humidity (AH, %), were obtained from the National Oceanic and Atmospheric Administration of the United States.

Data pre-processing

To ensure data quality and consistency, a comprehensive pre-processing pipeline was implemented. Missing values in the dataset were addressed using the K-Nearest Neighbors (KNN) imputation method, which allowed for informed estimation of missing values based on the characteristics of the ten nearest data points. This method was selected for its ability to preserve the underlying data structure and reduce bias compared to mean or median imputation, particularly in time-series environmental data. In this study, Z-score standardization was used to process continuous variables and eliminate dimensional influence. The variance inflation factor (VIF) was utilized to detect multicollinearity, revealing potential collinearity among air pollutants and meteorological factors. The results indicated weak or no multicollinearity (Supplementary Material Table S1).

Statistical method

Statistical analysis

The preprocessed dataset was divided into a training subset containing 70% of the data and a test subset containing 30% of the data. Continuous variables were described using means and standard deviations (SD) or interquartile ranges (IQR). Pearson correlation analysis was used to investigate the correlations among all exposure variables. Additionally, restricted cubic splines were used to model potential non-linear relationships between each air pollutant exposure, meteorological factor, and TB incidence. All statistical analyses were primarily conducted using statistical packages in R 4.1.3, with the statistical significance level set at 0.05.

Gradient boosted decision tree

GBDT is a classical machine learning method proposed by Friedman et al.25. GBDT employs decision trees as weak learners and iteratively corrects model errors through a gradient descent strategy. In contrast to the parallel construction of random forests, the sequential training mechanism of GBDT enables it to focus more on correcting the residuals of the preceding models, often resulting in advantages in prediction accuracy26. In this study, an exhaustive grid search was conducted over a predefined parameter space to automate hyperparameter optimization. The performance of each unique hyperparameter combination was rigorously evaluated using a robust 10-fold cross-validation protocol. To mitigate the risk of overfitting and ascertain the optimal number of iterations for each specific hyperparameter configuration, early stopping was integrated into each training run during the cross-validation process. The optimal number of iterations is depicted in the supplementary material (Fig. S1). The final model reported in the results was trained utilizing the best-performing hyperparameter set identified through this optimization procedure. The implementation was based on the LightGBM framework27, and a summary of the optimized hyperparameters is provided in Supplementary Table S2.

Extreme gradient boosting

XGBoost28 is an efficient implementation of the GBDT proposed by Chen and Guestrin. Through technological innovations such as second-order derivative optimization, regularization control, and the Weighted Quantile Sketch, XGBoost significantly enhances the accuracy and training speed of traditional GBDT.

Key improvements include: Regularization in the loss function, the introduction of regularization terms into the loss, the function helps control model complexity (L1/L2 regularization), thereby mitigating overfitting. Second-order Taylor expansion, by utilizing both the first-order derivative (gradient) and the second-order derivative (Hessian) of the loss function, XGBoost optimizes the node splitting criterion. In a parallelized design, features such as pre-sorted feature blocks and cache optimization break through the computational bottlenecks of GBDT, enabling parallel processing. Automatic handling of missing values, XGBoost learns the default splitting direction for missing values, eliminating the need for preprocessing. XGBoost iteratively optimizes an objective function with regularization terms:

$$Obj^{{\left( t \right)}} = \sum\limits_{{i = 1}}^{n} {\left[ {g_{i} f_{t} \left( {x_{i} } \right) + \frac{1}{2}h_{i} f_{t}^{2} \left( {x_{i} } \right)} \right]} + \gamma T_{t} + \frac{1}{2}\lambda \left\| {\omega _{t} } \right\|^{2}$$
(1)

Here, \(^{c} Obj^{{\left( t \right)}}\) represents the total optimization objective at the t-th iteration, cn denotes the number of training samples, cgi refers to the gradient, i.e., the first-order derivative of the loss function L with respect to the current predicted value, hi represents the Hessian, i.e., the second-order derivative of the loss function cL with respect to the current predicted value, \(f_{t} \left( {x_{i} } \right)\) is the predicted value of the t-th tree for the cxi sample. The hyperparameter cγ is the penalty for leaf splitting to control the complexity of the tree, cTt indicates the number of leaf nodes in the t-th tree, λ is the L2 regularization coefficient to prevent overfitting, and \(^{{\text{c}}} \left\| {\omega _{{\text{t}}} } \right\|^{2}\) is the squared L2 norm of the leaf weight vector, which measures the model complexity. We set the tree depth to 4, the learning rate to 0.1, and the regularization coefficients γ = 0.3 and λ = 1. The optimal number of iterations is shown in the supplementary material (Fig. S2). The model was tuned through cross-validation, and a summary of the optimized hyperparameters is provided in Supplementary Table S2.

This study compares the predictive performance of the two models using the following quantitative metrics: R2, RMSE, and MAE.

SHAP theory and interpretation

To enhance the interpretability of the machine learning model and quantify the marginal contribution of each environmental variable to the prediction of PTB incidence, we employed SHAP values. SHAP is a unified framework based on cooperative game theory that assigns an importance value to each feature for every individual prediction. Its core assumption is that the model’s prediction result can be decomposed into the sum of contributions from each feature, namely, an additive explanatory model:

$$f\left( x \right) = \varphi _{0} + \sum\limits_{{i = 1}}^{n} {\varphi _{i} }$$
(2)

Here, f(x) represents the predicted value for a sample x, φ0 is the baseline value (often the model’s prediction with an empty feature set), φiand is the SHAP value of feature i. The SHAP value represents the contribution of that feature to the prediction result. Theoretically, SHAP values are precisely calculated by enumerating all possible feature combinations. For a sample containing features n, the formula for calculating the SHAP value φi of feature i is as follows:

$$\varphi _{i} = \sum\nolimits_{{s \subseteq x\left\{ {x_{i} } \right\}}} {\frac{{\left| s \right|!\left( {n - \left| s \right| - 1} \right)!}}{{n!}}} \left[ {f\left( {s \cup \left\{ {x_{i} } \right\}} \right) - f\left( s \right)} \right]$$
(3)

In this formula, s is a feature subset that does not include feature i, is the size of the subset, and \(\left| s \right|\) is the model’s prediction under the feature subset s. This formula calculates the contribution of a feature i by computing the increment in the model’s prediction when the feature i is added to different subsets s, and then taking a weighted sum based on the probability of each subset’s occurrence.

SHAP typically graphically visualizes machine learning predictions to enhance presentation. For instance, the SHAP variable importance plot succinctly illustrates the contribution of each feature to the predictive performance: the larger the value, the greater the contribution29,30. This is of paramount importance for improving the interpretability and transparency of models, thereby aiding in enhancing the understanding and trust in disease screening.

Results

Descriptive analysis results

Figure 1 is a schematic diagram of the general situation of the study area, which illustrates the geographical, administrative, and environmental distribution patterns within the region, while also indicating the specific site locations of meteorological monitoring stations in Xinjiang. Xinjiang is situated in the northwest of China. Based on local geographical characteristics, this region can be divided into three parts: Southern Xinjiang, Northern Xinjiang, and Eastern Xinjiang. As illustrated in Fig. 2, which presents the epidemiological surveillance data of tuberculosis from 2010 to 2022, the global incidence rate of tuberculosis exhibited a gradual upward trend, increasing from 133 per 100,000 population to 138 per 100,000 population. China as a whole demonstrated a better performance compared to the global average, with its incidence rate dropping from 72 per 100,000 population to 58 per 100,000 population. However, the Xinjiang region displayed a distinct evolutionary pattern. From 2010 to 2017, the incidence rate in Xinjiang consistently remained higher than the national average, reaching its peak in 2018. Xinjiang had been in a state of hyper-high prevalence for an extended period before the rate began to decline.

Fig. 1
Fig. 1
Full size image

Overview map of the study area.

Fig. 2
Fig. 2
Full size image

Trend of tuberculosis incidence rates globally, in China and in Xinjiang from 2010 to 2022.

The mean exposure concentrations of all air pollutants and meteorological factors in Xinjiang are summarized in Table 1. Notably, the annual mean concentrations of both PM2.5 and PM10 substantially exceeded the limits set by the WHO (Air Quality Guidelines 2021: PM2.5 = 5 µg/m3, PM10 = 10 µg/m3) and China’s National Ambient Air Quality Standards (GB 3095 − 2012: PM2.5 = 35 µg/m3, PM10 = 70 µg/m3). All meteorological factors exhibited varying degrees of fluctuation. The IQR was used to quantify the variability of each factor around the median. A larger IQR indicates greater variability in the data. AR showed the most significant fluctuation (IQR = 12.77 mm), indicating substantial variation in precipitation levels across the observation period, which is a characteristic feature of the arid continental climate in Xinjiang. In contrast, WS showed the least variability (IQR = 0.93 m/s), suggesting relatively stable wind conditions. Figure 3 presents the results of the Pearson correlation analysis between air pollutants and meteorological factors in Xinjiang, revealing the interactions among various environmental elements. In terms of internal correlations among air pollutants, PM10 and PM2.5 exhibit an extremely close relationship, with a correlation coefficient as high as 0.86 (P < 0.001). Additionally, a positive correlation is observed between PM2.5 and NO2, with r = 0.45 (P < 0.001). The correlation coefficient between NO2 and CO is r = 0.35 (P < 0.001), indicating a certain positive correlation in their concentration changes. In contrast, NO2 and O3 show a negative correlation, with r =−0.42 (P < 0.001), meaning that when the concentration of NO2 increases, the concentration of O3 tends to decrease. CO demonstrates a positive correlation with both PM10 and PM2.5, with correlation coefficients of 0.23 (P < 0.005) for both, suggesting that changes in CO concentration have similar impacts on the concentrations of these two particulate matters. From the perspective of the associations between air pollutants and meteorological factors, there is a significant correlation between particulate matter and AH. Specifically, PM10 exhibits a strong negative correlation with AH, with r =−0.65 (P < 0.001); PM2.5 also shows a strong negative correlation with AH, with r =−0.56 (P < 0.001). The effects of AR on different particulate matters vary. The correlation coefficient between AR and PM10 is−0.57 (P < 0.001), while that between AR and PM2.5 is−0.48 (P < 0.001), which is relatively weaker. WS also has a non-negligible impact on particulate matter concentrations. WS shows a significant negative correlation with PM2.5, with r =−0.33 (P < 0.001), and a similarly significant negative correlation with PM10, with r =−0.31 (P < 0.001). Furthermore, there are numerous significant correlation relationships among meteorological factors. AH and WS exhibit a significant negative correlation, with r =−0.35 (P < 0.001), whereas AH and AR show an extremely strong positive correlation, with r = 0.83 (P < 0.001). Meanwhile, WS and AT are positively correlated, with r = 0.35 (P < 0.001); however, WS and AR are negatively associated, with r =−0.34 (P < 0.001).

Table 1 Descriptive statistics of exposure concentrations of air pollutants and meteorological factors in Xinjiang.
Fig. 3
Fig. 3
Full size image

The relationship between air pollutants and meteorological factors.

Nonlinear exposure-response (E-R) relationship patterns between atmospheric pollutants, meteorological factors, and tuberculosis risk

The E-R curves in Fig. 4 depict the nonlinear associations between environmental factors and PTB incidence, expressed as Odds Ratios (ORs). An OR represents the ratio of the odds of PTB incidence occurring at a specific exposure level compared to a reference level (typically the median exposure value in this study). An OR greater than 1 indicates an increased risk of PTB, while an OR less than 1 suggests a protective effect against the disease. A detailed analysis of Fig. 4 reveals several key patterns. The concentrations of PM2.5 and PM10 exhibit a pattern characterized as “gentle at low exposure, steep increase at high exposure” (Fig. 4a, b). This indicates that the risk of PTB increases marginally at lower concentrations but rises sharply when pollutant levels exceed a certain threshold. The E-R curves for NO2 and AT display a wavy, non-monotonic shape (Fig. 4c, f ). The risk fluctuates at lower concentrations and gradually stabilizes as concentrations increase, suggesting a complex, nonlinear influence on PTB transmission. In contrast, O3 and CO show a predominantly monotonic increasing trend (Fig. 4d, e), implying a consistent rise in PTB risk with rising concentrations of these pollutants. The curve for WS presents a distinct inverted J-shape (Fig. 4g). A “protective threshold” is observed within the range of 4.0–5.5 m/s, where the OR is at its lowest. Deviating from this optimal range, either with lower or higher wind speeds, results in a significant increase in PTB risk. This is likely because moderate winds promote aerosol dispersion, while calm conditions lead to pollutant accumulation, and very strong winds may resuspend dust particles. AR demonstrates a significant protective effect when it exceeds approximately 10 mm (Fig. 4h), as precipitation likely removes airborne particles through wet deposition. The E-R curve for AH is relatively flat (Fig. 4i), though a slight increase in risk is observed in low-humidity intervals (< 45%).

Fig. 4
Fig. 4
Full size image

The exposure-response relationship between air pollutants and meteorological factors and the incidence of tuberculosis. (a) PM2.5. (b) PM10. (c) NO2. (d) O3. (e) CO. (f) AT. (g) WS. (h) AR. (i) AH.

Model fitting and performance analysis

As shown in Fig. 5, a comparison of the scatter plots of predicted values versus actual values reveals significant performance disparities between the XGBoost model and the GBDT model in tuberculosis data modeling. The predicted points of the XGBoost model are more closely clustered along the ideal fitting line (dashed line), and the range of residual distribution is notably narrower. This indicates that the XGBoost model has a significantly superior ability to capture complex nonlinear relationships in the data compared to the GBDT model, thereby reducing systematic biases in the prediction results.

Table 2 presents the quantitative evaluation results of the fitting performance of the two models on the tuberculosis dataset. The coefficient of determination of the XGBoost model (R2 = 0.906) is significantly higher than that of the GBDT model (R2 = 0.492), suggesting that the XGBoost model accounts for a substantially larger proportion of the data variance. Moreover, the XGBoost model demonstrates a marked improvement in prediction stability, as evidenced by its significantly lower RMSE and MAE compared to the GBDT model, indicating better error control capabilities.

The residual distribution plots of the XGBoost model presented in Fig. 6 offer intuitive diagnostic evidence. The model residuals for both the training set (Fig. 6a) and the test set (Fig. 6b) exhibit a random distribution pattern centered around the zero axis, with no discernible directional trends or regular fluctuations. This is consistent with the expected form of random errors.

Table 2 Evaluation of model fitting effect. Evaluation of model prediction effect.
Fig. 5
Fig. 5
Full size image

Actual vs. Predicted Comparison.

Fig. 6
Fig. 6
Full size image

Residuals vs. Predictions. (a) Train. (b) Test.

Analysis of feature contributions in the model

The global feature importance analysis (Fig. 7) conducted on the XGBoost model based on the SHAP method has unveiled a complex association pattern between the incidence rate of tuberculosis and environmental drivers in the Xinjiang region. The quantitative evaluation results indicate significant hierarchical differences in the contributions of various environmental variables to the disease burden. Among the multiple environmental indicators included in the analysis, PM10 exhibits a dominant influence, with its mean absolute SHAP value ranking first. This suggests that this variable has the strongest explanatory power in the model prediction process.

CO concentration, as a secondary influential factor, follows closely behind, and its feature importance score also reaches a statistically significant level, implying that gaseous pollutants may play an important role in the transmission dynamics of tuberculosis. Although AT, AH, and PM2.5 have relatively lower contributions, they still maintain a moderate level of feature importance, indicating that these meteorological quality parameters constitute key components in the environmental risk spectrum for tuberculosis incidence. The feature importance scores for variables such as AR and NO2 are at a lower magnitude. Among them, NO2 has the weakest influence, suggesting that it may only have a marginal effect on the occurrence and development of the disease under specific exposure scenarios.

Fig. 7
Fig. 7
Full size image

Feature importance of the SHAP-based XGBoost explainer.

Discussion

This study focused on the association between the incidence data of PTB and environmental factors in the Xinjiang region from 2010 to 2022. By constructing GBDT and XGBoost machine learning models and combining them with SHAP interpretability analysis techniques, the study systematically dissected the nonlinear association mechanisms and feature importance hierarchies between environmental exposure factors and tuberculosis incidence. The research findings provide new scientific evidence and decision-making support for tuberculosis prevention and control in arid regions.

This study indicates that the incidence rate of PTB in Xinjiang reached its peak in 2018 and then experienced a sharp decline in 2019. The main reasons for this phenomenon may be as follows: Firstly, the large historical base of PTB patients in Xinjiang has led to a slow decline in the epidemic, with the PTB incidence rate consistently ranking first in the country over the years31. Secondly, the implementation of a policy combining passive surveillance with active case-finding for tuberculosis, along with the comprehensive roll-out of initiatives such as screening for suspected PTB patients during the national health check-up in 2018 and PTB screening among high-risk populations, significantly contributed to the marked increase in the number of reported cases in Xinjiang in 2018. In addition, under the guidelines and instructions of the national investigation into under-reporting and under-registration of PTB cases and diagnostic re-verification work in China in 201732, Xinjiang launched a tuberculosis technical assistance program, which promoted the improvement of local PTB detection capabilities. This serves as another crucial reason for the sharp surge in the number of reported PTB cases in 2018. From 2018 to 2022, the PTB incidence rate in Xinjiang has shown a year-on-year decrease. This may be attributed to the continuous implementation of the tuberculosis technical assistance program, which has effectively enhanced the comprehensive technical skills of PTB prevention and treatment personnel in the region33. Moreover, patients identified through active screening have all been included in management, resulting in a certain degree of decline in the PTB incidence rate in Xinjiang after 2019.

In the evaluation of model performance, the XGBoost framework significantly outperformed GBDT in capturing complex environmental exposure-disease relationships ( = 0.91 vs. 0.49). This is mainly attributed to its use of second-order Taylor expansion optimization, L1/L2 regularization constraints, and weighted quantile sketch techniques, which endow it with stronger feature capture capabilities when dealing with high-dimensional interactive features. This finding aligns with the theoretical advantages of the XGBoost algorithm proposed by Chen and Guestrin28 and validates its applicability through empirical research in the field of public health.

The SHAP value analysis further revealed that PM10 holds a dominant position among environmental drivers in Xinjiang, followed by CO, AT, and PM2.5. This conclusion is consistent with the findings of a multi-center study in East Asia34, which identified particulate matter pollution as an important risk factor for respiratory infectious diseases. However, in Xinjiang, the contribution of PM10 is significantly higher than that of PM2.5. This discrepancy may be attributable to the unique geographical and climatic characteristics of Xinjiang. Specifically, the arid climate, prevalent dust storms, and basin topography promote the generation and persistence of coarse particulate matter (PM10), leading to its dominance over PM2.5 as the primary particulate pollutant. The arid climate and frequent sandstorms in Xinjiang result in an annual average PM10 concentration of 142.1 µg/m3 (14 times the WHO limit), far exceeding the levels in eastern Chinese cities. The coarse particulate matter in Xinjiang mainly originates from natural dust processes, such as the frequent dust storms originating from the surrounding deserts (e.g., the Taklamakan Desert), which is a dominant source throughout the year. Sand dust particles may exacerbate infection risks by damaging the respiratory mucosal barrier and promoting the colonization of Mycobacterium tuberculosis, a hypothesis that resonates with the pathological studies conducted by Rivas-Santiago et al.35 and Glass36.

As a secondary risk factor, CO exhibits a monotonically increasing exposure-response curve, which is closely related to widespread anthropogenic emissions from residential coal combustion for heating during the extended winter, a common practice in both rural and urban areas of Xinjiang. Raqib et al. confirmed in their study on household air pollution that CO can enhance tuberculosis susceptibility by inhibiting macrophage immune function37. This mechanism has also received support in a tuberculosis study in Lima, Peru38, but it has not been highlighted in previous studies in the humid eastern regions of China and other research39,40. The annual average temperature in Xinjiang is only 6.2℃, with a significant seasonal temperature difference of 8.9℃. Extreme low-temperature conditions (< 4℃) may prolong the survival time of Mycobacterium tuberculosis in aerosols41 and simultaneously increase indoor congregation behaviors among the population, thereby elevating the risk of droplet transmission.

The effects of meteorological factors exhibit significant nonlinear characteristics. There exists a “protective threshold” of wind speed WS ranging from 4.0 to 5.5 m/s, which is consistent with the dynamics of aerosol dispersion42. When WS is below this threshold, air stagnation leads to the accumulation of pollutants, while excessively high wind speeds may lift pathogen-bearing particulate matter from the surface43. When AR exceeds 10 mm, it can reduce the risk of transmission through the wet deposition effect. However, this effect is weaker in arid regions compared to humid areas, reflecting regional differences in dust deposition mechanisms. This study found that the risk of PTB increases when AH is below 45%, which is consistent with research conclusions that dry environments in temperate regions promote droplet transmission44. However, high humidity (>55%) does not show a significant protective effect, differing from conclusions in tropical studies45. This discrepancy may be related to the arid baseline characteristics of Xinjiang.

It is noteworthy that O3 shows a positive association with PTB incidence, but its SHAP importance ranking is moderately low. This is mainly because the annual average concentration of O3 in Xinjiang (62.4 µg/m3) is lower than that in severely ozone-polluted areas in inland China, and only 14.3% of samples exceed the threshold concentration of 80 µg/m3. Experimental studies have shown that O3 may only significantly inhibit the phagocytic function of alveolar macrophages through oxidative stress when its concentration is above 80 µg/m[346. The chemical mechanism of O3 generation being suppressed under high PM10 conditions47 further explains why its effect is masked by dominant factors. This is consistent with the conclusions of an Indian study48 but differs from a report in Los Angeles, USA49, where low particulate matter levels make O3 the primary risk factor. The exposure-response curve of NO2 exhibits a wavy pattern, and its SHAP importance ranking is the lowest. This nonlinear pattern suggests a possible duality in its role: the risk increases in the low-concentration zone (< 30 µg/m3), which may reflect the spread of pathogens carried by primary pollutants from traffic emissions50; the effect weakens in the high-concentration zone (>50 µg/m3), possibly due to the consumption of NO2 through photochemical reactions to generate O3, forming a “NO2 - O3” antagonistic effect51,52. In addition, the overall NO2 concentration in Xinjiang is relatively low (mean value of 27.8 µg/m3), with only 7.2% of samples exceeding China’s secondary standard (40 µg/m3), further limiting its overall influence.

Through multi-regional comparisons, this study found that the environmental exposure characteristics and disease-association patterns in Xinjiang are significantly different from those in inland China and other international regions53,54,55. Compared with other developed regions in China56,57,58, Xinjiang’s unique arid climate, frequent sandstorms, and energy structure characteristics lead to significant regional specificities in the intensity and pathways of environmental factors’ effects on PTB incidence. These differences may result in varying intensities and patterns of environmental factors’ impacts on PTB incidence in different regions. At the methodological level, this study conducted a comparative analysis of model selection, verifying the advantages of the XGBoost-SHAP framework in analyzing complex environmental health effects, providing a methodological reference for similar studies.

Limitations and policy implications

Despite its contributions, this study is subject to several limitations. Firstly, at the data level, relying on aggregated incidence data and environmental monitoring data from Xinjiang may introduce ecological bias. Moreover, individual-level environmental exposure and health data were not included, making it impossible to directly establish individual-level causal associations. Secondly, socioeconomic indicators (per capita GDP, medical accessibility), behavioral factors (smoking rate, population aggregation), and biological factors (HIV co-infection, drug resistance) were not incorporated into the model, which may overestimate the contribution of environmental factors, especially in impoverished and medically under-resourced areas such as rural Xinjiang. Furthermore, while SHAP values quantify the importance of environmental predictors, the findings lack direct mechanistic support from experimental studies, such as the survival of M. tuberculosis in aerosols under varying temperature/humidity conditions or the immunosuppressive pathways of air pollutants. Finally, the exposure-response relationships identified in the arid environment of Xinjiang may not be generalizable to humid regions due to fundamental differences in pollutant composition and climate patterns.

Notwithstanding these limitations, our findings carry important implications for public health policy. The identification of PM10 (largely dust-derived) and CO (primarily from coal combustion) as the dominant environmental drivers of PTB in Xinjiang suggests that regional control strategies should prioritize integrated dust suppression measures (e.g., afforestation, soil stabilization) and accelerated clean energy transition (e.g., replacing coal-based heating with electricity or natural gas in rural households). Moreover, the non-linear effects of meteorological factors suggest that public health advisories and intervention plans could be optimized based on seasonal weather patterns—for instance, issuing health warnings during low-wind periods (< 4.0 m/s) or high-dust seasons to reduce outdoor exposure. These targeted environmental interventions, combined with ongoing biomedical strategies, could significantly reduce the TB burden in arid regions and contribute to achieving the WHO’s End TB goals.

Conclusions

This study systematically evaluated the composite effects of multiple environmental factors on the incidence of PTB within the specific environmental context of Xinjiang by integrating PTB incidence data, concentrations of various air pollutants, and meteorological indicators from 2010 to 2022. By comparing two machine learning models, GBDT and XGBoost, and employing the SHAP value interpretation framework, this study quantified the marginal contributions of individual environmental factors and revealed the direction and magnitude of the effects exerted by the primary environmental drivers.

The key findings are as follows: First, the XGBoost model demonstrated superior performance in fitting complex nonlinear relationships and handling high-dimensional data interactions (R2 = 0.91), significantly outperforming the GBDT model (R2 = 0.49). Second, SHAP analysis indicated that PM10 (mean concentration: 142.1 µg/m3, exceeding the WHO guideline limit by 14-fold) was the most prominent environmental risk factor for PTB incidence in Xinjiang, followed by CO, mean temperature, and PM2.5. Both PM10 and CO exhibited monotonic exposure-response relationships, with PTB risk increasing as their concentrations rose. Additionally, meteorological factors demonstrated significant nonlinear characteristics: a “protective threshold” for wind speed was identified within the 4.0–5.5 m/s range, while precipitation exceeding 10 mm showed a trend of reducing PTB risk.

The results of this study underscore that, within the arid ecosystem of Xinjiang, dust-related PM10 and coal combustion-derived CO are key environmental drivers of PTB transmission. This study not only provides a novel perspective for understanding the environmental mechanisms underlying PTB in arid regions but also demonstrates the effectiveness of the XGBoost-SHAP framework in dissecting complex environmental health effects. From a public health practice standpoint, the findings support the development of targeted regional policies for dust pollution control and clean energy promotion, offering a scientific basis for accelerating the achievement of the goal of ending the tuberculosis epidemic through environmental interventions.

Future research should aim to incorporate individual-level data, socioeconomic variables, and biomarker information, while validating the generalizability of the exposure-response relationships established in this study across different climatic zones and pollution contexts through multicenter collaborations. Such efforts will further advance the field of environmental tuberculosis toward precision and interpretability.