Introduction

Skeletal fluorosis (SF) is a neglected chronic disease that poses a significant public health burden in many low-and middle-income countries, with an estimated 100 million people affected globally, particularly in India, Africa, and China1,2. Fluoride occurs naturally in water, air, soil, and food; however, exposure levels exceeding the World Health Organization (WHO) threshold of 1.5 mg/L can lead to serious metabolic and skeletal disorders3,4,5. In China, fluorosis predominantly arises from three environmental exposure pathways: coal-burning, drinking-water contamination, and consumption of fluoride-rich brick tea. These pathways affect 28 provincial-level administrative regions and over 70,000 villages nationwide6. Although each exposure route leads to distinct fluoride absorption patterns, the majority of fluoride is ultimately deposited in mineralized tissues, cumulatively resulting in progressive skeletal damage over time7.

Clinically, early-stage SF presents with subtle symptoms such as joint pain and stiffness, which may progress to severe skeletal deformities and disability if left untreated8. Current diagnostic criteria rely heavily on radiographic evidence, which typically becomes apparent only after significant and often irreversible skeletal damage has occurred. Such diagnostic delays greatly undermine the effectiveness of early interventions9.

Recent advances in machine learning (ML) offer promising opportunities to overcome these diagnostic limitations through predictive modeling. ML algorithms can integrate heterogeneous clinical and environmental data to detect early, subtle indicators of SF progression, potentially outperforming conventional diagnostic approaches10. In this study, we aimed to develop and validate an ML-based predictive framework that integrates environmental fluoride exposure data with clinical biomarkers. We utilized data from three fluoride-endemic regions in China, each representing a unique environmental fluoride exposure route, and evaluated the performance of several ML algorithms. The inclusion of diverse exposure patterns enhances the generalizability of the model across different affected populations. This predictive tool enables early identification of high-risk individuals and provides a valuable approach for targeted public health interventions to reduce the burden and progression of SF.

Method

Study design and participants

Data and participants for this study were drawn from the China Fluorosis Cohort (CFC), initiated in 2022. Based on the national classification criteria for endemic fluorosis areas (GB/T 17018–2011) and historical surveillance data, three provinces with distinct sources of fluoride exposure were selected: Shanxi (drinking-water type), Guizhou (coal-burning type), and Sichuan (brick-tea type)11,12,13. A multi-stage stratified cluster sampling strategy was used to recruit participants. In the first stage, 1–2 endemic areas were selected for each exposure type. In the second stage, the local Centers for Disease Control and Prevention (CDC) selected 3–6 communities/villages in each district based on population size. In the final stage, the diagnosis and grading of SF were conducted by a team of two radiologists and three orthopedic specialists based on standardized X-ray images and the People’s Republic of China Health Industry Standard for Endemic Fluorosis (WS/T192-2021). The same team was responsible for reading all radiographs from the three study regions. All cases were reviewed independently, and any diagnostic discrepancies were resolved through expert consensus. This process led to the inclusion of 1,789 individuals with varying degrees of SF severity.

All procedures in this study adhered to the ethical standards of our university, the National Research Committee, and the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. The study was approved by the Ethics Committee of Guizhou Medical University (Ethics Approval No. 2023 − 189), and all participants provided written informed consent.

Data collection and preprocessing

A multimodal data collection framework combining epidemiologic survey, clinical examination, and laboratory test was used for the data of this study. A structured questionnaire designed by the research team for the CFC was administered through face-to-face interviews by master’s-level graduate students with medical backgrounds. All interviewers received centralized, standardized training prior to fieldwork to ensure consistency in question delivery and interpretation. To maintain data quality, a random sample of 20% of completed questionnaires was cross-checked each evening by an experienced quality control team. Feedback and retraining were provided as necessary, and questionnaires deemed invalid were excluded from the final analysis.

On the final day of fieldwork at each community site, standardized physical examinations were conducted. Local health center nurses were responsible for measuring anthropometric and physiological indicators, including height, weight, and blood pressure. Orthopedic specialists evaluated the functional status and pain levels of major limb joints (shoulders, elbows, knees, and hips) using standardized clinical protocols.

Blood and urine samples were collected from all participants during the survey. These specimens were used to analyze a comprehensive panel of laboratory markers, including liver function, kidney function, lipid level, electrolytes, bone metabolism indicators, and inflammatory markers.

Participants who did not undergo clinical examination or refused to participate in laboratory tests were excluded. Variables with more than 20% missing data were removed, and individuals with over 50% missing values were also excluded from the analysis. For variables with missing data less than or equal to 10%, multiple imputation was performed using the mice package in R to reduce potential bias. The final study dataset incorporated 1309 individuals and was randomly divided into training and test sets in a 7:3 ratio. Minimum-maximum scaling was used for continuous variables to ensure consistent scaling between the training and test sets, and categorical variables were converted to dummy variables by one-hot coding to match the input requirements of the ML models.

Feature selection and ML model building

In order to optimize the feature selection and improve the model interpretation, Least Absolute Shrinkage and Selection Operator regression (LASSO) was used to select the variables in the training set, and the coefficients of the non-critical features were compressed to zero by L1 regularization to reduce the influence of multiple covariates and achieve feature sparsification. Initially, 80 candidate variables were included in the study, and the standardized feature matrix and the dichotomous dependent variable (SF severity, coded 0/1) were input into the model. LASSO regression was conducted using the cv.glmnet() function in the glmnet R package, with 10-fold cross-validation. The regularization path of the penalty parameter was generated automatically using the default data-adaptive sequence provided by glmnet, which spans a logarithmically spaced range from maximal to minimal regularization. A fixed random seed was applied to ensure the reproducibility of fold assignments. Finally, 22 non-zero coefficients were selected for subsequent analysis based on the minimum cross-validation error (lambda.min).

Using these selected features, ML algorithms, including Random Forest (RF), K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), and Decision Tree (DT), were constructed to develop prediction models for SF severity. To ensure model stability and prevent information leakage, all model training and hyperparameter tuning were conducted strictly within the training set using a grid-search strategy combined with 10-fold cross-validation. For each algorithm, a predefined and sufficiently wide hyperparameter space was systematically explored to identify the optimal configuration that maximized the area under the receiver operating characteristic curve (AUC). The search ranges were carefully calibrated to ensure that the finalized optimal values did not reside on the extreme boundaries of the predefined grids, thereby confirming the thoroughness of the optimization process. All evaluated parameters, their respective search ranges, and the final optimized values are detailed in Table 1.

Table 1 Hyperparameters for grid search tuning in each model.

After model optimization, the final models were applied exclusively to the independent test set to evaluate their generalization performance. Model performance was evaluated using AUC, accuracy (ACC), sensitivity (SENS), specificity (SPEC), and F1 score ((1)–(4)). Among these metrics, AUC was predefined as the primary criterion for model comparison and selection, while the remaining indicators provided complementary assessments of classification performance. For the training set, performance metrics were reported as the mean values across the 10-fold cross-validation, whereas for the test set, metrics were calculated directly based on the independent predictions. Furthermore, calibration curves and decision curve analysis (DCA) were conducted on the test set to assess the agreement between predicted probabilities and observed outcomes, as well as to quantify the net clinical benefit across a range of risk thresholds. The evaluation metrics of all classifiers are presented as follows:

  1. (1)

    \(\text{Accuracy=\:}\frac{\text{TP}\text{+}\text{TN}}{\text{TP}\text{+}\text{FP}\text{+}\text{FN}\text{+}\text{TN}}\)

  2. (2)

    \(\text{\:Sensitivity=\:}\frac{\text{TP}}{\text{TP}\text{+}\text{FN}}\)

  3. (3)

    \(\text{\:Specificity=\:}\frac{\text{TN}}{\text{TN}\text{+}\text{FP}}\)

  4. (4)

    \({\text{F1~Score = 2}} \cdot \frac{{{\text{Precision~}} \cdot {\text{Sensitivity}}}}{{{\text{Precision + Sensitivity}}}}\)

Model explanatory analysis

SHAP (Shapley Additive exPlanations) values are used in this study for feature importance analysis, which is based on the Shapley value principle in cooperative game theory, and can quantify the contribution of each feature variable to the model prediction results, and visualize the direction of the variable’s influence and its relative importance.

Statistical analysis

All statistical analyses were performed using R (version 4.4.2) and SAS (version 9.4). The normality of continuous variables was assessed using the Kolmogorov-Smirnov test, which revealed that all continuous variables were non-normally distributed. Therefore, continuous variables were expressed as median and interquartile range (IQR) and analyzed using Mann-Whitney U test or Kruskal-Wallis test. Categorical variables were expressed as counts and percentages for comparisons using the chi-square test. All statistical analyses were performed using two-sided tests with α < 0.05 considered statistically significant.

Result

Characteristics of participants

The training set included 917 participants (600 with mild SF and 317 with moderate-to-severe SF), and the test set consisted of 392 participants (254 mild, 138 moderate-to-severe). No significant differences in baseline characteristics were observed between the training and test sets (external P > 0.05), indicating that the data split preserved the representativeness of the overall sample. In the overall cohort, patients with moderate-to-severe fluorosis were significantly older than those with mild fluorosis (median age ~ 68 vs. 58 years in the training set, P < 0.001), a trend that was also observed in the test set (P < 0.001). In addition, several demographic characteristics were associated with fluorosis severity. Notably, living area (region of residence), ethnicity, education level, and annual income differed significantly between mild and moderate/severe groups (P < 0.001 for each), whereas there was no significant difference in severity between male and female participants. The distribution of disease severity varied markedly by region. The coal-burning fluorosis region, Guizhou, had a much higher proportion of moderate-to-severe cases compared to the brick-tea fluorosis region, Sichuan, with the drinking-water fluorosis region, Shanxi, in between (overall P < 0.001 for living area). For example, in the training set, approximately half of the participants from Guizhou had moderate-to-severe SF, versus only around 15% of those from Sichuan (brick-tea area), with Shanxi around 40%. A similar pattern was seen in the test set. These differences reflect the varying intensity of environmental fluoride exposure across regions.

As for bone health–related indices, bone mineral density and a bone mass index measure were significantly lower in participants with moderate-to-severe SF than in those with mild SF. Conversely, parathyroid hormone (PTH) levels were significantly higher in the moderate/severe group than in the mild group (median 52.8 vs. 48.3 pg/mL, P = 0.002), whereas β-CrossLaps (a marker of bone resorption) levels were significantly lower in moderate/severe compared to mild cases (0.45 vs. 0.52 ng/mL, P = 0.001). These differences in bone turnover markers and bone indices were consistent in both the training and test sets (P < 0.05 in each set).

In terms of general biochemical and hematological indices, we also observed significant differences between severity groups. Total bilirubin, urinary fluoride (UF), white blood cell count (WBC), neutrophil count (NEUT), mean corpuscular volume (MCV), mean corpuscular haemoglobin (MCH), and multiple inflammatory cytokines (e.g. interleukin-5, interferon α, IL-2, IL-1β, IFNγ, and IL-4) all differed significantly between mild and moderate/severe SF. Notably, UF levels in patients with moderate-to-severe SF were substantially higher than those in mild cases (median 2.04 vs. 1.30 mg/L, P < 0.001). Interestingly, several inflammatory cytokines were higher in mild fluorosis than in moderate/severe fluorosis. The trends for all these biomarkers were similar in both the training and test cohorts. Detailed data on the characteristics and group comparisons can be found in Table 2.

Table 2 Participant characteristics of the training and test set.

Feature selection

The LASSO regression identified 22 non-zero coefficient features from the 80 candidate variables that were significantly associated with the severity of SF, and the trends in the logarithmic values of the coefficient paths versus the regularization parameter λ are shown in Supplementary material 1.

The final selected predictors included demographic characteristics (Age, Living area, Average sleep time), skeletal health indicators (Bone Density, Bone Mass Index), joint mobility measures (Elbow, Spinal, Knee, Shoulder function), pain level score, bone metabolism markers (PTH, Osteocalcin, PINP, β-CrossLaps, Vitamin D), inflammatory marker (IL-4), hematological indices (WBC, HCT, MCV, MCHC), liver function indicators (Total bilirubin), and an environmental exposure marker (UF).

Evaluation and performance of ML models

We assessed the predictive performance of five ML classifiers using multiple evaluation metrics for predicting SF severity. As AUC was predefined as the primary metric for model comparison, the RF model achieved the highest cross-validated AUC in the training set among all evaluated classifiers. On the training set, the RF model achieved an AUC of 0.875 (95% CI: 0.851–0.913), along with favorable sensitivity (0.903; 95% CI: 0.871–0.936) and F1 score (0.855; 95% CI: 0.832–0.879). It also achieved an accuracy of 0.797 (95% CI: 0.765–0.829) (Fig. 1A; Table 3). When applied to the independent test set, the RF model maintained comparable performance with an AUC of 0.832, accuracy of 0.793, and F1 score of 0.841, indicating stable predictive performance and good generalizability (Fig. 1B; Table 4).

Calibration and decision curve analyses were used as complementary evaluations. Both RF and XGBoost showed good calibration and higher net benefit across decision thresholds (Fig. 1C,D). As RF consistently demonstrated the highest AUC during model development and stable performance in the independent test set, it was selected for subsequent interpretive analysis.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Performance evaluation of ML models for SF severity prediction. (A) ROC curves of the five ML models in the training set. (B) ROC curves of the five ML models in the test set. (C) Calibration curves. (D) Decision curve analysis.

Table 3 Model evaluation metrics in the training set.
Table 4 Model evaluation metrics in the test set.

To further explore the contribution of individual features to predicting SF severity, we conducted SHAP value analysis using the RF model. The SHAP analysis identified pain level score, knee function, age, shoulder function, and UF levels as the top five predictors with the greatest impact on the model’s predictions (Fig. 2A). Pain score emerged as the most influential predictor, exhibiting the highest SHAP values (> 0.06), strongly indicative of its critical role in differentiating SF severity. Other key factors, such as bone mass index, place of residence, total bilirubin, and bone metabolism markers (PINP, osteocalcin), also significantly influenced severity predictions.

The SHAP summary plot (beeswarm plot, Fig. 2B) visualizes how each feature contributes to model predictions. Specifically, lower pain scores were highly consistent with the prediction of mild SF, while individuals with no significant activity limitations in knee and shoulder function were similarly categorized as mild SF. Conversely, older age indicated a greater likelihood of moderate to severe SF in the elderly. In addition, lower bone mass index values and higher levels of bone metabolic markers (PINP and osteocalcin) were also associated with SF severity.

SHAP dependence plots for the most important predictors, including pain score and UF concentration, were further analyzed (Fig. 2C and D). UF exhibited a non-linear relationship with fluorosis severity prediction: initial increases in UF concentration sharply raised SHAP values, signifying increased disease severity risk, but beyond a certain threshold, additional increases produced diminishing predictive impact. In contrast, the dependence plot for pain score revealed a more complex pattern; low pain scores were consistently associated with mild fluorosis predictions (negative SHAP values), while higher scores correlated positively and increasingly with predictions of moderate-to-severe SF. These results underline the pivotal role of pain severity and environmental fluoride exposure, captured through UF levels, as predictors of SF severity.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

SHAP-based interpretability analysis of key predictors for SF severity. (A) SHAP feature importance plot. (B) SHAP summary plot. (C) SHAP dependence plot for UF. (D) SHAP dependence plot for pain level score.

Discussion

This study presents a robust and interpretable ML model for predicting the severity of SF, integrating clinical biomarkers and environmental fluoride exposure data from three distinct endemic regions in China, representing coal-burning (Guizhou), drinking-water (Shanxi), and brick-tea consumption (Sichuan) exposure pathways. Among the five ML models tested, the RF model demonstrated the highest predictive performance (AUC = 0.875 in training and 0.832 in test), highlighting the utility of ML approaches in capturing the complex interplay between environmental exposure and disease progression. This represents an advancement beyond traditional radiographic severity assessment, offering a complementary data-driven perspective.

The selection of performance metrics is crucial when evaluating ML models in epidemiological contexts where class imbalance is common. Such an imbalance is frequently encountered in studies of chronic diseases, cancer prognosis, and disease severity stratification. In many of these settings, researchers may apply resampling techniques such as SMOTE or other balancing strategies during model training to address skewed class distributions14,15. Importantly, however, these preprocessing strategies do not alter the predominant use of AUC as the primary evaluation metric in medical classification research16,17. In this study, the dataset reflects the real-world distribution of SF severity, where moderate-to-severe cases are naturally less frequent than mild cases. Our objective was not to artificially rebalance the data, but to preserve this distribution to better reflect practical screening scenarios in endemic populations. While some researchers advocate for Precision-Recall (PR) curves in imbalanced settings, recent evidence suggests that AUC maintains robust discriminative power and remains highly competitive with PR-AUC in many medical classification scenarios18,19. Nevertheless, we acknowledge that relying solely on AUC may obscure the practical implications of false positives under specific decision thresholds. To address this, we complemented AUC with threshold-dependent metrics such as the F1-score, which balances precision and recall, as well as calibration and decision curve analyses to provide a more practically interpretable evaluation of model performance.

Pain level emerged as the most influential predictor of SF severity in our model, followed by knee function, age, shoulder function, and UF. This finding aligns closely with the clinical progression of SF, which typically begins with joint pain, stiffness, and limited mobility, and may eventually lead to deformities due to ossification of ligaments and the formation of bony outgrowths7,20. Deformities of the knees, hips, and spine are especially common in moderate-to-severe cases21,22. SHAP analysis revealed a clear gradient between higher pain scores and the predicted probability of severe SF. This suggests that pain, while often considered a subjective symptom, carries substantial risk stratification and severity discrimination value. Previous studies have demonstrated strong correlations between pain level and both SF severity (r = 0.71, P < 0.001) and patient age (r = 0.79, P < 0.001)23. This relationship indicates that pain may reflect not only current disease status but also cumulative exposure and biological vulnerability associated with aging.

In addition, pain scores are closely linked to joint function, particularly in the knees—a site frequently affected by fluorosis-related osteoarticular damage. A study emphasized the utility of pain scores in assessing functional impairment and disease burden in fluorosis-induced osteoarthritis24. This supports our finding that knee and shoulder function were among the top-ranking features in the predictive model. As such, pain scoring and joint function may serve as a practical and cost-effective indicator for early risk identification in endemic communities.

UF, as a direct biomarker of environmental fluoride exposure and body burden, also played a key role in our model23,25. Elevated UF levels have been shown to significantly increase the risk of SF (OR = 7.25)26, and our SHAP analysis confirmed that individuals with higher UF levels were more likely to be classified as having moderate-to-severe SF. Interestingly, the relationship between UF and SF severity appeared to follow a nonlinear threshold effect—suggesting that once UF levels surpass a critical value (e.g., 4 mg/L), disease progression may accelerate more rapidly and irreversibly. This highlights the public health importance of early intervention in high-exposure communities before fluoride accumulation crosses harmful thresholds. Multiple studies have also affirmed the value of UF as a non-invasive, objective biomarker for both individual-level risk assessment and community-level exposure surveillance27,28.

Together, these findings support a comprehensive model in which subjective symptoms (e.g., pain), functional measures (e.g., joint mobility), and objective biomarkers (e.g., UF) jointly contribute to SF severity assessment, risk evaluation, and prevention. The integration of such multidimensional indicators could significantly improve early detection and risk stratification strategies in endemic settings.

Beyond joint symptoms and fluoride exposure, alterations in bone metabolism also emerged as key indicators of SF severity. In particular, elevated levels of two bone formation markers, including PINP and osteocalcin, were strongly associated with moderate-to-severe disease in our model29,30. PINP, an early marker of type I collagen synthesis, reflects enhanced osteoblast proliferation and collagen production, while osteocalcin, secreted by mature osteoblasts, indicates active matrix mineralization31,32. The concurrent elevation of both markers points to a pathophysiological state of high bone turnover in SF patients. This metabolic profile underscores the involvement of systemic bone remodeling processes in the pathogenesis of SF, beyond the localized symptoms and fluoride burden. These findings align with the recognized pathological features of SF, which involve heightened osteogenic activity accompanied by compromised structural integrity of the newly formed bone. Although bone anabolism is accelerated, the newly formed bone tends to be structurally abnormal, characterized by increased density but higher brittleness, which ultimately compromises bone integrity33. Our results reinforce this paradoxical relationship: despite elevated bone formation markers, individuals with moderate-to-severe SF had significantly lower bone mass index and bone density, compared to those with milder disease. This “high turnover, low quality” state of bone metabolism represents a hallmark of osteofluorosis and likely contributes to increased fracture risk, skeletal deformities, and impaired mobility8,34.

Notably, bone metabolic disturbances were evident not only between healthy individuals and SF patients, but also among patients with differing levels of disease severity, indicating a progressive decline in bone quality along the disease continuum. Moreover, such imbalances appear to persist regardless of specific exposure pathways, highlighting the need to incorporate bone quality assessment into early screening and population-level health monitoring of SF. Evaluating markers like PINP and osteocalcin could aid in identifying individuals at risk for progressive bone damage, thereby informing preventive strategies and guiding the development of targeted therapeutic interventions aimed at restoring bone homeostasis.

Our results reveal a gradient association between fluoride exposure pathways and SF severity, with coal-burning areas showing the highest proportion of moderate-to-severe cases, followed by drinking-water and brick-tea regions. This pattern likely stems from differences in exposure complexity, cumulative dose, and intervention effectiveness. Coal-burning fluorosis in Guizhou is particularly severe due to combined dietary and inhalation exposure. Locally used coal and clay contain high fluoride levels- 78 mg/kg and 1116 mg/kg respectively, with coal-clay briquettes reaching 313 mg/kg35. Roasting staple foods like corn leads to surface fluoride accumulation, raising levels to 20.26 mg/kg- 16 times that of unroasted samples36. Despite stove renovation efforts, traditional practices persist, limiting control efficacy25. In contrast, drinking-water-type fluorosis, though globally widespread (≥ 1.5 mg/L exposure in ~ 200 million people; 72.07 million in China)12, has declined due to national interventions like the “Three Five-Year Plans” and the “Three-Year Action Plan”37,38,39,40. Brick-tea-type fluorosis, prevalent in Sichuan, is caused by long-term consumption of fluoride-rich Camellia sinensis products41. Though it leads to chronic accumulation, symptoms are generally milder and reversible upon cessation, making behavioral intervention and education the primary prevention strategy42,43. In summary, the severity of SF correlates with the intensity and multiplicity of fluoride exposure routes. These findings emphasize the need for tailored interventions, ranging from technological and environmental to behavioral measures, based on specific exposure contexts.

This study has several notable strengths. It is the first to construct an interpretable predictive model for SF severity that integrates three major fluoride exposure pathways (coal-burning, drinking-water, and brick-tea types), enhancing both the predictive utility and generalizability of the findings across endemic regions. The identification of key early-stage indicators, such as pain level scores, joint function, and UF, underscores their evaluative value throughout the disease continuum, extending beyond the distinction between patients and healthy individuals to the differentiation among varying degrees of severity. These insights support more precise risk stratification and provide an evidence base for targeted, context-specific public health interventions in fluoride-affected populations worldwide.

Nonetheless, several limitations should be acknowledged. First, the relatively low specificity observed in some models may be attributable to the substantial overlap in clinical and biochemical profiles between mild and moderate-to-severe SF groups, as well as the inherently non-linear and non-separable nature of disease severity progression. These characteristics may increase the likelihood of false-positive classifications when distinguishing between severity levels. Second, the study did not achieve precise quantification of individual-level fluoride exposure. Although regional fluoride types were considered, environmental monitoring data (e.g., from air, water, and food) were not directly integrated, limiting the granularity of exposure assessment. Third, key nutritional factors such as calcium and vitamin D intake, known to modulate fluoride toxicity were not included in the model, potentially affecting the comprehensiveness of disease prediction. Finally, while age was identified as a strong predictor, it does not fully capture cumulative fluoride exposure or distinguish age-related bone degeneration from fluoride-specific skeletal damage. Longitudinal cohort studies with detailed dietary and exposure tracking are needed to disentangle these effects and validate the model’s applicability over time.

Conclusions

This study developed and validated an interpretable machine learning framework for predicting the severity of SF by integrating multi-dimensional predictors, including demographic factors, environmental exposure levels, and key biomarkers. The model provides a practical tool for early identification and risk assessment in affected populations. By supporting timely diagnosis and stratified management, it holds potential value for guiding public health interventions in fluoride-endemic regions. Further work is needed to optimize the model’s performance and evaluate its broader applicability in diverse geographical contexts.