Introduction

Arthritis is a musculoskeletal condition that commonly leads to disability and significantly impacts the quality of life of those affected1. It not only causes pain and affects physical functioning but is also associated with various other outcomes, including mental health issues, sleep disturbances, work limitations, and even mortality2. Currently, more than 300 million people worldwide suffer from arthritis, and its prevalence continues to rise3. Despite this, there is still a lack of effective medications to halt the progression of arthritis. The high prevalence of arthritis and its associated complications4, such as joint injuries and disability, place an increasing burden on the global public health system5.

Arthritis affects women more than men, and the risk of developing the disease increases with age6. This leads to a reduction in quality of life in older adults, placing an increased burden on them, their families, and society as a whole.7,8. Therefore, it is important to identify potential high-risk populations in the early stages of arthritis onset to achieve the goals of prevention and relief.

Among the risk factors for arthritis, according to current research, risk factors can be divided into two categories: joint level and individual level9. The joint level refers to the degree of loading on the joints, while individual factors include age, sex, obesity, and various indicators. Emerging evidence indicates that systemic inflammation serves as a critical factor in the pathogenesis of various forms of arthritis10,11. The systemic immune-inflammation index (SII) is used as an indicator of the systemic inflammatory response, and in recent years, its field of application has expanded. An increasing number of studies suggest that the SII can also be used to predict the severity of certain diseases and monitor the effectiveness of treatments12.

Moreover, recent studies have demonstrated that the non-HDL cholesterol to HDL cholesterol ratio (NHHR) is associated with several diseases, such as periodontitis13 and diabetes14, among which diabetes is a risk factor for arthritis15. Few studies have examined the possible association between the NHHR and arthritis. In addition, a growing number of studies have linked adipokines from obesity to the development of arthritis16. Some studies have shown that certain body mass indices of obesity are associated with arthritis4,17,18, especially those that differ from traditional obesity indices (BMI), such as the weight-adjusted-waist index (WWI), waist-to-height ratio (WHtR), and other obesity indices, which also incorporate the calculation of waist circumference. These indicators are easier to measure and detect than the adiposity factor due to obesity.

At present, the relationship between various indicators and arthritis still needs to be verified. Therefore, this study aims to investigate the relationship between different obesity indicators, NHHR, and SII and the risk of arthritis, identify the factors that have the greatest impact on arthritis, develop a new nomogram, and provide new methods for predicting arthritis.

Materials and methods

Study population and data

The survey study population was obtained from NHANES, a public database collected and maintained by the CDC in the United States. The data are updated biennially and have been essentially continuous from 1999 to 2020. Due to the impact of the COVID-19 pandemic, data collection was suspended from March 2020 to July 2021 and resumed in August 2021. The data were collected from the U.S. population aged 20 years and older. This study used data from August 2021 to August 2023, and the survey was a cross-sectional study of U.S. residents aged 20 years or older. According to the content of this study, the data was screened and the exclusion criteria are as follows: (1) participants under 20 years of age; (2) participants who did not complete the obesity indicator test and the arthritis questionnaire; and (3) participants with missing covariate data. After manual data screening, 3,660 participants were ultimately selected for analysis, as shown in Fig. 1. A total of 3,660 participants were enrolled, with 1,234 of them diagnosed with arthritis.

Fig. 1
figure 1

Flowchart of the study participants.

Definition of a variable

The outcome variable in this study was the incidence of arthritis, which was ascertained through the questionnaire item: "Have you ever been diagnosed with arthritis?" After reviewing the literature, this study selected confounding variables that could affect arthritis, including sex, race, education level, family poverty-income ratio (PIR), age (grouped as 20–39, 40–59, 60 +), and drinking status (participants who answered “yes” to the questionnaire were considered drinkers). Smoking status was based on the questionnaire response to having smoked at least 100 cigarettes in their lifetime (participants who answered “yes” were considered smokers)19. Hypertension and diabetes status were also based on participants’ affirmative responses to the questionnaire.

Trained technicians measured the height, weight, and waist circumference (WC) of all participants. The total serum 25(OH)D concentration was calculated as the sum of 25(OH)D2 and 25(OH)D3 and grouped into quartiles20: Group 1 (0–57.3 nmol/L), Group 2 (57.4–78.5 nmol/L), Group 3 (78.6–103.0 nmol/L), and Group 4 (above 103.0 nmol/L). After a fasting period of at least 8 h, blood samples were systematically collected from participants, including measurements of triglyceride (TG) and high-density lipoprotein cholesterol (HDL-C) levels, for comprehensive examination. The formulas for calculating BMI, WHtR, WWI, SII, NHHR, conicity index (CI), and A body shape index (ABSI) are outlined below21:

$${\text{BMI}}\, = \frac{{W{\text{eight}}}}{{{\text{Height}}^{{2}} }} ^{{{22}}}$$
$${\text{ABSI}} = \frac{WC}{{Height^{{\tfrac{1}{2}}} \times BMI^{{\tfrac{2}{3}}} }}^{{{23}}}$$
$${\text{CI}} = \frac{WC}{{\sqrt[{0.109}]{{W{\text{eight}}/Height}}}}^{{{24}}}$$
$${\text{WWI}} = \frac{WC}{{W{\text{eight}}^{2} }}^{{{25}}}$$
$${\text{WHtR}} = \frac{WC}{{W{\text{eight}}}}^{{4}}$$
$${\text{NHHR}} = \frac{(TC - HDL)}{{HDL}}^{{{12}}}$$
$${\text{SII}} = \frac{{{\text{Platelets}} \times {\text{Neutrophils}}}}{{{\text{Lymphocytes}}}}^{{{26}}}$$

The factors in the SII calculation formula are all derived from peripheral blood, and the fourteen indicators chosen for this investigation can be obtained through straightforward anthropometric measurements and basic blood chemistry analyses. Detailed measurement procedures using the study variables are available at https://www.cdc.gov/nchs/nhanes/.

Development and validation of machine learning prediction models

In this study, Python 3.12.3 (https://www.python.org) was used to build the models. The dataset was divided into training and testing sets with a 7:3 ratio, and the Random Forest (RF) algorithm was applied for model construction. A prediction model was built using the training set and evaluated on the test set. A 500-fold repeated cross-validation method was employed, and the area under the receiver operating characteristic curve (AUC) was calculated for the subjects. Finally, the optimal model was explained using SHapley Additive exPlanations (SHAP) to assess the impact of each variable on the prediction results. A SHAP dependency graph was created to illustrate the relationship between the predictor and the dependent variable.

Statistical methods

Data analysis was conducted using R 4.4.1 software (http://www.R-project.org) and SPSS version 26.0 (https://www.ibm.com/spss). Continuous variables were presented as means and standard deviations, with Student’s t-tests used to determine differences between groups. Categorical variables were expressed as frequencies (n) and percentages (%), and group comparisons were conducted using either the chi-square test.

Moreover, the relationships between arthritis and other variables were examined through least absolute shrinkage and selection operator (LASSO) regression analysis. This method was employed to reduce the dimensionality of the data and identify the most valuable predictors. Variables with coefficients that contributed minimally to the overall model were reduced to zero, thereby ensuring the predictive accuracy of the model27. The data were divided 7:3 into training and test sets. The training set was used for nomogram model development, and the test set was used for the final generation of receiver operating characteristic (ROC) curves, calibration curves, and decision curve analysis (DCA) to evaluate the performance of the model and assess the predictive effectiveness of the training and test sets, respectively.

On the basis of the regression coefficients of logistic regression, a nomogram was constructed for all the factors, and the ROC curve was used to evaluate the discriminative ability of the nomogram28. The predictive abilities of the RF model and the nomogram model were compared, and the model with the highest AUC value was selected as the final model. A calibration curve was used to assess the fit between the predicted probabilities and the actual probabilities of the final model29. Finally, DCA was plotted, as the DCA curve avoids the critical value, sensitivity, and specificity issues that may be associated with the ROC curve, and the net benefit of the final model in a clinical setting was directly calculated30. Significant levels were determined by a P-value ≤ 0.05.

Results

Characteristics of the participants

In this study, 3,660 participants were identified as the final study population (Fig. 1). The demographic characteristics of the patients are shown in Table 1. Among the non-arthritic patients, 1,171 (48.3%) were male and 1,255 (51.7%) were female, with a mean age of 48.1 ± 16.6 years, and their mean SII was 511 ± 307. Among the arthritic patients, 505 (40.9%) were male and 729 (59.1%) were female, with a mean age of 63.5 ± 11.7 years, and the mean SII was 617 ± 438. Table 1 also shows a detailed baseline for other variables, and except for PIR, all other variables showed significant differences (P < 0.05).

Table 1 Baseline characteristics of participant demographics.

Model development and selection

ROC curves were constructed to assess the individual predictive effects of all variables on arthritis (Fig. 2). It can be seen that, when predicting arthritis independently, the top five variables with the best results are age, WWI, ABSI, SII, and WHtR. BMI is not included among them.

Fig. 2
figure 2

ROC curve for predicting arthritis separately for all variables.

The LASSO regression model analyzed a total of 23 variables and selected the best variables based on the value of λ (Fig. 3). The 12 selected variables were entered into the logistic regression equation. After using the stepwise backward method to exclude variables with P > 0.05 in the logistic regression equation, a total of nine variables were ultimately included in the logistic regression (Table 2).

Fig. 3
figure 3

Lasso regression for all variables: (A) regression coefficient diagram. (B) cross-validation curve.

Table 2 Multivariate logistic regression of predictive factors for arthritis patients.

Based on the final model, a nomogram was constructed for predicting arthritis. The risk factors included age, sex, PIR, race, the occurrence of diabetes, vitamin D levels, the SII, and the WHtR (Fig. 4).

Fig. 4
figure 4

Nomogram for predicting arthritis risk.

The AUC of the nomogram model is 0.784 (Fig. 5A), while the AUC of the random forest model is 0.771 (Fig. 5B). Therefore, the nomogram model was selected as the final prediction model.

Fig. 5
figure 5

Predictive model evaluation: (A) ROC curve of nomogram model. (B) ROC curve of random forest model.

Model interpretation and validation

The SHAP values of different variables in the random forest model were calculated to generate a variable importance ranking chart (Fig. 6A) and SHAP dependence plots (Fig. 6B). As shown in Fig. 6A, age, WHtR, and vitamin D content are among the top three most important variables for predicting the risk of arthritis. The SHAP dependence plots further illustrate the influence of variables on model predictions, with Fig. 6B displaying the SHAP values associated with WHtR and arthritis. A WHtR of 0.6 is identified as the critical threshold for the occurrence of arthritis, and as WHtR increases, the risk of developing arthritis also rises, demonstrating a non-linear positive correlation between WHtR and the incidence of arthritis.

Fig. 6
figure 6

Predictive model evaluation: (A) variable importance. (B) SHAP dependence plots.

Based on the nomogram model, calibration curves (Fig. 7A) and DCA (Fig. 7B) were plotted. The calibration curve indicates a strong consistency between the predicted and observed results. The DCA curve shows that the net benefit probability ranges from 0 to 65%, suggesting that the model has significant predictive power for arthritis. Additionally, the predictive model demonstrated good discriminative ability, high accuracy, and potential clinical benefits in the validation set.

Fig. 7
figure 7

Predictive model evaluation: (A) calibration curves. (B) DCA.

Discussion

This study developed a nomogram model consisting of nine predictive factors to assess the risk of arthritis. After validating the model’s effectiveness, it demonstrated good predictive value and could provide more precise risk stratification for potential high-risk populations.

Arthritis is currently a significant public health problem, with obesity31, dietary intake, type 2 diabetes, and inflammatory conditions widely recognized as influencing its development. Among these factors, the association between BMI and arthritis has been extensively studied, but BMI is not a good indicator of fat distribution in the body32. Studies have shown that BMI has a J-shaped association with all-cause mortality and most cause-specific mortality, with a lower BMI associated with an increased risk of death33. However, to a significant extent, the dispute regarding the ‘obesity paradox’ can be elucidated by the circumstance that individuals with a relatively lower BMI range34 possess a lower lean body mass and body weight rather than a diminished quantity of body fat, so it was not significant according to LASSO regression.

Upon reviewing the literature, other indicators of obesity, such as WHtR and WWI, were included in this study. These newly proposed indicators explain certain diseases better than BMI25, so they were incorporated into the model. Although both WWI and WHtR were identified as meaningful predictors under the chosen regularization parameters in the LASSO regression, which aligns with previous research18, WWI was subsequently excluded from the nomogram model due to its lack of statistical significance in the logistic regression. This could be due to the different inclusion of other variables in the model or the varying circumstances of the sample.

Gender, age, ethnicity, and PIR were also selected as important factors in this model, which is also roughly the same as the results of Wang17 study, where gender and age were likewise included in the model as important factors, whereas different ethnicities were not included. This may be because although the samples in the previous study and this study are based on NHANES, they were collected before and after the outbreak of COVID-19, leading to different sample selection methods.

Diabetes and hypertension were included in this analysis. In addition, diabetes was included in the nomogram model as a significant parameter. However, this is roughly the same as the research results of Matsunaga et al.35 because there is a correlation between metabolic abnormalities and some types of inflammation; however, in some studies36, diabetes and arthritis may be complications, so further research is needed to address these contradictions. In logistic regression, hypertension was excluded, which is slightly different from the results of previous studies37. In their study, the hemodynamic abnormalities induced by primary hypertension38 may further promote the transformation of endothelial cells into a proinflammatory phenotype, increase the expression of inflammatory genes and adhesion molecules, and exacerbate the inflammatory cascade response. However, in this study, hypertension was a significant variable in arthritis according to LASSO regression but was not clearly associated with logistic regression; this may be due to the difference in samples. Thus, there is still a contradiction in this area, and further studies need to explore this association.

Vitamin D deficiency39 is associated with numerous health issues. For instance, rheumatoid arthritis, an autoimmune disease, is a chronic inflammatory systemic disorder that can damage joints, leading to cartilage destruction and bone erosion. Vitamin D3, acting as a steroid hormone40, plays a significant role in bone metabolism. Despite the absence of a consensus regarding human serum Vitamin D levels under both healthy and pathological conditions, concentrations ranging from approximately 30 to 100 ng/mL have been reported in healthy individuals, and concentrations ranging from 21 to 29 ng/mL characterise a VitD-deficient state. Levels less than 20 ng/mL indicate significant Vitamin D deficiency41. Therefore, in this study, serum vitamin D levels were analyzed in order of quartiles into 4 groups, with a high risk of arthritis at high concentration levels in the fourth group, which is considered paradoxical because there may be cases where arthritis patients have followed medication prescriptions and consumed vitamin D for a long time.

This result revealed that the SII can be used as a parameter in predictive models of arthritis. Consistent evidence from previous studies10 suggests that the SII is a more precise predictor of inflammation and may also be added as a parameter to future models for predicting inflammation-related diseases.

Among the nomogram predictors, WHtR contributes more to the prediction of results, which is similar to previous research42. In their study, the concept of the triglyceride glucose index was also introduced, but the effect of WHtR was not evaluated. This study improved upon this and found that WHtR has a significant impact on arthritis. Thus, high-risk groups can monitor changes in these variables, with WHtR being easily obtained by measuring waist circumference and height. Compared with BMI, WHtR has a better predictive effect on arthritis.

Study strengths and limitations

The study has several strengths. Firstly, it is the first to identify WHtR as a vital influencing factor for the risk of arthritis. Secondly, for the first time, SII and WHtR, along with other easily obtainable factors, were combined to construct a simple, feasible, fast, and cost-effective arthritis incidence prediction model. The model has been validated through a validation set to confirm its reliability, consistency, and accuracy.

This study also has several limitations. First, due to the nature of the cross-sectional study design, this study was unable to establish clear causal relationships. Second, the results of the current study are based on U.S. adults; therefore, it may not represent the characteristics of other populations. Third, although this study adjusted for a number of potential confounders, it was unable to completely rule out the possibility of unmeasured confounders. Additionally, the study could not develop more accurate models for predicting arthritis due to incomplete data for the years 2021–2023, such as TG, LDL, some laboratory data, and dietary data.

Despite these limitations, this study has developed a new nomogram model that performs well.

Conclusion

Through the above research, it can be concluded that SII and WHtR play an important role in the risk of arthritis. Therefore, this study systematically integrated the predictive role of key variables in arthritis and established a bar chart. Due to the low cost and easy collection of variables in this bar chart, in the future, it can save health resources, diversify the prediction of arthritis, and provide more effective methods for early prevention, control of disease progression, and avoidance of serious harm caused by arthritis. For example, high-risk populations can implement targeted interventions on high-risk variables through the bar chart of this study, such as reducing waist circumference through abdominal exercise to lower WHtR, or allowing doctors and nurses to pay more attention to SII or other easily obtainable indicators during physical examinations of high-risk populations to reduce the risk of arthritis.