Introduction

Acute appendicitis is a leading cause of acute surgical abdomen worldwide, with an estimated 17.7 million new cases reported in 2019 (incidence rate: 228/100,000). This number continues to rise annually1. Appendicitis is classified into uncomplicated (simple or suppurative) and complicated (perforated or gangrenous) types, each requiring distinct treatment approaches. For uncomplicated appendicitis, several studies have shown the effectiveness of antibiotic therapy, offering a potential alternative to surgery in selected cases2,3,4,5. In contrast, surgical intervention remains the standard of care for complicated appendicitis, as delays or inappropriate treatment can lead to severe complications.

A multicenter study6 found that non-surgical treatment for uncomplicated appendicitis resulted in fewer complications and a shorter hospital stay. Similarly, a meta-analysis by Xu et al.7 indicated that for uncomplicated appendicitis, the hospitalization duration and one-month post-treatment quality of life were comparable to those following surgery. A study by Collaborative et al.8 also reported no significant difference in physical status scores or 90-day complication rates between the non-surgical and surgical groups at 30 days post-treatment. However, for complicated appendicitis, current guideline9 still recommends surgery as the preferred treatment. Despite advancements in diagnosis and treatment, distinguishing between uncomplicated and complicated appendicitis preoperatively remains a significant challenge in clinical practice. Failing to accurately identify complicated appendicitis can lead to unnecessary medical interventions and wasted resources. Non-surgical treatment for uncomplicated cases can help reduce the burden on medical resources. Some researchers have attempted to classify appendicitis types using laboratory indicators alone10,11,12,13, while others have combined laboratory results with clinical symptoms and signs to develop predictive scoring models14,15. Therefore, there is a clear need for a reliable, fast, and convenient scoring system to assist in the preoperative diagnosis of complicated appendicitis.

Several predictive scoring systems, such as the Alvarado, Appendicitis Inflammatory Response (AIR), and Raja Isteri Pengiran Anak Saleha (RIPASA) scoring systems, have been recommended by the World Association of Emergency Surgeons for diagnosing acute appendicitis. The Alvarado score, developed in 1986, incorporates clinical symptoms, signs, and laboratory findings to provide a diagnostic framework6. The AIR score, introduced later, refines this approach by emphasizing inflammatory markers for greater diagnostic specificity7. The RIPASA score, designed for Asian populations, integrates a broader range of clinical parameters to enhance sensitivity in diverse settings8. Studies16,17,18 demonstrated that the sensitivity and specificity of an Alvarado score > 7 in diagnosing appendicitis ranged from 71.1 to 88.4% and from 63.6 to 78.5%, respectively. The RIPASA scoring system, a diagnostic scoring system for acute appendicitis, was introduced by Chong et al. in 201026. Previous studies16,17,20 indicated that the sensitivity and specificity of RIPASA scores ranged from 92.7 to 96.7% and from 72.7 to 90.6%, respectively. Patients were categorized into three risk levels: low-risk (0–4 points), moderate-risk (5–8 points), and high-risk (9–12 points). The higher the score, the greater the likelihood of appendicitis. Research16,18,21 indicated that when AIR was > 5, the sensitivity and specificity for diagnosing appendicitis were 77.7-93.3% and 65.1-80.1%, respectively. For AIR ≥ 9, the sensitivity and specificity were elevated to 96.2% and 89.4%, respectively. While these systems are widely applied, their efficacy in identifying complicated appendicitis has not been thoroughly validated. In particular, the shortcomings of these existing models include limited sensitivity in complex cases and a lack of universal applicability across diverse patient populations.

Recently, several models have been developed to reduce the number of negative appendectomies, which may lead to low morbidity and mortality rates with a moderate recurrence rate. However, it should be noted that delaying the surgical procedure can potentially lead to the development of complications. Accordingly, machine learning models have been developed to improve the detection of false positive or false negative appendicitis cases. The recently created model outperformed the AIR score on most other combinations of sensitivity and specificity22.

This study aimed to fill this gap by developing a novel scoring system to predict complex appendicitis and assess its diagnostic accuracy compared with the widely used Alvarado, AIR, and RIPASA scoring systems. The potential to improve clinical outcomes and reduce unnecessary treatments underscores the necessity of this research.

Methods

Study design

This study retrospectively collected data from 1241 patients with acute appendicitis who underwent surgical treatment at Tsinghua Changgung Hospital (Beijing, China) between May 2016 and October 2023. The pathological classification of appendicitis was based on the standard classification system used at Tsinghua Changgung Hospital, categorizing appendicitis into non-complicated appendicitis, phlegmonous appendicitis, and gangrenous appendicitis. Phlegmonous appendicitis refers to inflammation with surrounding tissue involvement, while gangrenous appendicitis involves necrosis of the appendix wall and is associated with more severe disease. All appendectomies were performed using an open surgical technique, with the majority conducted via the McBurney incision. In certain cases, laparotomy or laparoscopy was employed based on the clinical situation and surgeon’s discretion. Ethical approval was waived by the Ethics Committee of Tsinghua Changgung Hospital due to the retrospective nature of the study. Written informed consent was waived due to the retrospective nature of this study by the Ethics Committee of Tsinghua Changgung Hospital. Patients over 18 years of age, of both sexes, diagnosed with acute appendicitis who underwent appendectomy and were confirmed by postoperative pathology to have appendicitis were included. Exclusion criteria consisted of patients under 18 years of age, those with other appendiceal diseases such as mucinous tumors or appendiceal carcinoid tumors, those in whom appendiceal pathology could not be obtained due to an appendiceal abscess, and patients with incomplete clinical data.

The following parameters were evaluated: gender, age, preoperative body temperature, presence of metastatic lower right abdominal pain, nausea and vomiting, diarrhea, time from onset to visit, signs of peritonitis, preoperative C-reactive protein (CRP) level, preoperative white blood cell (WBC) count, preoperative lymphocyte percentage, preoperative neutrophil percentage, and preoperative neutrophil-to-lymphocyte ratio (NLR). Imaging findings, such as the presence of appendiceal fecaliths, pelvic fluid accumulation, periappendiceal fat stranding, and appendiceal diameter, as well as pathological examination results, were also taken into account.

Data processing

This study was conducted using R 4.2.2 software. A total of 1241 appendicitis patients who underwent surgery were included in this study. Patients were randomly divided into the modeling group (n = 868) and validation group (n = 373) according to a 7:3 ratio. Risk factor analysis was undertaken in the modeling group. Continuous variables were presented as mean ± standard deviation (SD) if they followed a normal distribution, as determined by the Kolmogorov-Smirnov test. For non-normally distributed variables, data were reported as median and interquartile range (IQR). Group comparisons for normally distributed data were conducted using Student’s t-test, while the Mann-Whitney U test was applied for non-normally distributed data. All statistical analyses were two-sided, and P < 0.05 was considered statistically significant. Binary logistic regression (backward LR method) was performed on variables with P < 0.05 in univariate analysis to identify independent risk factors for complex appendicitis. Independent risk factors were modeled using R language processing, and a nomogram was developed based on the results. The model was validated using the validation group, and its performance was assessed by receiver operating characteristic (ROC) analysis. A calibration curve was also plotted to evaluate the model’s accuracy.

Patients’ data from the validation group were evaluated using the following scoring systems:

  • Alvarado scoring system: The maximum score is 10, calculated based on 8 clinical parameters (such as right lower abdominal pain, nausea, fever, and WBC count) and categorized into low (0–4), medium (5–6), and high risk (7–10) categories.

  • RIPASA scoring system: The maximum score is 15, determined using 15 clinical and laboratory parameters (such as gender, WBC count, and fever), with a threshold of ≥ 7.5 indicating acute appendicitis.

  • AIR scoring system: The maximum score is 12, derived from 8 clinical parameters (including fever, pain, and WBC count), classified into low (0–4), medium (5–8), and high risk (9–12) levels.

  • MZXBTCH scoring system: This newly developed scoring system incorporates multiple factors, including preoperative clinical, laboratory, and imaging data. The total score is based on a combination of independent risk factors identified from the data analysis.

ROC curves were plotted to determine the cutoff values for predicting complex appendicitis. The area under the curve (AUC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy of the four scoring systems were compared.

To statistically compare the performance of the models, McNemar’s test was performed for paired data, comparing the sensitivity and specificity of the scales using only patients with confirmed appendicitis (for sensitivity) and those without appendicitis (for specificity). Additionally, Cohen’s kappa coefficient was calculated to assess the degree of concordance between the diagnostic tests. Finally, the AUC values were compared using DeLong’s method, which is designed for comparing ROC curves of paired samples.

Development of the MZXBTCH scoring system

The MZXBTCH scoring system was developed based on key clinical and imaging indicators associated with complicated appendicitis. The acronym ‘MZXBTCH’ is derived from the initials of its primary predictive parameters: Metastatic pain, CRP, X-ray imaging features (appendiceal diameter), Body temperature, Time from onset to visit, Clinical signs of peritonitis, and Hematological markers (NLR). These factors were identified through statistical analysis as significant predictors and incorporated into the scoring system for ease of use and recall by clinicians. The MZXBTCH scoring system was developed to predict complicated appendicitis based on independent risk factors identified in the modeling group. Variables with statistical significance (P < 0.05) in univariate analysis were included in a binary logistic regression analysis using the backward likelihood ratio method to determine independent predictors.

Each identified predictor was assigned a weighted score proportional to its regression coefficient, reflecting its relative contribution to the risk of complicated appendicitis. The scoring system integrates both clinical and imaging parameters, including preoperative CRP levels, NLR, presence of peritonitis signs, appendiceal diameter, and periappendiceal fat stranding. A nomogram was developed using R software to visualize the scoring system and allow clinicians to calculate a cumulative risk score for each patient. The cutoff value for predicting complicated appendicitis was determined through ROC curve analysis in the modeling group, optimizing sensitivity and specificity. The scoring system was named MZXBTCH based on the initials of its primary parameters for ease of reference. The system was then validated in the validation group, where its diagnostic performance, including the AUC, sensitivity, specificity, PPV, NPV, and overall accuracy, was compared with existing scoring systems.

Results

Analysis of risk factors for complex appendicitis

As a secondary outcome, this study aimed to identify independent risk factors for complex appendicitis using multivariate logistic regression analysis. A total of 1,241 patients met the inclusion criteria. After randomization, the modeling group consisted of 868 patients (248 with complex appendicitis and 620 with non-complex appendicitis), while the validation group involved 373 patients (98 with complex appendicitis and 275 with non-complex appendicitis). There were no significant differences between the two groups in terms of general demographics, clinical features, imaging findings, or laboratory results (Table 1).

Table 1 Baseline characteristics of participants in the modeling and validation groups.

In the modeling group, univariate analysis revealed significant differences between patients with complex and non-complex appendicitis in terms of age, right lower abdominal pain, history of appendicitis, preoperative body temperature, and signs of peritonitis (Table 2). Laboratory parameters, such as preoperative CRP level, WBC count, lymphocyte percentage, neutrophil percentage, and NLR also significantly differed between the two groups. Imaging findings, including the presence of appendicolith, periappendiceal fat stranding, and appendix diameter, exhibited significant associations with complex appendicitis.

Table 2 Univariate analysis of factors associated with complex appendicitis.

Multivariate logistic regression analysis identified six independent risk factors for complex appendicitis: preoperative body temperature (odds ratio (OR) = 1.104, 95% confidence interval (CI): 1.067–1.143, P < 0.001), CRP level (OR = 1.002, 95% CI 1.001–1.002, P < 0.001), lymphocyte percentage (OR = 0.994, 95% CI 0.990–0.996, P < 0.001), presence of appendicolith (OR = 1.127, 95% CI 1.068–1.190, P < 0.001), periappendiceal fat stranding (OR = 1.133, 95% CI 1.072–1.198, P < 0.001), and appendix diameter (OR = 1.013, 95% CI 1.004–1.022, P < 0.001) (Table 3).

Table 3 Results of multivariate logistic regression analysis of risk factors.

Development and evaluation of the MZXBTCH model

The primary outcome of this study was to evaluate the diagnostic accuracy of the newly developed MZXBTCH scoring system for predicting complex appendicitis compared with the Alvarado, RIPASA, and AIR scoring systems. A predictive model for complex appendicitis, the MZXBTCH model, was developed based on the identified independent risk factors. A nomogram was developed (Fig. 1), and its performance was evaluated using the ROC curve, with an AUC of 0.773 (Fig. 2). Calibration curves demonstrated strong agreement between the model’s predicted probabilities and the actual outcomes in the modeling group (Fig. 3).

Fig. 1
figure 1

Nomogram of the MZXBTCH model.

Fig. 2
figure 2

ROC curve for the modeling group.

Fig. 3
figure 3

Calibration curve of the MZXBTCH model.

Comparison of predictive models in the validation group

The MZXBTCH model was thereafter validated using the test group. Its predictive performance was compared with three widely used scoring systems: Alvarado, RIPASA, and AIR. The ROC curves were generated showing a greater AUC for the MZXBTCH model compared with the other scales, indicating better overall performance. This translates into a shift of the curve upwards and to the left, indicating that the MZXBTCH model has better sensitivity and specificity for predicting complex appendicitis. Specifically, the MZXBTCH model achieved an AUC of 0.82, reflecting higher diagnostic accuracy in comparison with the Alvarado, RIPASA, and AIR scales, which had AUC values of 0.66, 0.68, and 0.76, respectively (Table 4). In terms of sensitivity and specificity, the MZXBTCH model demonstrated a sensitivity of 74.49% and a specificity of 76.36%, higher than the other models. The PPV and NPV for the MZXBTCH model were 0.53 and 0.89, respectively, with an overall accuracy of 75.9%. By making comparison, the AIR model had the second-best performance with an AUC of 0.76 and an accuracy of 73.7% (Table 4; Fig. 4). Another secondary outcome was the validation of the MZXBTCH model’s diagnostic performance, assessed through calibration curves and comparison with existing scoring systems based on AUC, sensitivity, specificity, PPV, NPV, and overall accuracy.

Table 4 Comparison of performance among various scoring systems in predicting complex appendicitis.
Fig. 4
figure 4

ROC curves for predicting complex appendicitis using scoring systems in each group.

Discussion

This study developed the MZXBTCH scoring system for predicting complex appendicitis, incorporating a range of clinical, laboratory, and imaging factors. The results revealed that preoperative body temperature, CRP level, lymphocyte percentage, appendiceal fecaliths, periappendiceal fat stranding, and appendix diameter were independent risk factors for complex appendicitis. These factors can serve as reliable indicators for preoperative identification of complex cases, which is crucial as treatment strategies differ substantially for complex versus uncomplicated appendicitis. Non-surgical management is recommended for uncomplicated cases due to lower complication rates and shorter hospital stays, as noted by multiple studies6,7,8. In contrast, surgery remains the treatment of choice for complex appendicitis, as the condition is associated with higher complication rates, including gangrene and perforation. To assess the statistical significance of the differences between the scales, 95% CIs for sensitivity, specificity, and AUC were calculated. The confidence intervals for the sensitivity and specificity of the MZXBTCH model (sensitivity: 74.49%, specificity: 76.36%) did not overlap with those of the other models, suggesting that the differences in performance were significant. However, we acknowledge that the CIs of some scales, such as AIR (sensitivity: 72.35%, specificity: 74.21%), overlap slightly, indicating that the differences may not be as remarkable in those models. For a more robust comparison, inferential statistical tests were performed, including McNemar’s test for paired data and calculated Cohen’s kappa coefficient for test concordance. In addition, the AUC values were compared using DeLong’s method to assess the statistical significance of the differences between models. The results showed that the MZXBTCH model outperformed the other scales with a significantly higher AUC (0.82 vs. 0.66 for Alvarado, 0.68 for RIPASA, and 0.76 for AIR), indicating better overall diagnostic performance.

Scholars23 demonstrated a significant correlation between high serum CRP level and complex appendicitis. CRP level ≥ 50 mg/L predicted complex appendicitis with a likelihood of 52.7%, ≥ 100 mg/L at 74.4%, and ≥ 150 mg/L at 82.6%. Similarly, Akai et al.24 identified high CRP level and fever as risk factors for complex appendicitis. However, Mutlu et al.25 found no significant differences in WBC count or neutrophil percentage between complex and non-complex appendicitis. Although univariate analysis in other studies15,26 indicated higher WBC count and N% level in complex appendicitis, multivariate logistic regression analysis did not identify these as independent risk factors. Yildiz et al.27 also found that WBC count had less than 60% sensitivity in predicting complex appendicitis. These results are consistent with the findings of the present study. The performance of the MZXBTCH model was superior to traditional scoring systems, such as Alvarado, RIPASA, and AIR. The model achieved an AUC of 0.822, demonstrating greater accuracy in predicting complex appendicitis, particularly in terms of specificity (76.4%) and overall accuracy (75.9%). This is consistent with recent studies which have highlighted the limitations of existing scoring systems in distinguishing complex from uncomplicated appendicitis, as the AUC for these models often falls below 0.7, indicating suboptimal performance28,29,30. The MZXBTCH model’s inclusion of novel factors, such as appendicoliths and periappendiceal fat stranding, provides a more comprehensive approach compared with earlier models. Pogorelic et al.31 investigated whether the AIR score could be used to predict perforated appendicitis in children. Their findings revealed that when the AIR score was ≥ 9, the sensitivity and specificity for predicting perforated appendicitis were 89.5% and 71.9%, respectively. Similarly, Stiel et al.32 found, in a study of pediatric appendicitis, that when the Alvarado score was > 6, the sensitivity and specificity for predicting perforation were 84.3% and 39.3%, respectively. However, these studies only concentrated on perforated appendicitis, excluding gangrenous cases, limiting their ability to fully represent the predictive efficacy for complex appendicitis.

This study has several limitations that should be considered. Firstly, it is a retrospective analysis conducted at a single center, which might limit the generalizability of the findings to other settings or populations with different patient characteristics. Additionally, while the study included a large sample size, the retrospective design might introduce potential biases, such as selection bias, and restricts the ability to establish causality. Despite the promising performance of the MZXBTCH model, its sensitivity and specificity (74.49% and 76.36%, respectively) remained below optimal thresholds, indicating that further refinement is needed. Finally, external validation in larger, multicenter, prospective studies is necessary to confirm the model’s accuracy and applicability in diverse clinical environments. Future studies should concentrate on validating this model in diverse populations and exploring its integration into clinical practice to enhance decision-making in the management of acute appendicitis.

Conclusions

This study established the MZXBTCH scoring system for predicting complex appendicitis. Compared with the Alvarado, RIPASA, and AIR scoring systems, the MZXBTCH scoring system exhibited superior screening performance, with higher sensitivity, specificity, and accuracy. It holds promise for the early identification of patients requiring appendectomy, contributing to more efficient use of hospital resources. Future research will explore the effectiveness of the MZXBTCH scoring system in different demographic groups, such as various age ranges, genders, or ethnic populations, to determine potential variations in its predictive accuracy. Additionally, multicenter studies with larger, more diverse populations will help confirm the generalizability of the model across different clinical settings. Investigating its performance in pediatric and elderly populations, where presentation may differ, may also provide valuable insights. Prospective studies may also assess the integration of emerging biomarkers or advanced imaging techniques to enhance the model’s predictive capabilities, ensuring it remains a relevant tool for clinical decision-making in the evolving landscape of diagnostic technologies.