Abstract
Exploring large language models (LLMs) performance in the specific medical domain can help understand their generalizability in real-world application. We assessed the predictive and decision-support value of two state-of-the-art LLMs in predicting bone cement leakage (BCL) and new vertebral fractures (NVF) after percutaneous kyphoplasty (PKP) and to compare them with those of traditional machine learning (TML) and spine surgeon. This study utilized combined retrospective and prospective data at a single tertiary hospital. Two LLMs (GPT-5 and DeepSeek R1) with zero- and few-shot strategy, five TML models, and two spine surgeons with/without exposure to LLM responses, were asked to predict complications based on demographic, perioperative baseline, and radiographic data. We also tested LLMs’ ability to predict complication subtype. For BCL prediction, both LLMs demonstrated acceptable performance (F1-score, 0.857–0.871; MCC, 0.164–0.332) under zero-shot conditions, comparable to TML models (F1-score, 0.758–0.867; MCC, 0.265–0.416), and slightly superior to surgeons alone (F1-score, 0.675–0.684; MCC, 0.074–0.185). Few-shot prompting enhanced specificity but yielded uncertain overall gains. For NVF prediction, the zero-shot LLM performance was poor (F1-score, 0.309; MCC, 0.044) but improved with few-shot learning. The RBF-SVM model showed the best performance for NVF prediction (F1-score, 0.536; MCC, 0.414). LLM explanations enhanced surgeon performance in BCL prediction but not in NVF. LLMs showed poor prediction of complication subtypes. The findings suggest that current LLMs hold diverse predictive performances for different complications after PKP, they are still immature for real clinical applicability and need further improvement.
Similar content being viewed by others
Data availability
The datasets generated and/or analyzed during the current study are not publicly available due to the institutional and participant privacy considerations, but are available from the corresponding author on reasonable request.
Code availability
The underlying code for this study is not publicly available but may be made available to qualified researchers on reasonable request from the corresponding author.
References
Alsoof, D. et al. Diagnosis and management of vertebral compression fracture. Am. J. Med. 135, 815–821 (2022).
Ballane, G., Cauley, J. A., Luckey, M. M. & El-Hajj Fuleihan, G. Worldwide prevalence and incidence of osteoporotic vertebral fractures. Osteoporos. Int. 28, 1531–1542 (2017).
Wu, Y. et al. Risk factors for cement leakage after percutaneous vertebral augmentation for osteoporotic vertebral compression fractures: a meta-analysis. Int. J. Surg. 111, 1231–1243 (2025).
Ebeling, P. R. et al. The efficacy and safety of vertebral augmentation: a second asbmr task force report. J. Bone Miner. Res. 34, 3–21 (2019).
Expert Panels on Neurological Imaging, Interventional Radiology, and Musculoskeletal Imaging, et al. ACR Appropriateness Criteria® Management of Vertebral Compression Fractures: 2022 Update. J. Am. Coll. Radiol. 20, S102–S124 (2023).
NASS. Evidence-based clinical guidelines for multidisciplinary spine care: diagnosis & treatment of adults with osteoporotic vertebral compression fractures. (2024), accessed 10 June 2025. [https://www.spine.org/Portals/0/assets/downloads/ResearchClinicalCare/Guidelines /Osteoporotic-Vertebral-Compression-Fractures.pdf].
Hsieh, M. K., Chen, L. H. & Chen, W. J. Current concepts of percutaneous balloon kyphoplasty for the treatment of osteoporotic vertebral compression fractures: evidence-based review. Biomed. J. 36, 154–161 (2013).
Robinson, Y., Heyde, C. E., Försth, P. & Olerud, C. Kyphoplasty in osteoporotic vertebral compression fractures-guidelines and technical considerations. J. Orthop. Surg. Res. 6, 43 (2011).
Li, W. et al. Establishment and validation of a nomogram and web calculator for the risk of new vertebral compression fractures and cement leakage after percutaneous vertebroplasty in patients with osteoporotic vertebral compression fractures. Eur. Spine J. 31, 1108–1121 (2022).
Zhong, B. Y. et al. Nomogram for predicting intradiscal cement leakage following percutaneous vertebroplasty in patients with osteoporotic related vertebral compression fractures. Pain. physician 20, E513–E520 (2017).
Ding, J. et al. Risk factors for predicting cement leakage following percutaneous vertebroplasty for osteoporotic vertebral compression fractures. Eur. Spine J. 25, 3411–3417 (2016).
Tao, W. et al. Predictive factors for adjacent vertebral fractures after percutaneous kyphoplasty in patients with osteoporotic vertebral compression fracture. Pain. Phys. 25, E725–E732 (2022).
Park, J. S. & Park, Y. S. Survival analysis and risk factors of new vertebral fracture after vertebroplasty for osteoporotic vertebral compression fracture. Spine J. 21, 1355–1361 (2021).
Yang, S., Liu, Y., Yang, H. & Zou, J. Risk factors and correlation of secondary adjacent vertebral compression fracture in percutaneous kyphoplasty. Int. J. Surg. 36, 138–142 (2016).
Hu, Y. L. et al. Interpretable machine learning model to predict bone cement leakage in percutaneous vertebral augmentation for osteoporotic vertebral compression fracture based on SHapley Additive exPlanations. Glob. Spine J. 15, 689–701 (2025).
Deng, G. et al. Application of machine learning in prediction of bone cement leakage during single-level thoracolumbar percutaneous vertebroplasty. BMC Surg. 23, 63 (2023).
Li, W. et al. Machine learning applications for the prediction of bone cement leakage in percutaneous vertebroplasty. Front. Public Health 9, 812023 (2021).
Howell, M. D., Corrado, G. S. & DeSalvo, K. B. Three epochs of artificial intelligence in health care. JAMA 331, 242–244 (2024).
Tam, T. Y. C. et al. A framework for human evaluation of large language models in healthcare derived from literature review. NPJ Digit. Med. 7, 258 (2024).
Bedi, S. et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA 333, 319–328 (2025).
Liu, Y. et al. Functional outcome prediction in acute ischemic stroke using a fused imaging and clinical deep learning model. Stroke 54, 2316–2327 (2023).
Jiang, Y. et al. Predicting peritoneal recurrence and disease-free survival from CT images in gastric cancer with multitask deep learning: a retrospective study. Lancet Digit. Health 4, e340–e350 (2022).
Goedmakers, C. M. W. et al. Deep learning for adjacent segment disease at preoperative MRI for cervical radiculopathy. Radiology 301, 664–671 (2021).
Davis, S. E., Walsh, C. G. & Matheny, M. E. Open questions and research gaps for monitoring and updating AI-enabled tools in clinical settings. Front. Digit. Health 4, 958284 (2022).
Chung, P. et al. Large language model capabilities in perioperative risk prediction and prognostication. JAMA Surg. 159, 928–937 (2024).
Amacher, S. A. et al. Can the large language model ChatGPT-4omni predict outcomes in adult patients with status epilepticus? Epilepsia 66, 674–685 (2025).
Amacher, S. A. et al. Prediction of outcomes after cardiac arrest by a generative artificial intelligence model. Resusc. Plus 18, 100587 (2024).
Gakuba, C. et al. Evaluation of ChatGPT in predicting 6-month outcomes after traumatic brain injury. Crit. Care Med. 52, 942–950 (2024).
Glicksberg, B. S. et al. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J. Am. Med. Inform. Assoc. 31, 1921–1928 (2024).
Huang, X. et al. Predicting glaucoma before onset using a large language model chatbot. Am. J. Ophthalmol. 266, 289–299 (2024).
Brown, K. E. et al. Large language models are less effective at clinical prediction tasks than locally trained machine learning models. J. Am. Med. Inform. Assoc. 32, 811–822 (2025).
Zhu, X. et al. Fully automatic deep learning model for spine refracture in patients with OVCF: a multi-center study. Orthop. Surg. 16, 2052–2065 (2024).
Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330, 78–80 (2023).
Chen, R. et al. Deep Learning-Based Prediction for Bone Cement Leakage During Percutaneous Kyphoplasty Using Preoperative Computed Tomography: MODEL Development and Validation. Spine. https://doi.org/10.1097/BRS.0000000000005448 (2025).
Zhang, Z. L., Yang, J. S., Hao, D. J., Liu, T. J. & Jing, Q. M. Risk factors for new vertebral fracture after percutaneous vertebroplasty for osteoporotic vertebral compression fractures. Clin. Interv. Aging 16, 1193–1200 (2021).
Meskó, B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J. Med. Internet Res. 25, e50638 (2023).
Pu, Z. et al. ChatGPT and generative AI are revolutionizing the scientific community: a Janus-faced conundrum. iMeta 3, e178 (2024).
Laymouna, M. et al. Roles, users, benefits, and limitations of chatbots in health care: rapid review. J. Med. Internet Res. 26, e56930 (2024).
Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378 (2024).
Xi, Y. et al. Deep learning-based multimodal image analysis predicts bone cement leakage during percutaneous kyphoplasty: protocol for model development, and validation by prospective and external datasets. Front. Med. 11, 1479187 (2024).
Charlson, M., Szatrowski, T. P., Peterson, J. & Gold, J. Validation of a combined comorbidity index. J. Clin. Epidemiol. 47, 1245–1251 (1994).
Yang, K. et al. Bone cement distribution patterns in vertebral augmentation for osteoporotic vertebral compression fractures: a systematic review. J. Orthop. Surg. Res. 20, 568 (2025).
Fan, N. et al. A predictive nomogram for intradiscal cement leakage in percutaneous kyphoplasty for osteoporotic vertebral compression fractures combined with intravertebral cleft. Front. Surg. 9, 1005220 (2022).
Wu, J., Wang, Z. & Qin, Y. Performance of DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination: a comparative study. J. Med. Syst. 49, 74 (2025).
Yu, A., Li, A., Ahmed, W., Saturno, M. & Cho, S. K. Evaluating artificial intelligence in spinal cord injury management: a comparative analysis of ChatGPT-4o and Google Gemini against American College of Surgeons best practices guidelines for spine injury. Glob. Spine J. 15, 3199–3220 (2025).
Acknowledgements
We sincerely thank DP and LJ for their valuable investigation and evaluation in this research. No funding was received for conducting this study.
Author information
Authors and Affiliations
Contributions
T.W., R.C., and M.L. contributed equally to this work. Conceptualization: L.Z., T.W., and R.C.; methodology: T.W., R.C., and M.L.; formal analysis and investigation: T.W., R.C., M.L., H.K., B.W., and Z.M.; writing—original draft preparation: T.W. and R.C.; writing—review and editing: M.L., H.K., B.W., Z.M., A.W., N.F., and S.Y.; resources: L.Z.; supervision: L.Z.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, T., Chen, R., Liang, M. et al. Comparative performance of LLMs and machine learning in predicting complications after percutaneous kyphoplasty for osteoporotic vertebral compression fractures. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02588-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-026-02588-4


