Comparative performance of LLMs and machine learning in predicting complications after percutaneous kyphoplasty for osteoporotic vertebral compression fractures

Wang, Tianyi; Chen, Ruiyuan; Liang, Minghui; Ke, Han; Wang, Baodong; Ma, Ziqian; Wang, Aobo; Fan, Ning; Yuan, Shuo; Zang, Lei

doi:10.1038/s41746-026-02588-4

Download PDF

Article
Open access
Published: 01 April 2026

Comparative performance of LLMs and machine learning in predicting complications after percutaneous kyphoplasty for osteoporotic vertebral compression fractures

Tianyi Wang¹^na1,
Ruiyuan Chen¹^na1,
Minghui Liang¹^na1,
Han Ke¹,
Baodong Wang¹,
Ziqian Ma¹,
Aobo Wang¹,
Ning Fan¹,
Shuo Yuan¹ &
…
Lei Zang¹

npj Digital Medicine , Article number: (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Exploring large language models (LLMs) performance in the specific medical domain can help understand their generalizability in real-world application. We assessed the predictive and decision-support value of two state-of-the-art LLMs in predicting bone cement leakage (BCL) and new vertebral fractures (NVF) after percutaneous kyphoplasty (PKP) and to compare them with those of traditional machine learning (TML) and spine surgeon. This study utilized combined retrospective and prospective data at a single tertiary hospital. Two LLMs (GPT-5 and DeepSeek R1) with zero- and few-shot strategy, five TML models, and two spine surgeons with/without exposure to LLM responses, were asked to predict complications based on demographic, perioperative baseline, and radiographic data. We also tested LLMs’ ability to predict complication subtype. For BCL prediction, both LLMs demonstrated acceptable performance (F1-score, 0.857–0.871; MCC, 0.164–0.332) under zero-shot conditions, comparable to TML models (F1-score, 0.758–0.867; MCC, 0.265–0.416), and slightly superior to surgeons alone (F1-score, 0.675–0.684; MCC, 0.074–0.185). Few-shot prompting enhanced specificity but yielded uncertain overall gains. For NVF prediction, the zero-shot LLM performance was poor (F1-score, 0.309; MCC, 0.044) but improved with few-shot learning. The RBF-SVM model showed the best performance for NVF prediction (F1-score, 0.536; MCC, 0.414). LLM explanations enhanced surgeon performance in BCL prediction but not in NVF. LLMs showed poor prediction of complication subtypes. The findings suggest that current LLMs hold diverse predictive performances for different complications after PKP, they are still immature for real clinical applicability and need further improvement.

Machine learning algorithms for prediction of cerebrospinal fluid leakage after posterior surgery for thoracic ossification of the ligamentum flavum

Article Open access 03 July 2025

Predicting proximal junctional failure in adult spinal deformity patients using machine learning models based on spinal alignment parameters

Article Open access 20 November 2025

A novel puncture approach via point “O” for percutaneous kyphoplasty in patients with L4 or L5 osteoporotic vertebral compression fracture

Article Open access 07 November 2022

Data availability

The datasets generated and/or analyzed during the current study are not publicly available due to the institutional and participant privacy considerations, but are available from the corresponding author on reasonable request.

Code availability

The underlying code for this study is not publicly available but may be made available to qualified researchers on reasonable request from the corresponding author.

References

Alsoof, D. et al. Diagnosis and management of vertebral compression fracture. Am. J. Med. 135, 815–821 (2022).
Google Scholar
Ballane, G., Cauley, J. A., Luckey, M. M. & El-Hajj Fuleihan, G. Worldwide prevalence and incidence of osteoporotic vertebral fractures. Osteoporos. Int. 28, 1531–1542 (2017).
Google Scholar
Wu, Y. et al. Risk factors for cement leakage after percutaneous vertebral augmentation for osteoporotic vertebral compression fractures: a meta-analysis. Int. J. Surg. 111, 1231–1243 (2025).
Google Scholar
Ebeling, P. R. et al. The efficacy and safety of vertebral augmentation: a second asbmr task force report. J. Bone Miner. Res. 34, 3–21 (2019).
Google Scholar
Expert Panels on Neurological Imaging, Interventional Radiology, and Musculoskeletal Imaging, et al. ACR Appropriateness Criteria® Management of Vertebral Compression Fractures: 2022 Update. J. Am. Coll. Radiol. 20, S102–S124 (2023).
NASS. Evidence-based clinical guidelines for multidisciplinary spine care: diagnosis & treatment of adults with osteoporotic vertebral compression fractures. (2024), accessed 10 June 2025. [https://www.spine.org/Portals/0/assets/downloads/ResearchClinicalCare/Guidelines /Osteoporotic-Vertebral-Compression-Fractures.pdf].
Hsieh, M. K., Chen, L. H. & Chen, W. J. Current concepts of percutaneous balloon kyphoplasty for the treatment of osteoporotic vertebral compression fractures: evidence-based review. Biomed. J. 36, 154–161 (2013).
Google Scholar
Robinson, Y., Heyde, C. E., Försth, P. & Olerud, C. Kyphoplasty in osteoporotic vertebral compression fractures-guidelines and technical considerations. J. Orthop. Surg. Res. 6, 43 (2011).
Google Scholar
Li, W. et al. Establishment and validation of a nomogram and web calculator for the risk of new vertebral compression fractures and cement leakage after percutaneous vertebroplasty in patients with osteoporotic vertebral compression fractures. Eur. Spine J. 31, 1108–1121 (2022).
Google Scholar
Zhong, B. Y. et al. Nomogram for predicting intradiscal cement leakage following percutaneous vertebroplasty in patients with osteoporotic related vertebral compression fractures. Pain. physician 20, E513–E520 (2017).
Google Scholar
Ding, J. et al. Risk factors for predicting cement leakage following percutaneous vertebroplasty for osteoporotic vertebral compression fractures. Eur. Spine J. 25, 3411–3417 (2016).
Google Scholar
Tao, W. et al. Predictive factors for adjacent vertebral fractures after percutaneous kyphoplasty in patients with osteoporotic vertebral compression fracture. Pain. Phys. 25, E725–E732 (2022).
Google Scholar
Park, J. S. & Park, Y. S. Survival analysis and risk factors of new vertebral fracture after vertebroplasty for osteoporotic vertebral compression fracture. Spine J. 21, 1355–1361 (2021).
Google Scholar
Yang, S., Liu, Y., Yang, H. & Zou, J. Risk factors and correlation of secondary adjacent vertebral compression fracture in percutaneous kyphoplasty. Int. J. Surg. 36, 138–142 (2016).
Google Scholar
Hu, Y. L. et al. Interpretable machine learning model to predict bone cement leakage in percutaneous vertebral augmentation for osteoporotic vertebral compression fracture based on SHapley Additive exPlanations. Glob. Spine J. 15, 689–701 (2025).
Google Scholar
Deng, G. et al. Application of machine learning in prediction of bone cement leakage during single-level thoracolumbar percutaneous vertebroplasty. BMC Surg. 23, 63 (2023).
Google Scholar
Li, W. et al. Machine learning applications for the prediction of bone cement leakage in percutaneous vertebroplasty. Front. Public Health 9, 812023 (2021).
Google Scholar
Howell, M. D., Corrado, G. S. & DeSalvo, K. B. Three epochs of artificial intelligence in health care. JAMA 331, 242–244 (2024).
Google Scholar
Tam, T. Y. C. et al. A framework for human evaluation of large language models in healthcare derived from literature review. NPJ Digit. Med. 7, 258 (2024).
Google Scholar
Bedi, S. et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA 333, 319–328 (2025).
Google Scholar
Liu, Y. et al. Functional outcome prediction in acute ischemic stroke using a fused imaging and clinical deep learning model. Stroke 54, 2316–2327 (2023).
Google Scholar
Jiang, Y. et al. Predicting peritoneal recurrence and disease-free survival from CT images in gastric cancer with multitask deep learning: a retrospective study. Lancet Digit. Health 4, e340–e350 (2022).
Google Scholar
Goedmakers, C. M. W. et al. Deep learning for adjacent segment disease at preoperative MRI for cervical radiculopathy. Radiology 301, 664–671 (2021).
Google Scholar
Davis, S. E., Walsh, C. G. & Matheny, M. E. Open questions and research gaps for monitoring and updating AI-enabled tools in clinical settings. Front. Digit. Health 4, 958284 (2022).
Google Scholar
Chung, P. et al. Large language model capabilities in perioperative risk prediction and prognostication. JAMA Surg. 159, 928–937 (2024).
Google Scholar
Amacher, S. A. et al. Can the large language model ChatGPT-4omni predict outcomes in adult patients with status epilepticus? Epilepsia 66, 674–685 (2025).
Google Scholar
Amacher, S. A. et al. Prediction of outcomes after cardiac arrest by a generative artificial intelligence model. Resusc. Plus 18, 100587 (2024).
Google Scholar
Gakuba, C. et al. Evaluation of ChatGPT in predicting 6-month outcomes after traumatic brain injury. Crit. Care Med. 52, 942–950 (2024).
Google Scholar
Glicksberg, B. S. et al. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J. Am. Med. Inform. Assoc. 31, 1921–1928 (2024).
Google Scholar
Huang, X. et al. Predicting glaucoma before onset using a large language model chatbot. Am. J. Ophthalmol. 266, 289–299 (2024).
Google Scholar
Brown, K. E. et al. Large language models are less effective at clinical prediction tasks than locally trained machine learning models. J. Am. Med. Inform. Assoc. 32, 811–822 (2025).
Google Scholar
Zhu, X. et al. Fully automatic deep learning model for spine refracture in patients with OVCF: a multi-center study. Orthop. Surg. 16, 2052–2065 (2024).
Google Scholar
Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330, 78–80 (2023).
Google Scholar
Chen, R. et al. Deep Learning-Based Prediction for Bone Cement Leakage During Percutaneous Kyphoplasty Using Preoperative Computed Tomography: MODEL Development and Validation. Spine. https://doi.org/10.1097/BRS.0000000000005448 (2025).
Zhang, Z. L., Yang, J. S., Hao, D. J., Liu, T. J. & Jing, Q. M. Risk factors for new vertebral fracture after percutaneous vertebroplasty for osteoporotic vertebral compression fractures. Clin. Interv. Aging 16, 1193–1200 (2021).
Google Scholar
Meskó, B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J. Med. Internet Res. 25, e50638 (2023).
Google Scholar
Pu, Z. et al. ChatGPT and generative AI are revolutionizing the scientific community: a Janus-faced conundrum. iMeta 3, e178 (2024).
Google Scholar
Laymouna, M. et al. Roles, users, benefits, and limitations of chatbots in health care: rapid review. J. Med. Internet Res. 26, e56930 (2024).
Google Scholar
Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378 (2024).
Google Scholar
Xi, Y. et al. Deep learning-based multimodal image analysis predicts bone cement leakage during percutaneous kyphoplasty: protocol for model development, and validation by prospective and external datasets. Front. Med. 11, 1479187 (2024).
Google Scholar
Charlson, M., Szatrowski, T. P., Peterson, J. & Gold, J. Validation of a combined comorbidity index. J. Clin. Epidemiol. 47, 1245–1251 (1994).
Google Scholar
Yang, K. et al. Bone cement distribution patterns in vertebral augmentation for osteoporotic vertebral compression fractures: a systematic review. J. Orthop. Surg. Res. 20, 568 (2025).
Google Scholar
Fan, N. et al. A predictive nomogram for intradiscal cement leakage in percutaneous kyphoplasty for osteoporotic vertebral compression fractures combined with intravertebral cleft. Front. Surg. 9, 1005220 (2022).
Google Scholar
Wu, J., Wang, Z. & Qin, Y. Performance of DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination: a comparative study. J. Med. Syst. 49, 74 (2025).
Google Scholar
Yu, A., Li, A., Ahmed, W., Saturno, M. & Cho, S. K. Evaluating artificial intelligence in spinal cord injury management: a comparative analysis of ChatGPT-4o and Google Gemini against American College of Surgeons best practices guidelines for spine injury. Glob. Spine J. 15, 3199–3220 (2025).
Google Scholar

Download references

Acknowledgements

We sincerely thank DP and LJ for their valuable investigation and evaluation in this research. No funding was received for conducting this study.

Author information

These authors contributed equally: Tianyi Wang, Ruiyuan Chen, Minghui Liang.

Authors and Affiliations

Department of Orthopedics, Beijing Chaoyang Hospital, Capital Medical University, Beijing, China
Tianyi Wang, Ruiyuan Chen, Minghui Liang, Han Ke, Baodong Wang, Ziqian Ma, Aobo Wang, Ning Fan, Shuo Yuan & Lei Zang

Authors

Tianyi Wang
View author publications
Search author on:PubMed Google Scholar
Ruiyuan Chen
View author publications
Search author on:PubMed Google Scholar
Minghui Liang
View author publications
Search author on:PubMed Google Scholar
Han Ke
View author publications
Search author on:PubMed Google Scholar
Baodong Wang
View author publications
Search author on:PubMed Google Scholar
Ziqian Ma
View author publications
Search author on:PubMed Google Scholar
Aobo Wang
View author publications
Search author on:PubMed Google Scholar
Ning Fan
View author publications
Search author on:PubMed Google Scholar
Shuo Yuan
View author publications
Search author on:PubMed Google Scholar
Lei Zang
View author publications
Search author on:PubMed Google Scholar

Contributions

T.W., R.C., and M.L. contributed equally to this work. Conceptualization: L.Z., T.W., and R.C.; methodology: T.W., R.C., and M.L.; formal analysis and investigation: T.W., R.C., M.L., H.K., B.W., and Z.M.; writing—original draft preparation: T.W. and R.C.; writing—review and editing: M.L., H.K., B.W., Z.M., A.W., N.F., and S.Y.; resources: L.Z.; supervision: L.Z.

Corresponding author

Correspondence to Lei Zang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, T., Chen, R., Liang, M. et al. Comparative performance of LLMs and machine learning in predicting complications after percutaneous kyphoplasty for osteoporotic vertebral compression fractures. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02588-4

Download citation

Received: 21 November 2025
Accepted: 17 March 2026
Published: 01 April 2026
DOI: https://doi.org/10.1038/s41746-026-02588-4