Optimizing prompting strategies improves large language model classification of pain- and fatigue-related functional impact in childhood cancer survivors

Sim, Jin-ah; Horan, Madeline R.; Huang, Xiaolei; Kim, Minsu; Srivastava, Deo Kumar; Ness, Kirsten K.; Hudson, Melissa M.; Baker, Justin N.; Huang, I-Chan

doi:10.1038/s43856-026-01499-5

Download PDF

Article
Open access
Published: 25 March 2026

Optimizing prompting strategies improves large language model classification of pain- and fatigue-related functional impact in childhood cancer survivors

Communications Medicine , Article number: (2026) Cite this article

1175 Accesses
60 Altmetric
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Background

Understanding how symptoms affect daily functioning is central to improving care for childhood cancer survivors. As narrative symptom reporting becomes increasingly common in survivorship care, scalable automated tools are needed to interpret these descriptions and identify their functional impact. This study evaluates how two large language models (ChatGPT-4o, Llama-3.1) perform this task across different prompt engineering strategies.

Methods

We analyzed semi-structured interviews from 30 childhood cancer survivors and their caregivers, yielding 819 pain- and fatigue-related symptom narratives. Each narrative was expert-annotated for physical, social, or cognitive functional impact, serving as the reference standard. ChatGPT-4o and Llama-3.1 were evaluated using four prompting strategies: zero-shot, few-shot, step-by-step reasoning (Chain-of-Thought), and generated knowledge. Model outputs were compared with expert annotations, and performance was quantified using standard classification and discrimination metrics with resampling-based confidence intervals.

Results

Here, we show that prompting strategies based on generated knowledge and step-by-step reasoning consistently outperform zero-shot and few-shot across both models. Overall, these strategies produce the most accurate and stable classification of physical, social, and cognitive functional impact. Specifically, ChatGPT-4o achieves more balanced precision and discrimination across physical, social, and cognitive functioning, whereas Llama-3.1 demonstrates higher sensitivity but substantially lower precision, particularly for physical and social functioning.

Conclusions

Prompt engineering improves how large language models interpret survivor-reported pain and fatigue. These findings support the use of carefully designed prompts to enable automated, context-aware analysis of symptom narratives, providing a scalable approach to support symptom monitoring and survivor-centered care.

Plain language summary

Many children who survive cancer continue to have pain and tiredness that affect how they move, think, and take part in daily life. These experiences are often recorded in interviews or written notes, which are hard for doctors to review quickly. This study tested whether two computer programs could aid in interpreting these symptom narratives. We provided the programs with different types of instructions and asked them to determine how pain and fatigue affected physical, social, and thinking activities. We found that instructions that included background information or step-by-step thinking led to more accurate results. These tools could help doctors better track symptoms, identify problems earlier, and provide more personalized care for childhood cancer survivors.

Comprehensive testing of large language models for extraction of structured data in pathology

Article Open access 31 March 2025

Cancer type, stage and prognosis assessment from pathology reports using LLMs

Article Open access 26 July 2025

Effects of high-intensity training on the quality of life of cancer patients and survivors: a systematic review with meta-analysis

Article Open access 23 July 2021

Data availability

Source data underlying Table 1 is in Supplementary Table 1; source data underlying Tables 2–3 are in Supplementary Data parts 3 and 4. These source data are available on Zenodo.org (https://zenodo.org/records/18526848)⁵³. These public datasets enable reproduction of the reported tables but do not include unstructured symptom narratives. The full analytic datasets used for model training and statistical analyses, including unstructured symptom narratives, may contain protected health information (PHI) and therefore cannot be made publicly available. These data are stored within secure St. Jude data systems and are available to qualified researchers upon reasonable request, subject to institutional review and execution of a data use agreement (DUA). Requests for access to restricted analytic data should be directed to the corresponding author, who will coordinate access in accordance with St. Jude data-governance policies.

References

Ohlsen, T., Martos, M. & Hawkins, D. Recent advances in the treatment of childhood cancers. Curr. Opin. Pediatr. 36, 57–63 (2023).
Google Scholar
Siegel, R. L., Miller, K. D., Wagle, N. S. & Jemal, A. Cancer statistics, 2023. CA Cancer J. Clin. 73, 17–48 (2023).
Google Scholar
Miller, K. D. et al. Cancer treatment and survivorship statistics, 2022. CA Cancer J. Clin. 72, 409–436 (2022).
Google Scholar
Armstrong, G. T. et al. Reduction in late mortality among 5-year survivors of childhood cancer. NEJM 374, 833–842 (2016).
Google Scholar
Hudson, M. M. et al. Clinical ascertainment of health outcomes among adults treated for childhood cancer. JAMA 309, 2371–2381 (2013).
Google Scholar
Bhakta, N. et al. The cumulative burden of surviving childhood cancer: an initial report from the St Jude Lifetime Cohort Study (SJLIFE). Lancet 390, 2569–2582 (2017).
Google Scholar
Shin, H. et al. Associations of symptom clusters and health outcomes in adult survivors of childhood cancer: a report from the St Jude Lifetime Cohort Study. J. Clin. Oncol. 41, 497–507 (2023).
Google Scholar
Horan, M. R. et al. Multilevel characteristics of cumulative symptom burden in young survivors of childhood cancer. JAMA Netw. Open 7, e2410145–e2410145 (2024).
Google Scholar
Molcho, M., D’Eath, M., Alforque Thomas, A. & Sharp, L. Educational attainment of childhood cancer survivors: a systematic review. Cancer Med. 8, 3182–3195 (2019).
Google Scholar
Leahy, A. B. & Steineck, A. Patient-reported outcomes in pediatric oncology: the patient voice as a gold standard. JAMA Pediatr. 174, e202868–e202868 (2020).
Google Scholar
Basch, E., Leahy, A. B. & Reeve, B. B. Symptom monitoring with patient-reported outcomes during pediatric cancer care. JAMA 332, 1979–1980 (2024).
Google Scholar
Aiyegbusi, O. L. et al. Recommendations to address respondent burden associated with patient-reported outcome assessment. Nat. Med. 30, 650–659 (2024).
Google Scholar
Minvielle, E., di Palma, M., Mir, O. & Scotté, F. The use of patient-reported outcomes (PROs) in cancer care: a realistic strategy. Ann. Oncol. 33, 357–359 (2022).
Google Scholar
Lu, Z. et al. Natural language processing and machine learning methods to characterize unstructured patient-reported outcomes: validation study. J. Med. Internet Res. 23, e26777 (2021).
Google Scholar
Sezgin, E., Hussain, S. A., Rust, S. & Huang, Y. Extracting medical information from free-text and unstructured patient-generated health data using natural language processing methods: feasibility study with real-world data. JMIR Form. Res. 7, e43014 (2023).
Google Scholar
Castro, A., Pinto, J., Reino, L., Pipek, P. & Capinha, C. Large language models overcome the challenges of unstructured text data in ecology. Ecol. Inf. 82, 102742 (2024).
Google Scholar
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 4171–4186 (Association for Computational Linguistics, 2019).
Yang, X. et al. Multiple large language models versus experienced physicians in diagnosing challenging cases with gastrointestinal symptoms. NPJ Digit. Med. 8, 85 (2025).
Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Google Scholar
Hadi, A., Tran, E., Nagarajan, B. & Kirpalani, A. Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians. PLoS ONE 19, e0307383 (2024).
Google Scholar
Liévin, V., Hother, C. E., Motzfeldt, A. G. & Winther, O. Can large language models reason about medical questions?. Patterns 5, 100943 (2024).
Google Scholar
Wei, W. I. et al. Extracting symptoms from free-text responses using ChatGPT among COVID-19 cases in Hong Kong. Clin. Microbiol. Infect. 30, 142.e141–142.e143 (2024).
Google Scholar
Nazi, Z. A., Hossain, M. R. & Mamun, F. A. Evaluation of open and closed-source LLMs for low-resource language with zero-shot, few-shot, and chain-of-thought prompting. Nat. Lang. Proc. J. 10, 100124 (2025).
Google Scholar
Sim, J. A. et al. Natural language processing with machine learning methods to analyze unstructured patient-reported outcomes derived from electronic health records: a systematic review. Artif. Intell. Med. 146, 102701 (2023).
Google Scholar
Sim, J. A., Huang, X., Horan, M. R., Baker, J. N. & Huang, I. C. Using natural language processing to analyze unstructured patient-reported outcomes data derived from electronic health records for cancer populations: a systematic review. Expert Rev. Pharmacoecon. Outcomes Res. 24, 467–475 (2024).
Google Scholar
Cho, S. et al. Leveraging large language models for improved understanding of communications with patients with cancer in a call center setting: proof-of-concept study. J. Med. Internet Res. 26, e63892 (2024).
Google Scholar
Gupta, G. K., Singh, A., Manikandan, S. V. & Ehtesham, A. Digital diagnostics: the potential of large language models in recognizing symptoms of common illnesses. AI 6, 13 (2025).
Google Scholar
Huang, X. et al. Evaluating the performance of ChatGPT in clinical pharmacy: a comparative study of ChatGPT and clinical pharmacists. Br. J. Clin. Pharm. 90, 232–238 (2024).
Google Scholar
Han, C. et al. Evaluation of GPT-4 for 10-year cardiovascular risk prediction: insights from the UK Biobank and KoGES data. iScience 27, 109022 (2024).
Google Scholar
Goh, E. et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw. Open 7, e2440969–e2440969 (2024).
Google Scholar
Rydzewski, N. R. et al. Comparative evaluation of LLMs in clinical oncology. NEJM AI 1, 1–14 (2024).
Google Scholar
Hirosawa, T. et al. ChatGPT-generated differential diagnosis lists for complex case-derived clinical vignettes: diagnostic accuracy evaluation. JMIR Med. Inf. 11, e48808 (2023).
Google Scholar
Chen, A., Chen, D. O. & Tian, L. Benchmarking the symptom-checking capabilities of ChatGPT for a broad range of diseases. J. Am. Med. Inf. Assoc. 31, 2084–2088 (2024).
Google Scholar
Forrest, C. B. et al. Self-reported health outcomes of children and youth with 10 chronic diseases. J. Pediatr. 246, 207–212 (2022).
Google Scholar
Pierzynski, J. A. et al. Patient-reported outcomes in paediatric cancer survivorship: a qualitative study to elicit the content from cancer survivors and caregivers. BMJ Open 10, e032414 (2020).
Google Scholar
Forrest, C. B. et al. Establishing the content validity of PROMIS Pediatric pain interference, fatigue, sleep disturbance, and sleep-related impairment measures in children with chronic kidney disease and Crohn’s disease. J. Patient Rep. Outcomes 4, 11 (2020).
Google Scholar
Varni, J. W. et al. PROMIS Pediatric Pain Interference Scale: an item response theory analysis of the pediatric pain item bank. J. Pain. 11, 1109–1119 (2010).
Google Scholar
Lai, J. S. et al. Development and psychometric properties of the PROMIS(®) pediatric fatigue item banks. Qual. Life Res. 22, 2417–2427 (2013).
Google Scholar
Hahn, E. A. et al. Precision of health-related quality-of-life data compared with other clinical measures. Mayo Clin. Proc. 82, 1244–1254 (2007).
Google Scholar
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners. In Proceedings of the 36th International Conference on Neural Information Processing Systems 22199–22213 (Association for Computational Linguistics, 2022).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems 24824–24837 (NeurIPS, 2022).
Liu, J. et al. Generated knowledge prompting for commonsense reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics 3154–3169 (Association for Computational Linguistics, 2022).
Zelikman, E., Wu, Y., Mu, J. & Goodman, N. Shape of thought: When distribution matters more than correctness in reasoning tasks. In Proceedings of the 36th International Conference on Neural Information Processing Systems 15476–15488 (NeurIPS, 2022).
Kakaday, R., Herrera, E. Z., Coskey, O., Hertel, A. W. & Kaiser, P. The STREAMLINE Pilot—study on time reduction and efficiency in AI-mediated logging for improved note-taking experience. Appl. Clin. Inf. 16, 614–621 (2025).
Google Scholar
Wornow, M. et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit. Med. 6, 135 (2023).
Google Scholar
Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55, 1–38 (2023).
Google Scholar
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing System 9459–9474 (NeurIPS, 2020).
Li, M., Kilicoglu, H., Xu, H. & Zhang, R. BiomedRAG: a retrieval augmented large language model for biomedicine. J. Biomed. Inf. 162, 104769 (2025).
Google Scholar
Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55, 1–35 (2023).
Google Scholar
Xu, D. et al. Editing factual knowledge and explanatory ability of medical large language models. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management 2660–2670 (Association for Computing Machinery, 2024).
Rajkomar, A., Hardt, M., Howell, M. D., Corrado, G. & Chin, M. H. Ensuring fairness in machine learning to advance health equity. Ann. Intern. Med. 169, 866–872 (2018).
Google Scholar
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. 54, 1–35 (2021).
Google Scholar
Source data from: Optimizing prompting strategies improves large language model classification of pain- and fatigue-related functional impact in childhood cancer survivors. Zenodo https://doi.org/10.5281/zenodo.18526848 (2026).

Download references

Acknowledgements

The research reported in this manuscript was supported by the U.S. National Cancer Institute (NCI) under award numbers NCI U01CA195547, NCI R21CA202210, NCI R01CA238368, and NCI R01CA258193. This research was also supported by the grant from the National Research Foundation of Korea (NRF), funded by the Korean government (no.: 2022R1C1C1009902). The content is solely the responsibility of the authors and does not represent the official views of the funding agencies. Support for St. Jude Children’s Research Hospital was also provided by the Cancer Center Support (CORE) grant (P30 CA21765, C. Roberts, Principal Investigator) and the American Lebanese-Syrian Associated Charities (ALSAC). In addition, the authors thank Rachel M. Keesey and Ruth J. Eliason for conducting in-depth interviews with study participants; Jennifer L. Clegg and Conor M. Jones, MD, for annotating symptom data from the interview data; and Christopher B. Forrest, MD, PhD, for adjudicating discrepancies in the annotation of symptom data.

Author information

Authors and Affiliations

Graduate School of Public Health and Healthcare Management, The Catholic University of Korea, Seoul, Republic of Korea
Jin-ah Sim
Department of Pediatrics, Wake Forest University School of Medicine, Winston-Salem, NC, USA
Madeline R. Horan
Department of Computer Science, College of Arts & Sciences, University of Memphis, Memphis, TN, USA
Xiaolei Huang
Department of AI Convergence, Hallym University, Chuncheon, Republic of Korea
Minsu Kim
Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN, USA
Deo Kumar Srivastava
Department of Epidemiology & Cancer Control, St. Jude Children’s Research Hospital, Memphis, TN, USA
Kirsten K. Ness, Melissa M. Hudson & I-Chan Huang
Department of Oncology, St. Jude Children’s Research Hospital, Memphis, TN, USA
Melissa M. Hudson
Division of Quality of Life and Pediatric Palliative Care, Department of Pediatrics, Stanford University School of Medicine and Stanford Medicine Children’s Health, Palo Alto, CA, USA
Justin N. Baker

Authors

Jin-ah Sim
View author publications
Search author on:PubMed Google Scholar
Madeline R. Horan
View author publications
Search author on:PubMed Google Scholar
Xiaolei Huang
View author publications
Search author on:PubMed Google Scholar
Minsu Kim
View author publications
Search author on:PubMed Google Scholar
Deo Kumar Srivastava
View author publications
Search author on:PubMed Google Scholar
Kirsten K. Ness
View author publications
Search author on:PubMed Google Scholar
Melissa M. Hudson
View author publications
Search author on:PubMed Google Scholar
Justin N. Baker
View author publications
Search author on:PubMed Google Scholar
I-Chan Huang
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization: J.A.S. and I.C.H.; Funding acquisition: J.A.S., K.K.N., M.M.H., and I.C.H.; Methodology: J.A.S., X.H., and I.C.H.; Data analysis: J.A.S. and M.K.; Interpretation of data: all co-authors; Project administration: I.C.H.; Resources: K.K.N., M.M.H., J.N.B., and I.C.H.; Supervision: I.C.H.; Writing: J.A.H., M.R.H., and I.C.H.; Review and editing: all co-authors; All authors have read and agreed to the submitted version of the manuscript.

Corresponding author

Correspondence to I-Chan Huang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Medicine thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Transparent Peer Review file (download PDF )

Supplemental Information (download PDF )

Description of Additional Supplementary files (download DOCX )

Supplementary Data 1-4 (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Sim, Ja., Horan, M.R., Huang, X. et al. Optimizing prompting strategies improves large language model classification of pain- and fatigue-related functional impact in childhood cancer survivors. Commun Med (2026). https://doi.org/10.1038/s43856-026-01499-5

Download citation

Received: 13 July 2025
Accepted: 23 February 2026
Published: 25 March 2026
DOI: https://doi.org/10.1038/s43856-026-01499-5