Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Benchmarking large language models for replication of guideline-based PGx recommendations

Abstract

We evaluated the ability of large language models (LLMs) to generate clinically accurate pharmacogenomic (PGx) recommendations aligned with CPIC guidelines. Using a benchmark of 599 curated gene–drug–phenotype scenarios, we compared five leading models, including GPT-4o and fine-tuned LLaMA variants, through both standard lexical metrics and a novel semantic evaluation framework (LLM Score) validated by expert review. General-purpose models frequently produced incomplete or unsafe outputs, while our domain-adapted model achieved superior performance, with an LLM Score of 0.92 and significantly faster inference speed. Results highlight the importance of fine-tuning and structured prompting over model scale alone. This work establishes a robust framework for evaluating PGx-specific LLMs and demonstrates the feasibility of safer, AI-driven personalized medicine.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Exploratory data analysis of CPIC-based pharmacogenomic recommendations.
Fig. 2: Performance of LLM-based scoring metric for evaluating pharmacogenomic recommendations.
Fig. 3: LLM performance (LLM Score and BLEU) in clinical recommendation generation.
Fig. 4: BLEU score comparison across recent open-weight model variants.

Similar content being viewed by others

Data availability

The datasets used in this study have been deposited on Zenodo and are publicly available at: https://doi.org/10.5281/zenodo.15658429. All data are released under the MIT License.

Code availability

The code used for data processing and analysis is available on Zenodo at the same link: https://doi.org/10.5281/zenodo.15658429, under the MIT License.

References

  1. Coleman JJ, Pontefract SK. Adverse drug reactions. Clin Med. 2016;16:481–5.

    Article  Google Scholar 

  2. Slight SP, Seger DL, Franz C, Wong A, Bates DW. The national cost of adverse drug events resulting from inappropriate medication-related alert overrides in the United States. J Am Med Inform Assoc. 2018;25:1183–8.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Baldacci A, Saguin E, Balcerac A, Mouchabac S, Ferreri F, Gaillard R, et al. Pharmacogenetic guidelines for psychotropic drugs: optimizing prescriptions in clinical practice. Pharmaceutics. 2023;15:2540.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Wang X, Wang C, Zhang Y, An Z. Effect of pharmacogenomics testing guiding on clinical outcomes in major depressive disorder: a systematic review and meta-analysis of RCT. BMC Psychiatry. 2023;23:334.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Swen JJ, van der Wouden CH, Manson LE, Abdullah-Koolmees H, Blagec K, Blagus T, et al. A 12-gene pharmacogenetic panel to prevent adverse drug reactions: an open-label, multicentre, controlled, cluster-randomised crossover implementation study. Lancet. 2023;401:347–56.

    Article  CAS  PubMed  Google Scholar 

  6. Beunk L, Nijenhuis M, Soree B, de Boer-Veger NJ, Buunk A-M, Guchelaar HJ, et al. Dutch pharmacogenetics working group (DPWG) guideline for the gene-drug interaction between CYP2D6, CYP3A4 and CYP1A2 and antipsychotics. Eur J Hum Genet. 2024;32:278–85.

    Article  CAS  PubMed  Google Scholar 

  7. Keat K, Venkatesh R, Huang Y, Kumar R, Tuteja S, Sangkuhl K, et al. PGxQA: a resource for evaluating LLM performance for pharmacogenomic QA tasks. Pac Symp Biocomput. 2025;30:229–46.

    PubMed  PubMed Central  Google Scholar 

  8. Murugan M, Yuan B, Venner E, Ballantyne CM, Robinson KM, Coons JC, et al. Empowering personalized pharmacogenomics with generative AI solutions. J Am Med Inform Assoc. 2024;31:1356–66.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Chaves JMZ, Wang E, Tu T, Vaishnav ED, Lee B, Mahdavi SS, et al. Tx-LLM: a large language model for therapeutics. Preprint at https://doi.org/10.48550/arXiv.2406.06316 (2024).

  10. Li D, Wu L, Lin Y-C, Huang H-Y, Cotton E, Liu Q, et al. Enhancing pharmacogenomic data accessibility and drug safety with large language models: a case study with Llama3.1. Exp Biol Med. 2024;249:10393.

    Article  Google Scholar 

  11. Telis N, Stoller D, Chapman C, Chahal A, Judge D, Olson D, et al. P190: Scalable system-wide CYP2C19 pharmacogenomic testing reveals excess incidence of adverse events in metabolizers receiving inappropriate prescriptions. GIM Open 2025;3:102155.

  12. Clinical Pharmacogenetics Implementation Consortium (CPIC). CPIC database. https://cpicpgx.org. Accessed March 3, 2025.

  13. Meta AI. LLaMA 3.1. Available from: https://ai.meta.com/llama/.

  14. Microsoft Research. Phi-4. Available from: https://huggingface.co/microsoft/phi-4.

  15. DeepSeek AI. DeepSeek-chat. Available from: https://deepseek.ai.

  16. Alibaba DAMO Academy. Qwen2.5. 2024. Available from: https://huggingface.co/Qwen.

  17. Mistral AI. Mistral models. 2024. Available from: https://mistral.ai/.

  18. Google DeepMind. Gemma. 2024. Available from: https://ai.google.dev/gemma.

  19. OpenAI. GPT-4. Available from: https://openai.com/gpt-4.

  20. Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in neural information processing systems. 33. Curran Associates, Inc.; 2020. p. 9459–74.

    Google Scholar 

  21. Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, Thorn CF, et al. Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther. 2012;92:414–7.

    Article  CAS  PubMed  Google Scholar 

  22. Papineni K, Roukos S, Ward T & Zhu W-J. BLEU: a method for automatic evaluation of machine translation. in: Proceedings of the 40th annual meeting on association for computational linguistics - ACL ’02 311. Philadelphia, Pennsylvania: Association for Computational Linguistics; 2001. https://doi.org/10.3115/1073083.1073135.

  23. Lin C-Y. ROUGE: a package for automatic evaluation of summaries. Text summarization branches out 74–81. Barcelona, Spain: Association for Computational Linguistics; 2004.

    Google Scholar 

  24. McKinney W. Data structures for statistical computing in python. in: van der Walt S & Millman J, eds. Proceedings of the 9th Python in Science Conference. 56–61 (2010). https://doi.org/10.25080/Majora-92bf1922-00a.

  25. Hunter JD. Matplotlib: A 2D graphics environment. Comput Sci Eng. 2007;9:90–95.

    Article  Google Scholar 

  26. Waskom ML. seaborn: statistical data visualization. J Open Source Softw. 2021;6:3021.

    Article  Google Scholar 

  27. Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl Sci. 2021;11:6421.

    Article  CAS  Google Scholar 

  28. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620:172–80.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Shekhani R, Steinacher L, Swen JJ, Ingelman-Sundberg M. Evaluation of current regulation and guidelines of pharmacogenomic drug labels: opportunities for improvements. Clin Pharmacol Ther. 2020;107:1240–55.

    Article  PubMed  Google Scholar 

  30. Koutsilieri S, Tzioufa F, Sismanoglou D-C, Patrinos GP. Unveiling the guidance heterogeneity for genome-informed drug treatment interventions among regulatory bodies and research consortia. Pharmacol Res. 2020;153:104590.

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

MZ designed the study, performed LLM evaluation, and contributed to manuscript writing. IS implemented the machine learning models, performed all analyses, contributed to study design and manuscript writing. DS contributed to study design, result analysis, and manuscript writing. AM contributed to data visualization, introductory writing, and manuscript editing. DSok and IT provided infrastructure support and reviewed the manuscript. AG contributed to project supervision and provided strategic input.

Corresponding author

Correspondence to Mike Zack.

Ethics declarations

Ethics approval and consent to participate

This study did not involve human participants, animals, or patient data. Therefore, ethics approval and informed consent were not required.

Competing interests

All authors are employees of PGxAI Inc. and declare no other competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zack, M., Slobodchikov, I., Stupichev, D. et al. Benchmarking large language models for replication of guideline-based PGx recommendations. Pharmacogenomics J 25, 23 (2025). https://doi.org/10.1038/s41397-025-00383-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41397-025-00383-0

Search

Quick links