Abstract
We evaluated the ability of large language models (LLMs) to generate clinically accurate pharmacogenomic (PGx) recommendations aligned with CPIC guidelines. Using a benchmark of 599 curated gene–drug–phenotype scenarios, we compared five leading models, including GPT-4o and fine-tuned LLaMA variants, through both standard lexical metrics and a novel semantic evaluation framework (LLM Score) validated by expert review. General-purpose models frequently produced incomplete or unsafe outputs, while our domain-adapted model achieved superior performance, with an LLM Score of 0.92 and significantly faster inference speed. Results highlight the importance of fine-tuning and structured prompting over model scale alone. This work establishes a robust framework for evaluating PGx-specific LLMs and demonstrates the feasibility of safer, AI-driven personalized medicine.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 6 print issues and online access
$259.00 per year
only $43.17 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout




Similar content being viewed by others
Data availability
The datasets used in this study have been deposited on Zenodo and are publicly available at: https://doi.org/10.5281/zenodo.15658429. All data are released under the MIT License.
Code availability
The code used for data processing and analysis is available on Zenodo at the same link: https://doi.org/10.5281/zenodo.15658429, under the MIT License.
References
Coleman JJ, Pontefract SK. Adverse drug reactions. Clin Med. 2016;16:481–5.
Slight SP, Seger DL, Franz C, Wong A, Bates DW. The national cost of adverse drug events resulting from inappropriate medication-related alert overrides in the United States. J Am Med Inform Assoc. 2018;25:1183–8.
Baldacci A, Saguin E, Balcerac A, Mouchabac S, Ferreri F, Gaillard R, et al. Pharmacogenetic guidelines for psychotropic drugs: optimizing prescriptions in clinical practice. Pharmaceutics. 2023;15:2540.
Wang X, Wang C, Zhang Y, An Z. Effect of pharmacogenomics testing guiding on clinical outcomes in major depressive disorder: a systematic review and meta-analysis of RCT. BMC Psychiatry. 2023;23:334.
Swen JJ, van der Wouden CH, Manson LE, Abdullah-Koolmees H, Blagec K, Blagus T, et al. A 12-gene pharmacogenetic panel to prevent adverse drug reactions: an open-label, multicentre, controlled, cluster-randomised crossover implementation study. Lancet. 2023;401:347–56.
Beunk L, Nijenhuis M, Soree B, de Boer-Veger NJ, Buunk A-M, Guchelaar HJ, et al. Dutch pharmacogenetics working group (DPWG) guideline for the gene-drug interaction between CYP2D6, CYP3A4 and CYP1A2 and antipsychotics. Eur J Hum Genet. 2024;32:278–85.
Keat K, Venkatesh R, Huang Y, Kumar R, Tuteja S, Sangkuhl K, et al. PGxQA: a resource for evaluating LLM performance for pharmacogenomic QA tasks. Pac Symp Biocomput. 2025;30:229–46.
Murugan M, Yuan B, Venner E, Ballantyne CM, Robinson KM, Coons JC, et al. Empowering personalized pharmacogenomics with generative AI solutions. J Am Med Inform Assoc. 2024;31:1356–66.
Chaves JMZ, Wang E, Tu T, Vaishnav ED, Lee B, Mahdavi SS, et al. Tx-LLM: a large language model for therapeutics. Preprint at https://doi.org/10.48550/arXiv.2406.06316 (2024).
Li D, Wu L, Lin Y-C, Huang H-Y, Cotton E, Liu Q, et al. Enhancing pharmacogenomic data accessibility and drug safety with large language models: a case study with Llama3.1. Exp Biol Med. 2024;249:10393.
Telis N, Stoller D, Chapman C, Chahal A, Judge D, Olson D, et al. P190: Scalable system-wide CYP2C19 pharmacogenomic testing reveals excess incidence of adverse events in metabolizers receiving inappropriate prescriptions. GIM Open 2025;3:102155.
Clinical Pharmacogenetics Implementation Consortium (CPIC). CPIC database. https://cpicpgx.org. Accessed March 3, 2025.
Meta AI. LLaMA 3.1. Available from: https://ai.meta.com/llama/.
Microsoft Research. Phi-4. Available from: https://huggingface.co/microsoft/phi-4.
DeepSeek AI. DeepSeek-chat. Available from: https://deepseek.ai.
Alibaba DAMO Academy. Qwen2.5. 2024. Available from: https://huggingface.co/Qwen.
Mistral AI. Mistral models. 2024. Available from: https://mistral.ai/.
Google DeepMind. Gemma. 2024. Available from: https://ai.google.dev/gemma.
OpenAI. GPT-4. Available from: https://openai.com/gpt-4.
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in neural information processing systems. 33. Curran Associates, Inc.; 2020. p. 9459–74.
Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, Thorn CF, et al. Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther. 2012;92:414–7.
Papineni K, Roukos S, Ward T & Zhu W-J. BLEU: a method for automatic evaluation of machine translation. in: Proceedings of the 40th annual meeting on association for computational linguistics - ACL ’02 311. Philadelphia, Pennsylvania: Association for Computational Linguistics; 2001. https://doi.org/10.3115/1073083.1073135.
Lin C-Y. ROUGE: a package for automatic evaluation of summaries. Text summarization branches out 74–81. Barcelona, Spain: Association for Computational Linguistics; 2004.
McKinney W. Data structures for statistical computing in python. in: van der Walt S & Millman J, eds. Proceedings of the 9th Python in Science Conference. 56–61 (2010). https://doi.org/10.25080/Majora-92bf1922-00a.
Hunter JD. Matplotlib: A 2D graphics environment. Comput Sci Eng. 2007;9:90–95.
Waskom ML. seaborn: statistical data visualization. J Open Source Softw. 2021;6:3021.
Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl Sci. 2021;11:6421.
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620:172–80.
Shekhani R, Steinacher L, Swen JJ, Ingelman-Sundberg M. Evaluation of current regulation and guidelines of pharmacogenomic drug labels: opportunities for improvements. Clin Pharmacol Ther. 2020;107:1240–55.
Koutsilieri S, Tzioufa F, Sismanoglou D-C, Patrinos GP. Unveiling the guidance heterogeneity for genome-informed drug treatment interventions among regulatory bodies and research consortia. Pharmacol Res. 2020;153:104590.
Author information
Authors and Affiliations
Contributions
MZ designed the study, performed LLM evaluation, and contributed to manuscript writing. IS implemented the machine learning models, performed all analyses, contributed to study design and manuscript writing. DS contributed to study design, result analysis, and manuscript writing. AM contributed to data visualization, introductory writing, and manuscript editing. DSok and IT provided infrastructure support and reviewed the manuscript. AG contributed to project supervision and provided strategic input.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study did not involve human participants, animals, or patient data. Therefore, ethics approval and informed consent were not required.
Competing interests
All authors are employees of PGxAI Inc. and declare no other competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zack, M., Slobodchikov, I., Stupichev, D. et al. Benchmarking large language models for replication of guideline-based PGx recommendations. Pharmacogenomics J 25, 23 (2025). https://doi.org/10.1038/s41397-025-00383-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41397-025-00383-0