Benchmarking large language models for replication of guideline-based PGx recommendations

Zack, Mike; Slobodchikov, Ioan; Stupichev, Danil; Moore, Alex; Sokolov, David; Trifonov, Igor; Gobbs, Allan

doi:10.1038/s41397-025-00383-0

Article
Published: 26 July 2025

Benchmarking large language models for replication of guideline-based PGx recommendations

Mike Zack ORCID: orcid.org/0000-0003-0607-4812¹,
Ioan Slobodchikov¹,
Danil Stupichev¹,
Alex Moore¹,
David Sokolov¹,
Igor Trifonov¹ &
…
Allan Gobbs¹

The Pharmacogenomics Journal volume 25, Article number: 23 (2025) Cite this article

226 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

We evaluated the ability of large language models (LLMs) to generate clinically accurate pharmacogenomic (PGx) recommendations aligned with CPIC guidelines. Using a benchmark of 599 curated gene–drug–phenotype scenarios, we compared five leading models, including GPT-4o and fine-tuned LLaMA variants, through both standard lexical metrics and a novel semantic evaluation framework (LLM Score) validated by expert review. General-purpose models frequently produced incomplete or unsafe outputs, while our domain-adapted model achieved superior performance, with an LLM Score of 0.92 and significantly faster inference speed. Results highlight the importance of fine-tuning and structured prompting over model scale alone. This work establishes a robust framework for evaluating PGx-specific LLMs and demonstrates the feasibility of safer, AI-driven personalized medicine.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Exploratory data analysis of CPIC-based pharmacogenomic recommendations.**

**Fig. 2: Performance of LLM-based scoring metric for evaluating pharmacogenomic recommendations.**

**Fig. 3: LLM performance (LLM Score and BLEU) in clinical recommendation generation.**

**Fig. 4: BLEU score comparison across recent open-weight model variants.**

A collaborative large language model for drug analysis

Article Open access 23 September 2025

Validating large language models against manual information extraction from case reports of drug-induced parkinsonism in patients with schizophrenia spectrum and mood disorders: a proof of concept study

Article Open access 20 March 2025

CancerGPT for few shot drug pair synergy prediction using large pretrained language models

Article Open access 19 February 2024

Data availability

The datasets used in this study have been deposited on Zenodo and are publicly available at: https://doi.org/10.5281/zenodo.15658429. All data are released under the MIT License.

Code availability

The code used for data processing and analysis is available on Zenodo at the same link: https://doi.org/10.5281/zenodo.15658429, under the MIT License.

References

Coleman JJ, Pontefract SK. Adverse drug reactions. Clin Med. 2016;16:481–5.
Article Google Scholar
Slight SP, Seger DL, Franz C, Wong A, Bates DW. The national cost of adverse drug events resulting from inappropriate medication-related alert overrides in the United States. J Am Med Inform Assoc. 2018;25:1183–8.
Article PubMed PubMed Central Google Scholar
Baldacci A, Saguin E, Balcerac A, Mouchabac S, Ferreri F, Gaillard R, et al. Pharmacogenetic guidelines for psychotropic drugs: optimizing prescriptions in clinical practice. Pharmaceutics. 2023;15:2540.
Article CAS PubMed PubMed Central Google Scholar
Wang X, Wang C, Zhang Y, An Z. Effect of pharmacogenomics testing guiding on clinical outcomes in major depressive disorder: a systematic review and meta-analysis of RCT. BMC Psychiatry. 2023;23:334.
Article CAS PubMed PubMed Central Google Scholar
Swen JJ, van der Wouden CH, Manson LE, Abdullah-Koolmees H, Blagec K, Blagus T, et al. A 12-gene pharmacogenetic panel to prevent adverse drug reactions: an open-label, multicentre, controlled, cluster-randomised crossover implementation study. Lancet. 2023;401:347–56.
Article CAS PubMed Google Scholar
Beunk L, Nijenhuis M, Soree B, de Boer-Veger NJ, Buunk A-M, Guchelaar HJ, et al. Dutch pharmacogenetics working group (DPWG) guideline for the gene-drug interaction between CYP2D6, CYP3A4 and CYP1A2 and antipsychotics. Eur J Hum Genet. 2024;32:278–85.
Article CAS PubMed Google Scholar
Keat K, Venkatesh R, Huang Y, Kumar R, Tuteja S, Sangkuhl K, et al. PGxQA: a resource for evaluating LLM performance for pharmacogenomic QA tasks. Pac Symp Biocomput. 2025;30:229–46.
PubMed PubMed Central Google Scholar
Murugan M, Yuan B, Venner E, Ballantyne CM, Robinson KM, Coons JC, et al. Empowering personalized pharmacogenomics with generative AI solutions. J Am Med Inform Assoc. 2024;31:1356–66.
Article PubMed PubMed Central Google Scholar
Chaves JMZ, Wang E, Tu T, Vaishnav ED, Lee B, Mahdavi SS, et al. Tx-LLM: a large language model for therapeutics. Preprint at https://doi.org/10.48550/arXiv.2406.06316 (2024).
Li D, Wu L, Lin Y-C, Huang H-Y, Cotton E, Liu Q, et al. Enhancing pharmacogenomic data accessibility and drug safety with large language models: a case study with Llama3.1. Exp Biol Med. 2024;249:10393.
Article Google Scholar
Telis N, Stoller D, Chapman C, Chahal A, Judge D, Olson D, et al. P190: Scalable system-wide CYP2C19 pharmacogenomic testing reveals excess incidence of adverse events in metabolizers receiving inappropriate prescriptions. GIM Open 2025;3:102155.
Clinical Pharmacogenetics Implementation Consortium (CPIC). CPIC database. https://cpicpgx.org. Accessed March 3, 2025.
Meta AI. LLaMA 3.1. Available from: https://ai.meta.com/llama/.
Microsoft Research. Phi-4. Available from: https://huggingface.co/microsoft/phi-4.
DeepSeek AI. DeepSeek-chat. Available from: https://deepseek.ai.
Alibaba DAMO Academy. Qwen2.5. 2024. Available from: https://huggingface.co/Qwen.
Mistral AI. Mistral models. 2024. Available from: https://mistral.ai/.
Google DeepMind. Gemma. 2024. Available from: https://ai.google.dev/gemma.
OpenAI. GPT-4. Available from: https://openai.com/gpt-4.
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in neural information processing systems. 33. Curran Associates, Inc.; 2020. p. 9459–74.
Google Scholar
Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, Thorn CF, et al. Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther. 2012;92:414–7.
Article CAS PubMed Google Scholar
Papineni K, Roukos S, Ward T & Zhu W-J. BLEU: a method for automatic evaluation of machine translation. in: Proceedings of the 40th annual meeting on association for computational linguistics - ACL ’02 311. Philadelphia, Pennsylvania: Association for Computational Linguistics; 2001. https://doi.org/10.3115/1073083.1073135.
Lin C-Y. ROUGE: a package for automatic evaluation of summaries. Text summarization branches out 74–81. Barcelona, Spain: Association for Computational Linguistics; 2004.
Google Scholar
McKinney W. Data structures for statistical computing in python. in: van der Walt S & Millman J, eds. Proceedings of the 9th Python in Science Conference. 56–61 (2010). https://doi.org/10.25080/Majora-92bf1922-00a.
Hunter JD. Matplotlib: A 2D graphics environment. Comput Sci Eng. 2007;9:90–95.
Article Google Scholar
Waskom ML. seaborn: statistical data visualization. J Open Source Softw. 2021;6:3021.
Article Google Scholar
Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl Sci. 2021;11:6421.
Article CAS Google Scholar
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620:172–80.
Article CAS PubMed PubMed Central Google Scholar
Shekhani R, Steinacher L, Swen JJ, Ingelman-Sundberg M. Evaluation of current regulation and guidelines of pharmacogenomic drug labels: opportunities for improvements. Clin Pharmacol Ther. 2020;107:1240–55.
Article PubMed Google Scholar
Koutsilieri S, Tzioufa F, Sismanoglou D-C, Patrinos GP. Unveiling the guidance heterogeneity for genome-informed drug treatment interventions among regulatory bodies and research consortia. Pharmacol Res. 2020;153:104590.
Article CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

PGxAI Inc., 330 E Charleston Rd, Palo Alto, CA, 94306, USA
Mike Zack, Ioan Slobodchikov, Danil Stupichev, Alex Moore, David Sokolov, Igor Trifonov & Allan Gobbs

Authors

Mike Zack
View author publications
Search author on:PubMed Google Scholar
Ioan Slobodchikov
View author publications
Search author on:PubMed Google Scholar
Danil Stupichev
View author publications
Search author on:PubMed Google Scholar
Alex Moore
View author publications
Search author on:PubMed Google Scholar
David Sokolov
View author publications
Search author on:PubMed Google Scholar
Igor Trifonov
View author publications
Search author on:PubMed Google Scholar
Allan Gobbs
View author publications
Search author on:PubMed Google Scholar

Contributions

MZ designed the study, performed LLM evaluation, and contributed to manuscript writing. IS implemented the machine learning models, performed all analyses, contributed to study design and manuscript writing. DS contributed to study design, result analysis, and manuscript writing. AM contributed to data visualization, introductory writing, and manuscript editing. DSok and IT provided infrastructure support and reviewed the manuscript. AG contributed to project supervision and provided strategic input.

Corresponding author

Correspondence to Mike Zack.

Ethics declarations

Ethics approval and consent to participate

This study did not involve human participants, animals, or patient data. Therefore, ethics approval and informed consent were not required.

Competing interests

All authors are employees of PGxAI Inc. and declare no other competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental material

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zack, M., Slobodchikov, I., Stupichev, D. et al. Benchmarking large language models for replication of guideline-based PGx recommendations. Pharmacogenomics J 25, 23 (2025). https://doi.org/10.1038/s41397-025-00383-0

Download citation

Received: 09 May 2025
Revised: 03 July 2025
Accepted: 16 July 2025
Published: 26 July 2025
DOI: https://doi.org/10.1038/s41397-025-00383-0