Evaluation of large language models for discovery of gene set function

Hu, Mengzhou; Alkhairy, Sahar; Lee, Ingoo; Pillich, Rudolf T.; Fong, Dylan; Smith, Kevin; Bachelder, Robin; Ideker, Trey; Pratt, Dexter

doi:10.1038/s41592-024-02525-x

Article
Published: 28 November 2024

Evaluation of large language models for discovery of gene set function

Nature Methods volume 22, pages 82–91 (2025)Cite this article

14k Accesses
41 Citations
145 Altmetric
Metrics details

Subjects

Abstract

Gene set enrichment is a mainstay of functional genomics, but it relies on gene function databases that are incomplete. Here we evaluate five large language models (LLMs) for their ability to discover the common functions represented by a gene set, supported by molecular rationale and a self-confidence assessment. For curated gene sets from Gene Ontology, GPT-4 suggests functions similar to the curated name in 73% of cases, with higher self-confidence predicting higher similarity. Conversely, random gene sets correctly yield zero confidence in 87% of cases. Other LLMs (GPT-3.5, Gemini Pro, Mixtral Instruct and Llama2 70b) vary in function recovery but are falsely confident for random sets. In gene clusters from omics data, GPT-4 identifies common functions for 45% of cases, fewer than functional enrichment but with higher specificity and gene coverage. Manual review of supporting rationale and citations finds these functions are largely verifiable. These results position LLMs as valuable omics assistants.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Use and evaluation of LLMs for functional analysis of gene sets.**

**Fig. 2: Evaluation of LLMs in recovering GO gene set names.**

**Fig. 3: Evaluation of LLM self-confidence.**

**Fig. 4: Evaluation of GPT-4 in naming ‘omics gene clusters.**

**Fig. 5: Representative analysis for protein interaction clusters (NeST:2-105).**

GeneAgent: self-verification language agent for gene-set analysis using domain databases

Article Open access 28 July 2025

Genomic language model predicts protein co-regulation and function

Article Open access 03 April 2024

Analyses of transcriptomes and the first complete genome of Leucocalocybe mongolica provide new insights into phylogenetic relationships and conservation

Article Open access 03 February 2021

Data availability

All data used in this paper are publicly available. The full GO (2023-11-15 release) is downloaded from http://release.geneontology.org/2023-11-15/ontology/index.html. The selected NeST gene set is available to download from https://github.com/idekerlab/llm_evaluation_for_gene_set_interpretation/blob/main/data/Omics_data/NeST__IAS_clixo_hidef_Nov17.edges. The L1000 data used in this study are available at https://maayanlab.cloud/static/hdfs/harmonizome/data/lincscmapchemical/gene_attribute_edges.txt.gz. The viral infection data are available at https://maayanlab.cloud/static/hdfs/harmonizome/data/geovirus/gene_attribute_matrix.txt.gz. Detailed information on data download and parsing procedures, along with all datasets used in this paper, are available in our GitHub repository at https://github.com/idekerlab/llm_evaluation_for_gene_set_interpretation.

Code availability

The code to run the LLM gene set analysis pipeline and to reproduce results for the evaluation tasks is available via GitHub at https://github.com/idekerlab/llm_evaluation_for_gene_set_interpretation or Code Ocean⁷⁷ (https://doi.org/10.24433/CO.7045777.v1) with the MIT License. Note that LLM outputs are inherently stochastic and the precise names and analysis text produced by the models are not guaranteed to be the same from run to run. We minimized the variability of the outputs as described in ‘Controlling the variability of LLM responses’ section in Methods.

References

Zeeberg, B. R. et al. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 4, R28 (2003).
Article PubMed PubMed Central Google Scholar
Breitling, R., Amtmann, A. & Herzyk, P. Iterative group analysis (iGA): a simple tool to enhance sensitivity and facilitate interpretation of microarray experiments. BMC Bioinf. 5, 34 (2004).
Article Google Scholar
Beissbarth, T. & Speed, T. P. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 20, 1464–1465 (2004).
Article CAS PubMed Google Scholar
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Article CAS PubMed PubMed Central Google Scholar
Al-Shahrour, F. et al. From genes to functional classes in the study of biological systems. BMC Bioinf. 8, 114 (2007).
Article Google Scholar
Backes, C. et al. GeneTrail—advanced gene set enrichment analysis. Nucleic Acids Res. 35, W186–W192 (2007).
Article PubMed PubMed Central Google Scholar
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009).
Article CAS PubMed Google Scholar
Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinf. 14, 128 (2013).
Article Google Scholar
Pomaznoy, M., Ha, B. & Peters, B. GOnet: a tool for interactive Gene Ontology analysis. BMC Bioinf. 19, 470 (2018).
Article CAS Google Scholar
Cerami, E. G. et al. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res. 39, D685–D690 (2011).
Article CAS PubMed Google Scholar
Fabregat, A. et al. The Reactome pathway Knowledgebase. Nucleic Acids Res. 44, D481–D487 (2015).
Article PubMed PubMed Central Google Scholar
Pico, A. R. et al. WikiPathways: pathway editing for the people. PLoS Biol. 6, e184 (2008).
Article PubMed PubMed Central Google Scholar
Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109–D114 (2012).
Article CAS PubMed Google Scholar
Pillich, R. T. et al. NDEx IQuery: a multi-method network gene set analysis leveraging the Network Data Exchange. Bioinformatics 39, btad118 (2023).
Article CAS PubMed PubMed Central Google Scholar
Wang, S. et al. Typing tumors using pathways selected by somatic evolution. Nat. Commun. 9, 4159 (2018).
Article PubMed PubMed Central Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).
Article CAS PubMed PubMed Central Google Scholar
Gene Ontology Consortiumet al. The Gene Ontology knowledgebase in 2023. Genetics 224, iyad031 (2023).
Article Google Scholar
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947–1951 (2019).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M. & Ishiguro-Watanabe, M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 51, D587–D592 (2023).
Article CAS PubMed Google Scholar
Croft, D. Reactome: a database of biological pathways. Nat. Preced. https://doi.org/10.1038/npre.2010.5025.1 (2010).
Jassal, B. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 48, D498–D503 (2020).
CAS PubMed Google Scholar
Sollis, E. et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Res. 51, D977–D985 (2023).
Article CAS PubMed Google Scholar
Blake, J. A. et al. The Mouse Genome Database genotypes::phenotypes. Nucleic Acids Res. 37, D712–D719 (2009).
Article CAS PubMed Google Scholar
Weng, M.-P. & Liao, B.-Y. MamPhEA: a web tool for mammalian phenotype enrichment analysis. Bioinformatics 26, 2212–2213 (2010).
Article CAS PubMed PubMed Central Google Scholar
Keenan, A. B. et al. ChEA3: transcription factor enrichment analysis by orthogonal omics integration. Nucleic Acids Res. 47, W212–W224 (2019).
Article CAS PubMed PubMed Central Google Scholar
Rubin, J. D. et al. Transcription factor enrichment analysis (TFEA) quantifies the activity of multiple transcription factors from a single experiment. Commun. Biol. 4, 661 (2021).
Article CAS PubMed PubMed Central Google Scholar
Franzén, O., Gan, L.-M. & Björkegren, J. L. M. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, baz046 (2019).
Article PubMed PubMed Central Google Scholar
Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 47, D721–D728 (2019).
Article CAS PubMed Google Scholar
Hu, C. et al. CellMarker 2.0: an updated database of manually curated cell markers in human/mouse and web tools based on scRNA-seq data. Nucleic Acids Res. 51, D870–D876 (2023).
Article CAS PubMed Google Scholar
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).
Brown, T. B. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (eds H. Larochelle, et al.) 1877–190 (NeurIPS, 2020).
Vaswani, A. et al. Attention is all you need. Neural Inf. Process Syst. 30, 5998–6008 (2017).
Google Scholar
OpenAI. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).
Jiang, A. Q. et al. Mixtral of experts. Preprint at https://arxiv.org/abs/2401.04088 (2024).
Gemini Team et al. Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).
Joachimiak, M. P., Harry Caufield, J., Harris, N. L., Kim, H. & Mungall, C. J. Gene set summarization using large language models. Preprint at https://arxiv.org/abs/2305.13338 (2023).
Moghaddam, S. R. & Honey, C. J. Boosting theory-of-mind performance in large language models via prompting. Preprint at https://arxiv.org/abs/2304.11490 (2023).
Hebenstreit, K., Praas, R., Kiesewetter, L. P. & Samwald, M. An automatically discovered chain-of-thought prompt generalizes to novel models and datasets. Preprint at https://arxiv.org/abs/2305.02897 (2023).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) 24824–24837 (NeurIPS, 2022).
Caufield, J. H. et al. Structured prompt interrogation and recursive extraction of semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics 40, btae104 (2024).
Article PubMed PubMed Central Google Scholar
Miller, G. A. & Charles, W. G. Contextual correlates of semantic similarity. Lang. Cogn. Process. 6, 1–28 (1991).
Article Google Scholar
Xiong, M. et al. Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference on Learning Representations (ICLR, 20234).
Fu, J., Ng, S.-K., Jiang, Z. & Liu, P. GPTScore: evaluate as you desire. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1: Long Papers (eds Duh, K. et al.) 6556–6576 (Association for Computational Linguistics, 2024).
Kolberg, L. et al. g:Profiler-interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update). Nucleic Acids Res. 51, W207–W212 (2023).
Article CAS PubMed PubMed Central Google Scholar
Duan, Q. et al. LINCS Canvas Browser: interactive web app to query, browse and interrogate LINCS L1000 gene expression signatures. Nucleic Acids Res. 42, W449–W460 (2014).
Article CAS PubMed PubMed Central Google Scholar
Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000. Profiles Cell 171, 1437–1452.e17 (2017).
CAS PubMed Google Scholar
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2013).
Article CAS PubMed Google Scholar
Zheng, F. et al. Interpretation of cancer mutations using a multiscale map of protein systems. Science 374, eabf3067 (2021).
Article CAS PubMed PubMed Central Google Scholar
Pinkas, D. M. et al. Structural complexity in the KCTD family of Cullin3-dependent E3 ubiquitin ligases. Biochem. J. 474, 3747–3761 (2017).
Article CAS PubMed Google Scholar
Dhanoa, B. S., Cogliati, T., Satish, A. G., Bruford, E. A. & Friedman, J. S. Update on the Kelch-like (KLHL) gene family. Hum. Genomics 7, 13 (2013).
Article PubMed PubMed Central Google Scholar
Pleiner, T. et al. WNK1 is an assembly factor for the human ER membrane protein complex. Mol. Cell 81, 2693–2704.e12 (2021).
Article CAS PubMed PubMed Central Google Scholar
Berthold, J. et al. Characterization of RhoBTB-dependent Cul3 ubiquitin ligase complexes—evidence for an autoregulatory mechanism. Exp. Cell. Res. 314, 3453–3465 (2008).
Article CAS PubMed PubMed Central Google Scholar
McCormick, J. A. et al. Hyperkalemic hypertension-associated cullin 3 promotes WNK signaling by degrading KLHL3. J. Clin. Invest. 124, 4723–4736 (2014).
Article CAS PubMed PubMed Central Google Scholar
Sohara, E. & Uchida, S. Kelch-like 3/Cullin 3 ubiquitin ligase complex and WNK signaling in salt-sensitive hypertension and electrolyte disorder. Nephrol. Dial. Transpl. 31, 1417–1424 (2016).
Article CAS Google Scholar
Tang, H., Finn, R. D. & Thomas, P. D. TreeGrafter: phylogenetic tree-based annotation of proteins with Gene Ontology terms and other annotations. Bioinformatics 35, 518–520 (2019).
Article CAS PubMed Google Scholar
Groh, B. S. et al. The antiobesity factor WDTC1 suppresses adipogenesis via the CRL4WDTC1 E3 ligase. EMBO Rep. 17, 638–647 (2016).
Article CAS PubMed PubMed Central Google Scholar
Ji, W. & Rivero, F. Atypical rho GTPases of the RhoBTB subfamily: roles in vesicle trafficking and tumorigenesis. Cells 5, 28 (2016).
Article PubMed PubMed Central Google Scholar
Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. Preprint at https://arxiv.org/abs/2303.13375 (2023).
López Espejel, J., Ettifouri, E. H., Yahaya Alassan, M. S., Chouham, E. M. & Dahhane, W. GPT-3.5, GPT-4, or BARD? Evaluating LLMs reasoning ability in zero-shot learning and performance boosting through prompts. Nat. Lang. Process. J. 5, 100032 (2023).
Article Google Scholar
Yu, H. et al. Evaluation of retrieval-augmented generation: a survey. Preprint at https://arxiv.org/abs/2405.07437 (2024).
Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations (ICLR, 2022).
Nair, V., Schumacher, E., Tso, G. & Kannan, A. DERA: enhancing large language model completions with dialog-enabled resolving agents. In Proc. 6th Clinical Natural Language Processing Workshop (eds Naumann, T. et al.) 122–161 (2023).
Shinn, N. et al. Reflexion: language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 8634–8652 (NeurIPS, 2023).
Li, G., Al Kader Hammoud, H. A., Itani, H., Khizbullin, D. & Ghanem, B. CAMEL: Communicative Agents for ‘Mind’ Exploration of Large Scale Language Model Society. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 36, 51991–52008 (NeurIPS, 2023).
Schick, T. et al. Toolformer: language models can teach themselves to use tools. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 68539–68551 (NeurIPS, 2023).
Shen, Y. et al. HuggingGPT: solving AI tasks with ChatGPT and its friends in Hugging Face. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 36, 38154–38180 (NeurIPS, 2023).
Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C. & Socher, R. CTRL: a conditional transformer language model for controllable generation. Preprint at https://arxiv.org/abs/1909.05858 (2019).
Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. In The Eighth International Conference on Learning Representations (ICLR, 2020).
Smith, B. et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotechnol. 25, 1251–1255 (2007).
Article CAS PubMed PubMed Central Google Scholar
Tirmizi, S. H. et al. Mapping between the OBO and OWL ontology languages. J. Biomed. Semant. 2, S3 (2011).
Article Google Scholar
Liu, F., Shareghi, E., Meng, Z., Basaldella, M. & Collier, N. Self-alignment pretraining for biomedical entity representations. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 4228–4238 (Association for Computational Linguistics, 2021).
Wolf, T. et al. Transformers: state-of-the-art natural language processing. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Liu, Q. & Schlangen, D.) 38–45 (Association for Computational Linguistics, 2020).
Rouillard, A. D. et al. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database 2016, baw100 (2016).
Article PubMed PubMed Central Google Scholar
Seal, R. L. et al. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 51, D1003–D1009 (2023).
Article CAS PubMed Google Scholar
Hu, M. et al. Evaluation of Large Language Models for Discovery of Gene Set Function (Code Ocean, 2024); https://doi.org/10.24433/CO.7045777.V1

Download references

Acknowledgements

This work was supported by National Institutes of Health grants U24 CA269436 (R.T.P., D.F., K.S., T.I. and D.P.), OT2 OD032742 (M.H., T.I. and D.P.), U24 HG012107 (D.F., T.I. and D.P.) and U01 MH115747 (S.A. and T.I.). Additional support was received from Schmidt Futures (M.H. and T.I.). We thank X. Zhao and A. Singhal for insightful comments, M. R. Kelly for providing the NeST data raw files and C. Churas and J. Lenkiewicz for helping to improve the GitHub repository.

Author information

These authors contributed equally: Mengzhou Hu, Sahar Alkhairy.

Authors and Affiliations

Department of Medicine, University of California San Diego, La Jolla, CA, USA
Mengzhou Hu, Ingoo Lee, Rudolf T. Pillich, Dylan Fong, Robin Bachelder, Trey Ideker & Dexter Pratt
Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
Sahar Alkhairy & Trey Ideker
Department of Physics, University of California San Diego, La Jolla, CA, USA
Kevin Smith

Authors

Mengzhou Hu
View author publications
Search author on:PubMed Google Scholar
Sahar Alkhairy
View author publications
Search author on:PubMed Google Scholar
Ingoo Lee
View author publications
Search author on:PubMed Google Scholar
Rudolf T. Pillich
View author publications
Search author on:PubMed Google Scholar
Dylan Fong
View author publications
Search author on:PubMed Google Scholar
Kevin Smith
View author publications
Search author on:PubMed Google Scholar
Robin Bachelder
View author publications
Search author on:PubMed Google Scholar
Trey Ideker
View author publications
Search author on:PubMed Google Scholar
Dexter Pratt
View author publications
Search author on:PubMed Google Scholar

Contributions

M.H., S.A., T.I. and D.P. designed the study. M.H. and S.A. developed and implemented the automated LLM-based gene set interpretation pipeline, performed the data analysis and organized the GitHub repository. S.A. developed and assessed the semantic similarity calculation. I.L. and M.H. contributed to the development of the citation search and validation pipeline. D.P. contributed to the coding and the evaluation of the analysis. R.T.P. assisted in the study design, prompt engineering and the evaluation of the analysis. M.H., R.T.P., R.B. and D.P. conducted the scientific review of the LLM output. M.H. and D.P. contributed to the user interface design for the GSAI tool. D.F. built the web interface for the GSAI tool, and K.S. set up the server for accessing open-source LLMs. M.H., S.A., T.I. and D.P. wrote the paper with input from all authors. All authors approved the final version of this paper.

Corresponding authors

Correspondence to Trey Ideker or Dexter Pratt.

Ethics declarations

Competing interests

T.I. is a cofounder and member of the advisory board and has an equity interest in Data4Cure and Serinus Biosciences. T.I. is a consultant for and has an equity interest in Ideaya Biosciences. The terms of these arrangements have been reviewed and approved by the University of California San Diego in accordance with its conflict-of-interest policies. The other authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Qiao Jin, Zhiyong Lu, Zhizheng Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Schematic of the citation module.

a, GPT-4 is asked to provide gene symbol keywords and functional keywords separately. Multiple gene keywords and functions are combined and used to search PubMed for relevant paper titles and abstracts in the scientific literature. GPT-4 is queried to evaluate each abstract, saving supporting references. b, Prompts used to query the GPT-4 model.

Extended Data Fig. 2 Distribution of GO term gene sizes.

a, Distribution of term size (number of genes) for terms in the Biological Process branch (GO-BP). Terms with 3-100 genes shown (n = 8,910). b, Distribution of term size for the 1000 GO terms used in Task 1.

Extended Data Fig. 3 Evaluation of GPT-4 in recovery of GO-CC and GO-MF names.

a, Cumulative number of GO-CC term names recovered by GPT-4 (y-axis) at a given similarity percentile (x-axis). 0 = least similar, 100 = most similar. Blue curve: semantic similarities between GPT-4 names and assigned GO-CC term names. Grey dashed curve: semantic similarities between GPT-4 names and random GO-CC term names. The red dotted line marks that 642 of the 1000 sampled GO-CC names are recovered by GPT-4 at a similarity percentile of 95%. b, As for panel a, but for GO-MF terms rather than GO-CC. The red dotted line marks that 757 of the 1000 sampled GO-MF names are recovered by GPT-4 at a similarity percentile of 95%.

Extended Data Fig. 4 Supplemental analysis of the confidence score.

a, Distribution of confidence scores (n = 300) assigned by GPT-4 with confidence level threshold set based on the distribution pattern. “High confidence” (red): 0.87–1.00; “Medium confidence” (blue): 0.82–0.86; “Low confidence” (dark orange): 0.01–0.81; “Name not assigned” (gray): 0. b, Scatter plot of naming accuracy versus GPT-4 self-assessed confidence score for real gene sets drawn from GO (points, n = 100). Accuracy is estimated by the semantic similarity between the GPT-4 proposed name and the real GO term name. The best-fit regression line is shown in dark gray. The correlation coefficient (R) is determined by a two-sided Pearson’s correlation with p-value shown.

Extended Data Fig. 5 Distribution of ‘omics gene set sizes.

Distribution shown for all ‘omics gene sets considered in this study (n = 300).

Extended Data Table 1 Engineered prompt for gene set analysis

Full size table

Extended Data Table 2 Overview of five language models

Full size table

Extended Data Table 3 Confidence assessment by GPT-4 versus human

Full size table

Extended Data Table 4 Clusters named by LLM (GPT-4) versus enrichment (g:Profiler)

Full size table

Extended Data Table 5 Engineered prompt for identifying genes supporting a proposed name

Full size table

Supplementary information

Supplementary Information

Description for Supplementary Tables 1–4.

Reporting Summary

Supplementary Tables 1–4

Supplementary Table 1. Complete analysis of GO terms, 50/50 mix and random for all models (related to task 1: Figs. 2a and 3b and Table 1). Supplementary Table 2. Complete GPT-4 analysis of GO terms (related to task 1: Fig. 2). Supplementary Table 3. Complete GPT-4 analysis of omics gene sets (related to task 2: Fig. 4 and Extended Data Table 4). Supplementary Table 4. Reviewer fact-checking of GPT-4 analysis text and citation relevance (related to task 2: Fig. 5).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hu, M., Alkhairy, S., Lee, I. et al. Evaluation of large language models for discovery of gene set function. Nat Methods 22, 82–91 (2025). https://doi.org/10.1038/s41592-024-02525-x

Download citation

Received: 16 August 2023
Accepted: 21 October 2024
Published: 28 November 2024
Issue date: January 2025
DOI: https://doi.org/10.1038/s41592-024-02525-x

This article is cited by

GeneAgent: self-verification language agent for gene-set analysis using domain databases
- Zhizheng Wang
- Qiao Jin
- Zhiyong Lu
Nature Methods (2025)
Towards domain-adapted large language models for water and wastewater management: methods, datasets and benchmarking
- Boyan Xu
- Guanlan Wu
- How Yong Ng
npj Clean Water (2025)
Multimodal cell maps as a foundation for structural and functional genomics
- Leah V. Schaffer
- Mengzhou Hu
- Trey Ideker
Nature (2025)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links