A trimodal protein language model enables advanced protein searches

Su, Jin; He, Yan; You, Shiyang; Jiang, Shiyu; Zhou, Xibin; Zhang, Xuting; Wang, Yuxuan; Su, Xining; Tolstoy, Igor; Chang, Xing; Lu, Hongyuan; Yuan, Fajie

doi:10.1038/s41587-025-02836-0

Brief Communication
Published: 02 October 2025

A trimodal protein language model enables advanced protein searches

Nature Biotechnology (2025) Cite this article

13k Accesses
13 Citations
11 Altmetric
Metrics details

Subjects

Abstract

ProTrek unifies protein sequence, structure and natural language function in a trimodal language model through contrastive learning, enabling comprehensive searches between any two modalities, including within modality. ProTrek surpasses current alignment tools (for example, Foldseek and MMseqs2) in speed and accuracy for identifying functionally related proteins. Computational and wet-lab experimental validations show that the ProTrek server (www.search-protrek.com), with precomputed embeddings for over 5 billion proteins, efficiently processes and analyzes large-scale protein repositories.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: ProTrek performance on protein search and representation tasks.**

**Fig. 3: ProTrek-enabled identification of proteins functionally analogous to UDG.**

ProtGPT2 is a deep unsupervised language model for protein design

Article Open access 27 July 2022

Boosting the predictive power of protein representations with a corpus of text annotations

Article 18 August 2025

Designing proteins with language models

Article 15 February 2024

Data availability

The pretrained model weights (650 million and 35 million parameters) of ProTrek can be obtained online (https://huggingface.co/westlake-repl/ProTrek_650M and https://huggingface.co/westlake-repl/ProTrek_35M, respectively). All precomputed protein sequence, structure and text embeddings for the SWISS-PROT database are available online (https://huggingface.co/datasets/westlake-repl/faiss_index). Larger protein databases (TB-scale embeddings) are accessible through ProTrek’s web server (http://search-protrek.com) and official GitHub (https://github.com/westlake-repl/ProTrek), which contains billions of entries. The structural 3Di sequences are available from GitHub (https://github.com/steineggerlab/foldseek). The text descriptions are available from UniProt (https://www.uniprot.org).

Code availability

ProTrek is open-sourced under the MIT license. The code repository is available from GitHub (https://github.com/westlake-repl/ProTrek). The ProTrek web server can be accessed online (http://search-protrek.com). ColabProTrek v1 and v2 are available online (https://colab.research.google.com/github/westlake-repl/SaprotHub/blob/main/colab/ColabProTrek.ipynb and https://colab.research.google.com/github/westlake-repl/SaprotHub/blob/main/colab/ColabSeprot.ipynb?hl=en, respectively). For users requiring private database integration, we provide command-line tools to build embeddings for local deployment through GitHub (https://github.com/westlake-repl/ProTrek#add-custom-database).

References

Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS PubMed PubMed Central Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
Article CAS PubMed PubMed Central Google Scholar
Van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).
Article PubMed Google Scholar
Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Article CAS PubMed Google Scholar
Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Nat. Biotechnol. 40, 932–937 (2022).
Article CAS PubMed Google Scholar
Gane, A. et al. ProtNLM: model-based natural language protein annotation. Preprint at https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/protnlm_preprint_draft.pdf (2022).
Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).
Article PubMed PubMed Central Google Scholar
Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 1–23 (2019).
Article Google Scholar
Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227 (2013).
Article CAS PubMed PubMed Central Google Scholar
Liu, W. et al. PLMSearch: protein language model powers accurate and fast sequence search for remote homology. Nat. Commun. 15, 2775 (2024).
Article CAS PubMed PubMed Central Google Scholar
Hong, L. et al. Fast, sensitive detection of protein homologs using deep dense retrieval. Nat. Biotechnol. 43, 983–995 (2025).
Article CAS PubMed Google Scholar
Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).
Touvron, H. et al. LLaMA 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).
Guo, D. et al. DeepSeek-R1: incentivizing reasoning capability in llms via reinforcement learning. Preprint at https://arxiv.org/abs/2501.12948 (2025).
Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article CAS PubMed Google Scholar
Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2021).
Article Google Scholar
Zhou, X. et al. Decoding the molecular language of proteins with Evolla. Preprint at bioRxiv https://doi.org/10.1101/2025.01.05.630192 (2025).
Peng, F. Z. et al. PTM-Mamba: a PTM-aware protein language model with bidirectional gated Mamba blocks. Nat. Methods 22, 945–949 (2025).
Article CAS PubMed PubMed Central Google Scholar
Su, J. et al. SaProt: protein language modeling with structure-aware vocabulary. In Proc. 12th International Conference on Learning Representations (ICLR, 2024); https://openreview.net/forum?id=6MRm3G4NiU
Su, J. et al. SaprotHub: making protein modeling accessible to all biologists. Preprint at bioRxiv https://doi.org/10.1101/2024.05.24.595648 (2024).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 1–23 (2021).
Article CAS Google Scholar
Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003).
Article CAS PubMed PubMed Central Google Scholar
UniProt Consortium UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).
Article Google Scholar
Koehler Leman, J. et al. Sequence–structure–function relationships in the microbial protein universe. Nat. Commun. 14, 2351 (2023).
Article CAS PubMed PubMed Central Google Scholar
Todd, A. E., Orengo, C. A. & Thornton, J. M. Evolution of protein function, from a structural perspective. Curr. Opin. Chem. Biol. 3, 548–556 (1999).
Article CAS PubMed Google Scholar
Douze, M. et al. The Faiss library. Preprint at https://arxiv.org/abs/2401.08281 (2024).
Liu, S. et al. A text-guided protein design framework. Nat. Mach. Intell. 7, 580–591 (2025).
Article Google Scholar
Xu, M., Yuan, X., Miret, S. & Tang, J. ProtST: multi-modality learning of protein sequences and biomedical texts. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 38749–38767 (PMLR, 2023).
Chen, J. et al. Global marine microbial diversity and its potential in bioprospecting. Nature 633, 371–379 (2024).
Article CAS PubMed PubMed Central Google Scholar
Hu, Z. et al. Discovery and engineering of small SlugCas9 with broad targeting range and high specificity and activity. Nucleic Acids Res. 49, 4008–4019 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yu, T. et al. Enzyme function prediction using contrastive learning. Science 379, 1358–1363 (2023).
Article CAS PubMed Google Scholar
Kweon, J. et al. Efficient DNA base editing via an optimized DYW-like deaminase. Preprint at bioRxiv https://doi.org/10.1101/2024.05.15.594452 (2024).
Gherardini, P. F., Wass, M. N., Helmer-Citterich, M. & Sternberg, M. J. E. Convergent evolution of enzyme active sites is not a rare phenomenon. J. Mol. Biol. 372, 817–845 (2007).
Article CAS PubMed Google Scholar
Doolittle, R. F. Convergent evolution: the need to be explicit. Trends Biochem. Sci. 19, 15–18 (1994).
Article CAS PubMed Google Scholar
Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).
Article CAS PubMed PubMed Central Google Scholar
Pomaznoy, M., Ha, B. & Peters, B. GOnet: a tool for interactive Gene Ontology analysis. BMC Bioinformatics 19, 470 (2018).
Article CAS PubMed PubMed Central Google Scholar
Dauparas, J. et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).
Article CAS PubMed PubMed Central Google Scholar
He, Y. et al. Protein language models-assisted optimization of a uracil-N-glycosylase variant enables programmable T-to-G and T-to-C base editing. Mol. Cell 84, 1257–1270 (2024).
Article CAS PubMed Google Scholar
Tong, H. et al. Development of deaminase-free T-to-S base editor and C-to-G base editor by engineered human uracil DNA glycosylase. Nat. Commun. 15, 4897 (2024).
Article CAS PubMed PubMed Central Google Scholar
Ye, L. et al. Glycosylase-based base editors for efficient T-to-G and C-to-G editing in mammalian cells. Nat. Biotechnol. 42, 1538–1547 (2024).
Article CAS PubMed Google Scholar
Cornman, A. et al. The OMG dataset: an Open MetaGenomic corpus for mixed-modality genomic language modeling. In Proc. 13th International Conference on Learning Representations (ICLR, 2025); https://openreview.net/forum?id=jlzNb1iWs3
Kavli, B. et al. Excision of cytosine and thymine from DNA by mutants of human uracil-DNA glycosylase. EMBO J. 15, 3442–3447 (1996).
Article CAS PubMed PubMed Central Google Scholar
Hayes, T. et al. Simulating 500 million years of evolution with a language model. Science 387, 850–858 (2025).
Article CAS PubMed Google Scholar
Burley, S. K. et al. RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Res. 47, D464–D474 (2019).
Article CAS PubMed Google Scholar
Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).
Article CAS PubMed Google Scholar
Pruitt, K. D., Tatusova, T., Brown, G. R. & Maglott, D. R. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 40, D130–D135 (2012).
Article CAS PubMed Google Scholar
Dai, F. et al. Toward de novo protein design from natural language. Preprint at bioRxiv https://doi.org/10.1101/2024.08.01.606258 (2024).
Liu, N. et al. Protein design with dynamic protein vocabulary. Preprint at https://arxiv.org/abs/2505.18966 (2025).
Kuang, J., Liu, N., Sun, C., Ji, T. & Wu, Y. PDFBench: a benchmark for de novo protein design from function. Preprint at https://arxiv.org/abs/2505.20346 (2025).
Ko, Young Su. Using ProTrek for protein binder design. Twitter https://x.com/youngsuko9/status/1865845977673834595 (2024).
Gitter, A. Using ProTrek to retrieve proteins with desired function. Twitter https://x.com/anthonygitter/status/1827760237194920435 (2024).
Gitter, A. Using ProTrek to retrieve proteins with desired function. Twitter https://x.com/anthonygitter/status/1813427191000035330 (2024).
Gitter, A. Using ProTrek to retrieve proteins with desired function. Twitter https://x.com/anthonygitter/status/1882642214624678193 (2025).
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).
Article CAS PubMed Google Scholar
van den Oord, A., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at https://arxiv.org/abs/1807.03748 (2018).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 4171–4186 (Association for Computational Linguistics, 2019).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article CAS PubMed PubMed Central Google Scholar
Rasley, J., Rajbhandari, S., Ruwase, O. & He, Y. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Gupta, R. & Liu, Y.) 3505–3506 (Association for Computing Machinery, 2020).
Loshchilov, I. and Hutter, F. Fixing weight decay regularization in Adam. OpenReview.net https://openreview.net/forum?id=rk6qdGgCZ (2018).
Loshchilov, I. & Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In Proc. International Conference on Learning Representations (ICLR, 2017); https://openreview.net/forum?id=Skq89Scxx
Xu, J. et al. Protein inverse folding from structure feedback. Preprint at https://arxiv.org/abs/2506.03028 (2025).
Enzyme Nomenclature (Nomenclature Committee of the International Union of Biochemistry and Molecular Biology, 2024); https://iubmb.qmul.ac.uk/enzyme/
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2017).
Article CAS PubMed Google Scholar
Kucera, T., Oliver, C., Chen, D., and Borgwardt, K. ProteinShake: building datasets and benchmarks for deep learning on protein structures. In Advances in Neural Information Processing Systems 36 (eds Oh, A. et al.) (NeurIPS, 2023).

Download references

Acknowledgements

We thank P. Beltrao, S. Ovchinnikov, J. Zeng, C. Wang and J. Huang for their valuable suggestions to improve this paper. We thank L. Hong and M. Li for providing the predownloaded OMG and MGnify databases. This work is supported by Zhejiang Province Leading Geese Plan (2025C01094), the National Natural Science Foundation of China (32471547, U21A20427, 82450102 and 32025016), the Ministry of Science and Technology of the People’s Republic of China (2022YFA0807300), the National Key Research and Development Program of China (2022ZD0115100), the Zhejiang Key Laboratory of Low-Carbon Intelligent Synthetic Biology (2024ZY01025), the Guangzhou Science and Technology Program City–University Joint Funding Project (2023A03J0001), the Nansha Key Science and Technology Project (2023ZD015), the Guangdong Education Department (2023ZDZX2073), the Hong Kong University of Science and Technology 20 for 20 Cross-Campus Collaborative Research Scheme (G051) and the Westlake Center of Synthetic Biology and Integrated Bioengineering. We also thank N. Li and the Westlake HPC Center for computing resources and technical support.

Author information

These authors contributed equally: Jin Su, Yan He, Shiyang You.

Authors and Affiliations

Westlake University, Hangzhou, China
Jin Su, Yan He, Shiyu Jiang, Xibin Zhou, Xuting Zhang, Yuxuan Wang, Xing Chang & Fajie Yuan
The Hong Kong University of Science and Technology, Guangzhou, China
Shiyang You, Xining Su & Hongyuan Lu
Independent Scientist, Germantown, MD, USA
Igor Tolstoy

Authors

Jin Su
View author publications
Search author on:PubMed Google Scholar
Yan He
View author publications
Search author on:PubMed Google Scholar
Shiyang You
View author publications
Search author on:PubMed Google Scholar
Shiyu Jiang
View author publications
Search author on:PubMed Google Scholar
Xibin Zhou
View author publications
Search author on:PubMed Google Scholar
Xuting Zhang
View author publications
Search author on:PubMed Google Scholar
Yuxuan Wang
View author publications
Search author on:PubMed Google Scholar
Xining Su
View author publications
Search author on:PubMed Google Scholar
Igor Tolstoy
View author publications
Search author on:PubMed Google Scholar
Xing Chang
View author publications
Search author on:PubMed Google Scholar
Hongyuan Lu
View author publications
Search author on:PubMed Google Scholar
Fajie Yuan
View author publications
Search author on:PubMed Google Scholar

Contributions

F.Y. conceptualized and led this research. J.S., Y.H. and S.Y. performed the main research. H.L. and X.C. supervised and led the wet-lab experiments and analyses. J.S. and F.Y. designed the ProTrek network architecture. J.S. conducted the machine learning modeling and implementation. J.S. and X. Zhou collected and curated the training datasets. J.S., S.J. and I.T. performed the computational experimental analyses. I.T., H.L. and X.C. provided bioinformatics guidance. X. Zhang developed the ColabProTrek. Y.H., S.Y., X.S., H.L. and X.C. conducted the wet-lab experiments and evaluations. Y.W. explored an early wet-lab validation experiment. F.Y., J.S., Y.H., X.C. and H.L. wrote and revised the paper.

Corresponding authors

Correspondence to Xing Chang, Hongyuan Lu or Fajie Yuan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Wei Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Evaluation of Hit Rates for Enzyme Commission (EC) Numbers and CRISPR-associated Proteins Using ProTrek’s Text-to-Sequence Retrieval Function.

Textual descriptions were used to query the UniProt50 database and extract the top 100 proteins not included in the training data. a, Bar plot depicts hit rate for six EC numbers selected from six top-level EC classes, categorized into exact matches and varying levels of mismatches (Mismatch-1: mismatch in the last digit; Mismatch-2: mismatch in the last two digits; Mismatch-3: only match the first digit; Mismatch: mismatch all digits). Each bar shows the proportion of exact matches and progressively less accurate matches. The results demonstrate ProTrek’s performance in identifying enzyme categories, with variations in match accuracy across different EC numbers. b, Bar plot illustrates hit rate for six extensively studied CRISPR-associated proteins (Cas10, Cas3, Cas9, Cas12, Cas13, and Csm/Cmr). Each bar is segmented into matches and mismatches, showcasing ProTrek’s retrieval performance across these protein categories. If the corresponding keywords such as ‘Cas10’, ‘Cas3’, ‘Cas9’, ‘Cas12’, or ‘EC:3.4.15.1’ appear in the protein’s ground truth, it is considered a true hit.

Extended Data Fig. 2 Structural similarity matrix of the top 30 exact matched enzymes.

The matrix shows the pair-wise TM-scores for the top 30 exact matched enzymes that were shown in Extended Data Fig. 1a. The value of each cell in the matrix denotes the TM-score, indicating the degree of structural similarity for each protein pair. The color scale indicates the TM-score value, where warmer colors represent higher structural similarity and cooler colors represent lower similarity.

Extended Data Fig. 3 Structural similarity matrix of the top 30 identified CRISPR-associated proteins.

The matrix displays the pair-wise TM-scores for the top 30 matched CRISPR-associated proteins that were shown in Extended Data Fig. 1b. The value of each cell in the matrix denotes the TM-score, indicating the degree of structural similarity for each protein pair. The color scale indicates the TM-score value, where warmer colors represent higher structural similarity and cooler colors represent lower similarity.

Extended Data Fig. 4 Experimental validation of proteins identified through ProTrek searches.

a, Sequence alignment of hUDG-Y147A with proteins identified via ProTrek searches. S-V1 to S-V5 represent the top five hits obtained from the sequence-to-sequence search, while T-V1 to T-V5 represent the top five hits obtained from the text-to-sequence search. Yellow-highlighted regions denote highly conserved sequences. Sequence alignment was performed using Clustal Omega. b, Proteins S-V1 to S-V5 and T-V1 to T-V5 were fused with Cas9n following the introduction of a mutation analogous to UDG-Y147A. Notably, S-V1 and T-V1 refer to the same protein. eGFP was used as a negative control, while the mock group represented cells without any treatment. HeLa cells were transfected with the base editor constructs alongside specific sgRNAs targeting Dicer-1 and VEGFA. Five days post-transfection, thymine nucleotide substitutions at the target sites were quantified using high-throughput sequencing (HTS), with the mutated nucleotide positions annotated relative to the 5’ end of the protospacer. Data are presented as mean ± s.d. from two independent experiments (n = 2).

Extended Data Fig. 5 Alignment speed comparison (CPU time) for 100 query proteins against the UniRef50 database, using 24 CPU cores.

ProTrek demonstrates efficient database processing and searching capabilities with rapid response times, achieving over 100-fold speed improvement compared to Foldseek and MMseqs2.

Extended Data Fig. 6 ProTrek score sensitivity analysis for keyword and template regions in functional descriptions.

We first computed the similarity score (‘Score_original’) between a protein sequence and its complete textual description. Next, we systematically removed each word from the description and recalculated the similarity score with the protein. For the template region (blue circle), we averaged all similarity scores obtained after deleting words within this region to derive ‘Score_changed’. Similarly, for the keyword region (orange triangle), ‘Score_changed’ was calculated by averaging all similarity scores resulting from the removal of words in that specific region.

Supplementary information

Supplementary Information (download PDF )

Supplementary Tables 1–16 and Figs. 1–9.

Reporting Summary (download PDF )

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Su, J., He, Y., You, S. et al. A trimodal protein language model enables advanced protein searches. Nat Biotechnol (2025). https://doi.org/10.1038/s41587-025-02836-0

Download citation

Received: 19 February 2025
Accepted: 29 August 2025
Published: 02 October 2025
Version of record: 02 October 2025
DOI: https://doi.org/10.1038/s41587-025-02836-0

This article is cited by

Learning physical interactions to compose biological large language models
- Joseph D. Clark
- Tanner J. Dean
- Diwakar Shukla
Communications Chemistry (2026)
Ab-initio amino acid sequence design from protein text description with ProtDAT
- Xiao-Yu Guo
- Yi-Fan Li
- Hong-Bin Shen
Nature Communications (2025)
Democratizing protein language model training, sharing and collaboration
- Jin Su
- Zhikai Li
- Fajie Yuan
Nature Biotechnology (2025)