Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Medical knowledge representation enhancement in large language models through clinical tokens optimization
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 29 January 2026

Medical knowledge representation enhancement in large language models through clinical tokens optimization

  • Qianqian Li1,
  • Jijun Tong2,
  • Shanna Liu3,
  • Chang Li3,
  • Jie Tang3 &
  • …
  • Qingli Zhou3 

Scientific Reports , Article number:  (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Computer science
  • Software

Abstract

During the training of medical large language models (LLMs), conventional tokenizers frequently segment domain-specific medical terms into multiple subword tokens, resulting in suboptimal recognition and representation of specialized vocabulary. As a consequence, the model encounters difficulties in effectively acquiring medical domain knowledge during the fine-tuning process. To address this limitation, the present study introduces “clinical tokens”—medical subword units—by augmenting the vocabulary of the original LLaMA2 tokenizer. This adapted tokenizer retains medical terms as whole tokens wherever feasible, thereby enhancing tokenization accuracy and enabling the model to learn and interpret medical knowledge more effectively. For downstream task adaptation, this study employs the Byte Pair Encoding (BPE) algorithm to construct a domain-specific vocabulary and tokenization model, ensuring the inclusion of medical subword units (clinical tokens). We compare the tokenization performance of three variants: the original LLaMA2 tokenizer, the Chinese-LLaMA2 tokenizer (expanded with an extended Chinese vocabulary), and the clinical token-augmented tokenizer. This was followed by fine-tuning the large language models on curated medical datasets. The experimental results indicate that the enhanced tokenizer improves encoding and decoding efficiency, extends the model’s effective context window, and yields superior performance on downstream medical tasks.

Data availability

The datasets supporting the findings of this study are publicly available on GitHub and Hugging Face. Specifically:1. Medical domain pre-training data: Available at SylvanL/Traditional-Chinese-Medicine-Dataset-Pretrain on Hugging Face.2. General-domain pre-training data (SkyPile dataset): Available at Skywork/SkyPile-150B on Hugging Face.3. Medical QA data: Available at Toyhom/Chinese-medical-dialogue-data on GitHub and michaelwzhu/ShenNong\_TCM\_Dataset on Hugging Face.4. General-purpose GPT-4 generated QA pairs: Available at Instruction-Tuning-with-GPT-4/GPT-4-LLM on GitHub.

References

  1. Ahammad, S. H. et al. Improved neural machine translation using natural Language processing (NLP). Multimed Tools Appl. 83, 39335–39348. https://doi.org/10.1007/s11042-023-17207-7 (2024).

    Google Scholar 

  2. Wang, Z. et al. MiMuSA—mimicking human Language Understanding for fine-grained multi-class sentiment analysis. Neural Comput. Applic. 35, 15907–15921. https://doi.org/10.1007/s00521-023-08576-z (2023).

    Google Scholar 

  3. Luo, M., Xue, B. & Niu, B. A comprehensive survey for automatic text summarization: Techniques, approaches and perspectives, Neurocomputing.603, 128280, ISSN 0925–2312, (2024). https://doi.org/10.1016/j.neucom.2024.128280

  4. Landes, P. Toward Automatic Summarization of Hospital Discharge Notes. University of Illinois Chicago. Thesis. (2024). https://doi.org/10.25417/uic.27153255.v1

  5. Garg, A., Gupta, S., Vats, S., Handa, P. & Goel, N. Prospect of large Language models and natural Language processing for lung cancer diagnosis: A systematic review. Expert Syst. 41 (11), e13697. https://doi.org/10.1111/exsy.13697 (2024).

    Google Scholar 

  6. Kudo, T. & Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Conference on Empirical Methods in Natural Language Processing. (2018).

  7. Guo, S., Huang, Y., Huang, B., Yang, L. & Zhou, C. CWSXLNet: A sentiment analysis model based on Chinese word segmentation information enhancement. Appl. Sci. 13 (6), 4056. https://doi.org/10.3390/app13064056 (2023).

    Google Scholar 

  8. Lin, H., Yang, L. & Wang, P. S. P. W-core transformer model for Chinese word segmentation. In Trends and Applications in Information Systems and Technologies. WorldCIST 2021. Advances in Intelligent Systems and Computing Vol. 1365 (eds Rocha, Á. et al.) (Springer, 2021). https://doi.org/10.1007/978-3-030-72657-7_26.

    Google Scholar 

  9. Yao, Y. & Huang, Z. Bi-directional LSTM recurrent neural network for Chinese word segmentation. (2016). ArXiv, abs/1602.04874.

  10. Zhang, C. Improved word segmentation system for Chinese criminal judgment documents. Appl. Artif. Intell. 38 (1). https://doi.org/10.1080/08839514.2023.2297524 (2023).

  11. Wei, D. et al. GeoBERTSegmenter: word segmentation of Chinese texts in the geoscience domain using the improved BERT model. Earth Space Sci. 9, e2022EA002511. https://doi.org/10.1029/2022EA002511 (2022).

    Google Scholar 

  12. Qiu, Q., Xie, Z., Ma, K. & Tian, M. BERTCWS: unsupervised multi-granular Chinese word segmentation based on a BERT method for the geoscience domain. Ann. GIS. 29 (3), 387–399. https://doi.org/10.1080/19475683.2023.2186487 (2023).

    Google Scholar 

  13. Choo, S. & Kim, W. A study on the evaluation of tokenizer performance in natural Language processing. Appl. Artif. Intell. 37 (1). https://doi.org/10.1080/08839514.2023.2175112 (2023).

  14. Kim, W. Comparative study of tokenizer based on learning for sentiment analysis. J. Korean Soc. Qual. Manage., 48. (2020).

  15. Qarah, F. & Alsanoosy, T. A comprehensive analysis of various tokenizers for Arabic large Language models. Appl. Sci. 14 (13), 5696. https://doi.org/10.3390/app14135696 (2024).

    Google Scholar 

  16. Wu, Y. et al. Google’s Neural Machine Translation System (Bridging the Gap between Human and Machine Translation, 2016). ArXiv, abs/1609.08144.

  17. Držík, D. & Forgac, F. Slovak morphological tokenizer using the Byte-Pair encoding algorithm. PeerJ Comput. Sci. 10, e2465. https://doi.org/10.7717/peerj-cs.2465 (2024).

    Google Scholar 

  18. Varjokallio, M., Kurimo, M. & Virpioja, S. Learning a subword vocabulary based on unigram likelihood, 2013 IEEE workshop on automatic speech recognition and Understanding, Olomouc. Czech Repub. 7–12. https://doi.org/10.1109/ASRU.2013.6707697 (2013).

  19. Fujii, T., Shibata, K., Yamaguchi, A., Morishita, T. & Sogawa, Y. How do different tokenizers perform on downstream tasks in scriptio continua languages? A case study in Japanese. Annual Meeting of the Association for Computational Linguistics. (2023).

  20. Yang, J. Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models. ArXiv, abs/2403.00417. (2024).

  21. Ji, Y. et al. Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation (ArXiv, 2023). abs/2304.07854.

  22. Cui, Y., Yang, Z. & Yao, X. Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca. ArXiv, abs/2304.08177. (2023).

  23. Huang, Q. et al. Lawyer LLaMA Technical Report. (2023).

  24. Zhang, X., Yang, Q. & Xu, D. XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters. Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. (2023).

  25. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating Text Generation with BERT. ArXiv, (2019). abs/1904.09675.

  26. Amalia, A., Sitompul, O. S., Mantoro, T. & Nababan, E. B. Morpheme embedding for Bahasa Indonesia using modified byte pair Encoding, in IEEE access, 9, pp. 155699–155710, (2021). https://doi.org/10.1109/ACCESS.2021.3128439

  27. Wei, T. et al. Skywork: A More Open Bilingual Foundation Model. (2023). ArXiv, abs/2310.19341.

  28. Guillaume Wenzek, M. A. et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of the Twelfth Language Resources and Evaluation Conference,Marseille, France. European Language Resources Association. 4003–4012 (2020).

Download references

Funding

The authors are grateful for the support of the Department of Science and Technology of Zhejiang Province (Grant No.2024C03270).

Author information

Authors and Affiliations

  1. School of Computer Science and Technology (School of Artificial Intelligence), Zhejiang Sci-tech University, Hangzhou, 310000, Zhejiang, China

    Qianqian Li

  2. School of Information Science and Engineering (School of Cyber Science and Technology), Zhejiang Sci-tech University, Hangzhou, 310000, Zhejiang, China

    Jijun Tong

  3. Department of Information Technology, International Institutes of Medicine, the Fourth Affiliated Hospital of School of Medicine and International School of Medicine, Zhejiang University, Yiwu, 322000, Zhejiang, China

    Shanna Liu, Chang Li, Jie Tang & Qingli Zhou

Authors
  1. Qianqian Li
    View author publications

    Search author on:PubMed Google Scholar

  2. Jijun Tong
    View author publications

    Search author on:PubMed Google Scholar

  3. Shanna Liu
    View author publications

    Search author on:PubMed Google Scholar

  4. Chang Li
    View author publications

    Search author on:PubMed Google Scholar

  5. Jie Tang
    View author publications

    Search author on:PubMed Google Scholar

  6. Qingli Zhou
    View author publications

    Search author on:PubMed Google Scholar

Contributions

Q.L.: Conceptualization, Writing - original draft, Data curation, Formal analysis, Software. J.T.: Formal analysis, Methodology. S.L.: Investigation, Resources. C.L.: Software, Visualization. J.T.: Writing - review & editing, Supervision, Validation. Q.Z.: Conceptualization, Writing - review & editing, Funding acquisition, Project administration, Resources.

Corresponding authors

Correspondence to Jie Tang or Qingli Zhou.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Q., Tong, J., Liu, S. et al. Medical knowledge representation enhancement in large language models through clinical tokens optimization. Sci Rep (2026). https://doi.org/10.1038/s41598-026-37438-6

Download citation

  • Received: 20 June 2025

  • Accepted: 22 January 2026

  • Published: 29 January 2026

  • DOI: https://doi.org/10.1038/s41598-026-37438-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • clinical tokens
  • LLMs
  • tokenizer
  • natural language processing (NLP)
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on Twitter
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com sitemap

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics