Abstract
During the training of medical large language models (LLMs), conventional tokenizers frequently segment domain-specific medical terms into multiple subword tokens, resulting in suboptimal recognition and representation of specialized vocabulary. As a consequence, the model encounters difficulties in effectively acquiring medical domain knowledge during the fine-tuning process. To address this limitation, the present study introduces “clinical tokens”—medical subword units—by augmenting the vocabulary of the original LLaMA2 tokenizer. This adapted tokenizer retains medical terms as whole tokens wherever feasible, thereby enhancing tokenization accuracy and enabling the model to learn and interpret medical knowledge more effectively. For downstream task adaptation, this study employs the Byte Pair Encoding (BPE) algorithm to construct a domain-specific vocabulary and tokenization model, ensuring the inclusion of medical subword units (clinical tokens). We compare the tokenization performance of three variants: the original LLaMA2 tokenizer, the Chinese-LLaMA2 tokenizer (expanded with an extended Chinese vocabulary), and the clinical token-augmented tokenizer. This was followed by fine-tuning the large language models on curated medical datasets. The experimental results indicate that the enhanced tokenizer improves encoding and decoding efficiency, extends the model’s effective context window, and yields superior performance on downstream medical tasks.
Data availability
The datasets supporting the findings of this study are publicly available on GitHub and Hugging Face. Specifically:1. Medical domain pre-training data: Available at SylvanL/Traditional-Chinese-Medicine-Dataset-Pretrain on Hugging Face.2. General-domain pre-training data (SkyPile dataset): Available at Skywork/SkyPile-150B on Hugging Face.3. Medical QA data: Available at Toyhom/Chinese-medical-dialogue-data on GitHub and michaelwzhu/ShenNong\_TCM\_Dataset on Hugging Face.4. General-purpose GPT-4 generated QA pairs: Available at Instruction-Tuning-with-GPT-4/GPT-4-LLM on GitHub.
References
Ahammad, S. H. et al. Improved neural machine translation using natural Language processing (NLP). Multimed Tools Appl. 83, 39335–39348. https://doi.org/10.1007/s11042-023-17207-7 (2024).
Wang, Z. et al. MiMuSA—mimicking human Language Understanding for fine-grained multi-class sentiment analysis. Neural Comput. Applic. 35, 15907–15921. https://doi.org/10.1007/s00521-023-08576-z (2023).
Luo, M., Xue, B. & Niu, B. A comprehensive survey for automatic text summarization: Techniques, approaches and perspectives, Neurocomputing.603, 128280, ISSN 0925–2312, (2024). https://doi.org/10.1016/j.neucom.2024.128280
Landes, P. Toward Automatic Summarization of Hospital Discharge Notes. University of Illinois Chicago. Thesis. (2024). https://doi.org/10.25417/uic.27153255.v1
Garg, A., Gupta, S., Vats, S., Handa, P. & Goel, N. Prospect of large Language models and natural Language processing for lung cancer diagnosis: A systematic review. Expert Syst. 41 (11), e13697. https://doi.org/10.1111/exsy.13697 (2024).
Kudo, T. & Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Conference on Empirical Methods in Natural Language Processing. (2018).
Guo, S., Huang, Y., Huang, B., Yang, L. & Zhou, C. CWSXLNet: A sentiment analysis model based on Chinese word segmentation information enhancement. Appl. Sci. 13 (6), 4056. https://doi.org/10.3390/app13064056 (2023).
Lin, H., Yang, L. & Wang, P. S. P. W-core transformer model for Chinese word segmentation. In Trends and Applications in Information Systems and Technologies. WorldCIST 2021. Advances in Intelligent Systems and Computing Vol. 1365 (eds Rocha, Á. et al.) (Springer, 2021). https://doi.org/10.1007/978-3-030-72657-7_26.
Yao, Y. & Huang, Z. Bi-directional LSTM recurrent neural network for Chinese word segmentation. (2016). ArXiv, abs/1602.04874.
Zhang, C. Improved word segmentation system for Chinese criminal judgment documents. Appl. Artif. Intell. 38 (1). https://doi.org/10.1080/08839514.2023.2297524 (2023).
Wei, D. et al. GeoBERTSegmenter: word segmentation of Chinese texts in the geoscience domain using the improved BERT model. Earth Space Sci. 9, e2022EA002511. https://doi.org/10.1029/2022EA002511 (2022).
Qiu, Q., Xie, Z., Ma, K. & Tian, M. BERTCWS: unsupervised multi-granular Chinese word segmentation based on a BERT method for the geoscience domain. Ann. GIS. 29 (3), 387–399. https://doi.org/10.1080/19475683.2023.2186487 (2023).
Choo, S. & Kim, W. A study on the evaluation of tokenizer performance in natural Language processing. Appl. Artif. Intell. 37 (1). https://doi.org/10.1080/08839514.2023.2175112 (2023).
Kim, W. Comparative study of tokenizer based on learning for sentiment analysis. J. Korean Soc. Qual. Manage., 48. (2020).
Qarah, F. & Alsanoosy, T. A comprehensive analysis of various tokenizers for Arabic large Language models. Appl. Sci. 14 (13), 5696. https://doi.org/10.3390/app14135696 (2024).
Wu, Y. et al. Google’s Neural Machine Translation System (Bridging the Gap between Human and Machine Translation, 2016). ArXiv, abs/1609.08144.
Držík, D. & Forgac, F. Slovak morphological tokenizer using the Byte-Pair encoding algorithm. PeerJ Comput. Sci. 10, e2465. https://doi.org/10.7717/peerj-cs.2465 (2024).
Varjokallio, M., Kurimo, M. & Virpioja, S. Learning a subword vocabulary based on unigram likelihood, 2013 IEEE workshop on automatic speech recognition and Understanding, Olomouc. Czech Repub. 7–12. https://doi.org/10.1109/ASRU.2013.6707697 (2013).
Fujii, T., Shibata, K., Yamaguchi, A., Morishita, T. & Sogawa, Y. How do different tokenizers perform on downstream tasks in scriptio continua languages? A case study in Japanese. Annual Meeting of the Association for Computational Linguistics. (2023).
Yang, J. Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models. ArXiv, abs/2403.00417. (2024).
Ji, Y. et al. Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation (ArXiv, 2023). abs/2304.07854.
Cui, Y., Yang, Z. & Yao, X. Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca. ArXiv, abs/2304.08177. (2023).
Huang, Q. et al. Lawyer LLaMA Technical Report. (2023).
Zhang, X., Yang, Q. & Xu, D. XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters. Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. (2023).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating Text Generation with BERT. ArXiv, (2019). abs/1904.09675.
Amalia, A., Sitompul, O. S., Mantoro, T. & Nababan, E. B. Morpheme embedding for Bahasa Indonesia using modified byte pair Encoding, in IEEE access, 9, pp. 155699–155710, (2021). https://doi.org/10.1109/ACCESS.2021.3128439
Wei, T. et al. Skywork: A More Open Bilingual Foundation Model. (2023). ArXiv, abs/2310.19341.
Guillaume Wenzek, M. A. et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of the Twelfth Language Resources and Evaluation Conference,Marseille, France. European Language Resources Association. 4003–4012 (2020).
Funding
The authors are grateful for the support of the Department of Science and Technology of Zhejiang Province (Grant No.2024C03270).
Author information
Authors and Affiliations
Contributions
Q.L.: Conceptualization, Writing - original draft, Data curation, Formal analysis, Software. J.T.: Formal analysis, Methodology. S.L.: Investigation, Resources. C.L.: Software, Visualization. J.T.: Writing - review & editing, Supervision, Validation. Q.Z.: Conceptualization, Writing - review & editing, Funding acquisition, Project administration, Resources.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, Q., Tong, J., Liu, S. et al. Medical knowledge representation enhancement in large language models through clinical tokens optimization. Sci Rep (2026). https://doi.org/10.1038/s41598-026-37438-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-37438-6