A lightweight large language model for regulatory affairs translation in pharmaceutical industry

Chen, Tao; Mo, Hongyi; Wang, Tao; Jiang, Chenlu; Liu, Zihan; Hou, Fengzhen; Gan, Jue

doi:10.1038/s41598-025-21867-w

Download PDF

Article
Open access
Published: 30 October 2025

A lightweight large language model for regulatory affairs translation in pharmaceutical industry

Tao Chen¹,
Hongyi Mo¹,
Tao Wang¹,
Chenlu Jiang¹,
Zihan Liu¹,
Fengzhen Hou¹ &
…
Jue Gan²

Scientific Reports volume 15, Article number: 37992 (2025) Cite this article

3111 Accesses
Metrics details

Subjects

Abstract

New drug development is a costly and time-consuming project in pharmaceutical industry. However, the issue of relatively poor-quality, expensive and delayed regulatory affairs translation which hurdles this project has long been neglected by the pharmaceutical community. This study designed a tailored and impactful lightweight large language model (LLM), PhT-LM, to improve regulatory affairs translation and cut the cost of translation fee for the first time. Following web crawling, cleaning, and verifying the bilingual documents from the official websites of competent regulatory authorities in China and international organizations, a translation dataset containing 34,769 bilingual data was established. Next, the open-source Qwen-1_8B-Chat model was chosen as the basic model, which was then fine-tuned in the aforementioned translation dataset using the low-rank adapter technique. Finally, a retrieval-augmented generate technique was utilized to further enhance the model’s translation performance. When compared to popular general-purpose large language models, this lightweight model achieved a BLEU-4 mean score of 36.018 and a CHRF mean score of 58.047 based on a self-constructed training corpus, with improved scores ranging from 16% to 65% with a favorable cost-benefit analysis. Further, the model’s excellence has been demonstrated by human evaluation, particularly, its superiority in English-Chinese translation tasks. Our model offers a promising tool for pharmaceutical industry worldwide to translate regulatory affairs documents in high-quality, and efficiently with decreased cost.

Introduction

The global pharmaceutical industry is a rapidly evolving sector shaped by ongoing innovation and discovery. However, regulatory affairs remain one of the most crucial factors driving this industry, which ensure regulatory compliance. Dossier submissions and marketing applications to agencies, such as the Food and Drug Administration in the US (FDA), the European Medicines Agency (EMA), and National Medical Products Administration (NMPA) in China and other governmental agencies that oversee drug development^1,2,3,4 are time-consuming. Undoubtedly, a new drug project for any pharmaceutical business is a costly program that is both full of risks and challenges⁵. In its essence, despite the seamless early-stage research and development of a new drug, the necessity for companies to invest a significant amount of time and resources into the preparation and translation of registration dossiers stems from the differences in languages and in international regulations across countries, which hinder the new drug market access⁶.

Large language models (LLMs) have demonstrated exceptional performance in natural language processing tasks and play a prominent role across multiple critical stages of the pharmaceutical lifecycle. From drug discovery and design to clinical trials and patient management, these models provide robust support for enhancing operational efficiency, reducing costs, and optimizing decision-making processes^7,8,9. For instance, the SeeTrials¹⁰, a tool developed by Intelligent Medical Objects(US), leverages OpenAI’s GPT-4 to efficiently extract safety and efficacy data from clinical trial abstracts. This assists researchers and study designers to quickly grasp the methodologies and outcomes of analoguous studies, thereby supporting decision-making in clinical trials. Beyond trial design, LLM-powered tools like TrialGPT¹¹ utilize end-to-end retrieval, matching, and ranking capabilities to accurately predict patient eligibility while providing explanations. This not only shortens patient screening timelines but also significantly improves the efficiency of matching individuals with appropriate clinical trials.

In new drug application (NDA) submissions, LLMs automate the review of submitted materials to ensure completeness, consistency, and regulatory compliance. By enabling companies to proactively identify and address deficiencies in their submissions, these models streamline the approval workflow and mitigate delays in regulatory review¹². In patient care and management, the Foresight model¹³ — trained on UK hospital patient data—exhibits strong potentials in disease risk prediction, differential diagnosis, and personalized medication recommendations. Moreover, LLMs also excel in translation tasks spanning literature and law. They provide efficient support for cross-lingual knowledge transfer, legal translation and case study analysis¹⁴^,15.

Although LLMs have showcased impressive applications in the pharmaceutical sector, they face notable challenges^16,17,18. Specifically, large-parameter LLMs impose substantial demands on computational resources for training and inference—resulting in high costs—while general-purpose LLMs exhibit strong dependence on the quality and quantity of training data. Compounding these issues, inherent biases and errors in LLMs, alongside concerns about legal compliance and data privacy related to their training corpora, further hinder their reliable application in this field. To address these limitations, several key technical advancements have emerged. Retrieval-Augmented Generation (RAG)—a technique first proposed by Facebook AI Research (FAIR) in 2020—has gained widespread adoption in the LLM era¹⁹. This approach integrates vector databases as external knowledge repositories, enabling dynamic retrieval of relevant information during text generation, which effectively mitigates errors stemming from knowledge gaps. In addition to RAG, two foundational capabilities have further enhanced LLM utility: in-context learning and prompt engineering. Tom B. Brown et al. first identified the in-context learning ability of GPT-series models, demonstrating that these models can guide generation outputs via contextual cues without additional training²⁰. Building on this, Pengfei Liu et al. introduced prompt engineering: a methodology that steers LLM behavior toward specific objectives by designing tailored prompts—without updating model weights—thereby alleviating domain-specific biases and errors²¹.

However, current similarity-based retrieval mechanisms exhibit limitations, particularly in specialized industry translation tasks such as pharmaceutical regulatory documentation between Chinese and English, which is highly confidential, time-critical, and expensive, and should strictly stick to the terminology^22,23,24. For example, a NDA submitted to US FDA generally consists of over 200 documents with as many as 60,000-100,000 pages. Its high-quality translation with quick response streamlines subsequent review and approval²⁵. The translation issues regarding these dossiers in China can be broadly categorized as follows: (1) Slow Speed: The completion of more than 200 registration documents typically takes 1–2 months at least. (2) Poor quality: Multidisciplinary information including chemistry or biology, medicine, pharmacy, and production, is written and compiled by multiple departments in the company. All of the original data, charts, figures, protocols, procedures, and records should be translated completely into clear and consistent language. However, presently, the translations fail to meet the requirements of regulatory authorities. (3) High cost: The translation cost up to hundreds of thousands of US dollars, which adds heavy burden to the high-risk journey of a new drug. Lastly, the dearth of professional translators has long been overlooked by the industry²⁶. These tasks demand not only semantic consistency but also strict adherence to precise terminology translation. Vector databases primarily rely on semantic similarity during retrieval, often yielding semantically matched content while failing to guarantee accurate identification and translation of domain-specific terms. Consequently, the key to the wider implementation of LLMs is to develop an efficient and cost-effective application for specific vertical domains^27,28,29.

Following the wave of LLM, lightweight LLMs surge in domain-specific researches to meet the fast, easy distributed and secure demands of stakeholders^30,31,32. It is expected to be specially trained to target pharmaceutical regulatory affairs. In particular, this study created the PhT-LM to improve the quality, efficiency and reduce the cost in bilingual translation of pharmaceutical regulatory affairs.

Materials and methods

The original materials employed to establish a corpus were obtained from the official website of NMPA, the official regulatory agency in China responsible for the review and approval of drugs, as well as bilingual pharmaceutical textbooks used in universities and colleges and other official published guidelines both in Chinese and English. Following the cleansing and reviewing processes, an English-Chinese bilingual pharmaceutical text translation dataset was generated and stored in a knowledge base comprising two components: a document database and a vector database. Subsequently, a general-purpose LLM, Qwen-1_8B-Chat model, was optimized on the aforementioned translation dataset through the Low-Rank Adaptation (LoRA) technique. Next, the Retrieval-augmented Generate (RAG) technique was employed to input the regulatory affairs document to be translated into the fine-tuned model, along with similar translation examples retrieved from the knowledge base, with the objective of further improving the quality of the model’s translation³³. Finally, the whole workflow of creating the lightweight LLM, PhT-LM procedure to created PhT-LM is illustrated in Fig. 1.

Data

Data collection

In order to guarantee the authority and comprehensiveness of the model’s data sources, the data for this study were primarily derived from the following sources: (1) Chinese-English bilingual texts of laws and regulations regarding drug registration published on the NMPA website (https://www.nmpa.gov.cn/) paralleled with those on its English website (https://english.nmpa.gov.cn/); (2) Chinese-English bilingual terms of Pharmacy issued by the China National Committee for Terminology in Science and Technology (CNCTST) in 2014; and (3) relevant teaching materials in the field of pharmacy.

In data collection, automated scripts were written using the Selenium library in Python for content from the NMPA websites to crawl relevant text, including titles, document contents, and attachments. Data such as textbooks related to the international registration of medicines and other sources were used directly in their electronic versions. The final corpus was compiled based on 1,506 documents.

Data pre-processing

Following the collection of documents, a series of rigorous cleaning procedures was carried out to ensure that the resulting dataset thoroughly cover the terminology and key concepts in regulatory affairs. The data cleaning procedures included document screening, establishing the pairs between Chinese and English texts, data deduplication and content validation.

The raw data were pre-processed as follows: Step 1: A manual screening method was used to pair source documents by content to ensure that each Chinese document aligned with its English version. Step 2: The Docx library of Python was employed to read the Chinese and English document pairs that have been exchanged. And each paragraph of the documents was mapped to construct the Chinese and English translation dataset, which was stored as an Excel file. Step 3: The dataset was subjected to de-duplication, with the objective of eliminating redundant data pairs and preventing the occurrence of identical textual content within documents originating from disparate sources. Step 4: A manual review of the format and content of the data was conducted, including, but not limited to, the examination of blanks, garbled codes, and the translation of Chinese and English text. This process aims to confirm the accuracy and reliability of the translated data. Step 5: In case of any potential order effects during subsequent model fine-tuning, it was imperative to randomize the dataset in case of any potential sources of bias. Finally, a dataset comprising 34,769 Chinese-English pairs pertaining to regulatory affairs with a total of 912,826 tokens was created.

Retrieval module

The overall framework of the retrieval module was depicted in Fig. 2. In the knowledge base construction stage, the preprocessed Chinese and English translation data were imported into the document database according to the fields. Concurrently, the text embedding model was imported to encode the text into vectors, which were then imported into the vector database. Both the document database and the vector database were stored in Elasticsearch (hereafter referred to as ES) in a manner corresponding to two independent indexes in the ES cluster³⁴. During retrieval, the input text was searched by document database and vector databases separately, and the query results of the both databases were obtained. The results of the two routes were then proportionally fused as the final query result.

ES retrieval consists of three main components: text processing, query statement generation, and matching processing. For the input text, the text is first pre-processed to generate query statements by segmenting, removing stop words, setting word weights, etc., and then the query statements are used for multi-field matching. The text segmentation uses IK participle, which supports two participle modes: fine-grained participle and intelligent participle. To optimize retrieval performance, the fine-grained disambiguation mode is used in data storage to improve the hit rate of retrieval by increasing the diversity of disambiguation. In text retrieval, the intelligent participle mode was used to reduce unnecessary participles and improve retrieval accuracy³⁴.

Vector database retrieval, or semantic retrieval, is a search method based on natural language processing and semantic analysis techniques that analyzes the semantic content of documents to determine their relevance to the search query. The key to the effectiveness of semantic retrieval lies in the encoding capability of the text embedding model. The embedded model text vector review list, MTEB (Massive Text Embedding Benchmark, https://huggingface.co/spaces/mteb/leaderboard), is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. Considering the model size, inference time and other practical factors, the mxbai-embed-large-v1 model, ranked No. 4 in the English list, and the stella-large-zh-v3-1792d model, ranked No.3 in the Chinese list, were chosen (as of March 15, 2024) as the embedded models³⁵.

This study employed proportional fusion as the foundation of its retrieval strategy, viz. combining the multiplexed recall results from ES retrieval and vector retrieval in equal proportion, enabling the retrieval of translation examples for source texts across both keyword matching and semantic alignment dimensions. The retrieval results followed a straightforward rule: for any specified number n of results, the top n/2 highest-ranked results are extracted from the ES and the vector database retrievals equally as the final query results. The method backed up these retrieved examples synergistically combine lexical matching-based precision with semantic understanding-driven relevance, not only capturing direct lexical correspondences at the vocabulary level but also effectively leveraging contextual information to enhance translation naturalness, fluency, and contextual adaptability.

Fine-tuned model

The selection of basic model

Table 1 lists the commonly used general purpose lightweight LLMs, such as RedPajama-INCITE-Chat-3B, Firefly-Bloom-1B4, Qwen-1_8B, and OpenBuddy-3B. Among them, Qwen-1_8B-Chat is developed by Aliyun with 1.8 billion parameters, based on Transformer architecture, and pre-trained with over 2.2 trillion tokens of data. In addition, on the English downstream evaluation task MMLU (Massive Multitask Language Understanding), the Qwen-1_8B-Chat model excels existing open-source models of similar size, such as OpenLLaMA-Chinese-3B, OpenBuddy-3B, etc. On the C-Eval validation set, the model is superior to the chat models like ChatGLM2-6B and LLaMA2-7B. Finally, all its weights are completely open for academic research³⁶. Hence, this study took Qwen-1_8B-Chat as the basic model.

Table 1 Comparison of general-purpose lightweight LLMs.

Full size table

Experiments

(1)
Constructing training and test datasets

First, the texts to be translated were extracted from the constructed translation data set. These texts were then fed into the retrieval module to identify the translation examples that are most similar to the text to be translated. Then, this study constructed templates of prompts integrating translation examples and texts to be translated. The template styles are shown in Tables 2 and 3, respectively, aiming to train the model to extract feature information from contextual translation examples and improve the translation quality. A total of 34,769 datasets with contextualized translation examples were constructed, and 32,769 of them were classified as the training set and the remaining 2,000 as the test dataset.

Since short texts failed to fully demonstrate the superiority of the translation model, the texts to be translated with a length of less than 10 after word splitting were deleted using code. Finally, an English-to-Chinese test dataset containing 658 data items was constructed in parallel with a Chinese-to-English test dataset of 628 data items.

Table 2 The prompt of English-Chinese translation.

Full size table

Table 3 The prompt of Chinese-English translation.

Full size table

(2)
Fine-tuning

This study used the LoRA technique to fine-tune the interested LLM. This technique allows the model to be more adapted to a specific task by introducing a small number of trainable parameters while leaving most of the model’s parameters unchanged³⁷. All fine-tuning was performed on Ubuntu 22.04 operating system equipped with an 80GB A100 graphics card to ensure the stability of the experimental environment. The hyperparameters recommended by Qwen-1_8B-Chat were employed in fine-tuning: cutoff_len = 2048, batch_size = 4, lr = 5e-5, num_train_epochs = 2, lora_rank = 8. Their excellent performance has been demonstrated in multiple studies^38,39,40,41. Considering its target, the fine-tuned model was constructed and named as PhT-LM.

Retrieval-augmented generation

Retrieval-Augmented Generation, (RAG) integrates the advantages of information retrieval systems and LLMs, aiming to enhance the quality and relevance of generated text³³. This study adopts the RAG paradigm, with the specific process outlined in Fig. 3: First, the retrieval framework proposed in Retrieval module is employed as the retrieval module, utilizing a combination of keyword matching and semantic matching to start searches and obtain potentially useful examples. Next, the retrieved translation examples are integrated according to the translation templates shown in Tables 2 and 3. Finally, the PhT-LM fine-tuned in Fine-tuned model serves as the generation model, producing more accurate and contextually appropriate translation results based on the constructed prompt information. This method effectively integrates external translation databases and terminological knowledge into the generation process of LLMs, notably enhancing their performance in domain-specific translation tasks.

Evaluation metrics

Bilingual Evaluation Understudy (BLEU) is a commonly used metric for evaluating machine translation⁴². This study adopts Jieba Segmentation (a popular Chinese text segmentation tool developed by Sun Junyi from Baidu Company) to deal with Chinese text segmentation in translations when calculating BLEU scores. And we choose to utilize the sentence_bleu function in the NLTK library. By passing the weights parameter weights= [(1,), (1/2, 1/2), (1/3, 1/3, 1/3), (1/4, 1/4, 1/4, 1/4)], i.e., the n-gram weights are set equally in calculating the BLEU scores at 1-gram, 2-gram, 3-gram, and 4-gram, individually. In order to deal with the problem that the BLEU score is zero when the n-gram that is not presented in the reference translation appears in the machine translation, the smoothing function (). Method4 in the NLTK library is employed as the smoothing technique in BLEU score calculation.

Character n-gram F-score (CHRF) is another metric used to evaluate the quality of machine translation⁴². It assesses the quality of machine translation by calculating the number of n-gram matches between the reference translation and the resulting translation. In this study, when calculating the CHRF score, this study utilized the sentence_chrf function in the NLTK library to assess the similarity between the resulting machine translation and the reference translation.

Given that the translation quality of regulatory affairs dossiers not only depends on lexical and grammatical correctness, but also needs to achieve accurate matching at character-level, such as terminology. Therefore, this study integrates the BLEU and CHRF scoring criteria to provide a more comprehensive, objective, and precise evaluation of this model. BLEU focuses on n-gram matching at the lexical level and CHRF focuses on n-gram matching at the character level. The integrated scoring criteria thoroughly measure the consistence of translations at all levels, resulting in more accurate, consistent, and professional translations.

Code availability

The retrieval module source code, and fine-tuning model built in this study are publicly available on GitHub (https://github.com/chent-1928/PhT_LM).

Results

The study constructs a new corpus for regulatory affairs in pharmaceutical industry. It comprises 1,506 documents officially published in China, both in English and Chinese. After web crawling, cleaning, and verifying the bilingual documents from the official websites of competent regulatory authorities in China and international organizations, a translation dataset containing 34,769 bilingual data was constructed. Based on the corpus, we design a tailored and impactful lightweight large language model (LLM), PhT-LM. Next, the open-source Qwen-1_8B-Chat model was chosen as the basic model, which was then fine-tuned in the above translation dataset using the low-rank adapter technique. Finally, the retrieval-augmented generate technique was utilized to further polish the model’s translation performance. When compared to popular general-purpose large language models, our model achieved a BLEU-4 mean score of 36.018 and a CHRF mean score of 58.047 based on a self-constructed training corpus, with improved scores ranging from 16% to 65%. In addition, PhT-LM speeds up regulatory affairs translation, compared to the conventional translation by skilled translators. Its excellent performance in translation also cuts down translation fees.

Translation performance test of PhT-LM

This study compares PhT-LM with the general-purpose LLMs including Qwen-1_8B-Chat, ChatGLM3-6B, Chinese-Alpaca-2-7B and GPT-3.5-turbo from the perspective of translation performance. As Table 4(a) shows, in English-Chinese translation test dataset, PhT-LM achieves the highest scores in all five evaluation indices, viz. BLEU-1, BLEU-2, BLEU-3, BLEU-4, and CHRF, with the score of 69.053, 56.726, 47.744, 40.741, and 51.115, respectively, which apparently surpasses the rest four models. Similarly, in Chinese-to-English translation task, Table 4(b) presents PhT-LM’s superior performance with each score improvements ranging from 16% to 65%.

In addition, PhT-LM is also compared with the current mainstream translation services, such as Google Translate and DeepL. It still ranks the top in all five evaluation metrics for both English-Chinese and Chinese-English translation tasks, further verifying its significant advantages in translation performance.

Table 4 Score of all models in the test dataset.

Full size table

In this study, the retrieval module is also added to the Qwen-1_8B-Chat model, the ChatGLM3-6B model, and the Chinese-Alpaca-2-7B model, and the results are shown in Table 5. Comparing the scores in Tables 4 and 5, addition of the retrieval module has greatly improved all index scores of the three models, ranging from 25% to 37% separately.

Table 5 Score of all models with retrieval module in the test dataset.

Full size table

Impact of the number of translation examples on the translation quality of PhT-LM

In order to reveal the importance of translation examples for the model’s in-context learning, and explore the impact of different numbers of translation examples on the performance of the fine-tuned model, this study adjusts the number of query results returned from the retrieval module by modifying the number of translation examples. Consequently, test datasets with 1, 2, 8, and 16 translation examples were constructed, separately, and the test dataset with 4 translation examples acts best.

Fig. 4 shows the impact of different numbers of contextual translation examples on the BLEU-1, BLEU-2, BLEU-3, BLEU-4 and CHRF metrics of the model. It could be seen that the overall trends of all metrics are similar in both English-to-Chinese and Chinese-to-English tasks, i.e., when the number of translation examples is raised from 1 to 4, the BLEU-1 to BLEU-4 scores and the CHRF scores show an upward trend and later, becoming stable. In contrast, when more translation examples are added, e.g., the number of examples is raised up to 16, and the BLEU scores and CHRF scores show a decreasing trend. This figure depicts that the number 4 is the turning point in both English to Chinese and Chinese to English translation quality. Unlike generic LLMs, the number of contextual translation examples shows a different feature in domain-specific lightweight model.

Ablation experiments

This study examines the effectiveness of the retrieval strategy based on proportional fusion through ablation experiments. Two test groups are constructed in the study, covering Chinese-to-English and English-to-Chinese tasks, separately. Each test group contains 2 test files, and the contextual translation examples of the text to be translated in each group are individually generated by ES retrieval alone and vector database retrieval only.

Table 6 shows a notable improvement in the translation quality of the model, regardless of ES retrieval or vector database retrieval. In the English-to-Chinese translation test database, the model supported by ES retrieval alone reached 68.349, 56.031, 47.118, 40.086 and 50.521 in BLEU-1 to BLEU-4 and CHRF index scores, respectively, and the five evaluation indexes surpassed the scores of the model supported by vector database retrieval alone, which were 67. 661, 54.969, 46.051, 31.199, and 49.684. These were all lower than the scores of 69.053, 56.726, 47.744, 40.471, and 51.115 presented by the model using the proportional fusion retrieval strategy.

Similarly, the five metrics of the model to evaluate its performance using ES retrieval alone in the Chinese-English task (56.246, 44.749, 36.711, 30.881, and 64.436) were higher than those of the model supported by vector database retrieval (56.024, 43.883, 35.744, 29.290, and 63.310), and were lower than those of the model with proportional fusion retrieval strategy (57.212, 45.475, 37.246, 31.295, and 64.978). Therefore, the model translation quality has been further improved by adopting the retrieval strategy of proportional fusion, i.e., the fusion of the ES and the vector database retrieval, complementing each other’s feature information, is a promising strategy in this domain-specific lightweight model.

Table 6 Score of PhT-LM with different retrieval strategies in test dataset.

Full size table

Translation efficiency and cost

Table 7 lists the PhT-LM ‘s translation speed in test dataset following two different retrieval strategies, ES only and proportional fusion. Apparently, the way of ES retrieval responds much quicker than that of proportional fusion in every 1000 word. For English to Chinese translation, ES retrieval strategy takes 30.49 seconds and 56.28 seconds with proportional fusion way. In Chinese-to-English translation tasks, the results are even more outstanding. The ES retrieval strategy and proportional fusion strategy takes merely 19.40 seconds and 38.11 seconds respectively, which is nearly only two-thirds of the time required for English-to-Chinese translation tasks.

PhT-LM represents a cost-effective option for business handling translation tasks. The model could be deployed on most smartphones or personal computers, imposing no extra hardware requirements on companies to perform translation work. In contrast, regulatory affairs translation currently demands that translators available possess a high level of specialization and professional training. Meanwhile, the total cost of engaging translators or translation companies for such work can reach several hundred thousand dollars.

In general, hundreds of thousands of pages in an NDA submission must be submitted to US FDA. Taking 100,000 pages for instance, the whole translation process can be completed iwithn around 50 h—an amazing speed in comparison with conventional human translation speed.

Table 7 Speed of PhT-LM with different retrieval strategies in test dataset.

Full size table

Use-case validations

To evaluate the generalization of the model in real-world application scenarios, this study selected three regulatory guidance documents issued by the International Council for Harmonization of Technical Requirements for Pharmaceuticals for Human Use (ICH), EMA and FDA as external test cases, namely: the “ ICH S1A: Guideline on the need for carcinogenicity studies of pharmaceuticals” issued by, the “Guideline on the clinical investigation of medicines for the treatment of Alzheimer’s Disease” issued by the EMA, and the Guidance for Industry “Allergic Rhinitis: Developing Drug Products for Treatment ” issued by the U.S. FDA. None of these documents were included in the model’s knowledge base or training corpus, and they come from different regulatory systems covering highly specific topics and linguistic complexity, making them the suitable benchmarks for assessing the model’s generalization.

The final translation quality evaluation results, as shown in Table 8, demonstrate that the PhT-LM model achieved superior overall performance. Both in English-to-Chinese and Chinese-to-English tasks, this model achieved the highest scores on the BLEU-3 and BLEU-4 metrics; meanwhile, its scores on the CHRF metric are also close to those of the top-performing GPT-3.5-turbo.

Notably, BLEU-1 and BLEU-2 primarily measure unigram and bigram matching accuracy. Given that GPT-3.5-turbo is an ultra-large-scale language model trained on massive and diverse corpora covering extensive domains and styles, it exhibits greater lexical flexibility. PhT-LM, however, its smaller-scale knowledge base and training data limite its exposure to diverse linguistic patterns during training, which partially impacts its low-order n-gram matching rates and leads to slightly lower BLEU-1 and BLEU-2 scores compared to GPT-3.5-turbo, respectively.

To more intuitively demonstrate the translation effect, Tables 9 and 10 provide paragraph translation examples. PhT-LM model successfully translate the word consistency to 一致性 instead of 共识 (“consensus” in English). And it correctly translates the phrase worldwide regulatory assessments to 全球监管评估 rather than 全球注册评价监管机构. Both the general word and technical term translation confirmed the model’s capability in bilingual translation which not only accurately completes the translation task but also effectively preserves the meaning of the original text and the accuracy of professional terms.

Table 8 Score of all models in the case.

Full size table

Table 9 Translation examples of PhT-LM model in English-Chinese task.

Full size table

Table 10 Translation examples of PhT-LM model in Chinese-English task.

Full size table

Domain-expert assessment

To comprehensively assess the model’s applicability in real-world pharmaceutical regulatory contexts, this study further conducts domain-expert assessment to carry out an in-depth examination of the model’s translation quality from a professional perspective, focusing on three key dimensions: semantic accuracy, terminology consistency, and linguistic fluency.

The study design is as follows: 100 English-to-Chinese and 100 Chinese-to-English bilingual sentence pairs were randomly selected from the constructed test set, and manual scoring was performed by professionals with 3–5 years’ experience in translating dossiers of regulatory affairs. A 1–3 point scale was adopted for scoring (with higher scores indicating better performance), covering the following three dimensions:

Accuracy: Whether the target text accurately conveys the semantic meaning of the source text;
Consistency: Whether professional terms and expressions remain unified and conform to the conventional norms of regulatory documents;
Fluency: Whether the target text is natural and coherent.

Notably, the models to be evaluated were limited to those that demonstrated excellent performance in terms of the BLEU and CHRF metrics, namely, GPT-3.5-turbo (representing general-purpose large models), DeepL (representing mainstream commercial translation services), and the PhT-LM proposed in this study.

In the English-to-Chinese task, results came out to be consistently with automated metric evaluation (as shown in Table 11), PhT-LM demonstrated notable advantages, achieving scores of 242, 263, and 279 in the three dimensions of accuracy, consistency, and fluency—all significantly outperforming GPT-3.5-turbo and DeepL. This fully reflects its professional strengths in terminology processing and contextual adaptation within the regulatory affairs.

In the Chinese-to-English task, PhT-LM performed on par with GPT-3.5-turbo and DeepL in terms of consistency, while it was slightly inferior to the latter two in accuracy(235) and fluency(278). This may be attributed to the complex implicit semantic structures and highly concise sentence patterns in Chinese regulatory texts, which impose higher demands on the model’s capabilities in semantic decoding and English reconstruction. In contrast, general-purpose large models still maintain advantages in the flexibility and naturalness of language generation.

Nevertheless, PhT-LM retained overall competitiveness, with particularly outstanding performance in semantic fidelity and terminology standardization under professional contexts. This demonstrates its practical value and irreplaceability in translation tasks within the pharmaceutical regulatory field.

Table 11 Scores of the expert evaluation.

Full size table

Discussion

Increasing requirements for the quality of document translation in the field of pharmaceutical regulation have been acknowledged by the industry^43,44. Particularly in the preparation of regulatory submission materials, high-quality translation has become one of the key factors ensuring the smooth progress of approval processes.

The linguistic review process of product information in the centralised procedure, published by EMA, clearly states that low-quality translations may lead to delays when application materials are submitted to the European Commission^45,46. Similarly, FDA has issued multiple guidances to standardize translation practices. Guidance for Industry-Translation of GLP Study Reports: Questions and Answers, requires that translators should possess relevant educational backgrounds and experience in medical document translation, and that written procedures be established to ensure translation quality⁴⁷.

Data quality requirements

Research has shown that data quality is critical when modeling LLMs in these vertical industries⁴⁸. Whether it is a LLM with more than 7 billion data or a lightweight language model with only 1.8 billion data, they all possess the ability of in-context learning. Therefore, it is cautious in retrieving, crawling, cleaning and verifying the source data. Since the training data sources of the versatile LLMs are wide, their primary goal is to accomplish a variety of tasks and scenarios generalization, which is quite different from domain-specific LLM⁴⁹. However, their applications in vertical domains lack accuracy, authority, and coherence, especially in the highly regulated pharmaceutical sector. PhT-LM is particularly constructed based on authoritative, accurate and reliable data. It excels in translation quality through retrieval enhancement. Empirical experiments conducted on real-world texts confirm the effectiveness of our model.

Selection of retrieval strategies

It is worth noting that although the retrieval module using the proportional fusion strategy has the greatest effect on improving the quality of model translation, this improvement comes with increased cost. In particular, the text embedding model must be used to encode the sentences to be translated when performing vector database retrieval, a process that consumes a certain amount of GPU resources and leads to an increase in inference cost. Therefore, in resource-constrained settings, such as relatively low-end mobile phones or PCs with limited video memory, it is more efficient to use the ES retrieval strategy to support the model for text translation. This route is more applicable for most of the small and medium businesses. Further, ES retrieval strategy offers superior translation support without strengthening the computational burden.

In fact, the model retrieved by ES alone is noninferior to the retrieval module using the proportional fusion strategy in terms of translation quality. When compared to GPT-3.5-turbo, PhT-LM retrieved by ES shows an improved performance by 14% to 31%. Therefore, considering the cost of deploying models in the pharmaceutical industrial setting, the ES retrieval strategy is highly recommended as an effective solution for balancing performance and resource consumption, requiring only 4G of video memory for inference.

Currently, generic LLMs fail to meet the accurate demands in vertical domain translation, especially in the translation of regulatory affairs documents which are claimed with poor professionality, consistency and accuracy^50,⁵¹. This study suggests that an easily deployed lightweight LLM like PhT-LM, is practical and economic for vertical domain translators or business entities to use. It is convenient for them to establish their own models with appropriate number of contextual examples after validation⁵².

Performance differences between Chinese-English translation

English is a global lingua franca, and a large part of open-source datasets are presented in English⁵³. As a result, the generic LLMs are exposed to a large amount of English data during the pre-training phase, allowing it to demonstrate superior English comprehension⁵⁴. Similarly, This PhT-LM model also performs better on the English-to-Chinese translation task than on the Chinese-to-English task. Interestingly, the model’s CHRF indicator scores on the Chinese-to-English task are higher than those on the English-to-Chinese task. This comes from the fact that CHRF mainly focuses on the character-level overlap in the translated text. Due to the inherent difference between Chinese and English language, the model will be more likely to generate character sequences similar to the reference translation in the Chinese-to-English task, thus achieving a higher CHRF score. In follow-up studies, the training data can be further expanded in the pharmaceutical domain to further improve the model’s performance in the Chinese-to-English translation task.

Cost-benefit analysis

As Table 12 shows, in general, the cost of translation for regulatory affairs depends on the complexity of texts, language pairs and task urgency. Human translators are required to have specialized knowledge and language proficiency due to the technical nature of the content. For urgent translation service, the cost is typically 25% to 50% higher. According to a professional translation company that has been operating internationally for 20 years. the price for English to Chinese is US$0.10 per English word, and vice versa. For example, translating an NDA can result in total translation costs of several hundred thousand dollars over several months.

Furthermore, DeepL’s pricing structure imposes a significant financial burden on enterprise users. For instance, its Individual Plan costs $8.74 per month, while its Team and Business Plans are priced at $28.74 and $57.49 per user per month, respectively. As enterprise translation needs grow, these costs increase proportionally.

In contrast, PhT-LM is fully open-source and can be deployed locally with just 4GB of VRAM, drastically reducing both initial investment and long-term operational costs for enterprises. While the model may not match the speed of commercial solutions, it excels in translation accuracy and domain-specific expertise. For large-scale translation tasks, PhT-LM outperforms human translators in response speed and scalability, making it especially suitable for high-frequency, standardized translation scenarios.

Although human translation remains indispensable for context comprehension, cultural adaptability, and professional precision, PhT-LM has demonstrated superior translation quality in the domain of pharmaceutical regulatory documents. Its ability to deliver accurate, contextually relevant translations makes it an attractive alternative, particularly when high-quality translations are required at a lower cost.

With its advantages in cost control, translation efficiency, and information security, PhT-LM stands out as a highly efficient and cost-effective solution for pharmaceutical regulatory document translation, offering a compelling alternative to both human and commercial translation services.

Table 12 Cost comparison of different translation methods.

Full size table

Prospective analysis

To further enhance PhT-LM’s adaptability in real-world scenarios, future research can explore in-depth improvements in two directions: data expansion and model optimization. At the data level, PhT-LM is currently trained and evaluated based on Chinese-English bidirectional translation tasks. It can be extended to multilingual capabilities, building a multilingual translation platform for international drug registration and compliance review on high-quality raw data. At the model level, more advanced lightweight large language models (LLMs) can be adopted to further optimize the model size, enabling efficient operation on edge devices and meeting the deployment needs of pharmaceutical enterprises across different IT environments.

Furthermore, PhT-LM can be integrated with existing pharmaceutical regulatory systems in the future, such as eCTD submission platforms and GMP document management systems. This integration would enable automated and intelligent scheduling of translation tasks. Combined with technologies like information extraction and entity recognition, it could also support the development of auxiliary functions⁵⁵. In the long term, PhT-LM is expected to evolve into a professional intelligent agent for the pharmaceutical regulatory field, and to support compliance writing. For instance, based on existing templates or structured content, it could provide terminology-consistent and format-standardized language support in line with regulatory guidelines, thereby assisting authors in more efficiently drafting initial versions of application materials⁵⁶.

In brief, PhT-LM holds significant advantages in current pharmaceutical regulatory translation scenarios. Through continuous optimization and expansion in the aforementioned forward-looking directions, it is expected to deliver broader application value in the intelligent processing of pharmaceutical regulatory documents, bringing greater efficiency and consistency to compliance processes in the pharmaceutical industry.

Potential applications in other domains

The PhT-LM framework proposed in this study integrates the features of lightweight model, high-precision retrieval, and domain knowledge base to broaden its potential applications. Beyond its Chinese-English bidirectional translation of pharmaceutical regulatory documents, it is applicable in other vertical domains that are highly specialized, strongly confidential, and have extremely high accuracy requirements, such as multilingual clinical trial protocols and informed consent forms translation, research papers and patents translation. By replacing it with a domain-specific knowledge base and combining it with task-oriented lightweight fine-tuning, specialized translation systems or intelligent auxiliary tools targeting different application needs can be quickly constructed. It sets a good model by its excellent generalization and transferability.

Data access and ethic consideration

Although the bilingual data used in this study are primarily derived from publicly available regulatory texts issued by the National Medical Products Administration, ICH, FDA, and EMA—none of which pertain to personal privacy or business secrets—ethical considerations must still be addressed in the model’s training and application. When such public data are incorporated into a private knowledge base and embedded into LLMs, the model may inadvertently “memorize” and reproduce sensitive expressions, potentially leading to the dissemination of misleading information or disputes over liability attribution in specific contexts. Moreover, as regulations in the rapidly evolving pharmaceutical industry continue to emerge worldwide, over-reliance on the model’s outputs without timely manual review could result in misinterpretation of regulatory requirements or errors in translation.

Even if the data sources are compliant, the auxiliary role of the model should be clearly defined to prevent it from replacing human professional judgment. Therefore, in practical deployment, it is recommended to establish standard operation practices. Its decision-making weight should be strictly set, with domain experts retaining final review responsibility. Meanwhile, the source of the knowledge base data should be specified during deployment to maintain transparency, which is conducive to establishing a sustainable balance between technological innovation and public trust.

Limitations

Despite its impressive performance in regulatory affairs translation, the lightweight LLM still has certain limitations. Future researches could explore the following directions:

Knowledge base: The data relied on by the current model is primarily sourced from Chinese regulatory authorities or the translated regulatory guidelines of ICH, FDA and EMA by Chinese professionals, which inevitably introduces cultural and linguistic biases. Collecting officially issued translations of pharmaceutical regulations from other languages into Chinese is extremely difficult, as such materials are rarely released by the regulatory authorities of other countries or international organizations. The singularity of data sources may lead to the dominance of linguistic patterns and expression conventions specific to Chinese regulatory contexts in the model; this, in turn, affects the model’s translation performance in non-Chinese contexts, resulting in issues such as performance degradation or insufficient cultural adaptability.

To alleviate such biases, future research should prioritize expanding data sources by systematically collecting, cleaning, and integrating official documents from regulatory authorities of multiple countries and regions (e.g., Japan, Spain, and the European Union) to enrich the knowledge base. By integrating regulatory texts with multilingual and multicultural backgrounds, the model can not only be trained to enhance its multilingual translation capabilities but also promote the balanced representation of linguistic features across different regulatory systems. This will thereby significantly improve the model’s generalization ability and cultural adaptability in international application scenarios.

Base model: PhT-LM is built on the Qwen-1_8B model. With the rapid development of LLMs, more advanced lightweight LLMs, such as Qwen2.5-1.5B and Llama-3-1B, keep emerging one after another. A more advanced base model may further enhance the overall performance.

Training strategy: As an efficient parameter-efficient method, LoRA fine-tuning relies on low-rank approximation, which may impair the model’s performance. Therefore, the coming study could combine secondary pre-training with vertical corpora to enhance the model’s understanding. Then, full fine-tuning could be followed according to specific task requirements to maximize its performance.

Translation speed: Although the translation speed of PhT-LM currently satisfies the existing requirements, the integration of more advanced efficient inference optimization techniques such as vLLM, together with more efficient retrieval algorithms to expedite the information retrieval process within a large-scale knowledge base, has the potential to further curtail the translation time.

Conclusion

The study customizes a novel and practical lightweight LLM, PhT-LM, to tackle the problems of slow, poor and high cost of regulatory affairs translation between Chinese and English in pharmaceutical industry. The model notably improves the translation quality of registration documents, which is superior to the available general-purpose LLMs. In addition, it contains fewer parameters and requires 4G of video memory for inference, which could be easily and quickly deployed on personal computers or cell phones. It provides great convenience for and protects pivotal confidential information of pharmaceutical businesses. Last, it unleashes the potentials of domain-specific LLM to improve the quality and speed up the bilingual translation and hence to help these companies stand out from fierce competition in pharmaceutical industry.

Data availability

The datasets used in this study are available upon request. Interested researchers should contact the author via email at 1928539732@qq.com. Access to the data is granted exclusively for non-commercial academic purposes.

References

Zeng, D. China Pharmaceutical and Health Care Products Import and Export Chamber of Commerce Research Report: the Era of Local Pharmaceutical Companies Collectively Going to Sea is Coming. Economic Information Daily. 2023-05-10(006), https://doi.org/10.28419/n.cnki.njjck.2023.001671 (2023).
Zheng, Q. Model-informed drug development: principles and cases. Chin. J. New. Drugs. 32 (19), 1946–1952. https://doi.org/10.3969/j.issn.1003-3734.2023.19.008 (2023).
Article Google Scholar
Wang, W., Huang, M., Chen, C., Li, C. & Wang, L. Trends of new drug R&D and approval in china: an analysis of the data of new drug applications from 2016 to 2020 by NMPA. Chin. J. New. Drugs. 32(4), 386–395, https://doi.org/10.3969/j.issn.1003-3734.2023.04.008 (2023).
Youssoufian, H. & Lewis, J. Getting from the bench to the patient: biotechnology strategies. Surg. Oncol. Clin. N Am. 22 (4), 885–901. https://doi.org/10.1016/j.soc.2013.06.007 (2013).
Article PubMed Google Scholar
Liu, G., Yi, T., Li, X. & Shi, Y. Research on industrial competitive intelligence of biomedical industry in view holistic National security architecture. J. Intell. 40(9), 58–64, https://doi.org/10.3969/j.issn.1002-1965.2021.09.010 (2021).
Ding, H. & Tian, L. Research and enlightenment of registration and declaration system based on e—CTD format in America. Chin. J. New. Drugs. 32(21), 2121–2128, https://doi.org/10.3969/j.issn.1003-3734.2023.21.001 (2023).
Zhou, H. et al. A Survey of Large Language Models in Medicine: Progress, Application, and Challenge. https://doi.org/10.48550/arXiv.2311.05112. Accessed on (18 Feb 2025).
Zheng, Y. et al. Large Language Models for Medicine: A Survey. https://doi.org/10.48550/arXiv.2405.13055. Accessed on (18 Feb 2025).
Clusmann, J. et al. The future landscape of large Language models in medicine. Commun. Med. (Lond). 3 (1), 141. https://doi.org/10.1038/s43856-023-00370-1 (2023).
Article PubMed Google Scholar
Lee, K. et al. SEETrials: Leveraging large language models for safety and efficacy extraction in oncology clinical trials. Infor-matics in Med Unlocked. 50, 101589. https://doi.org/10.1016/j.imu.2024.101589 (2024).
Article Google Scholar
Jin, Q. et al. Matching patients to clinical trials with large Language models. Nat. Commun 15, 9074. https://doi.org/10.1038/s41467-024-53081-z(2024).
Mao, X., Bi, B., Cao, C., Yao, C. & Li, G. The application of large Language model technology in drug clinical research and development. Chinese J. Med. Guide 26(11), 1093–1097, https://doi.org/10.3969/j.issn.1009-0959.2024.11.007(2024).
Kraljevic, Z. et al. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. Lancet Digit. Health. 6 (4), e281–e290. https://doi.org/10.1016/S2589-7500(24)00025-6 (2024).
Article CAS PubMed PubMed Central Google Scholar
Wu, M., Yuan, Y., Haffari, G. & Wang, L. (Perhaps) beyond human translation: Harnessing Multi-Agent collaboration for translating ultra-Long Literary Texts. https://doi.org/10.48550/arXiv.2405.11804 Accessed on (25 Mar 2024).
Ding, L. A Comparative Study on the Quality of English-Chinese Translation of Legal Texts Between ChatGPT and Neural Machine Translation Systems. Theory Practice in Language Studies. 14(9), 2823–2833. https://doi.org/10.17507/tpls.1409.18 (2024).
Article Google Scholar
Yang, Y. et al. MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications. https://doi.org/10.48550/arXiv.2310.15777. Accessed on (25 Mar 2024).
Xu, Y., Hu, L., Zhao, J., Du, W. & Wang, W. Technology application prospects and risk challenges of large Language models. J. Comput. Appl. 44 (6), 1655–1662. https://doi.org/10.11772/j.issn.1001-9081.2023060885 (2024).
Article Google Scholar
Chavan, A., Magazine, R., Kushwaha, S., Debbah, M. & Gupta, D. Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward. https://doi.org/10.48550/arXiv.2402.01799. Accessed on (27 Mar 2024).
Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. https://doi.org/10.48550/arXiv.2005.11401. Accessed on (18 Feb 2025).
Tom, B. et al. Language Models are Few-Shot Learners. https://doi.org/10.48550/arXiv.2005.14165. Accessed on (18 Feb 2025).
Liu, P. et al. Pre-train, Prompt, and predict: A systematic survey of prompting methods in natural Language processing. ACM Comput. Surveys 55(9):1–35, https://doi.org/10.1145/3560815(2023).
Xiong, M. & Chi, X. On the security of the application of the generative large Language Model——Taking ChatGPT as an example. J. Shandong Social Sci. 5, 79–90, https://doi.org/10.14112/j.cnki.37-1053/c.2023.05.009(2023).
Cai, T., Wang, X., Ma, T., Chen, X. & Zhou, D. Large Language Models as Tool Makers. https://doi.org/10.48550/arXiv.2305.17126. Accessed on (27 Mar 2025).
Liu, Z. et al. AgentLite: A lightweight library for Building and advancing Task-Oriented LLM agent system. https://doi.org/10.48550/arXiv.2402.15538. Accessed on (28 Mar 2025).
Gao, J., Pan, J., Li, M., Lan, G. & Gao, C. Requirements for the writing and submission of clinical study reports after implementation of the provisions for drug registration. China Food & Drug Adm. Magazine 10, 20–26.https://doi.org/10.3969/j.issn.1673-5390.2021.10.003 (2021).
Qi, S. Features of chart translation in imported drug registration materials. Chin. J. Pharm. 53 (7), 1078–1079. https://doi.org/10.7626/j.issn.1001-8255.2022.7.zgyygy202207026 (2022).
Article Google Scholar
Xu, R. A knowledge-enhanced pretrained language model in Closed Domains. East China Normal University: Shanghai, China. https://doi.org/10.27149/d.cnki.ghdsu.2023.001859 (2023).
Gong, R. et al. Landing methodology in the era of big Language Model——Cost, efficiency and effectiveness. AI-View 3, 52–61, https://doi.org/10.16453/j.2096-5036.2023.03.005 (2023).
Chen, H., Liu, Z. & Sun, M. The social opportunities and challenges in the era of large Language models. Journal of Computer Research and Development. 61(5), 1094–1103, https://doi.org/10.7544/issn1000-1239.202330700 (2024).
Li, Z. et al. FlexKBQA: A flexible LLM-Powered framework for Few-Shot knowledge Base Question Answering. https://doi.org/10.48550/arXiv.2308.12060. Accessed on (30 Mar 2024).
Park, Y. et al. Any-Precision LLM: Low-Cost deployment of multiple, Different-Sized LLMs. https://doi.org/10.48550/arXiv.2402.10517. Accessed on (30 Mar 2024).
Huang, W. et al. A Fast, Performant, secure distributed training framework For Large Language Model. https://doi.org/10.48550/arXiv.2401.09796. Accessed on (30 Mar 2024).
Gao, Y. et al. Retrieval-Augmented Generation for Large Language Models: A Survey. https://doi.org/10.48550/arXiv.2312.10997. Accessed on (30 Mar 2024).
Dong, Y., Jia, Y., Zhu, Y., Li, E. & X, X. Research on information retrieval method based on elastic search distributed search engine. Journal Hubei Normal Univ. (Natural Science) 43(4), 56–61, https://doi.org/10.3969/j.issn.2096-3149.2023.04.008 (2023).
Muennighoff, N., Tazi, N., Magne, L. & Reimers, N. MTEB: Massive Text Embedding Benchmark. https://doi.org/10.48550/arXiv.2210.07316. Accessed on (01 Apr 2024).
Bai, J. et al. Qwen Technical Report. https://doi.org/10.48550/arXiv.2309.16609. Accessed on (01 Apr 2024).
Hu, E. J. et al. LoRA: Low-Rank Adaptation of Large Language Models. https://doi.org/10.48550/arXiv.2106.09685. Accessed on (01 Apr 2024).
Chen, D. pingan-team at SemEval-2025 Task 2: LoRA-Augmented Qwen2.5 with Wikidata-Driven Entity Translation. In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 2065–2070, Vienna, Austria. Association for Computational Linguistics. https://aclanthology.org/2025.semeval-1.268 (2025).
Zhao, J., Yu, X. & Yang, Z. MSPLoRA: A Multi-Scale Pyramid Low-Rank Adaptation for Efficient Model Fine-Tuning. https://doi.org/10.48550/arXiv.2503.21838. Accessed on (03 Sep 2025).
Xu, H., Kim, Y. J., Sharaf, A. & Awadalla, H. H. A paradigm shift in machine translation: boosting translation performance of Large Language Models. https://doi.org/10.48550/arXiv.2309.11674. Accessed on (03 Sep 2025).
Zheng, B., Gu, J., Li, S. & Dong, C. LM4LV: A frozen large Language model for low-level Vision Tasks. https://doi.org/10.48550/arXiv.2405.15734. Accessed on (03 Sep 2025).
Liu, S. Evaluating the application efficacy of machine translation in maritime contexts: A rigorous evaluation via BLEU, chrF++, and bertscore metrics. J. OCEAN. U CHINA (Social Sciences). 2, 21–31. https://doi.org/10.16497/j.cnki.1672-335X.202402003 (2024).
Article ADS Google Scholar
ISO. ISO 17100:2015 Translation services —Requirements for translation services.[2015.05]. https://www.iso.org/obp/ui/en/#iso:std:iso:17100:ed-1:v1:en. Accessed on (07 Jul 2025).
ISO. ISO 18587:2017 Translation services —Post-editing of machine translation output — Requirements.[2017.04]. https://www.iso.org/obp/ui/en/#!iso:std:62970:en. Accessed on (07 Jul 2025).
EMA.The Linguistic Review Process Of Product Information in the Centralised Procedure[EB/OL]. [2025.02.10]. https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/linguistic-review-process-product-information-centralised-procedure-human_en.pdf. Accessed on (07 Jul 2025).
CMDh.Best Practice Guide On The Submission Of High Quality National Translations[EB/OL]. [2012.04]. https://www.aifa.gov.it/documents/20142/516919/bpg_on_the_submission_of_high_quality_national_translations_cmdh_255_2012_rev0_2012_05.pdf. Accessed on (07 Jul 2025).
FDA.Translation of Good Laboratory Practice Study Reports: Questions and Answers.[EB/OL]. [2023.11.23]. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/translation-good-laboratory-practice-study-reports-questions-and-answers. Accessed on (07 Jul 2025).
Li, M. et al. From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning. https://doi.org/10.48550/arXiv.2308.12032. Accessed on (03 Apr 2024).
Liu, Z. et al. Tailoring large Language models to radiology: A preliminary approach to LLM adaptation for a highly specialized domain. In: (eds Cao, X., Xu, X., Rekik, I., Cui, Z. & Ouyang, X.) Machine Learning in Medical Imaging. MLMI Lecture Notes in Computer Science, vol 14348. Springer, Cham. https://doi.org/10.1007/978-3-031-45673-2_46 (2023).
Chapter Google Scholar
Huang, H. et al. Towards Making the Most of LLM for Translation Quality Estimation. NLPCC. 375–386. https://doi.org/10.1007/978-3-031-44693-1_30 (2023).
Zhang, B., Liu, Z., Cherry, C. & Firat, O. When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method. https://doi.org/10.48550/arXiv.2402.17193. Accessed on (03 Apr 2024).
Wang, L. et al. Document-Level Machine Translation with Large Language Models. https://doi.org/10.48550/arXiv.2304.02210. Accessed on (05 Apr 2024).
Intrator, Y. et al. Breaking the Language Barrier: Can Direct Inference Outperform Pre-Translation in Multilingual LLM Applications?. https://doi.org/10.48550/arXiv.2403.04792. Accessed on (05 Apr 2024).
Zhang, X., Li, S., Hauer, B., Shi, N. & Kondrak, G. Don’t Trust ChatGPT when your Question is not in English: A Study of Multilingual Abilities and Types of LLMs. https://doi.org/10.48550/arXiv.2305.16339. Accessed on (05 Apr 2024).
Pu, Y. Research on the application and development of AI model in digital government Systems. Technology Innovation and Application. 14(35), 22–26, https://doi.org/10.19981/j.CN23-1581/G3.2024.35.004(2024).
Zhao, J., & Li, X. Building and Applying Translation Agents Powered by Large Language Models. Technology Enhanced Foreign Language Education, 5, 22-28, https://doi.org/10.20139/j.issn.1001-5795.20240504 (2024).

Download references

Funding

This work is funded by General Program of Philosophy and Social Science Research in Jiangsu Colleges and Universities “Research on Language Services in the Pharmaceutical Industry from the perspective of linguistic economics—a case study of Jiangsu Province“(No. 2022SJYB0058).

Author information

Authors and Affiliations

Institute of Medical Big Data and Artificial Intelligence, School of Science, China Pharmaceutical University, 211198, Nanjing, China
Tao Chen, Hongyi Mo, Tao Wang, Chenlu Jiang, Zihan Liu & Fengzhen Hou
School of Foreign Languages, China Pharmaceutical University, 211198, Nanjing, China
Jue Gan

Authors

Tao Chen
View author publications
Search author on:PubMed Google Scholar
Hongyi Mo
View author publications
Search author on:PubMed Google Scholar
Tao Wang
View author publications
Search author on:PubMed Google Scholar
Chenlu Jiang
View author publications
Search author on:PubMed Google Scholar
Zihan Liu
View author publications
Search author on:PubMed Google Scholar
Fengzhen Hou
View author publications
Search author on:PubMed Google Scholar
Jue Gan
View author publications
Search author on:PubMed Google Scholar

Contributions

Data curation, Chenlu Jiang; Funding acquisition, Jue Gan; Investigation, Tao Wang; Methodology, Tao Chen; Project administration, Fengzhen Hou and Jue Gan; Resources, Zihan Liu and Jue Gan; Software, Tao Chen; Validation, Hongyi Mo; Visualization, Tao Chen; Writing – original draft, Tao Chen and Jue Gan; Writing – review & editing, Tao Chen and Jue Gan. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Fengzhen Hou or Jue Gan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, T., Mo, H., Wang, T. et al. A lightweight large language model for regulatory affairs translation in pharmaceutical industry. Sci Rep 15, 37992 (2025). https://doi.org/10.1038/s41598-025-21867-w

Download citation

Received: 19 August 2024
Accepted: 24 September 2025
Published: 30 October 2025
Version of record: 30 October 2025
DOI: https://doi.org/10.1038/s41598-025-21867-w

Subjects

Abstract

Introduction

Materials and methods

Data

Data collection

Data pre-processing

Retrieval module

Fine-tuned model

The selection of basic model

Experiments

Retrieval-augmented generation

Evaluation metrics

Code availability

Results

Translation performance test of PhT-LM

Impact of the number of translation examples on the translation quality of PhT-LM

Ablation experiments

Translation efficiency and cost

Use-case validations

Domain-expert assessment

Discussion

Data quality requirements

Selection of retrieval strategies

Performance differences between Chinese-English translation

Cost-benefit analysis

Prospective analysis

Potential applications in other domains

Data access and ethic consideration

Limitations

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links