Introduction

Statistics from the International Agency for Research on Cancer (IARC) show that lung cancer was the leading cause of global cancer mortality in 2020, with approximately 1.8 million deaths1. China faces a similar burden, with persistently high incidence and mortality rates2. The knowledge surrounding lung cancer is vast and complex, covering etiology, pathology, clinical manifestations, and therapeutic regimens. Although extensive medical literature exists, its fragmented nature and varied terminology hinder effective integration and use.

Furthermore, this challenge is amplified within the context of Chinese medical literature. This domain is characterized by a high degree of terminological variance, including numerous synonyms, acronyms, and the integration of concepts from Traditional Chinese Medicine (TCM). Clinical notes often feature dense, complex sentence structures that describe nested, hierarchical relationships. A successful knowledge graph construction framework must therefore not only extract information but also be robust enough to navigate this linguistic complexity, standardize diverse terminology, and deconstruct intricate syntactic patterns. This study presents a framework specifically designed to meet these challenges.

Knowledge graphs (KGs) offer a solution by representing this knowledge as structured “entity-relation-entity” triplets. KG construction typically follows two main approaches. Top-down methods build from structured sources like encyclopedias to create an initial ontology and schema. In contrast, bottom-up approaches extract facts from unstructured data using techniques like relation extraction, integrating high-confidence information into the knowledge base3.

Figure 1 illustrates the progression from raw data to a formed knowledge graph, involving stages of knowledge extraction, fusion (entity alignment), and quality assessment. Input data, categorized according to their degree of structure as structured, semi-structured, or unstructured, undergo distinct methods to transform into triplet form. Subsequently, triplet data undergoes knowledge fusion to yield standardized data representations. Moreover, to uncover novel knowledge, implicit information can be mined based on certain inference rules. All extracted knowledge undergoes a quality evaluation prior to inclusion in the knowledge graph.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Overall framework for the lung cancer knowledge graph (LCKG) construction. The process begins with heterogeneous data inputs (structured, semi-structured, and unstructured). Knowledge extraction converts this raw data into triplets. Knowledge fusion, which includes entity alignment and quality assessment, integrates these triplets into a unified representation. The final, quality-checked knowledge is stored in a graph database and made available through application services.

Constructing knowledge graphs from raw text is an intricate task. Conventional methods—such as manual annotation, rule-based matching, and techniques using shallow neural networks—often suffer from inefficiency, limited coverage, and poor adaptability4. On one hand, manually building knowledge graphs is labor-intensive, costly, and slow to update, rendering it ill-suited for coping with the rapid evolution of medical knowledge. On the other hand, rule- and template-based methods are constrained by the comprehensiveness and flexibility of predefined rules, often overlooking intricate latent knowledge and novel relationships5. Regarding shallow neural network learning technologies, they demonstrate inadequate performance in extracting deep, complex associative knowledge, particularly when confronted with the varied expressions found in unstructured texts, often resulting in unsatisfactory precision and recall rates6. Chen et al. constructed a lung cancer knowledge graph using traditional named entity recognition (NER) methods7. Zhang et al. combined top-down and bottom-up strategies to build a knowledge graph in the field of lung cancer in a semi-automated manner8. Shao et al. adopted deep learning methods for named entity recognition on case data for lung cancer treatment, extracting the entities required for the experiment and standardizing the data9.

In recent years, the rapid rise of deep learning technologies, particularly the transformative innovation of the Transformer architecture, has catalyzed the widespread adoption and in-depth exploration of large language models (LLMs) across the global research landscape. LLMs, as advanced artificial intelligence (AI) technologies centered on natural language understanding and generation, have revolutionized the field of natural language processing (NLP) with their potent contextual understanding and content generation capabilities10. Fundamentally, LLMs learn linguistic patterns and rules from vast amounts of textual data through deep neural networks, forming rich language knowledge representations. Prominent LLMs include the GPT series11 (Generative Pre-trained Transformer), Tsinghua University’s ChatGLM series12, and Alibaba Cloud’s QianWen13, all of which employ self-attention mechanisms and large-scale unsupervised pre-training to grasp fundamental grammatical structures and semantic features of language, significantly enhancing their abilities to understand and generate language at a deeper level, enabling them to effectively handle complex medical data.

In the context of constructing a medical knowledge graph for lung cancer, harnessing LLMs for data structuring in the knowledge graph reduces annotation costs and accelerates the graph construction process. Firstly, leveraging their exceptional language understanding capabilities, LLMs demonstrate marked advantages in tackling complex tasks such as entity recognition, entity classification, relation extraction, and anaphora resolution, thereby more efficiently extracting entities and relationships14. Secondly, LLMs play a pivotal role in enhancing the completeness of the knowledge graph. Existing knowledge graphs often fail to incorporate all required knowledge, leaving gaps. LLMs, conversely, can distill and supplement knowledge information absent from current knowledge graphs from their vast training datasets. While traditional knowledge graphs can only accommodate explicit, human-understood knowledge, the introduction of LLMs enables knowledge to be expressed in a vectorized form and integrated into the knowledge graph’s architecture, participating in knowledge computation and reasoning processes15, vastly broadening the graph’s scope of coverage.

Given these considerations, this study proposes an effective strategy for constructing a lung cancer knowledge graph based on LLMs. Through a process of fine-tuning, we constructed the KGLM (Knowledge Graph Large Model), which incorporates prompts to extract triplets from unstructured data. Entity fusion was then performed with semi-structured clinical data, post-rule-based cleansing, along with other open graph datasets. This led to the successful creation of a lung cancer diagnosis and treatment knowledge graph, realizing highly automated and efficient extraction of latent knowledge triplets from unstructured text. The process considerably alleviated manual workload and enhanced the timeliness and comprehensiveness of knowledge graph construction.

The principal contributions of this study are threefold:

  1. 1.

    We propose a unified, end-to-end framework for knowledge graph construction that reduces the complexity of traditional multi-stage pipelines. By fine-tuning a large language model to directly generate structured triplets, our approach is designed to mitigate error propagation between separate NER and RE modules and to simplify deployment.

  2. 2.

    We demonstrate the effectiveness of a multi-component prompt engineering strategy to ensure high-quality, structured output. Our prompt template goes beyond simple instructions by incorporating system role-setting, strict output formatting, and Chain-of-Thought (CoT) reasoning. This transforms the LLM from a general conversational agent into a precise, machine-readable knowledge extraction tool.

  3. 3.

    We empirically validate our KGLM framework, showing that it significantly outperforms established baselines. Through comprehensive objective and subjective evaluations, we prove that the combination of fine-tuning and advanced prompt engineering leads to substantial improvements in extraction accuracy (F1 score of 82%) and produces a knowledge graph of higher clinical relevance and usability.

To clearly define the scope of our empirical study, we emphasize that all comparative experiments in this work are conducted against open-weight and locally deployable baselines (ChatGLM, BERT, CNN and their variants), rather than proprietary frontier medical LLMs such as GPT-4 or Med-PaLM. Evaluating API-based systems on clinical text would raise data-privacy and governance concerns and introduce confounding factors related to prompt cost, latency, rate limits, and continuous model updates. As a result, we treat the absence of such head-to-head comparisons with proprietary models as an explicit limitation of this study.

Basic research

Fine-tuning of LLMs

Fine-tuning techniques have alleviated the challenge of training large LLMs, fueling the development of domain-specific models across various disciplines. However, full-precision fine-tuning of such models demands exorbitant GPU (graphics processing unit) memory, rendering them virtually inaccessible on personal computers. To alleviate the memory requirements for model training, Deepspeed employs ZeRO-Offload technology, incorporating CPU (central processing unit) and system memory as supplements during the training process16. Despite this, the computational efficiency of models on CPUs remains extremely low. To enhance model fine-tuning efficiency, Edward Hu et al.17 highlight the low-rank nature of the fine-tuning process in LLMs and introduce the Low-Rank Adaptation (LoRA) method, significantly boosting fine-tuning efficiency. LoRA freezes the majority of parameters during fine-tuning, concentrating on training only a small subset introduced into the model. Building upon LoRA, Tim Dettmers et al.18 propose Quantized LoRA (QLoRA), applying k-bit quantization to parameters prior to fine-tuning. With a 4-bit quantization scheme, QLoRA can reduce the fine-tuning resource consumption of large models by an order of magnitude, dramatically curtailing the cost of fine-tuning LLMs. As depicted in Fig. 2, QLoRA drastically diminishes the memory footprint during model fine-tuning, synergistically combining the strengths of ZeRO and LoRA to enhance the utilization efficiency of computational resources.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Comparison of different LLM fine-tuning methods. (a) Full finetune: All model parameters (e.g., 16-bit) are updated, requiring high memory for both the base model and optimizer states. (b) LoRA (Low-Rank Adaptation): Most parameters are frozen, and only small, low-rank “adapter” matrices are trained, significantly reducing memory usage. (c) QLoRA (Quantized LoRA): The base model’s parameters are quantized to a lower precision (e.g., 4-bit) before adding LoRA adapters, further decreasing the memory footprint and enabling fine-tuning on consumer-grade hardware.

Prompt engineering

Prompt engineering, an emerging research field, constitutes a collection of practical techniques and strategies aimed at precisely guiding LLMs, particularly in natural language processing systems, to generate goal-oriented, accurate, and relevant output tailored to specific contexts, thereby maximizing their efficiency in various applications and research domains19. User-provided input text when interacting with an LLM can be understood as a prompt, which the model utilizes to generate corresponding textual responses. The quality and clarity of prompts typically influence the accuracy and relevance of the generated text, with well-designed prompts steering the model towards producing desired outputs, whereas vague or ambiguous prompts may lead to uncertain or unrelated text generation. In employing LLMs for various natural language tasks, crafting effective prompts is a crucial skill that aids in obtaining text generation results that meet specific requirements.

Jason Wei et al.20 introduced Chain of Thoughts (CoT) prompts into the prompt engineering realm, enabling complex inferential capabilities through intermediate reasoning steps. WANG Xuezhi et al.21 proposed Self-Consistency of CoT (SCoT) prompts, utilizing Greedy Decoding to generate multiple distinct CoT reasoning paths and subsequently voting to produce a final outcome. Recognizing that simple prompting techniques are insufficient for complex tasks, YAO Shunyu et al.22 put forth the Tree of Thoughts (ToT) framework, which builds upon CoT prompts by summarizing them and guiding language models to employ thought processes as intermediary steps in addressing generic problems. The text generation approach under different prompting strategies is depicted in Fig. 3.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Comparison of different prompt engineering strategies. (a) Input-output prompting: The simplest form, providing a direct instruction and expecting an immediate answer. (b) Chain of thought (CoT) prompting: The model is instructed to generate intermediate reasoning steps before providing a final answer. (c) Self-consistency with CoT: Multiple CoT reasoning paths are generated, and the final answer is determined by a majority vote, improving robustness. (d) Tree of thought (ToT) prompting: The model explores multiple reasoning paths in a tree-like structure, evaluating intermediate steps to make more deliberate decisions.

Knowledge graph relationship extraction

Relation Extraction (RE) is dedicated to accurately extracting entities, attributes, and the relationships between them from vast amounts of unstructured textual resources. These extracted pieces of information serve as the foundation stones of knowledge graphs, revealing the inherent semantic connections between entities in a structured format. Over recent years, RE has emerged as a focal point of research within the realm of natural language processing.

Miwa Makoto et al.23 were pioneers in proposing a parameter-sharing RE model, which they validated on the ACE05 dataset, demonstrating a 6.1% improvement in F1 score compared to conventional methods. To further capture the intricate dependencies between entities and their relationships, ZHENG Suncong et al.24 innovatively transformed the RE problem into a sequence labeling task, devising an end-to-end deep learning model based on a unified tagging scheme, effectively enhancing extraction efficiency and precision.

Addressing the overlooked phenomenon of overlapping entities within relation triples in prior studies, ZENG Xiangrong et al.25 thoroughly analyzed two scenarios: single entity overlap and entity pair overlap. Subsequently, WEI Zhepei et al.26 introduced the CasRel framework, a phased pointer labeling approach specifically tailored to tackle the issue of overlapping entities, significantly enhancing the model’s capability to identify and handle such relationships. Furthermore, the team led by CHANG Hongyang27 successfully adapted the CasRel model to the context of Chinese medical texts, implementing adaptive enhancements that bolstered the model’s decoding performance when confronted with complex entity overlaps in Chinese medical texts, thus offering a solution to the challenging entity recognition problem.

In summary, the continuous advancement and refinement of RE techniques not only enhance the quality of knowledge graph construction but also provide a solid foundation for knowledge graph applications across various domains.

Methods

Overall framework

Traditional knowledge graph construction from unstructured medical data relies on laborious manual annotation or automated extraction techniques. These methods are time-consuming and resource-intensive, leading to high costs. In response, this study employs LLMs to assist in knowledge graph construction, with the process outlined in Fig. 4. Firstly, during the information extraction phase, the KGLM model was constructed through model fine-tuning strategies, introducing meticulously crafted prompt templates for knowledge extraction. This allowed for the automatic retrieval of valuable knowledge elements from a large corpus of unstructured data within the lung cancer domain. Secondly, in the knowledge fusion phase, the outcomes of knowledge triplet extraction from the prompt-integrated KGLM were merged with semi-structured clinical data processed by rules, as well as openly accessible graph data. Entity alignment, integration, and quality assessment were conducted using a combined calculation method of text similarity and semantic similarity. This laid the groundwork for the efficient construction and accurate population of the lung cancer knowledge graph. Finally, in the knowledge graph storage and visualization stage, the fused graph knowledge is written into a Neo4j graph database for storage and visual presentation, forming a uniquely comprehensive knowledge network specific to lung cancer. This approach not only significantly reduces the cost of knowledge graph construction but also ensures higher structural integrity and machine-readability of the extracted knowledge, which is crucial for downstream applications. It holds promise to propel the systematic organization and intelligent application of lung cancer diagnostic and therapeutic knowledge to new heights. The knowledge graph assembled in this manner is herein referred to as LCKG (Lung Cancer Knowledge Graph).

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

The three-stage framework for constructing the Lung Cancer Knowledge Graph (LCKG) using an LLM. (a) Information extraction: The fine-tuned KGLM, enhanced with prompt engineering, extracts knowledge triplets from unstructured web data. The prompt provides an instruction and an example of the desired (head, relation, tail) output format. (b) Knowledge fusion: The extracted triplets are integrated with semi-structured clinical data and structured public graph data. This fusion involves rule-matching, entity alignment, and a final quality assessment before manual cleaning of edge cases. (c) Storage and visualization: The unified and cleaned triplets are stored in a Neo4j graph database, which supports querying (via Cypher) and visualization tools.

Information extraction

Although LLMs can endogenously generate knowledge, their output often lacks a systematic structure. To improve a model’s adherence to specific instructions, we employ fine-tuning to develop a model specifically for extracting triplets from unstructured text. In this study, we base our fine-tuning on the ChatGLM model, which utilizes a Prefix Decoder-only architecture, an evolution of the Causal Decoder-only architecture exemplified by GPT, combining the benefits of both unidirectional and bidirectional attention. In terms of model details, ChatGLM applies gradient scaling to the embedding layer and utilizes the post-layer normalization (Post-LN) method to enhance training stability. Moreover, Rotational Positional Encoding (RoPE) is used in place of the traditional absolute positional encoding, and GeLU activation is employed to improve the feed-forward network (FFN) in the transformer architecture.

The LoRA fine-tuning process, as illustrated in Fig. 5, effectively introduces a “side branch” on the right, where the data first undergoes dimensionality reduction via a Linear layer A, transforming it from dimension d to r. This is followed by a second Linear layer B, which maps the data back to dimension d. Ultimately, the outputs from both the left (original) path and this adjusted right path are combined, or fused, to yield the final hidden state output.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

The mechanism of Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. Instead of updating the original pretrained weights (W), LoRA introduces two small, low-rank matrices (A, B). The input (x) passes through both the frozen original path and the trainable “side branch” (AB). The outputs are then added together. Only matrices A and B are updated during training, drastically reducing the number of trainable parameters.

Next, the model undergoes 8-bit quantization, a process illustrated by Eq. (1).

$$\:{\text{X}}^{\text{i}\text{n}\text{t}8}=\text{r}\text{o}\text{u}\text{n}\text{d}\left(\frac{127}{\text{a}\text{b}\text{s}\text{max}({\text{X}}^{\text{F}\text{P}32})}\cdot\:{\text{X}}^{\text{F}\text{P}32}\right)=\text{r}\text{o}\text{u}\text{n}\text{d}({\text{c}}^{\text{F}\text{P}32}\cdot\:{\text{X}}^{\text{F}\text{P}32})$$
(1)

Here, X represents the input tensor, with c serving as a constant denoting the quantization scale. The inverse process for the quantized model is given by Eq. (2).

$$\:\text{d}\text{e}\text{q}\text{u}\text{a}\text{n}\text{t}({c}^{\text{F}\text{P}32},{X}^{\text{i}\text{n}\text{t}8})=\frac{{X}^{\text{i}\text{n}\text{t}8}}{{c}^{\text{F}\text{P}32}}\approx\:{X}^{\text{F}\text{P}32}$$
(2)

QLoRA injects the parameter matrix from Eq. (2) into the quantized model, yielding Eq. (3).

$$\:\text{Y}=\text{X}\text{W}+\text{s}\text{X}{\text{L}}_{1}{\text{L}}_{2}$$
(3)

In the experimental workflow, as depicted in Fig. 4, the ChatGLM is first locally deployed, and fine-tuned using a carefully designed dataset. Each sample in this dataset assumes the following format: {“prompt”: “Please extract all triplets from the given sentence. Given sentence: text”, “response”: “[(head1, relation1, tail1), (head2, relation2, tail2) … (headn, relationn, tailn)]”}. Within this structure, the “prompt” field contains a clear instruction guiding the model to extract triplet information from the given sentence, while the “response” field emulates the expected output format, listing all extracted triplets in a list, with each triplet composed of a head entity (head), relationship (relation), and tail entity (tail). Using this carefully constructed fine-tuning dataset, the ChatGLM model is fine-tuned in a targeted manner, successfully yielding a model named KGLM. The model enables the automatic extraction of knowledge triplets from a lung cancer-related corpus.

Since the constructed KGLM model possesses not only the capability to extract knowledge but also retains the semantic understanding ability of the original LLM, a prompt template was designed for the structure of unstructured data during knowledge extraction. The design of the prompt template is a central contribution, consisting of four distinct components, each addressing a specific challenge in medical knowledge extraction. This design process enhances the model’s knowledge extraction ability in complex medical texts through role definition, structured constraints, and explicit guidance of multi-step reasoning. The multi-component prompt template, illustrated in Fig. 6, is central to our framework. It systematically guides the model by first setting a professional System Role (a), then providing examples of the desired Triplet Structure and JSON format (b, c), and finally, using CoT reasoning rules (d) to handle nested relationships.

  1. a.

    Set System Roles: This instruction primes the model to focus on factual extraction by assigning it the role of a “professional NLP expert".

  2. b/c.

    Define triples and output format: These components enforce a strict JSON-based triplet structure, ensuring the output is machine-readable for automated knowledge base population.

  3. d.

    Output Rules with CoT Reasoning: This rule incorporates a Chain-of-Thought (CoT) prompt (“let’s think step by step”) to handle complex, nested relationships. This is a direct response to the syntactic complexity of Chinese clinical narratives, where multiple hierarchical details are often embedded within a single sentence. The CoT prompt explicitly guides the model to deconstruct these relationships sequentially (e.g., Disease → Treatment → Dosage), ensuring that deeper-level attributes are correctly extracted and linked.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

The multi-component prompt template used to guide the KGLM model.The template consists of four parts: (a) System role: Assigns the model the identity of an expert to focus its output. (b) Define triples: Provides few-shot examples of the desired triplet structure. (c) Output format requirements: Specifies strict rules, such as using JSON format, to ensure machine-readability. (d) Output rules: Includes a Chain-of-Thought (CoT) instruction (“let’s think step by step”) to help the model process complex and nested relationships.

Although the data utilized is largely unstructured, in most cases, all relevant attribute values pertaining to the same entity will appear in the same or contiguous text sections. As a result, when focusing on a specific disease entity, its related attribute values are mostly standardized, making it straightforward for large models to extract the required triplets according to the prompt template. Accordingly, in the prompt template, the disease is designated as the head entity, and the attribute values of the disease are set as the tail entities. Centering around entities and their attribute values, the model comprehends the most fitting relationship between them.For complex attribute relationships, such as attributes of attribute values, the method of introducing a COT reasoning sequence is employed. This involves recursively matching relationships step by step, much like how a disease’s attributes encompass treatment methods, and then, the attributes of these treatment methods include efficacy and duration of treatment. An example of this hierarchical data extraction from a real-world patient case is illustrated in Fig. 7. The figure shows how the KGLM processes the raw text to identify a primary disease and then maps related entities like symptoms, examinations, and specific treatment details (e.g., Medication, Dosage) into a structured, interconnected graph.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

An example of knowledge triplet extraction from a patient case summary. The model processes the unstructured text (top box) and identifies key entities such as “Disease,” “Symptoms,” “Medication,” and their attributes (e.g., “Dosage,” “Course of Treatment”). It then structures this information into a hierarchical graph of (head, relation, tail) triplets, as shown in the diagram below the text. The right side shows the English version, and the left side shows the original Chinese example.

Through the aforementioned methods, we achieve the autonomous generation of knowledge triplets and the efficient transformation of the corpus. The unstructured data obtained through web scraping contains detailed therapies for various diseases, symptoms, and different populations, providing rich material for subsequent entity alignment and fusion. Nonetheless, even the prompted KGLM model might yield atypical triplet formations or substandard extraction outcomes when confronted with certain intricate or ambiguous medical text segments. To ensure high accuracy and consistency across the board in the output, a manual data cleansing strategy is employed in this study. For these edge cases, trained professionals review and correct the model’s extraction results, aiming to uphold the advantages of automation while safeguarding the quality and integrity of the ultimate knowledge repository.

Knowledge fusion

A single entity can be expressed in multiple ways due to variations in textual descriptions. For instance, “Chronic Obstructive Pulmonary Disease” and “COPD” are, in fact, two different expressions referring to the same medical condition. Such semantically identical entities not only waste graph database performance but also impair entity recall during subsequent knowledge graph queries. Hence, during knowledge graph construction, it is essential to avoid generating duplicate entities, instead fusing and storing different expressions of the same knowledge under a single representation. This process of directing entity terms with different expressions to a single entity is known as entity alignment or coreference resolution, primarily addressing the issue of polysemy. This approach effectively reduces redundancy arising during knowledge fusion from different data sources, thereby enhancing the quality of the knowledge graph.

Currently, entity alignment mainly employs two calculation methods: text similarity and semantic similarity:

  1. 1.

    Alignment based on text similarity boasts high computational efficiency and rapid speed, but it fails to handle entities with similar text but divergent semantics. Jaccard similarity, also called Jaccard coefficient, is used to compare the similarity and dissimilarity between finite sample sets, with higher values indicating greater sample similarity. The formula for calculating Jaccard similarity is given by Eq. (4).

    $$\:\text{J}(\text{A},\text{B})=\frac{|\text{A}\cap\:\text{B}|}{|\text{A}\cup\:\text{B}|}$$
    (4)
  2. 2.

    Alignment based on semantic similarity relies on model encoding. Currently, commonly used semantic representation models include Sentence-BERT (SBERT) proposed by Nils Reimers et al.28. Such models offer high accuracy but are computationally intensive and slower in speed. Combining semantic vectors with text description information for entity alignment achieves cross-language alignment of entities. The distance between semantic vectors is often computed using cosine similarity, with the calculation method outlined in Eq. (5).

    $$\:\text{cos}<\text{A},\text{B}>=\frac{\text{A}\cdot\:\text{B}}{\left|\text{A}\right|\times\:\left|\text{B}\right|}$$
    (5)

Here, A and B denote encoded semantic vectors, while |A| and |B| represent the respective magnitudes of A and B.

To address the challenges of varied entity representations and a large number of candidate entities, we employ a hybrid approach that combines both text and semantic similarity. This methodology is particularly well-suited for Chinese medical texts, which feature a high prevalence of synonyms, abbreviations, and distinct but semantically identical terms. For instance, a key challenge is aligning a formal diagnosis like “Chronic Obstructive Pulmonary Disease” with its common clinical abbreviation “COPD”. A purely text-based method like Jaccard similarity would yield a low score, potentially failing to merge the entities. Our hybrid approach (Algorithm 1) first uses Jaccard as a fast filter. If it fails, the more computationally intensive SBERT model is invoked. SBERT, having been trained on vast text corpora, understands that these two terms are semantically equivalent and produces a high cosine similarity score, correctly identifying them as duplicates. This two-stage process provides a robust solution for this specific linguistic challenge. As depicted in Algorithm 1, entity alignment commences by computing the text similarity. Should the text similarity exceed a predefined threshold, the semantic similarity is further assessed to determine whether it too surpasses this threshold; conversely, a determination of no duplicate entities is promptly made. This two-stage methodology effectively circumvents excessive model computations stemming from a small number of duplicate entities, while concurrently accounting for the practical scenario of similar texts with divergent semantics, thereby minimizing the probability of misclassifications and concurrently enhancing computational efficiency.

Algorithm 1
Algorithm 1The alternative text for this image may have been generated using AI.
Full size image

Entity alignment method based on the Jaccard coefficient and SBERT.

Where \(Jaccard(x,n)\) denotes the Jaccard similarity between entities, and \(\cos (a,b)\) represents the cosine similarity between vectors a and b. The thresholds were selected through a two-dimensional sensitivity analysis over t1 (Jaccard) and t2 (SBERT). We evaluated t1, t2 [0.70, 0.95] on 500 validation entity pairs. The results (Fig. 8) show a plateau of high F1 when t1[0.80, 0.88] and t2[0.83, 0.90]. For practical deployment, we recommend settings within this region. A symmetric choice of t1 = t2=0.85 offers a balanced trade-off, while asymmetric combinations (e.g., t1 = 0.82, t2 = 0.88) can be preferred if recall or precision is emphasized.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Threshold sensitivity analysis for entity alignment. F1 scores across combinations of Jaccard (t1) and SBERT (t2) thresholds on 500 validation pairs, showing a stable high-performance region (t1[0.80–0.88], t2[0.83–0.90]) with an optimum around 0.85.

The entity alignment method detailed above effectively identifies sets of duplicate entities across our diverse data sources. The next crucial task is to merge these duplicates into a single, standardized representation. In this work, we adopt a straightforward yet effective merging strategy: for each set of synonymous entities, the term with the shortest string length is chosen as the canonical name. All other expressions are then linked to this canonical entity as aliases.

This entire workflow, from initial data ingestion to final storage, constitutes our knowledge fusion process, which is depicted in Fig. 9. The process integrates data from various sources by applying our Jaccard-SBERT alignment model. Following the merging step, a final manual quality check is performed by experts to ensure the accuracy of the consolidated data before it is integrated into the final knowledge graph. During this expert review, any conflicting facts or relationships extracted from heterogeneous sources are adjudicated to ensure data integrity. In addition, raters flagged potentially harmful or prescriptive claims for exclusion, and alignment decisions were documented to support downstream auditability.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Flowchart of the multi-source knowledge fusion process. Triplets from three heterogeneous sources—unstructured web data, semi-structured clinical data, and structured public graph data—are first normalized. An entity alignment method, based on a combination of Jaccard and SBERT similarity, is used to merge entities with different names but the same meaning. After a final quality assessment by experts, the unified data is stored in the Neo4j database.

Storage and visualization of LCKG

Neo4j is a widely utilized graph database whose data storage structure comprises nodes and edges, representing entities and relationships between entities, respectively, within a knowledge graph. In addition to its robust storage capabilities, Neo4j offers intuitive knowledge graph visualization features, enabling users to browse and edit graph content with enhanced clarity29. Another critical factor in selecting Neo4j for this study was its built-in declarative graph query language, Cypher, which allows for a straightforward expression of complex operations on the graph database, encompassing comprehensive data management functions such as creation, deletion, modification, and retrieval. While our implementation uses Neo4j for its robust features, it is important to note that our knowledge construction framework is platform-agnostic. The output is a standardized set of triplets that can be readily imported into any other property graph database (e.g., Amazon Neptune, JanusGraph), making our core methodology widely applicable. To augment the functionality of the native Cypher query language, this study incorporates the pivotal APOC (Awesome Procedures On Cypher) extension library. APOC, functioning as a plugin, provides a wealth of utilities, procedures, and functions for Neo4j graph databases. Next, APOC is installed and activated by manually editing configuration files within the Neo4j server environment. Leveraging the APOC extension library, the consolidated lung cancer knowledge graph data is stored in the Neo4j graph database, thereby constructing a detailed and structured LCKG. In summary, LCKG harnesses Neo4j to achieve efficient storage and intuitive visualization, ensuring reliable preservation of extracted knowledge while creating a user-friendly environment conducive to querying, analyzing, and continuously enriching the lung cancer knowledge graph.

Experiments

Data construction and preprocessing

Acknowledging the relative scarcity of knowledge graph research in the lung cancer domain, this study focuses on targeted knowledge graph construction. Data collection encompasses diverse information sources, specifically:

  1. 1.

    Unstructured Web Data: Utilizing web crawling implemented in Python, this study systematically retrieves lung cancer-related medical data from domestic authoritative healthcare platforms such as “Haodf Online” and “Xunyiwenyao”. Taking “Haodf Online” as an example, by searching for the keyword “lung cancer” on the website, we successfully identified 8,558 doctors and experts specializing in lung cancer treatment from departments of Thoracic Surgery and Oncology. Subsequently, we individually accessed the consultation pages of these experts, parsed their HyperText Markup Language (HTML) resources, and ultimately gathered 5,000 unstructured data records rich in lung cancer-related knowledge. The raw collected data undergoes preliminary processing, with expansion methods employed to search for a broader range of vocabulary or terminology related to lung cancer, such as “pneumonia,” “tuberculosis,” “lung diseases,” and “respiratory system issues.” This enables a more comprehensive filtering of relevant content. Finally, irrelevant data is manually removed through a review process, enhancing the accuracy of the data. All personally identifiable information (PII), such as names, specific dates of birth, and contact information, is already anonymized by the platforms themselves as part of their terms of service. Our web scraping process collected only this pre-anonymized, publicly accessible text. Because web-scraped medical advice can reflect platform-specific and socio-demographic biases, we combined rule-based filtering with expert manual review to remove prescriptive or low-credibility content and to prioritize medically plausible, guideline-consistent statements. These steps mitigate—but cannot eliminate—selection and presentation biases inherent to public web sources.

  2. 2.

    Semi-structured Clinical Lung Cancer Data: This study utilizes a clinical dataset on lung cancer treatment contributed by the renowned Traditional Chinese Medicine master, Zhou Zhongying. Each real case contains effective entities and relationships such as the names of Traditional Chinese Medicine (TCM) diseases, Chinese herbs, and formulas. To align with the requirements of knowledge graph construction, the dataset undergoes customized processing. Adhering to the structured standards of knowledge graph triplets, we perform rule-based cleaning and transformation, converting the original semi-structured data into a standardized knowledge representation format. Prior to being used in this study, the dataset was fully anonymized by the contributing institution. All patient identifiers were removed and replaced with non-identifiable codes.

  3. 3.

    Structured Public Graph Data: From existing open-source projects for medical knowledge graphs, this study selectively extracts triplets pertinent to lung cancer. Ensuring the applicability and consistency of the data, these selected triplets are integrated and normalized, adopting a uniform data format for use in subsequent knowledge graph construction.

From the collated raw data, we curated a high-quality dataset of 49,860 instruction–response pairs specifically for the fine-tuning process. This curation followed a semi-automated, human-in-the-loop pipeline. The initial pool of candidate triplets was generated by applying rule-based scripts to our structured data sources and by prompting a baseline large language model with few-shot examples on the semi-structured and unstructured texts. The resulting candidates were then rigorously reviewed and corrected by trained annotators to ensure quality.

This rigorous process yielded a high-quality fine-tuning corpus where each instance is formatted as {“prompt”: “instruction…”, “response”: “[(triplet_1), (triplet_2),…]”}. The final corpus contains 49,860 examples, with 59.9% derived from unstructured web data, 21.1% from semi-structured clinical data, and 19.0% from structured public graph data. The corpus was subsequently divided into 80/10/10 train/validation/test splits (39,888/4,986/4,986 examples, respectively). Table 1 provides a detailed statistical breakdown of this dataset. This curated dataset formed the foundation for all subsequent model training and evaluation experiments detailed in this paper.

Table 1 Detailed statistics of the fine-tuning dataset.

Knowledge graph evaluation methodologies

To conduct a comprehensive and rigorous evaluation of the lung cancer knowledge graph constructed using an LLM, this study employs a dual perspective approach with diversified assessment metrics, considering both objective quantitative measures and subjective expert evaluations, aimed at ensuring the reliability and practicality of the knowledge graph as a medical knowledge resource.

  1. 1.

    Objective Scoring(Relation Extraction Performance Evaluation): For the crucial task of relation extraction (RE) in knowledge graph construction, this study utilizes Precision, Recall, and F1 Score as key indicators of its performance. Equation (6) to (8) define the calculation methods for these metrics:

    $$\:\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{P}}$$
    (6)
    $$\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}$$
    (7)
    $$\:F1S\text{c}\text{o}\text{r}\text{e}=\frac{2\times\:\text{P}\text{r}ecision\times\:\text{R}\text{e}call}{\text{P}\text{r}ecision+\text{R}\text{e}call}$$
    (8)

    Here, TP represents the number of true positive samples correctly identified by the model, TN stands for true negative samples correctly classified, FP denotes false positive samples incorrectly labeled as positive, and FN signifies false negative samples misclassified as negative30. These statistics reflect the model’s accuracy and completeness in identifying relationships within the lung cancer knowledge graph.

  2. 2.

    Subjective Scoring(Overall Knowledge Graph Quality Assessment):

    1. (1)

      Knowledge Graph Completeness: The knowledge graph’s comprehensiveness and breadth of coverage in the domain of lung cancer knowledge is evaluated by quantifying indicators such as the diversity of entity types included, the number of distinct relationship types covered, and the total quantity of integrated triplets.

    2. (2)

      Knowledge Graph Quality: To ensure the professionalism and accuracy of the knowledge graph content, authoritative experts from the medical field are invited to provide subjective scores for the lung cancer knowledge graph constructed by the LLM. Drawing upon their specialized backgrounds and clinical experience, these experts will thoroughly examine factual information within the knowledge graph, conducting a comprehensive assessment of its accuracy, usability, and clinical relevance. Independent scoring by multiple experts serves to validate the quality of the lung cancer knowledge graph, confirming that it meets professional standards of acceptance.

Experimental environment

During the fine-tuning experiments with the LLMs, this study employed the PyTorch deep learning framework on a single NVIDIA GeForce 4090 GPU with 24 gigabytes (GB) of graphics memory. The hyperparameters were carefully selected to maximize training efficiency and model performance. We employed 8-bit quantization, a critical step that reduced the model’s memory footprint from approximately 24 GB to a more manageable 6 GB. This reduction, in turn, enabled us to use a batch size of 64 and keep the peak GPU memory usage during training to a modest 16.2 GB. Following empirical testing, a learning rate of 2e-4 was chosen, providing an optimal balance between convergence speed and stability. To manage overfitting risk, we utilized the AdamW optimizer, which incorporates weight decay as a form of L2 regularization. Furthermore, the QLoRA method itself serves as a strong regularizer by significantly constraining the number of trainable parameters.

We employed an early stopping criterion based on the validation loss, which was monitored throughout the training process. Training was conducted for 3000 steps, a process that took approximately 5.1 h to complete, at which point we observed the validation loss beginning to plateau, indicating that further training risked overfitting. In terms of performance, the model achieved an average inference speed of approximately 45 tokens per second.

Experimental procedures

The experiments in this study encompass two principal research components aimed at comprehensively evaluating the performance benefits of prompts and fine-tuning in the construction of a lung cancer knowledge graph: Part I is an objective performance assessment, involving both horizontal and vertical experiments to conduct exhaustive quantitative comparisons of different methods’ performance on the Relation Extraction (RE) task. Part II shifts to a subjective performance evaluation, focusing on examining the changes in completeness and quality—two core dimensions—of the lung cancer knowledge graph constructed before and after applying prompts and fine-tuning engineering.

Regarding the baseline model, we selected ChatGLM for several key reasons. At the time our experiments were designed, it was a state-of-the-art, openly accessible model that demonstrated strong competitiveness in benchmark evaluations, particularly for Chinese language tasks. In all experiments reported in this paper, we use the official open-source ChatGLM-6B checkpoint provided in the official ChatGLM-6B GitHub repository, with the default tokenizer and configuration, so that our baseline is precisely specified and can be reproduced using the same model version. Its architecture and public availability made it a practical and powerful foundation for fine-tuning. Furthermore, ChatGLM provides superior Chinese token coverage compared with many other models, a critical feature for our predominantly Chinese dataset. These factors made it a representative and robust choice for developing our KGLM. Fine-tuning concluded at step 3,000, when the loss reached 0.0584, and the corresponding model weights were saved as KGLM. Subsequently, we compared model outputs before and after the integration of prompt engineering and fine-tuning. Representative examples of these improvements are shown in Table 2.

Table 2 Model response examples.

To verify the contribution of each module in the prompt template, ablation experiments were conducted to validate the independent contributions of each design module in the prompt template (a. System Role, b. and c. Triple Schema, d. COT Reasoning). The ablation results are summarized in Table 3.

Table 3 Ablation experiment results of different prompt designs.

The “System Role” (a. Set System Roles) constrains the model to act as a “professional NLP expert specializing in knowledge graphs.” This focuses the model on the relevant domain, preventing the generation of out-of-scope or conversational text. As observed, its removal led to a 7% F1 score decrease because the model began identifying non-medical entities (like “patient ID” or “hospital name”) that, while present in the text, are irrelevant for a clinical knowledge graph, thereby reducing precision.

The “Triple Schema” and “Output Format Requirements” (b. Define Triples, c. Output Format Requirements) are critical for ensuring the output is structured and machine-readable. By providing a clear template (head, relation, tail) and specifying a JSON format, we guarantee the output can be automatically ingested into the knowledge graph. The 14% F1 score drop upon its removal was due to the model generating outputs in varied, unstructured formats (e.g., natural language sentences, lists without proper delimiters), which caused the downstream knowledge fusion process to fail, severely impacting both precision and recall.

The “Output Rules,” which introduce Chain-of-Thought (CoT) reasoning (d. Output Rules), serve as a direct stress test for handling the nested syntactic structures prevalent in Chinese clinical notes. Our ablation experiment provides a quantitative measure of this capability. The removal of CoT led to a 22% decrease in recall for nested attributes because the model defaulted to extracting only simple, direct relationships. It failed to “think step by step” to connect hierarchical information. For example, without CoT, the model might correctly extract that a disease’s treatment is “chemotherapy” but fail to link the subsequent attributes that “chemotherapy” involves the drug “paclitaxel” and that “paclitaxel” has a dosage of “175 mg/m²”. The CoT prompt forces this multi-step reasoning, proving essential for deconstructing these complex, nested patterns.

While Table 3 quantitatively demonstrates the contribution of each prompt component, these aggregate scores do not directly reveal how the prompt-enhanced model improves extraction behavior at the sentence level—especially for Chinese medical texts that involve synonymy, Traditional Chinese Medicine (TCM) terminology, and nested clinical structures. To provide a more concrete and interpretable illustration of these improvements, we further conduct a focused micro-level comparison on a small set of representative Chinese clinical sentences. This comparison examines the extracted (head, relation, tail) triplets produced by different models under controlled cases designed to reflect these language-specific challenges.

Table 4 Micro-level comparison of triplet extraction results on representative Chinese clinical sentences.

As shown in Table 4, the prompt-enhanced KGLM exhibits consistent improvements across all three challenge types. For synonymy and abbreviation, the model correctly normalizes multiple mentions of the same disease into a single canonical entity. For TCM terminology, it accurately extracts domain-specific treatment names rather than collapsing them into generic therapy relations. For nested clinical structures, the explicit Chain-of-Thought guidance enables the model to decompose complex descriptions into multi-level triples, separating drugs from their dosage attributes. These sentence-level observations provide a concrete explanation for the recall gains observed in the ablation study, particularly for nested attributes, and illustrate how structured prompting translates into a cleaner and more coherent downstream knowledge graph.

Collectively, these results show that the full template’s superior performance (F1 = 0.82) is not incidental but a direct result of its systematic, multi-component design… The complete template was demonstrably superior to all simplified variants in terms of structured output and complex relationship processing.

Furthermore, this multi-component design is structured for potential adaptability. In principle, to apply this framework to other medical domains, such as cardiology or oncology, one would primarily need to update the few-shot examples within the “Define Triples” section with domain-specific entities and relations. The core structure of the prompt is intended to provide a generalizable scaffold for knowledge extraction across various medical specialties, though further research is required to empirically validate its effectiveness in other domains.

Results and analysis

Objective evaluation

In the objective evaluation, we benchmarked our KGLM against deep learning baselines representing two major prior paradigms in knowledge extraction. Our aim was to illustrate the architectural advantages of our generative approach. As a traditional baseline, we incorporated a Convolutional Neural Network (CNN), which exemplifies non-contextual deep learning methods31. To represent the transformer-based pipeline approach, we included BERT, whose contextualized embeddings revolutionized NLP and remain a strong baseline, particularly in the biomedical domain via models like BioBERT32,33. While newer encoder models like RoBERTa exist, BERT and CNN were chosen as landmark representatives of their respective architectural eras, allowing for a clear comparison across different fundamental approaches. In this initial work, we focused on such single-stage RE baselines and did not implement a full Chinese medical NER + CasRel/TPLinker pipeline on our corpus; we regard building and analyzing that pipeline as important future work for quantifying NER→RE error propagation more precisely.

This experiment quantitatively calculated the Precision, Recall, and F1 Score for the KGLM model with prompt engineering, BERT, CNN models, and their variants enhanced with Attention mechanisms in the knowledge graph relation extraction (RE) task. The experimental results are summarized in Table 5, visually illustrating the differences and strengths of each model in their capacity to build knowledge graphs.

Table 5 Results of the horizontal comparative experiment for models.

Based on the experimental data analysis shown in Table 5, the KGLM model with prompt tuning achieved higher F1 scores than BERT, CNN, and their attention-augmented variants in the Relation Extraction (RE) task. Although the improvement in F1 score is moderate, this approach reduces the need for manual rule design and feature engineering. By simplifying the overall knowledge graph construction workflow, it improves efficiency and automation within this specific domain, supporting more practical and scalable implementation of medical knowledge graph construction.

A deeper analysis highlights the architectural strengths of the KGLM framework. Traditional pipeline methods separate Named Entity Recognition (NER) and Relation Extraction (RE) into two stages, making them prone to error propagation: missed entities in NER block valid relations, while false positives generate spurious ones. Although a full NER + RE pipeline was not implemented for direct comparison, KGLM’s unified, end-to-end design jointly identifies entities and relations in a single step, effectively reducing such cascading errors. Moreover, unlike rigid, schema-bound classifiers, KGLM’s generative architecture leverages extensive pre-trained knowledge to handle linguistic variability and complex sentence structures with greater robustness. Guided by tailored prompt engineering, this flexibility enhances both precision and recall. Future work could include a quantitative comparison with dedicated pipeline baselines to further validate these advantages.

Subsequently, a vertical ablation experiment is conducted for comparison, the results of which are presented in Table 6. It can be seen that the fine-tuned KGLM model performs particularly prominently in the task of building a lung cancer-specific knowledge graph. After fine-tuning and the introduction of prompts, its F1 score increases by 25% compared to the original model, from 0.57 to 0.82. This substantial enhancement highlights the critical role of prompt tuning and fine-tuning in the generation tasks of LLMs. Due to the generative nature of large models, their high sensitivity to input sequences cannot be overlooked; even minute changes in the input sequence might result in significant discrepancies in the generated output34. The fine-tuned KGLM model effectively counters this uncertainty, while the introduction of prompt templates provides more stable and accurate knowledge extraction outcomes.

Table 6 Results of the vertical comparative experiment for models.

Subjective evaluation

In the subjective evaluation of the knowledge graph, focusing on the core dimensions of knowledge graph completeness and quality, a comprehensive assessment is made of the practical value and professional standard of the constructed knowledge graph.

Firstly, the completeness of the knowledge graph: This is achieved by comparing and analyzing the number of entity relationships and triplets contained in the knowledge graph constructed before and after model fine-tuning, as well as after the introduction of prompt templates. The aim is to highlight the critical role of prompts and fine-tuning in improving the performance of knowledge graph construction. Table 7 accurately illustrates the coverage of entity relationships and the quantity of triplets during the construction of the knowledge graph.

Table 7 Comparison of knowledge graph integrity.

Table 7 reveals that in terms of the extraction of entities, relations, and triplets for the knowledge graph, the KGLM model exhibits significant superiority over LLMs without fine-tuning. However, despite the noticeable increase in the number of entities, some remain unlinked through appropriate relational chains to other entities, suggesting that these isolated entities have not been fully integrated into the overall structure of the knowledge graph, thus compromising its coherence and practicality. Although the introduction of prompts results in fewer entities, relations, and triplets than the KGLM model, the post-prompt model displays greater acuity in entity recognition. The identified entities are better integrated into the knowledge graph, and they can comprehensively cover and identify the rich types of relations and various triplets in the text, contributing to the construction of a more complex and refined knowledge network. This finding aligns with the work of Trajanoska et al.35, who also demonstrated that LLMs can enhance knowledge graph construction. However, our work moves beyond their general framework by implementing a specific, fine-tuned model (KGLM) combined with a multi-component prompt strategy, which, as our subjective evaluation shows, leads to measurable improvements in the accuracy and clinical relevance of the extracted knowledge.

Secondly, the quality of the knowledge graph: To ensure a professional and credible assessment, we convened a panel of three senior experts for an independent, blinded evaluation. The panel was intentionally composed of both clinical and informatics experts to ensure a comprehensive assessment. To conduct the evaluation, experts were provided with a structured scoring form and a blinded, randomized sample of 50 extracted knowledge subgraphs from each model. The detailed evaluation rubric, including the specific questions and 1–10 scale anchor points for each criterion, is provided in Supplementary Appendix 1. Experts evaluated the knowledge graph based on three criteria: accuracy, usability, and clinical relevance.

  1. (1)

    Accuracy: Ensuring knowledge graph data quality, encompassing the accuracy of entity attribute values, correctness and logical rigor of triple relations, and source reliability (supported by authoritative medical literature, clinical guidelines, or databases).

  2. (2)

    Usability: Assessing the adaptability and service effectiveness of the knowledge graph in medical application scenarios (such as recommendation systems, question-and-answer systems), focusing on knowledge retrieval efficiency, interface friendliness, and scenario diversity (such as clinical decision-making, patient education, and research collaboration).

  3. (3)

    Clinical relevance: Evaluating the fit of the knowledge graph with current clinical practice, with attention to whether the disease definitions, diagnostic criteria, treatment plans, etc., in the knowledge graph align with prevailing clinical guidelines, as well as the timeliness of updates on cutting-edge knowledge. Gao et al.36 highlight the challenge of keeping medical knowledge graphs current and clinically relevant. Our subjective evaluation directly addresses this challenge by assessing our LCKG against clinical standards, and the positive expert scores (Table 8) suggest that our LLM-based approach is a promising direction for building and maintaining clinically useful knowledge resources.

The final score for each model is the average of the scores from the three experts. To maximize the objectivity and reliability of this subjective assessment, we implemented a rigorous evaluation protocol. This process began with a pre-evaluation calibration session, where all experts scored non-study samples and discussed their ratings to ensure a shared understanding and consistent application of the standardized evaluation instrument—a detailed and unified scoring rubric with clear operational definitions and anchors for each dimension (accuracy, usability, and clinical relevance). The formal evaluation was then conducted using a double-blind, randomized design: knowledge subgraph samples were randomly selected from each model’s output and presented to the raters without revealing their source, thereby mitigating potential bias.

To empirically validate the consistency of the expert ratings, we computed inter-rater reliability (IRR) using the intraclass correlation coefficient (ICC). Specifically, a two-way mixed-effects model for absolute agreement (ICC(3,k), with k = 3 raters) was applied across all blinded evaluation samples. The analysis yielded a high average-measures ICC of 0.87 (95% CI: 0.83–0.90), indicating good to excellent agreement among the three experts. This level of consistency suggests that the pre-evaluation calibration was effective and that the scoring criteria were applied in a stable and reliable manner.

The final score being the average of the scores from the three experts. The “growth rate” here refers to the percentage increase in the average score of the KGLM model with prompts relative to ChatGLM-6B. This section of the evaluation reveals the significant advantages in terms of depth and breadth of knowledge generated during the construction of the knowledge graph after the introduction of fine-tuning engineering and prompts. The specific results are shown in Table 8. The scoring process was guided by a consistent rubric to ensure that all experts applied the same standards when evaluating accuracy, usability, and clinical relevance.

Table 8 Knowledge graph quality expert scoring.

According to the data presented in Table 7, in terms of accuracy, the experts found that the KGLM with prompts exhibited more precise knowledge representation, with relatively fewer errors or inaccuracies. Regarding usability, the results imply that the KGLM with prompts demonstrated superior adaptability and service effectiveness in medical application scenarios. In terms of clinical relevance, the results indicated that the KGLM with prompts more effectively covered knowledge closely related to medical practice and could better serve medical contexts such as clinical decision support and patient management. Looking at the percentage growth rates, the KGLM with prompts showed significant improvements across all metrics compared to ChatGLM-6B. Specifically, accuracy improved by 29.3%, usability increased by 25.1%, and clinical relevance rose by 9.57%. These figures clearly reflect the notable advancements of the KGLM with prompts in optimizing the quality of the knowledge graph, particularly in enhancing information accuracy and system usability. These findings offer valuable references for future research and applications in knowledge graphs, underscoring the importance of prioritizing and optimizing the accuracy and usability of knowledge graphs in model development to improve their real-world impact.

Visual analysis

In this study, we employed the KGLM model to extract triplets from a corpus of unstructured text material related to lung cancer, storing these extractions in a Neo4j graph database for visual presentation, as depicted in Fig. 10. This visualized graph lucidly reflects the actual performance of the model’s extractions in constructing the knowledge graph, while simultaneously highlighting areas in need of improvement, specifically how to refine the model to more precisely extract triplet structures, thereby enhancing the structural quality and utility of the knowledge graph.

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

A visualized subset of the knowledge graph constructed by the fine-tuned KGLM model before applying the structured prompt template. This graph shows that the model can identify relevant concepts (e.g., Lung Cancer, Chemotherapy), but some extracted nodes are long, non-standardized phrases, and the relationships are not always precise.

From Fig. 10, it can be observed that the KGLM model exhibits a certain capacity for recognizing text segments associated with lung cancer disease. However, some of these identifications do not manifest in the form of standardized concepts but rather as complete phrases carrying entity meanings. Examples include extractions such as “treatment regimens targeting specific gene mutations,” “possibilities for standalone use or combination with other therapies,” and “stimulating the immune system to combat diseased cells.” While these contents are indeed closely tied to lung cancer diagnosis and treatment, they do not explicitly express themselves as concept labels directly corresponding to standard entities or relationships in a knowledge graph.The reason for this lies in the fact that, the KGLM model has undergone fine-tuning for entity and relationship recognition tasks, enabling it to generate relatively apt responses under specific prompts, its nature remains that of a conversational LLM. As discussed by Guo et al.37, while massive AI models excel at understanding and generation, they have inherent limitations in maintaining the high-level precision and structural fidelity required for tasks like knowledge extraction. Our findings in Fig. 10 confirm this, showing that even a fine-tuned LLM can produce non-standardized outputs. This underscores why our subsequent step—the integration of a highly structured prompt template—is not just beneficial but essential for bridging the gap between a conversational model and a precise knowledge engineering tool.

To enhance the quality and accuracy of knowledge graph construction, prompt templates were introduced during the knowledge extraction process with the KGLM model to ensure that only more precise and readily integrable entity relationship data was retained for the knowledge graph structure. After such processing, the resulting knowledge graph, as shown in Fig. 11, exhibits higher information accuracy and applicability. However, the system exhibited limitations in decoding nested semantic structures requiring cascaded reasoning. A representative error occurred in processing the statement “Drug A (targets Epidermal Growth Factor Receptor, EGFR) inhibits metastasis”, where only the direct relationship (Drug_A→targets→EGFR) was captured, while the implicit therapeutic effect (EGFR_inhibition→reduces_risk_of→Metastasis) remained unextracted.Therefore, it remains essential to conduct rigorous screening and optimization of the extracted triplet set.

Fig. 11
Fig. 11The alternative text for this image may have been generated using AI.
Full size image

A visualized subset of the knowledge graph constructed by the KGLM model after integrating the structured prompt template. Compared to Fig. 10, the entities in this graph are more standardized and atomic (e.g., Smoking, Chest Pain). The relationships are more precise, resulting in a cleaner, more accurate, and more machine-readable knowledge graph structure.

The prompt-enhanced KGLM model better handles complex medical concepts in our examples, revealing intricate relationships among entities and attributes that traditional methods often miss, thus efficiently converting unstructured lung cancer data into structured triplets. As evident when comparing the raw output in Fig. 10 with the prompt-guided output in Fig. 11, our approach allows for a direct insight into how various pulmonary disease entities are interconnected and influence one another. While the model without the structured prompt (Fig. 10) identified relevant concepts, its output contained non-standardized phrases. In contrast, Fig. 11 demonstrates that with the integrated prompt, the resulting knowledge graph is composed of clean, atomic entities and precise relationships, making it far more accurate and suitable for automated querying and analysis. The sentence-level extraction behaviors underlying these visual improvements are further analyzed in Table 4.

Finally, this study harmoniously integrates the semi-structured clinical data processed by rules, openly accessible graph resources, and the triplet knowledge extracted from unstructured web data using the KGLM model augmented with prompts, collectively constructing a comprehensive knowledge graph database, LCKG, that is tailored to the characteristics of lung cancer diagnosis and treatment. Within this distinctive lung cancer knowledge graph, the interrelations between entities such as the manifestations of lung cancer disease, typical symptoms, and corresponding therapeutic measures are intricately depicted. Moreover, users can interactively query the content of the knowledge graph either through the Cypher database query language or by directly clicking on entity node labels within the Neo4j graph database interface. Taking “lung cancer” as an example, executing the following Cypher query: “MATCH k=(s: Disease{Name:“Lung Cancer”})-[p]-> (o) RETURN k;” reveals all knowledge paths associated with the “lung cancer” disease. This portion of the query results is visually displayed in the graph visualization interface shown in Fig. 12.

Fig. 12
Fig. 12The alternative text for this image may have been generated using AI.
Full size image

Visualization of a Cypher query result for “Lung Cancer” in the final LCKG. This graph displays all entities directly connected to the central Lung Cancer node. The different colors represent different entity types: Diseases (green), Symptoms (red), and Examinations/Treatments (yellow). This demonstrates the knowledge graph’s ability to retrieve and display complex, interconnected information for a given query.

Conclusions

This study introduces the Knowledge Graph Language Model (KGLM), a framework developed by fine-tuning a Large Language Model for specialized knowledge graph construction. Our methodology employs sophisticated prompt templates to systematically extract and consolidate knowledge from large-scale, unstructured texts within the lung cancer domain. This textual knowledge is then fused with semi-structured clinical data and public structured graphs, creating a unified knowledge base within a Neo4j architecture tailored for lung cancer diagnostics and therapeutics. Empirically, our framework achieved higher F1 than the baselines we evaluated for RE on Chinese lung-cancer data. Overall, this applied framework demonstrates the feasibility of combining fine-tuning and prompt-based reasoning for domain-specific medical knowledge extraction in Chinese clinical contexts.

Limitations and future work

Within the broader LLM landscape, frontier medical models such as GPT-438, Med-PaLM39, and BioGPT40 have demonstrated strong performance primarily on clinical question-answering tasks (e.g., USMLE-style multiple choice and long-form responses). However, unlike these general medical reasoning models, our objective centers on schema-constrained triple extraction for knowledge-graph construction. Given this different focus, and because evaluating API-based models on clinical text entails data-privacy risks—while differences in prompt cost, latency, rate limits, and ongoing model updates impede strict like-for-like (“apples-to-apples”) comparisons—we did not perform direct head-to-head comparisons with proprietary API models in this study.

At the time of our experimentation, ChatGLM-6B represented a strong and widely adopted open-weight foundation model for Chinese language tasks and therefore served as the primary baseline in this study. Since then, newer open-source Chinese LLMs, such as the Qwen and Baichuan series, have reported substantial improvements on general NLP and information extraction benchmarks. However, a fair and task-specific comparison with our system would require these models to be fine-tuned on our curated medical dataset, which remains an important direction for future work. Accordingly, the present study should be viewed primarily as a validation of the proposed fine-tuning and prompt-engineering framework for specialized knowledge extraction, rather than as an exhaustive comparison of all contemporary base models.

To further contextualize our results within the broader landscape of Chinese information extraction research, we additionally summarize the zero-shot performance of representative open-weight Chinese LLMs on standard IE benchmarks in Table 941. While ChatGLM-6B provides a robust foundation for fine-tuning in our framework, recent models such as Qwen and Baichuan2 demonstrate considerable potential in zero-shot settings, particularly for relation extraction (RE) tasks. Notably, Qwen-7B-Chat achieves an F1 score exceeding 91% on the DuIE2.0 benchmark under type-constrained evaluation, highlighting the promise of newer architectures. These observations suggest that future iterations of the LCKG framework could benefit from incorporating such models to further improve extraction accuracy and enhance generalization across broader medical domains.

Table 9 Zero-shot performance of representative open-weight Chinese LLMs on Chinese IE/RE benchmarks (F1-score).

Despite these encouraging findings, several other limitations remain. First, the model has been fine-tuned on a single disease domain—lung cancer—which may limit its generalizability. Our quantitative evaluation is restricted to an in-distribution test set drawn from the same data sources, and performance on truly out-of-distribution lung cancer texts or other external corpora has not yet been assessed. Systematic evaluation under distribution shift, including validation on external lung cancer corpora and additional disease domains, is therefore an important direction for future work. Second, because the training and evaluation data are primarily Chinese, linguistic and regional biases may influence the extracted knowledge; at present, these potential biases are qualitatively acknowledged but not quantified. To address these issues, future work will expand data coverage to additional cancers, other languages, and international corpora to assess adaptability and robustness more systematically. Moreover, although the framework is designed to be conceptually adaptable to other disease domains, we do not report empirical validation beyond lung cancer in this study, and cross-disease evaluation will be an important focus of subsequent research. Third, the present study does not include a systematic quantitative error analysis. While representative examples have been discussed qualitatively, a detailed categorization of error types and their variation across different medical text genres remains to be conducted. Such diagnostic work will form a key focus of future research.

Moreover, the dynamic nature of biomedical knowledge poses ongoing challenges. To maintain the relevance of the Lung-Cancer Knowledge Graph (LCKG), future updates will involve periodic re-training, automated literature monitoring, and semi-automated expert validation. This is critical for incorporating new information, such as updated treatment guidelines, novel therapeutic approvals, and evolving diagnostic criteria, to ensure the graph remains a clinically relevant resource over time. Finally, it is important to note that both LCKG and KGLM are intended solely for research purposes; any clinical use requires prior expert evaluation and institutional approval.