Abstract
Seed quality standards are the essential basis for crop cultivation supervision. With the continuous development of China’s standard system, the number of seed quality standard documents has increased dramatically. However, the rapid growth and unstructured nature of standard documents hinder efficient query and semantic association. To address the lack of structured knowledge representation in the seed domain, this study proposes a Knowledge Graph (KG) construction framework for seed quality standards. First, a domain-specific ontology is constructed, defining 7 core classes and 12 relationship types to standardize semantic structure. Second, a hybrid knowledge extraction strategy is implemented: regular expressions are used for semi-structured tabular data, while a BERT-BiLSTM-CRF model is employed for unstructured text. Experimental results demonstrate that the proposed model achieves an F1-score of 91.61% in Named Entity Recognition (NER), outperforming than other model. Finally, a KG containing 2436 nodes and 3011 relationships is stored in Neo4j, enabling multi-dimensional retrieval and visualization. The proposed framework significantly improves the accuracy of standard information retrieval and provides a digital foundation for intelligent quality management in the plantation industry.S
Similar content being viewed by others
Introduction
As the foundation of agricultural production1,2, seed quality is the primary guarantee for crop yield and food security. It also significantly affects farmer livelihoods and production motivation, making its standardized management a top priority for modern agriculture3. The supervision of seed quality relies heavily on standard documents, which define technical requirements, inspection rules, and testing methods. With the rapid evolution of agricultural technologies, the volume of seed quality standards has increased dramatically, updating at an accelerated pace. The International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) have proposed the concept of SMART standards (Standards Machine Applicable, Readable, and Transferable), which aims to convert current standards documents that can only be read and used by humans into actionable knowledge4. However, in China, these critical standards currently exist predominantly as unstructured paper or image files. This format renders conventional keyword-based retrieval ineffective for capturing semantic relationships, thus failing to accommodate rapid information updates or complex queries, and unable to provide users with the required traceability and interconnected insights.
The primary challenge in digitizing these standards lies in the heterogeneity of the data. A typical seed quality standard document consists of three types of information: (1) Structured tables; (2) Semi-structured metadata; and (3) Unstructured textual clauses. Existing digitization approaches often struggle to handle this complexity simultaneously. Rule-based methods are effective for structured metadata but lack generalization for complex text3. Conversely, while emerging Large Language Models (LLMs) show promise in information extraction, they suffer from “hallucinations” and lack the strict interpretability required for normative standard documents4.To address these challenges, Knowledge Graphs (KG) offer a promising solution by structuring multi-source heterogeneous data into a semantic network of entities and relationships5,6 . Knowledge graphs have been extensively applied to accelerate digital transformation in domains such as emergency response and aerospace7. However, the seed quality standard domain remains underexplored, lacking a dedicated ontology and a robust, integrated extraction framework capable of handling its heterogeneous data characteristics.
To address the limitations of traditional manual consultation and to advance the digitization and structural transformation of standard documents, this study targets seed quality standard files and proposes a hybrid knowledge extraction framework that integrates rule-based and learning-based techniques. This framework adopts a cooperative, task-specific strategy tailored to the coexisting data types within the documents: structured tables, semi-structured metadata, and unstructured text. Precisely defined regular expressions are employed to ensure precision extraction from tables and semi-structured data. Concurrently, a BERT-BiLSTM-CRF model is leveraged to capture complex semantic relationships and domain-specific terminology from the unstructured textual content. This hybrid approach capitalizes on the complementary strengths of each method: regular expressions guarantee the reliability and consistency of critical metadata; the BERT layer enhances contextual understanding of agricultural terms; the BiLSTM layer effectively models long-range dependencies within regulatory clauses; and the CRF layer ensures the global optimality of entity label sequences. Together, they significantly improve the accuracy, robustness, and completeness of the knowledge extraction process. Based on the extracted knowledge, we construct a queryable and extensible knowledge graph, aiming to enable faster, more accurate information retrieval and more intuitive analysis of content relationships compared to conventional methods.
This study makes three primary contributions. First, it constructs a formal ontology for seed quality standards, comprising seven core classes and twelve relationship types, which provides a standardized schema for unifying heterogeneous standard data. Second, it designs a hybrid knowledge extraction framework. For unstructured text, information is extracted using a BERT-BiLSTM-CRF model, which achieves an F1-score of 91.5% and significantly outperforms conventional models in extracting entities such as inspection rules and drafting units. Finally, a knowledge graph containing 2,436 nodes and 3,011 relationships is built and stored in a Neo4j database. Its practical utility is demonstrated through multi-dimensional retrieval scenarios, showing that it enables more precise positioning of technical specifications compared to traditional document search methods.
Literature review
Digitization of standards and ontology construction
The transition from human-readable to machine-processable standards is a cornerstone of modern industrial and regulatory digital transformation. The concept of SMART Standards, advocated by ISO and IEC, has catalyzed research into converting conventional documents into structured, queryable knowledge bases.
Early research primarily focused on converting PDF documents into XML or structured formats using rule-based parsing. For instance, Fu and Qiang8 proposed an automated ontology construction approach that mines deep semantics from XML Schemas and instance documents through a rule-based intermediate model, enabling both conceptual-level generation and instance-level population. In the domain of food safety, Hu et al.9 proposed a food safety standard ontology to map regulatory constraints. While recent approaches leverage Natural Language Processing (NLP) and KG technologies to extract and link entities and relationships from heterogeneous standard texts. Fan et al.10 addressed the complexity and heterogeneity of airworthiness directive texts by proposing a domain knowledge-integrated large language model fine-tuning approach. Utilizing parameter-efficient adaptation techniques and prompt template enhancement, this method enables precise extraction and structuring of aviation fault knowledge, effectively supporting the construction of fault knowledge graphs and the development of intelligent maintenance management systems. This shift enables not only efficient retrieval but also advanced applications such as compliance checking, automated auditing, and intelligent decision support.
However, constructing an ontology for seed quality standards presents unique challenges. Unlike general industrial standards, seed standards contain highly heterogeneous data: rigorous numerical limits co-exist with descriptive inspection rules11. Previous ontologies often lack the granularity to represent the “Crop-Quality Indicator-Limit” ternary relationship effectively, which is crucial for the agricultural domain.
Named entity recognition in vertical domains
Knowledge extraction, particularly Named Entity Recognition (NER), is the core step in KG construction. Standard documents are inherently multi-modal, comprising structured tables, semi-structured metadata, and unstructured descriptive text.
Early works like relied on dictionaries and regular expressions. Based on semantic rules and a multi-layer Conditional Random Fields model, Cui et al.12 effectively addressed the challenge of nested entity recognition in meteorological reports, providing crucial technical support for the automated construction of meteorological knowledge graphs. While these methods achieve high precision for structured metadata, they suffer from low recall when processing unstructured text with complex sentence structures13.
Statistical and deep learning models have become the de facto standard for NER from unstructured text, capturing contextual nuances and domain-specific semantics14. The advent of deep learning has shifted the paradigm towards sequence labeling models. The BiLSTM-CRF architecture15 became a baseline for many years. With the introduction of pre-trained language models, BERT (Bidirectional Encoder Representations from Transformers) has significantly improved performance by capturing context-dependent semantic representations16.
Recently, Large Language Models (LLMs) like GPT-4 have shown impressive zero-shot extraction capabilities. Li et al.17 proposed a method that integrates an enhanced joint extraction model with large language models, which significantly improves the accuracy and intelligence of fault diagnosis by constructing a high-quality knowledge base and enabling intelligent question-answering. However, for the standards domain, LLMs face challenges regarding “hallucination” (generating non-existent regulations) and lack of reproducibility18. Besides, these models require substantial annotated data and may underperform on highly formulaic or tabular content where rules are more efficient and reliable.
A hybrid framework that strategically combines rules for structured/semi-structured data and deep learning for unstructured text can therefore offer a balanced solution, maximizing both precision and recall while ensuring scalability. To the best of our knowledge, such a hybrid, multi-level extraction pipeline has not been systematically designed and evaluated for the domain of seed quality standards, where the integration of tabular quality indicators, metadata, and technical descriptions is critical.
Comparative analysis
To position our work within the current research landscape, we compare our proposed framework with representative studies from the past 5 years in Table 1.
As shown in the comparison, most existing works rely on a single extraction modality. Few address the hybrid nature of standard documents. Existing research exhibits three main gaps when considering the digitization of seed quality standards.
First, lack of integrated hybrid frameworks. Prior works typically employ a single extraction paradigm, failing to leverage the complementary strengths of rule-based and learning-based methods for the multi-format data inherent in standards.
Second, absence of formal, reusable ontologies for seed quality. While domain ontologies exist in food safety and aviation, seed quality standards lack a dedicated ontology to represent core entities and relationships. Previous ontologies lack granularity for agricultural-specific ternary relationships, hindering consistent knowledge representation.
Third, insufficient quantitative evaluation and reproducibility. Many studies present qualitative demonstrations or limited metrics, and few share code, data, or configurations. This limits independent verification—a critical requirement for scientific progress.
Our work distinguishes itself by integrating a hybrid extraction strategy and constructing a domain-specific ontology for seed quality.
Methodology
Data processing
Currently, the knowledge graph in the field of standards is mainly based on existing standard documents, issued regulations, normative documents, etc., and no publicly available dataset has been found in the field of seed quality standards. This paper collects seed-oriented national standard documents as a data source from multiple standard websites such as the National Standard Information Public Service Platform, the National Standardization Administration Committee, and the Industry Standard Information Service Platform etc. The content of the documents contains the basic information of the standards, stipulates the terminology, quality requirements, test methods, and test rules in the field of seeds, divides the seed categories according to the different crops, and develops exceptional quality standards for each category. Seeds are divided according to other crops, and outstanding quality standards are set for each category.
The study stored the seed-related standard document data obtained from various platforms in PDF format. Firstly, the tabular data in the documents are extracted, and necessary adjustments are made to ensure that the format is compatible with the graph database. The documents are converted to TXT format, and text processing techniques are applied to clause the contents. At the same time, the processed tabular parts are removed to lay the foundation for the knowledge extraction work.
Since the seed quality standard document does not have a large amount of labeled data, to ensure that the model can obtain sufficient training materials, it is necessary to collect and sort out the raw text data in the early stage, clean the sorted data to remove invalid punctuation and characters, and finally label the cleaned data. All the text data are labeled with “BIO”, and the non-entity characters in the text are uniformly labeled with “O”. For entities requiring special labeling, the first character of the entity was labeled as “B-entity name”, and other characters in the entity were labeled as “I-entity name”, with a total of 1,530 items labeled.
Seed quality standard document characteristics and ontology construction
A standard is a normative document developed by negotiation and approved by a recognized body for everyday use to obtain the best order within a specific range19. The division standards according to the hierarchy can be divided into international standards, regional standards, national standards, industry standards, local standards, and enterprise standards. Standard documents are different from general text, in their content, form, scope of application, etc., with a specific layout format and drafting rules, text content, and structure of neat norms, timeliness and accuracy. With the rapid development of scientific and technological levels, the standard documents will continue to be formulated or revised, as a normative document, its content must be professional, precise, and standardized.
Seed quality standard documents cover seed quality grading, quality requirements, test methods and test rules, and other aspects involving many types of seeds of cash crops, grain crops, vegetables, melons, etc., which are diverse and complex. Therefore, to build the ontology structure of seed quality standard documents, it is necessary to define a complete and universal ontology concept from the common elements of the papers.
Ontology rule construction20,21 is one of the core works of knowledge graph entity relationship extraction, which is used to define the types of things in the standard documents and describe their properties. The construction of seed quality standard document ontology is mainly considered from the following two aspects:(1) Standardised line structure. Standard documents have a specific layout format and drafting rules, which can filter the essential information in its core elements. Including the relationship between the document and other documents, proposed information, attribution information, drafting units and principal drafters, etc.; (2) the content commonality of the document. According to the content commonality between seed quality standard documents, the relevant critical contents of seed quality can be extracted, including the quality requirements of seeds, inspection methods, and inspection rules.
To address the heterogeneity of seed quality standard documents and enable machine-readable knowledge representation, we constructed a domain-specific ontology following the SMART principles and competency questions (CQs) derived from user needs. Based on the SMART standard requirements and the competency questions, we defined 7 core classes that cover all critical elements of seed quality standards. Table 2 details their definitions, descriptions, and examples.
The core classes in Table 2 form the static foundation of the ontology, but their true value lies in the semantic relationships that connect them. These relationships are the dynamic soul of the ontology, enabling it to represent complex semantic information from seed quality standards. To visualize these connections, we constructed an Entity-Relationship (ER) Diagram (Fig. 1), which maps the core classes to their attributes and semantic relationships. This schema ensures that our knowledge extraction and graph construction are semantically consistent and goal-oriented.
Seed quality standards document ontology architecture. Note A square represents an entity type, a key header represents a relationship type, and an oval represents an attribute.
Hybrid knowledge extraction framework
The data of the seed quality standard document contains table-based structured data, text-based semi-structured data, and unstructured data. The knowledge extraction process for seed quality standard documents is shown in Fig. 2. Different extraction methods can be used to improve the extraction efficiency according to the data characteristics.
Seed quality standard file knowledge extraction process.
Structured data extraction
Table-based structured data. Tabular data, as part of structured data, can be obtained directly from standard documents through conversion. To address the lack of formal definitions, we define the seed quality knowledge graph as a directed graph \(G = \left( {E,R,A} \right)\), where:
-
\(E = \left\{ {e_{1} ,e_{2} , \ldots ,e_{n} } \right\}\) denotes the set of entities (nodes).
-
\(R = \left\{ {r_{1} ,r_{2} , \ldots ,r_{m} } \right\}\) denotes the set of relations (edges).
-
\(A\) represents the set of attributes values associated with entities.
By defining data nodes (entities) and edges (relationships), a knowledge graph can be constructed and visualized.
Seed quality standard documents contain structured data in the form of tables, and the content mainly includes entity class crop name, seed type, purity, purity, germination rate, moisture, and other specific requirements, the original document type conversion, through fine-tuning, can be obtained directly from the form of data, in the definition of the entity and the relationship between the form can be directly extracted using the pandas library content, and through the python language with the Neo4j database connection, the nodes and edges into the graph database, to build the knowledge graph. The collated table data is shown in Table 3.
Semi-structured data extraction
Text-based semi-structured data. The basic information of the seed quality standard document is written in a rigorous and standardized manner and in a logical manner, which has specific structured characteristics and belongs to semi-structured data. This kind of data is mainly used to describe and supplement the seed quality standard document and to provide detailed information about the standard, so it is stipulated as the attribute information of the seed quality standard document. Due to its prominent structural characteristics, the regular expression method can be used to complete the extraction task. Regular expressions are text patterns that are used to describe and match specific patterns in strings. They consist of ordinary characters (e.g., letters and numbers) and special characters (called metacharacters). They can be used to perform a variety of complex text-processing tasks, such as searching, replacing, validating, and extracting textual data. Features of regular expressions include being very flexible, logical, and functional, allowing complex control of strings to be achieved quickly and in a straightforward way.
Semi-structured data is mainly attribute class, which is the basic information that the standard has, including the name of the standard, the date of release, the date of implementation, the scope of application, the scope of provisions, etc., and its categories and definitions are shown in Table 2. Due to its standardized structure and rigorous writing logic, it has specific structured characteristics and can be regarded as text-based semi-structured data. Therefore, when dealing with this part of the data, regular expressions are used to extract the data. Its extraction rules are shown in Table 4.
Unstructured data extraction
The content of the body of the standard document does not follow the predefined data model, and its content is in various forms, lacks a fixed format and structure, and belongs to unstructured data, which cannot be extracted using the above two methods in the processing. Machine learning methods can automatically discover patterns and relationships in the data by learning the annotated data samples, thus achieving effective extraction of unstructured data. Given this, a machine learning model is chosen to implement the knowledge extraction task of seed quality standard documents. The BERT-BiLSTM-CRF model combines the powerful contextual semantic understanding capability of BERT, which can better capture the meanings of Chinese characters; the ability of BiLSTM to handle long-distance-dependent sequential information so that even if the information is too long, it will not be lost; and the CRF to ensure that the annotation consistency and the global optimum so that the correctness rate can be maximized. Correctness can be maximized. Thus, a robust information extraction framework is formed, which can improve the accuracy and consistency of unstructured data extraction tasks.
The content of seed quality standard documents, except for the data mentioned above, is unstructured data. Structured and semi-structured data usually have a clear format and semantic rules, while unstructured data have no specific format and rules, which are challenging to utilize directly. To improve the efficiency of information extraction from unstructured data, this paper adopts the BERT-BiLSTM-CRF model for relational extraction of unstructured data. Before fine-tuning the model, it is necessary to manually annotate part of the data and store and represent the extracted information in a ternary format (entity-relationship-entity). The entity labeling rules and the specific ternary kinds of forms are shown in Tables 5 and 6, respectively.
The model combines the contextual understanding capability of pre-trained language models, the long-range dependency capture capability of sequence modeling, and the globally optimal annotation capability of conditional random fields. It can identify the relationships between entities while extracting them.
BERT-BiLSTM-CRF named entity recognition model
To address the task of entity extraction from unstructured seed quality standard documents, this study adopts the BERT-BiLSTM-CRF model, a widely used architecture for sequence labeling tasks in NLP. The model integrates contextualized semantic understanding, sequential dependency modeling, and global label constraint decoding, enabling accurate recognition of domain-specific entities. This section details its architecture, training configuration, and functionality.
Model architecture
The BERT-BiLSTM-CRF model consists of three sequential modules (Fig. 3), designed to progressively transform raw text into structured entity labels.
Model frame diagram of BERT-BiLSTM-CRF.
The input standard clauses are first tokenized and fed into a pre-trained BERT model, which captures bidirectional contextual information via multi-head self-attention22,23. This pre-trained model 12 layers, 768 hidden units converts each token into a high-dimensional embedding that encodes semantic nuances. For example, in the sentence “The quality of secondary hybrid seeds meets the first-grade standard”, BERT contextualizes “secondary hybrid seeds” as a Crop entity and “first-grade standard” as a Standard entity, even when these terms appear in complex syntactic structures.
The BERT-generated embeddings are then passed to a BiLSTM network with a hidden size of 12824. Unlike unidirectional LSTMs, BiLSTM processes the text in both forward (left-to-right) and backward (right-to-left) directions, capturing long-range dependencies between entities. This module enhances the model’s ability to recognize entities that span multiple tokens or appear in non-contiguous positions.
Finally, a CRF layer is applied to the BiLSTM outputs to decode the optimal entity label sequence. Unlike standalone BiLSTM which predicts labels independently for each token, CRF enforces global constraints on label transitions. For instance, in the sentence “Maize seeds must meet GB 4404.1-2021”, CRF ensures “Maize” (Crop) and “GB 4404.1-2021” (Standard) are labeled as separate, sequentially coherent entities, avoiding invalid label sequences.
Training configuration
To ensure reproducibility, the model was trained with the following configurations:
-
Optimizer: AdamW with weight decay (λ = 0.01) to prevent overfitting.
-
Learning Rates: 2 × 10−5 for the BERT pre-trained layers and 1 × 10−3 for the BiLSTM-CRF layers.
-
Batch Size & Sequence Length: 16 samples per batch, with a maximum sequence length of 256 tokens.
-
Training Dynamics: 20 epochs with early stopping (patience = 3) to avoid overfitting; validation loss monitored on a held-out dataset (20% of total annotations).
-
Hardware & Framework: Experiments conducted on an NVIDIA RTX 3090 GPU using PyTorch 1.12.0.
Functionality in seed standard processing
The trained BERT-BiLSTM-CRF model can automatically extract critical information from a large amount of unstructured text, significantly improving the efficiency and accuracy of information extraction. Manual extraction of information is not only time-consuming and labor-intensive but also prone to errors, and the automated processing of the model can significantly reduce the manual workload. For the currently existing unstructured seed quality standard documents that have not yet been processed, the trained model can be immediately put into use to extract essential information from them automatically. As time goes by, new seed quality standard documents will be released, which are also primarily unstructured, and the relevant information can also be extracted and updated by the model to achieve dynamic updating and maintenance of the data and to reduce human errors and delays in operation.
Results and evaluation
Model performance
To objectively evaluate the performance of the proposed knowledge extraction model, we adopted standard evaluation metrics: Precision (P), Recall (R), and F1-score (F1). The calculation formulas are as follows:
where TP represents true positives, FP false positives, and FN false negatives.
The experiments were conducted on the dataset described in “Data processing” section, utilizing the training/validation/test split of 8:1:1.
To validate the effectiveness of the BERT-BiLSTM-CRF model, we compared it against two mainstream models commonly used in the standards domain. The experimental results are presented in Table 7.
BiLSTM-CRF: A classic sequence labeling model without pre-trained embeddings, using random initialization for word vectors.
BERT-CRF: A model that uses BERT for embeddings but lacks the BiLSTM layer for sequence feature capturing.
As shown in Table 7, our proposed model achieves the best performance across all metrics. Compared to BiLSTM-CRF, our model improves the F1-score by approximately 3%. This demonstrates that the pre-trained BERT layer effectively captures the rich semantic information of agricultural terms, resolving the polysemy issues that traditional embeddings fail to handle. Compared to BERT-CRF, the addition of the BiLSTM layer results in a 2.8% improvement in F1-score. This indicates that BiLSTM is crucial for capturing long-distance dependencies in standard clauses, such as complex “Inspection Rules” that span multiple lines.
To further analyze the model’s robustness, we evaluated the extraction performance for different entity types. The F1-scores for key entities are shown in Fig. 4 and Table 8.
Comparison of F1-scores across different models.
The model achieves near-perfect performance on highly structured entities like “Standard Code”. Notably, even for complex entities like “Inspection Rules”, the model maintains an F1-score of nearly 90%, proving the efficacy of the hybrid architecture.
Knowledge graph
Knowledge storage
In this paper, a Neo4j graph database is used for knowledge storage, where all the above knowledge information extracted from seed quality standard documents are integrated and organized into corresponding triples, i.e., entity-relationship-entity. These triples are stored in Neo4j to achieve the mapping of structured knowledge to the triple knowledge in the graph database25. In Neo4j, labels are used to define the types of nodes, which can help to filter and query the nodes quickly. Nodes and their attributes correspond to entities and their properties, edges correspond to inter-entity relationships, and the storage scheme is shown in Table 9. The final knowledge graph of seed quality standard documents contains 2436 nodes and 3011 entity relationships.
Knowledge graph visualization
Neo4j database provides a visualization tool for the knowledge graph of seed quality standard documents. Figure 5 shows some nodes and edges of the knowledge graph of seed quality standard documents. Nodes with the same color indicate the same type of knowledge entities, and the connecting lines between the nodes indicate the relationships between the knowledge entities. Among them, the knowledge entities include standard code, proposing organization, attributing unit, drafting unit, drafter, test method, test rule, etc. The entity relationships include revision relationships, citation relationships, drafting relationships, etc. By adding qualifications, the relevant information of a standard can be displayed visually. Click the ’standard code’ entity node to view its attributes, including the release date, implementation date, standard name, the scope of regulations, the scope of use, and so on. Through this information, we can fully understand the formulation process of the seed quality standard, the institutions and personnel involved, and the testing techniques and rules adopted. This visualization makes the knowledge structure of the seed quality standard document clear, which helps relevant personnel to carry out in-depth analysis and research.
Local visualization of the knowledge graph.
Conclusion and discussion
Conclusion
To address the challenges of data heterogeneity and inefficient semantic retrieval in seed quality standards, this study proposed a comprehensive framework for constructing a domain-specific KG. The key contributions are summarized as follows:
First, we designed a seed quality standard ontology that unifies structured parameters and unstructured clauses. This ontology, validated through SMART principles and competency questions, defines 7 core classes and 12 semantic relationships, forming a semantic schema to model the full lifecycle of seed standards.
Second, we developed a hybrid knowledge extraction model combining rule-based methods and deep learning. This model achieved an F1-score of 91.61% in entity and relationship extraction, outperforming baseline models by leveraging domain rules for structured data and contextual learning for unstructured text.
Third, we successfully built a KG with over 2,400 nodes, demonstrating its practical value in facilitating accurate and intelligent standard retrieval.
Overall, this research provides a viable technical path for the “SMART” transformation of agricultural standards. For regulators, it offers a tool to visualize standard relationships; for seed companies and farmers, it simplifies the complex process of consulting quality requirements, thereby contributing to the digital management of the plantation industry.
Interpretation of results
The experimental results demonstrate that our proposed hybrid framework significantly enhances the digitization of seed quality standards.
The BERT-BiLSTM-CRF model achieved an F1-score of 91.61%, which validates our hypothesis that pre-trained language models can effectively handle the semantic complexity of agricultural texts. Specifically, the high recognition rate for “Inspection Rules” indicates that the model successfully captured the boundary features of long, descriptive clauses, which was a major bottleneck in previous rule-based systems.
Furthermore, the constructed Knowledge Graph (2,436 nodes) successfully transforms static PDF documents into a queryable semantic network, enabling precise answers to competency questions that traditional keyword search engines fail to address.
Unlike general-purpose KG construction methods that rely solely on text extraction, our approach integrates Regular Expressions for tabular data. This hybrid strategy ensures high accuracy for critical numerical limits, which is a strict requirement for standard enforcement. Compared to recent LLM-based extraction attempts, our supervised learning approach offers better reproducibility and explainability. While LLMs are powerful, they are prone to generating non-existent standard clauses. In contrast, our pipeline extracts only what is explicitly present in the text, ensuring the normative authority of the standards.
Limitation and future work
Despite these achievements, three limitations should be noted:
-
1.
Data scope constraints: The current dataset focuses on national and industry-level standards, excluding local and enterprise standards, which often have non-standardized formats. This limits the generalizability of the KG to regional or enterprise-specific scenarios.
-
2.
PDF parsing dependency: Knowledge extraction performance relies heavily on PDF quality. Scanned documents with poor OCR accuracy introduce noise, reducing extraction precision for historical standards.
-
3.
Static knowledge maintenance: The KG is currently static and cannot automatically update with standard revisions. Manual intervention is required to handle dynamic changes, hindering real-time usability.
To address these limitations, future work will focus on three directions:
-
1.
Multimodal knowledge extraction: Integrate computer vision (CV) techniques to process non-text elements in standards, such as diagrams and scanned tables, expanding data coverage to older or unstructured documents.
-
2.
Incremental KG updating: Develop an automated mechanism to monitor standard revisions and update the KG incrementally, including handling relationships like “standard A supersedes standard B” and revising outdated thresholds.
-
3.
Knowledge-driven question answering (KBQA): Build a natural language interface atop the KG to support user-centric queries and decision-making, enhancing practical utility for end-users.
Data availability
The data that support the findings of this study are available from the corresponding author, Qiong He, upon reasonable request.
References
Niu, S. et al. Research on a lightweight method for maize seed quality detection based on improved YOLOv8. IEEE Access 12, 32927–32937 (2024).
Gang, Y., Chen, C. & Weiyue, W. Research on the protection and utilization of agricultural germplasm resources under the action of seed industry revitalization—Taking Yangzhou City, Jiangsu Province as an Example. Jiangsu Agric. Sci. 52, 20–27 (2024).
Maredia, M. K. & Bartle, B. Excess demand amid quality misperceptions: the case for low-cost seed quality signalling strategies. Eur. Rev. Agric. Econ. 50, 360–394 (2023).
Jeon, K. et al. A relational framework for smart information delivery manual (IDM) specifications. Adv. Eng. Inform. 49, 101319 (2021).
Liu, Q., Li, Y., Duan, H., Liu, Y. & Qin, Z. Knowledge graph construction techniques. J. Comput. Res. Dev. (2016).
Peng, C., Xia, F., Naseriparsa, M. & Osborne, F. Knowledge graphs: Opportunities and challenges. Artif. Intell. Rev. 56, 13071–13102 (2023).
Li, X. et al. exploration and practice of standard digitalization application in the aviation industry. In Information Technology and Standardization. 68–72+78 (2022).
Fu, Z. & Qiang, L. Constructing ontologies by mining deep semantics from XML schemas and XML instance documents. Int. J. Intell. Syst. 37, 661–698 (2021).
Hu, D., Weng, C., Wang, R., Song, X. & Qin, L. Construction Method of National Food Safety Standard Ontology. in Green, Pervasive, and Cloud Computing, GPC 2022 (eds Yu, C., Zhou, J., Song, X. & Lu, X.) vol. 13744 50–66 (Springer International Publishing Ag, Cham, 2023).
Fan, Y., Sun, Y., Mi, B. & Fu, X. Aircraft fault knowledge graph construction based on large language model incorporating Chinese airworthiness knowledge. Int. J. Softw. Eng. Knowl. Eng. 1, 2. https://doi.org/10.1142/S0218194025500962 (2025).
da Silva, A. R. On testing for seed sample heterogeneity with the exact probability distribution of the germination count range. Seed Sci. Res. 30, 59–63 (2020).
Cui, M. et al. Semantic rule-based information extraction for meteorological reports. Int. J. Mach. Learn. Cybern. 15, 177–188 (2024).
Nastou, K., Koutrouli, M., Pyysalo, S. & Jensen, L. J. Improving dictionary-based named entity recognition with deep learning. Bioinformatics (Oxford, England) 40, ii45–ii52 (2024).
Keraghel, I., Morbieu, S. & Nadif, M. Recent advances in named entity recognition: A comprehensive survey and comparative study (2024).
Zhu, Y. A knowledge graph and BiLSTM-CRF-enabled intelligent adaptive learning model and its potential application. Alex. Eng. J. 91, 305–320 (2024).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (2019).
Li, L., Haruna, A., Ying, W., Noman, K. & Li, Y. Knowledge graph-driven fault diagnosis for aviation equipment: Integrating improved joint extraction with large language model. J. Ind. Inf. Integr. 50, 101039 (2026).
Li, J., Cheng, X., Zhao, X., Nie, J.-Y. & Wen, J.-R. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H., Pino, J. & Bali, K.) 6449–6464 (Association for Computational Linguistics, Singapore, 2023). https://doi.org/10.18653/v1/2023.emnlp-main.397.
Kelly, Y., O’Rourke, N., Flynn, R., Hegarty, J. & O’Connor, L. Definitions of health and social care standards used internationally: A narrative review. Int. J. Health Plan. Manag. 38, 40–52 (2023).
Gao, S., Ren, G. & Li, H. Knowledge Management in Construction Health and Safety Based on Ontology Modeling. Appl. Sci. Basel 12, 8574 (2022).
Zhou, D. et al. Ontology Reshaping for Knowledge Graph Construction: Applied on Bosch Welding Case. In The Semantic Web—ISWC 2022 770–790 (Springer, Cham, 2022). https://doi.org/10.1007/978-3-031-19433-7_44.
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2019).
Xu, S., Zhang, C. & Hong, D. BERT-based NLP techniques for classification and severity modeling in basic warranty data study. Insur. Math. Econ. 107, 57–67 (2022).
Huang, Z., Xu, W. & Yu, K. Bidirectional LSTM-CRF models for sequence tagging. Preprint at https://doi.org/10.48550/arXiv.1508.01991 (2015).
Monteiro, J., Sa, F. & Bernardino, J. Experimental evaluation of graph databases: JanusGraph, Nebula Graph, Neo4j, and TigerGraph. Appl. Sci. Basel 13, 5770 (2023).
Acknowledgements
The authors would like to thank the Beijing Knowledge Management Research Base for their assistance with the study.
Funding
The research is funded by the Beijing Municipal Education Commission Research Plan General Project (Grant Number: KM202411232007).
Author information
Authors and Affiliations
Contributions
Qiong He: Writing—review and editing, supervision, funding acquisition. Zhenwei Yang: Writing—original draft, data collection, data curation, software, visualization, conceptualization. Jian Zhang: Writing—original draft, Software, methodology, validation.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Yang, Z., He, Q. & Zhang, J. Construction and application of knowledge graph for seed quality standard documents. Sci Rep 16, 5997 (2026). https://doi.org/10.1038/s41598-026-37084-y
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-026-37084-y







