Abstract
Corpora are crucial in computational linguistics and natural language processing. With the advancement of large language models (LLMs), high-quality human-annotated corpora have become essential for training high-performance general and domain-specific LLMs. This paper presents the construction of a part-of-speech (POS) tagged corpus based on the Twenty-Four Histories and their corresponding modern Chinese translations. First, in the data preprocessing stage, methods such as regular expressions, language models, and manual proofreading were employed to ensure data quality. In addition, task-specific annotation guidelines were established to standardize the POS tagset. Subsequently, the distribution patterns at the lexical level in the constructed corpus were explored from dimensions including word length, word frequency, POS tag distribution, word co-occurrence frequency, POS tag co-occurrence frequency, and word collocation relationships. Finally, we discuss potential applications. This corpus is released to support digital humanities research on ancient Chinese and to facilitate the intelligent processing of classical texts.
Similar content being viewed by others
Introduction
The strategic value of data has gained formal recognition at the highest levels of government; in April 2020, China’s Central Committee and State Council promulgated the Guidelines on Market-Oriented Allocation of Production Factors1 (Guidelines on Market-Oriented Allocation of Production Factors. https://english.www.gov.cn/policies/latestreleases/202004/10/content\_WS5e8faa09c6d0c201c2cc0922.html), formally recognizing data as the fifth core production factor alongside land, labor, capital, and technology. Complementing this vision, the UK Government’s National Data Strategy2 (National Data Strategy. https://www.gov.uk/guidance/national-data-strategy), released in December 2022, establishes principles for the development of a data infrastructure that is conducive to innovation-driven growth.
As generative artificial intelligence continues to advance, the economic and scientific value of these “data factors” has assumed increasing significance. Effectively harnessing linguistic resources through the construction of core corpora integrated with large language models (LLMs) is, therefore, of paramount importance. In the digital humanities, computational methods are increasingly being used to enhance the collation, analysis, and utilization of historical documents, thereby accelerating their digitization, knowledge structuring, and AI-enabled transformation. Preprocessed ancient texts provide foundational support for domain-specific LLMs such as the Xunzi series.
The current landscape of digital resources for ancient Chinese is characterized by a significant dichotomy: vast yet largely unannotated textual archives, contrasted with a limited number of small, linguistically annotated corpora. This disparity has created a “poverty of labeled data,” a primary bottleneck that hinders the development of robust, data-driven NLP models. Although vast digital libraries such as the Chinese Text Project1 and the Peking University Center for Chinese Linguistics Corpus2 have made enormous volumes of raw text accessible, their utility in training supervised models is constrained by the lack of linguistic annotation. In contrast, smaller annotated resources, such as the Sheffield Corpus of Chinese (SCC)3 and the more recent Ancient Chinese Word Segmentation and Part-Of-Speech Corpus (ACP)4, are frequently of insufficient scale for training large models. Other notable projects have focused on specific tasks, such as the Chinese Historical Information Extraction Corpus (CHisIEC) for named entities5. Annotation standards have also evolved, exemplified by the Academia Sinica Ancient Chinese Corpus (ASACC)6 and the Nanjing Normal University Ancient Chinese Tagset (2017)7, which introduced a 17-tag POS system designed specifically for Classical Chinese.
Research on POS-tagged corpora spans construction methodologies, computational tools, and practical applications. Common practices include text selection, preprocessing, tagset design, and annotation. Studies often employ hybrid or machine-assisted methods: for example, Maulud et al.8 used a bi-gram HMM with rule-based techniques for Sorani Kurdish, while Bernhard et al.9 emphasized tagset unification for regional languages in France. Advanced models such as CRF and Bi-LSTM have been applied to segmentation and tagging tasks, as seen in work on Chinese clinical texts10 and the Khasi language11. To improve efficiency, semi-automatic annotation systems have been developed for languages such as Korean12 and Amharic13, supporting downstream applications including POS tagger development and sentiment analysis.
An especially vital resource for machine translation and contrastive linguistics is the parallel corpus, which comprises sentence-aligned texts in two languages. For many years, the absence of a large-scale parallel corpus for classical and modern Chinese represented a considerable obstacle to computational research. Recent methodological advances in sentence alignment employ dynamic programming to identify the longest common subsequence of characters14, while combinations of lexical and statistical information have also proven effective15. These advances have enabled the creation of corpora with millions of sentence pairs16. The integration of artificial intelligence is further transforming this domain, as such parallel data is crucial for fine-tuning LLMs on specialized domains. As demonstrated in the training of GujiGPT17, combining domain-specific ancient texts with high-quality parallel data serves to mitigate the phenomenon known as “catastrophic forgetting,” thereby enhancing the model’s expert knowledge while preserving its foundational linguistic abilities.
Despite these advances, existing ancient Chinese corpora continue to be limited in both scale and depth of their annotation, as most development in Chinese corpus construction has focused on the modern language. To address this gap, this paper details the construction of the Twenty-Four Histories Ancient-Modern Part-of-Speech Tagged Corpus. The Twenty-Four Histories represent China’s official historical records, composed and transmitted over a period of nearly five thousand years. This study aims to significantly expand the resources available for ancient Chinese, providing a foundational dataset for research in linguistics, history, and literature, and thus fostering collaborative innovation between the digital humanities and artificial intelligence.
Methods
Corpus acquisition and cleaning
The textual basis for this corpus was derived from two primary sources: the book series The Complete Translation of the Twenty-Four Histories (《二十四史全译》) and a digitized, simplified Chinese version obtained from the Guoxue Dashi (国学大师) website. The Complete Translation of the Twenty-Four Histories is a large-scale scholarly project initiated in 1991 and completed in 2003, spanning 13 years of compilation. The entire publication contains approximately 100 million characters, combining both classical source texts and modern Chinese translations. For our work, we utilized the first edition published in 2004 as the primary source for Optical Character Recognition (OCR). To mitigate potential OCR-induced errors and ensure textual fidelity, the digitized version from the Guoxue Dashi website was used as a reference for cross-verification.
Following acquisition, a rigorous data cleaning protocol was established to ensure the quality and integrity of the dataset. We systematically removed incomplete sentences and paragraphs to maintain the contextual coherence and readability of the texts. Simultaneously, we addressed standardization by correcting or excising content containing grammatical errors(such as OCR inaccuracies arising from traditional Chinese characters) and non-standard linguistic forms like headers and footers. Furthermore, the quality of the corpus was enhanced by eliminating meaningless repetitions, irrelevant content, and anomalous characters or symbols. This comprehensive cleaning process yielded a final dataset comprising both plain text and original scanned image files for the parallel ancient and modern Chinese versions of the Twenty-Four Histories.
Sentence alignment
To facilitate effective comparison and analysis, it is paramount to ensure structural, sequential, and content consistency between the ancient Chinese text and its modern Chinese translation. This was achieved through a semi-automated alignment process. This study employed ABBYY Aligner 2.0 to perform both paragraph-level and sentence-level alignment. The workflow began with converting the source PDF documents into electronic text using OCR technology. The resulting ancient and modern Chinese texts were then loaded into the alignment software.
A two-stage alignment was conducted: first at the paragraph level, followed by a more granular sentence-level alignment. However, due to significant divergences in syntax and grammar between ancient and modern Chinese, the automated output necessitated a subsequent phase of meticulous manual proofreading. The software flagged potential misalignments, which were then subjected to manual review and correction. Following the alignment correction, the entire corpus was cross-referenced against the original PDF files to identify and rectify any OCR-related textual errors.
Annotation standards
The foundational processing of the corpus involved two primary linguistic annotation tasks: word segmentation and Part-of-Speech (POS) tagging. While established standards exist for modern Chinese, no universally accepted standard currently exists for ancient Chinese processing. To address this challenge and facilitate the construction of a bilingual parallel corpus suitable for machine translation, we adopted a ”modern-anchored, ancient-adapted” annotation strategy. Consequently, our protocol primarily adheres to the Chinese national standard GB/T 13715 for segmentation and the Grammatical Knowledge-base of Contemporary Chinese for POS tagging. This alignment ensures that grammatical categories in the ancient text map logically to their modern counterparts, thereby reducing noise in downstream computational tasks.
However, simply applying modern rules to classical texts is insufficient due to unique features such as monosyllabic dominance and flexible word boundaries. To resolve the issue of ambiguous boundaries caused by the lack of clear delimiters in ancient Chinese, we utilized the parallel modern translation as a semantic anchor. Characters in the ancient text were grouped into a single token only if they functioned as a cohesive semantic unit corresponding to a specific lexical unit in the modern translation. This method ensures that the segmentation respects the semantic integrity of the historical text while maintaining alignment with modern linguistic concepts.
In terms of POS tagging, special attention was given to the phenomenon of ”class shift,” where a word’s grammatical function shifts based on context. We prioritized the func- tional role of the word in specific contexts over its inherent lexical category. Specifically, under the predicate core principle, verbs were identified based on their role as the syntactic predicate. To further distinguish flexible verbal usages distinct from modern Chinese grammar, we intro- duced the specific tag ”gv”. This tag annotates causative(e.g., ” 活” in ” 臣活之” means ”to cause to live”), benefactive(e.g., ” 死” in ” 死国” means ”to die for the country”), and putative usages, allowing the model to distinguish between standard verbs and those undergoing seman- tic shifts. The final comprehensive tagset includes 22 distinct POS categories, using forward slashes to delimit segmentation boundaries and alphabetic codes to denote the corresponding POS tag. The complete tagset and detailed annotation rules are presented in Table 1.
Hybrid annotation pipeline
To ensure expert-level understanding of linguistic nuances and historical context while maxi- mizing annotation efficiency, we instituted a ”Machine-Manual” hybrid pipeline. This approach integrates a domain-specific deep learning model with rigorous human expertise through an iterative process. To automate the initial annotation, we trained a sequence labeling model tailored for ancient texts, employing combining SikuBERT and a BiLSTM layer as the encoder. SikuBERT is pre-trained on the ”Siku Quanshu” full-text corpus (110 M parameters) to capture the semantic representation of classical Chinese. A CRF layer is integrated into the decoding phase to optimize sequence labeling performance. Prior to the specific task, the model was ini- tially trained on a composite dataset containing ancient texts (Zuo Zhuan) and modern corpora (People’s Daily POS tagged corpus) to establish foundational linguistic capability. The specific training configurations are detailed in Table 2.
Human annotators served as both the source of high-quality seed data and the final quality gate. We recruited a specialized team comprising 11 doctoral candidates and 34 master’s students majoring in Chinese, Computational Linguistics, and Ancient Chinese History. Participants underwent a rigorous two-week training curriculum covering the GB/T 13715 standard and our project-specific adaptation protocols. To qualify for the formal project, annotators were required to pass a ”Gold Standard” evaluation of 500 sentences set by senior linguists with an accuracy rate exceeding 95%.
The annotation workflow was executed in four progressive stages. First, The tailored model was applied to the unannotated ancient and modern Chinese texts to perform initial word segmentation and POS tagging. Second, each annotator manually processed a seed dataset consisting of 820 ancient-modern sentence pairs. In this stage, they verified and corrected OCR errors by referring to the original printed texts, while also performing manual word segmentation and part-of-speech tagging. Then, these corrected seed data were used to fine-tune the sequence labeling model, which then generated automated annotations for the remaining corpus. Finally, in the targeted manual correction phase, annotators were assigned to verify the model-generated outputs, fixing segmentation boundaries and tagging errors to ensure the final dataset met the required standards.
Quality control mechanism
To guarantee the fidelity of the final dataset, we implemented a task specific quality control mechanism focusing on Inter-Annotator Agreement (IAA). Due to the complexity of segmentation, we adopted a boundary-based consistency metric wherein each character position was evaluated on whether it was followed by a word boundary, effectively converting word segmentation sequences into boundary sequences before calculating the Kappa coefficient. For POS tagging agreement, which intrinsically depends on segmentation boundaries, we employed a method of aligning inconsistent segmentations into minimal character segments to produce equivalent sub-word sequences. Each minimal span was then mapped to the POS tag of its corresponding word, and Fleiss Kappa was utilized to measure the agreement. As shown in Table 3, the assessment yielded scores consistently exceeding 0.96 for both word segmentation and POS tagging across ancient and modern texts, significantly surpassing the conventional threshold for reliability and affirming the high quality of the annotated corpus.
Results
This section details the construction results and feature analysis of the Twenty-Four Histories Ancient-Modern Part-of-Speech Tagged Corpus. First, the fundamental statistics of the corpus are presented, including the volume of parallel sentence pairs and total character counts. Subsequently, the study explores linguistic distribution patterns from both ancient and modern Chinese perspectives. Finally, through a case study on the spatial distribution of four dynasties, we demonstrate the corpus’s potential to support quantitative historical research in the digital humanities.
Corpus statistics and final format
As shown in Table 4, the corpus is substantial in scale, containing a total of 209,456 sentence alignments and 29,280 paragraph alignment entries, over 37,000 parallel alignment entries have been manually verified. In terms of character volume, the dataset comprises approximately 7.68 million manually verified ancient Chinese characters and over 9.98 million modern Chinese characters.
Table 5 illustrates the specific format of the “Ancient-Modern” parallel alignment. This example demonstrates that the ancient source text and modern translation are precisely aligned at the sentence level, with each token annotated with its corresponding Part-of-Speech tag.
Feature analysis of ancient Chinese part-of-speech tagged corpus
Before analyzing specific linguistic features, it is essential to contextualize the Twenty-Four Histories parallel Corpus within the landscape of existing resources. While the SCC offers diachronic variety, it is limited in scale, and the Academia Sinica Ancient Chinese Corpus (ASACC) lacks parallel modern translations. Similarly, whereas the ACP corpus focuses primarily on segmentation, our work extends to comprehensive POS tagging using a standard compatible with modern NLP tasks. By uniquely integrating large-scale data with sentence-level ancient-modern alignment, our corpus overcomes these limitations. This structural advantage enables a comparative linguistic analysis from the dual perspectives of ancient and modern Chinese, covering key metrics such as word length, frequency, POS distribution, and co-occurrence relationships.
As presented in Table 6, analyzing the word length distribution within the segmented ancient Chinese corpus reveals several key linguistic features. Spanning 18 distinct word-length categories, the data shows that shorter words predominate. There is considerable variance in the frequency of different word lengths.
A defining characteristic is the prevalence of single-character words. Monosyllabic words account for approximately 70% of the total tokens, and words shorter than three characters collectively constitute about 90% of the corpus. This high frequency of single-character usage is indicative of the linguistic conciseness characteristic of ancient Chinese. Conversely, the corpus also contains a small number of words with lengths exceeding ten characters. These longer constructs are not common vocabulary, but typically represent specific formulaic expressions, such as lengthy posthumous imperial titles (e.g., “誠孝恭肅明德弘仁順天啓聖昭皇后”) or complex numerical expressions (e.g., “三億七千六百三十二萬九千九百八十” (3,763,299,980)).
As detailed in Table 7, an analysis of the 50 most frequent words in the ancient Chinese source text reveals a predominance of function words, including pronouns, particles, conjunctions, prepositions, and adverbs.
The character “之” holds the highest frequency, underscoring its pivotal role and functional versatility in ancient Chinese. It commonly serves as both a structural particle and a pronoun. For instance, in the phrase “南陽國王之印” (the Seal of the King of Nanyang), “之” functions as a particle indicating a possessive or attributive relationship. In contrast, in “命宰臣陳執中書之” (Ordered Minister Chen Zhizhong to write it), “之” is used as an object pronoun. Other high-frequency function words include “以”, which operates as both a preposition and a conjunction. The most common adverbs are “不” (not) and “又” (again), while frequently occurring pronouns include “其” (his/her/its), “者” (one who), and “所” (that which). The conjunctions “而” and “與” are also prominent.
Alongside these function words, the list includes several common verbs, such as “為” (to be/to do), “有” (to have), “至” (to arrive), and “及” (to reach). Basic numerals like “一” (one), “二” (two), and “三” (three) also feature prominently, reflecting their wide distribution in historical records.
As depicted in Table 8, the corpus demonstrates a predominance of verbs and common nouns, collectively accounting for approximately 50% of all tokens. Nouns denoting personal names, adverbs, and place nouns occur at similar frequencies, each constituting approximately 7% of the total. Time nouns and pronouns exhibit frequencies around 30,000 tokens, collectively representing 4.5% of the corpus. Immediately following in frequency ranking are numerals, prepositions, conjunctions, and adjectives, all exceeding 10,000 occurrences.
This distribution indicates that ancient Chinese relies heavily on verbs alongside nouns expressing personal names, locations, official positions, and temporal references. These lexical categories carry substantial semantic content, forming a critical foundation for entity and relationship extraction in ancient Chinese textual analysis.
Table 9 presents the co-occurrence frequency distribution of the top 50 most frequent words in ancient Chinese. The pair “一” and “員” exhibits the highest co-occurrence frequency. As a numeral and a measure word respectively, their combination “一員” follows the grammatical rule governing numeral-measure word constructions. Semantically, this collocation is frequently used to denote official positions within historical administrative systems. For instance, the sentence “延佑七年,省寺卿、少卿各一員,定置如上” translates as: “In the seventh year of the Yanyou era (1320 CE), one Temple Minister and one Deputy Temple Minister were established, with appointments fixed as stated above.”
The frequent co-occurrence of “遣” and “使” suggests a significant emphasis on diplomatic activities throughout Chinese history, reflecting the importance various dynasties placed on foreign relations. This verb-object pairing was also commonly employed in contexts involving the dispatching of envoys, transmission of orders or messages, and execution of specific missions.
Moreover, the corpus contains numerous collocations related to military affairs and warfare, such as “以” and “兵” (using troops), “破” and “之” (to defeat someone), “諸” and “將” (various generals), and “討” and “之” (to campaign against someone). Additionally, common grammatical or functional collocations include “不” and “可” (not possible), “不” and “能” (not able to), as well as “以” and “為” (regard as / appoint as), which are frequently encountered in ancient texts.
As shown in Table 10, the most frequently co-occurring POS-tag combination is “v” (verb) and “n” (noun), representing verb–noun constructions. Examples include pairs such as “即” and “位” (ascend to the throne), “掌” and “制誥” (manage imperial edicts), “加” and “别名” (add an alternative name), and “輪” and “官” (rotate officials). These combinations reflect common syntactic and semantic patterns in ancient Chinese.
The second most frequent combination is “n” followed by “v,” where the noun functions as an adverbial modifier preceding the verb. For instance, in the phrase “遇隻日入侍邇英閤” (“on even days enter to attend at the Yiying Pavilion”), “日” (day) serves as a temporal adverbial; similarly, in “就內殿講讀” (“conduct lectures in the inner palace”), “內殿” (inner palace) functions as a locational adverbial. However, certain irregularities exist within this POS collocation pattern. These anomalies are primarily due to the treatment of punctuation marks as stop words during corpus processing, which may result in a noun appearing at the end of one sentence being erroneously paired with a verb at the beginning of the following sentence.
In addition, the combinations “v” and “v” (verb–verb) as well as “d” (adverb) and “v” (verb) are also prevalent in ancient Chinese. The former often involves sequences of monosyllabic verbs, such as “更” and “為” (change into) or “稱” and “曰” (declare or state), typically used to express complex actions or transitions. Adverbs (tagged as “d”) generally precede verbs to modify specific actions or behaviors. Furthermore, other common POS collocations include “v” and “ns” (verb–place noun) and “n” and “n” (noun–noun) combinations, reflecting the rich morphosyntactic diversity of ancient Chinese.
Modern Chinese part-of-speech tagged corpus feature analysis
As shown in Table 11, there are notable similarities between the distribution of word lengths in modern Chinese and ancient Chinese. In both cases, words of length 1 and 2 dominate, collectively accounting for approximately 90% of all words, although there are considerable differences in frequency across different word length categories. Moreover, certain features of ancient Chinese - such as the use of longer posthumous names and numerical expressions - are still preserved in modern translation to some extent. A key distinction lies in the fact that bi-gram words constitute the majority in modern Chinese, a trend closely associated with contemporary writing conventions, where speakers generally prefer multi-character expressions. Furthermore, the overall frequencies of various word lengths in ancient Chinese are slightly higher than those found in modern Chinese.
Consistent with the linguistic distribution, the set of words containing more than 10 characters was dominated by numerals (n > 110), while the remainder comprised proper nouns (see Fig. 1).
Table 12 presents the distribution of the top 50 most frequent words in the modern Chinese corpus. After filtering out common function words and other meaningless high-frequency terms as stop words, several lexemes related to official titles and military activities emerge prominently. These include terms such as “皇帝” (emperor), “命令” (command), “朝廷” (court), “軍隊” (army), “刺史” (governor), and “節度使” (military commissioner), which reflect the rich political and martial context embedded in the Twenty-Four Histories corpus.
Words like “皇帝” “下詔” “任命” and “詔令” indicate the frequent documentation of imperial edicts and official appointments within these historical records. The appearance of “節度使” particularly highlights the historical context of frontier defense and resistance against external invasions. Moreover, a number of single-character words such as “爲” “説” and “時” still appear frequently. These terms, however, carry limited semantic content when used for textual representation, suggesting an incomplete removal of ancient or traditional forms from the stop word list.
Table 13 presents the frequency distribution of POS-tags in the modern Chinese corpus. Mirroring the pattern observed in ancient Chinese, verbs and common nouns collectively constitute approximately 50% of all tags. Nouns denoting personal names, toponyms, temporal references, and official positions each exceed 5% representation. This predominance of content words (as opposed to function words) signifies their critical role in identifying textual themes and core semantic content within modern Chinese texts.
Notably, modern Chinese introduces the specialized “l” POS tag, classifying four-character idioms and set phrases retained from ancient Chinese, such as 大赦天下 (general amnesty), 所到之處 (wherever one goes), 從今以後 (henceforth), 數以萬計 (numerically in the tens of thousands)
As shown in Table 14, the most frequent word co-occurrence in the modern Chinese corpus is the pair “改” (change) and “爲” (to be/to become). This collocation typically appears in contexts involving modifications to posthumous titles, official positions, and names of fiefs. Examples include “把公主稱號改爲帝姬” (changing the title of princess to imperial princess), “調任瀘川尉,改爲仕邡尉” (transferred to the magistrate of Luchuan, changed to that of Shifang), and “根據他的封地改爲‘濟南王印’爲宜” (it would be appropriate to change the seal to ‘Jinan King Seal’ according to his fief).
Other frequently occurring collocations such as “任命” (appoint) and “爲”, “封” (grant a title) and “爲”, “升” (promote) and “爲”, and “提升” (elevate) and “爲” are closely associated with activities related to official appointments, promotions, and imperial bestowals.
The co-occurrence of “派遣” (dispatch) and “使者” (envoy) reflects diplomatic interactions between dynasties, as illustrated by examples like “壬戌,鄧全國派遣使者入朝貢奉” (On Renxu day, Deng Quanguo dispatched envoys to pay tribute at court) and “庚辰,契丹派遣使者前來祭奠高祖” (On Gengchen day, the Khitans dispatched envoys to mourn Emperor Gaozu). The dispatching of envoys also served as a common method for handling specific diplomatic or ceremonial tasks.
Moreover, the co-occurrence of “占辭” (divinatory statement) and “説” (say) highlights the significant role of divination in ancient Chinese society. Historical evidence suggests that such practices profoundly influenced decision-making processes in both major state affairs and everyday matters.
Finally, the pairing of “设” (establish) and “达鲁花赤” (Darughachi) underscores the administrative importance of the Darughachi office during the Yuan Dynasty, reflecting broader patterns of governance and political control under Mongol rule.
Table 15 presents the distribution of the top 50 POS-tag co-occurrences in the modern Chinese corpus. Verb–noun and verb–verb constructions are the most frequent grammatical collocations in modern Chinese, a pattern that closely resembles the syntactic features of ancient Chinese. These combinations are typically used to describe states, actions, or behavioral events. Examples include “起草” (draft) and “文書” (document), “傳達” (convey) and “聖旨” (imperial edict), “輪流” (take turns) and “講讀” (lecture), as well as “下詔” (issue an edict) and “同意” (agree).
Noun–verb co-occurrences also appear at high frequency and are commonly found in subject-predicate structures or in contexts where actions are modified by nominal elements. Representative examples include “言官” (censor) and “認爲” (believe), “皇子” (prince) and “外出” (go out), and “日” (day) and “講讀” (lecture) in the phrase “逢雙日講讀” (lectures are held on even days).
Co-occurrences such as proper nouns (e.g., personal names) with verbs, verbs with place names, function words with nouns, and noun–noun pairs occur at frequencies of approximately 20,000. These patterns are particularly useful for extracting structured semantic relations such as person–verb–location triplets, which provide rich information about political and military events.
In addition, several POS tag combinations—including “n” (noun) and “d” (adverb), “v” (verb) and “nr” (personal name), “t” (time noun) and “t,” “v” and “nx” (proper noun), and “v” and “r” (pronoun)—occur more than 10,000 times. These co-occurrence patterns encode valuable information regarding temporal references, changes in official titles, and complex event relationships, offering further insight into the thematic and structural characteristics of the historical texts.
Diachronic evolution of grammatical categories
The parallel architecture of our Corpus enables a quantitative investigation into the diachronic evolution of grammatical categories. To systematically capture these shifts, we constructed a POS Transition Matrix based on the manual verified sentence pairs. This analysis reveals how specific grammatical functions in ancient Chinese map onto modern Chinese linguistic structures, offering insights into the typological shift from synthetic to analytic syntax.
As illustrated in Fig. 2, a high degree of stability is observed in core content words. Common nouns (n) and standard verbs (v) exhibit the highest retention rates, forming a diagonal dominance in the transition matrix. The transition matrix also reveals category shifts between parallel sentences: Static nouns often function as dynamic predicates. High-frequency examples include (‘疏‘, ‘上疏‘) (Shu, ‘memorial’ → ‘submit a memorial’) and (‘位‘, ‘即位‘) (Wei, ‘throne’ → ‘ascend the throne’). This pattern illustrates the explicitation of action, where modern Chinese requires a verbal head to carry the predicative force of the original noun. Conversely, certain ancient verbs undergo nominalization. For instance, the verbs (‘杖‘, ‘杖刑‘) (Zhang, ‘to cane’ → ‘caning punishment’) and (‘諡‘, ‘謚號‘) (Shi, ‘to confer a title’ → ‘posthumous title’) map to specific institutional nouns. This reflects a tendency in modern Chinese to resolve semantic ambiguity by fixing specific ancient actions into defined nominal concepts.
A significant diachronic trend is the alignment of ancient Chinese function words—specifically prepositions (p) and adverbs (d)—with full lexical verbs in modern Chinese. The instrumental preposition Yi (以), typically functioning as a coverb (‘with/by’), frequently aligns with full verbs such as (‘以‘, ‘任命‘) (Yi → ‘appoint’), (‘以‘, ‘用‘) (Yi→ ‘use’), and (‘以‘, ‘認爲‘) (Yi → ‘consider’). Tokens tagged as adverbs in source texts often carry aspectual or modal meanings that translate into full verbs or auxiliaries. For example, the aspectual adverb Shi (始, ‘initially’) maps to the inchoative verb (‘始‘, ‘開始‘) (Shi → ‘begin’). Likewise, the modal adverb Yi (宜, ‘fittingly’) aligns with the auxiliary verb (‘宜‘, ‘應該‘) (Yi → ‘should’), and the future marker Jiang (將) aligns with (‘將‘, ‘準備‘) (Jiang → ‘prepare’).
Case study of the spatial distribution of four dynasties
The comprehensive Part-of-Speech (POS) tagging within our corpus serves as a powerful tool for Digital Humanities research, enabling the direct extraction of structured historical data from unstructured classical texts. Unlike traditional qualitative reading, utilizing specific POS tags allows for the quantitative analysis of macro-historical trends, such as the evolution of administrative geography.
To demonstrate this utility, we conducted a spatial analysis of the Tang, Song, Yuan, and Ming dynasties. By systematically filtering for the ‘ns’ (place name) tag embedded in the corpus, we extracted the fifty most frequently occurring toponyms for each era. These historical locations were subsequently mapped onto contemporary geographical coordinates and visualized to reveal shifts in political and economic centers (Fig. 3).
The spatial analysis reveals distinct patterns of geographical focus and territorial emphasis across dynasties. The Tang dynasty distribution (represented by star markers) demonstrates a pronounced concentration in the northwestern regions, particularly around the imperial capitals of Chang’an and Luoyang, with notable extensions into frontier territories. The Song dynasty pattern (square markers) exhibits a notable eastward shift, with primary concentrations in the Central Plains and Jiangnan regions, reflecting the dynasty’s reduced territorial scope and economic reorientation toward southern China.
The Yuan dynasty distribution (circular markers) displays the most expansive geographical coverage, encompassing territories extending from Mongolia in the north to Yunnan in the southwest, consistent with the Mongol Empire’s vast territorial reach. Finally, the Ming dynasty pattern (triangular markers) reveals a bipolar concentration in North China and the Jiangnan region, effectively illustrating the administrative structure of the dual-capital system (Beijing and Nanjing) that characterized Ming governance.
These findings confirm that the POS-tagged corpus effectively supports quantitative historical research. By transforming text into structured geospatial data, it enables historians to verify macro-historical trends and uncover latent patterns in administrative geography that are difficult to discern through close reading alone.
Discussion
This study has detailed the construction of the Twenty-Four Histories Ancient-Modern Chinese part-of-speech tagged corpus, a resource designed to address a significant gap in the digital humanities. The resulting corpus provides a robust data foundation for multifaceted disciplinary research. As the canonical collection of China’s official historical records, the Twenty-Four Histories documents crucial developmental stages of the Chinese language. The systematic annotation of this collection affords researchers the ability to conduct more efficient retrieval and analysis, thereby offering comprehensive data support for inquiries in ancient Chinese linguistics, philology, and historiography. For instance, statistical analysis of lexical frequencies can elucidate the linguistic characteristics of specific historical periods, while the POS-tagged data facilitates quantitative investigations into particular syntactic phenomena and their diachronic evolution, enriching the understanding of historical language change.
Beyond its value to traditional scholarship, the corpus is indispensable for promoting the intelligent processing and knowledge mining of ancient texts. High-quality annotated corpora are a prerequisite for advancing the application of natural language processing technologies to ancient literature, moving beyond simple digitization toward intelligent analysis. The challenges inherent in ancient Chinese, such as the absence of punctuation and widespread polysemy, have historically impeded automated processing. This corpus serves as high-quality training data, providing crucial support for enhancing the efficacy and accuracy of large language models for downstream tasks. Furthermore, it enables the development of dedicated platforms for ancient textual resources, integrating functionalities for search, annotation, and collaborative research, thereby creating a valuable data infrastructure for advanced computational methodologies.
Moreover, the corpus is poised to facilitate the broader public dissemination of traditional culture. The precise, sentence-level alignment between ancient texts and their modern translations lowers the barrier to entry for non-specialists, including students and history enthusiasts. This accessibility allows wider audiences to engage directly with the political, economic, and cultural dimensions of ancient Chinese society. Consequently, the resource functions as a powerful educational tool, helping learners to visualize linguistic change across time and deepen their understanding and appreciation of traditional Chinese culture.
In conclusion, the construction of the Twenty-Four Histories Ancient-Modern tagged corpus accelerates the digital and knowledge-driven transformation of ancient texts. By utilizing artificial intelligence to enhance the organization and analysis of these documents, this work provides a reliable data foundation for sophisticated applications, including language model fine-tuning, machine translation, and cross-linguistic research. It ultimately offers robust support for both academic inquiry and practical applications at the intersection of the digital humanities and artificial intelligence.
Data availability
The datasets are available in our GitHub repository.
Code availability
The underlying code is available in our GitHub repository: https://github.com/vino5211/Sequence-Labeling-for-POS-tag.
References
Sturgeon, D. Chinese Text Project: a dynamic digital library of premodern Chinese. Digit. Scholarsh. Humanit. 36, i101–i112 (2021).
Zhan, W., Guo, R., Chang, B., Chen, Y. & Chen, L. The building of the CCL corpus: its design and implementation. Corp. Linguist. 6, 71–86 (2019).
Hu, X., Williamson, N. & McLaughlin, J. Sheffield corpus of Chinese for diachronic linguistic study. Lit. Linguist. Comput. 20, 281–293 (2005).
Ke, Y. Construction of Ancient chinese word segmentation and part-of-speech corpus. In Proc. 23rd Chinese National Conference on Computational Linguistics. Vol. 1, 819–829 (Chinese Information Processing Society of China, 2024).
Tang, X. et al. CHisIEC: an information extraction corpus for ancient Chinese history. Preprint at https://arxiv.org/abs/2403.15088 (2024).
Wei, P.-C., Thompson, P. M., Liu, C.-H., Huang, C.-R. & Sun, C. Historical corpora for synchronic and diachronic linguistics studies. Comput. Linguist. Chin. Lang. Process. 2, 131–145 (1997).
Chen, X. et al. Ancient Chinese Corpus LDC2017T14 (Linguistic Data Consortium, 2017).
Bernhard, D. et al. Corpora with part-of-speech annotations for three regional languages of France: Alsatian, Occitan and Picard. In Proc. 11th Edition of the Language Resources and Evaluation Conference. 3917-3924 (European Language Resources Association, 2018).
Ashida, M., Lee, S. & Namgyal, K. Building a part-of-speech tagged corpus for Drenjongke (Bhutia). In Proc. 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop. 57–63 (Association for Computational Linguistics, 2020).
Warjri, S. et al. Part-of-speech (POS) tagging using deep learning-based approaches on the designed Khasi POS corpus. ACM Trans. Asian Low-Resour. Lang. Inf. Process 21, 63:1–63:24 (2021).
Lee, D. G. et al. Minimizing human intervention for constructing Korean part-of-speech tagged corpus. IEICE Trans. Inf. Syst. E93-D, 2336–2338 (2010).
Abebe, T. & Alemneh, E. Amharic text corpus based on parts of speech tagging and headwords. In 2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA) 77–82 (IEEE, 2021).
Trushkina, J. Development of a multilingual parallel corpus and a part-of-speech tagger for Afrikaans. in Intelligent Information Processing III (eds Shi, Z., Shimohara, K. & Feng, D.) 453–462 (Springer, 2007).
Zhang, Z., Li, W. & Su, Q. Automatic translating between ancient Chinese and contemporary Chinese with limited aligned corpora. in Natural Language Processing and Chinese Computing (eds Jie, T. et al.) 157–167 (Springer, 2019).
Liu, D. et al. Ancient–modern Chinese translation with a new large training dataset. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19, 1–13 (2019).
Cao, J. et al. Translating ancient Chinese to modern Chinese at scale: a large language model-based approach. In Proc. ALT2023: Ancient Language Translation Workshop. 61–69 (Asia-Pacific Association for Machine Translation, 2023).
Wang, D. et al. GujiBERT and GujiGPT: construction of intelligent information processing foundation language models for ancient texts. Preprint at https://arxiv.org/abs/2307.05354 (2023).
Acknowledgements
The research was funded by the National Social Science Foundation of China (No.21&ZD331) and the Social Science Fund Project of Jiangsu Province(No. 23TQC004).
Author information
Authors and Affiliations
Contributions
Conceptualization, Dongbo Wang; writing—original draft preparation, Wenhao Ye and Xue Zhao; writing—review and editing, Wenhao Ye; formal analysis, Qiankun Xu.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ye, W., Xu, Q., Zhao, X. et al. Construction of the twenty-four histories ancient-modern part-of-speech tagged corpus. npj Herit. Sci. 14, 97 (2026). https://doi.org/10.1038/s40494-026-02309-w
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s40494-026-02309-w





