Background & Summary

Historical documents are invaluable cultural heritages produced during the evolution of human civilization. Particularly, the long-standing Chinese civilization has left a vast collection of such documents, covering various fields such as history, art, and medicine. Therefore, the recognition and analysis of Chinese historical documents hold significant value for unveiling ancient Chinese culture, hence considered an important and urgent research topic. Advances in artificial intelligence (AI), particularly in deep learning, have facilitated the development of automatic recognition and analysis techniques1,2 for these documents, substantially reducing the reliance on extensive manual-labor by human experts. Nonetheless, the efficacy of these deep-learning-based methods heavily depends on the availability of extensive annotated datasets for model training.

In this field, there have been several pioneering efforts in dataset construction. For instance, Xu et al.3 established the CASIA-AHCDB, which contains over 2.2 million character images from Chinese historical documents. The MTHv1 proposed by Yang et al.4 is the first dataset with page-level annotation, which consists of 1,500 historical document images with annotated texts and their reading order. However, it is limited to a single document type, i.e., Buddhist scriptures. Building upon MTHv1, Ma et al.5 introduced MTHv2 by expanding the dataset to 3,199 images, but it still contains only Buddhist scriptures. Similar to MTHv1 and MTHv2, IC19 HDRC6 provides page-level annotation and includes 12,850 historical document images, but it also covers only a single type of document, namely family genealogies. To address the limitations in style within the above datasets, Shi et al.7 proposed M5HisDoc, which includes 8,000 historical document images and features multiple styles. However, there are still several key limitations that persist:

  • Limited data scale: All existing datasets contain only a few thousand to tens of thousands of historical document images. This is very small compared to the massive amount of historical documents in the real world, which hinders the development of relevant methods.

  • Insufficient character category coverage: The maximum number of character categories in the existing dataset is only 16,151, which hinders the models in addressing the challenges of category diversity in real-world scenarios.

  • Lack of book-level annotation: The annotations in the existing dataset are primarily focused on the page-level, without book-level annotations, making it impossible to conduct book-level research.

To address these limitations, we introduce HisDoc1B8, a large-scale dataset of Chinese historical documents. For its construction, we design an effective semi-automatic annotation method, which contains four main processes: character location, character annotation, character arrangement, and text punctuation. Leveraging this method, we harness vast amounts of unlabeled web data to construct HisDoc1B. As detailed in Table 1, HisDoc1B includes annotations for over 40 thousand books, 3 million images, and 1 billion characters across 30,651 categories. HisDoc1B has the following features. (a) As the largest dataset in the field, it is over 200 times greater than the largest existing datasets. (b) It is the most completely annotated dataset of Chinese historical documents, offering unique annotations of book-level and punctuation. We believe this dataset will aid and inspire future research in the recognition and analysis of Chinese historical documents.

Table 1 Comparison of HisDoc1B with existing Chinese historical document datasets.

Methods

To construct a large-scale dataset of Chinese historical documents, we designed a systematic pipeline, as illustrated in Fig. 1. This pipeline contains three key steps: data collection, data annotation, and data validation. The data annotation process is based on a semi-automatic annotation method we proposed, as depicted in Fig. 2, which significantly reduces the burden of manual-labor. The semi-automatic data annotation method compromises four main steps: character localization, character annotation, character arrangement, and text punctuation. In this section, we will delve into the details of the data construction.

Fig. 1
figure 1

The pipeline of building HisDoc1B dataset.

Fig. 2
figure 2

The pipeline of data annotation.

Data collection

Data acquisition

The main purpose of our study is to construct a large-scale dataset of Chinese historical documents. To achieve this goal, we need to acquire a sufficient amount of raw historical document data. Therefore, we selected a website with extensive historical document resources, GuoXueDaShi (https://www.guoxuedashi.net/guji/). This platform boasts an extensive collection of over one million books of historical documents, which were collected and organized by classical Chinese enthusiasts. The magnitude of data available on this website meets the requirements of our research. From this website, approximately 45,000 scanned books of historical documents were downloaded as source data, which are primarily in Portable Document Format (PDF) and DjVu formats.

Data cleaning

To guarantee the quality of the dataset, we engaged professional annotators to conduct a preliminary review of the downloaded books. The aim of the review is to filter out any content that deviated from our research standards, such as files with extensive watermarks and non-Chinese historical documents. This process resulted in the exclusion of about 10% of the data. Subsequently, we utilized automatic scripts to convert the scanned books into image formats (JPEG) and numbered them sequentially. After these two steps, we obtained over three million high-quality images of Chinese historical documents.

Data annotation

Character location

In this step, our goal is to obtain the position of all the characters on the historical document images. Referring to previous works4,7, we employ rectangular boxes that enclose the characters to indicate their position. Given the number of our images (over 3 million), it would be impractical to manually annotate all the rectangular boxes, as it would consume an enormous amount of human labor.

Therefore, we used a deep-learning based object detection algorithm as an automatic character position annotator, which treats all the characters as detection targets. To develop an accurate character-localization model, high-quality data for model training is required. To this end, we proposed a strategy to efficiently construct the training set: selecting one image sample from each book and manually annotating the character positions on these images. With this strategy, we only need to manually annotate 40,281 images, resulting in a 98.7% reduction in annotation compared to full annotation. This strategy is based on such a prior knowledge: The key to character location in Chinese historical documents lies in accurately distinguishing the foreground characters from the complex background. Furthermore, the stylistic attributes of the foreground characters and background are highly consistent with images from the same book. This insight suggests that a character location model capable of accurately localizing the characters in a sample image from a book can effectively handle all images in the entire book.

To efficiently annotate these 40,281 images, we adopted a hybrid approach that combines model-generated pseudo-annotations with manual refinement. Initially, the open-source datasets MTHv25 and M5HisDoc7 were used to train the object detection model. For the choice of character localizers, we compared mainstream detectors such as Faster R-CNN9, YOLOX10 and YOLOv711. Experimental results indicate that YOLOv7 achieves superior training and inference efficiency with comparable performance. To enhance efficiency, we chose YOLOv7 as the base model for the character localizer. Subsequently, the model was employed to generate preliminary annotations for the character positions of these images. Thereafter, we invited professional annotators to refine these annotations, including filling in omissions, removing redundancies, and correcting inaccuracies. The annotation refinement process was facilitated by Labelme (https://github.com/labelmeai/labelme), an open-source annotation tool. This hybrid annotating approach has further reduced human labor.

Through this method, we obtained the accurate character locations of 40,281 images. Combining this fine-tuned data with the open-source data as the training set, we trained the YOLOv7 model as a highly accurate character location annotator. Ultimately, the annotator was applied to our collection of over 3 million images, yielding accurate character location annotations. The character position obtained in this step totals over 1 billion.

Character annotation

The next crucial step is dedicated to the recognition of each character on the images, i.e., to represent each character with a computer-readable code. In the previous step, we obtained the location of over 1 billion characters, thus over 1 billion characters are required to be recognized. For such a large amount of data, manual annotation would be extremely labor-intensive and impractical.

Therefore, we adopted an automatic annotating method based on the deep-learning model, specifically employing a classifier model to automatically allocate the character images into their respective character categories. For character annotator, we evaluate both convolutional networks12 and Vision Transformer (ViT)13 architectures, with results showing that ViT achieves better performance. Therefore, to achieve higher-quality annotations, we selected ViT as the architecture for training the character annotator. The total number of categories set to 31,524, which encompasses the 27,533 characters in the national standard (GB 18030-2000) as well as other character categories in the open source dataset3,5,7.

To develop a high-performance character annotator, we implemented the following three strategies:

  • Self-supervised pre-training: Leveraging the precise annotations of character position obtained from the previous step, we cropped out character images from document images to form a self-supervised learning dataset. Utilizing this dataset, we pre-trained our model using the MAE14 framework, a self-supervised learning method that has been proven to significantly enhance the model performance.

  • Data synthesis: We employed MTHv25, CASIA-AHCDB3, and M5HisDoc7 as the foundation of the training set. However, due to the limited scale of these datasets, we integrated data synthetic techniques to augment our training set. From a font website (https://www.foundertype.com/), we selected 320 TrueType font (TTF) files that are similar in style to the characters on Chinese historical documents. The character images rendered using these font files were used to supplement the training set. Additionally, since the categories are not covered by the open source dataset and the TTF data, we used FontDiffuser15, an advanced deep learning-based font generation approach for data completion. Specifically, utilizing the character images of existing categories within the TTF files as training data, we trained the FontDiffuser model to generate character images for each of the absent character categories. Furthermore, to enhance the authenticity of synthetic data, we randomly replaced the white background with actual historical document backgrounds.

  • Category averaging sampling: In light of the long-tail distribution of character categories within open-source datasets, directly training models on them could lead to bias towards common categories and neglect of rare categories. To mitigate this, we adopted a class-balanced sampling strategy, randomly sampling the dataset during the training process to ensure that each category is given equal consideration, thereby improving the model’s comprehensive recognition capabilities.

Through these strategies, we successfully developed an automatic character annotator. Subsequently, we fed all the cropped character images into this annotator, accomplishing the character annotation step with precision and efficiency.

Algorithm 1

Segmentation of Text Blocks.

Algorithm 2

Sort Characters in Text Blocks.

Character arrangement

The objective of this stage is to arrange the characters within the images in the correct reading order. Analysis of Chinese historical documents reveals that the arrangement of characters follows certain patterns. For instance, as shown in Fig. 3(a), the characters are typically segregated into distinct blocks, and within each block, they are organized in a top-to-bottom and right-to-left sequence. Leveraging these patterns, we designed a character arrangement algorithm based on heuristic rules, which consists of three primary steps, as illustrated in Fig. 3:

  • Segmentation of text blocks: In Chinese historical documents, the segmentation of text blocks is typically indicated by some black lines, as illustrated in Fig. 3(a). This feature provides a visual clue for the automatic segmentation of text blocks. Based on this feature, we employed image processing techniques to identify and extract these segment lines. Firstly, the image binarization technique from the OpenCV library was applied to the image. Secondly, a masking operation was conducted to set all pixels within the character regions to white. The purpose of this strategy is to reduce the potential interference of characters with the detection of segment lines, as the strokes of the characters might be pixel-wise similar to the segment lines and could thus affect the performance of the algorithm. Finally, the horizontal projection technique was engaged to ascertain the segment lines. This technique is implemented by calculating the total number of black pixels across each row of the image. By analyzing the distribution of the projection results, we can identify the locations of peaks, which are considered to be the segment lines between text blocks. The specific algorithmic process is demonstrated in Algorithm 1.

  • Character arrangement within text blocks: Within each text block, characters are first arranged in the order from top to bottom, forming columns known as text lines. These text lines are then ordered from right to left. Notably, in Chinese historical documents, a special format sometimes occurs where two smaller text lines are below a larger text line, referred to as “double-column annotation”. These smaller text lines typically serve to explain the larger text line above them. In the situation of “double-column annotation”, the reading order should start with the larger text line, followed by the smaller text lines in a right-to-left order. Based on these reading order rules, we developed a heuristic algorithm, as depicted in Algorithm 2, which centralizes the idea of aggregating characters into horizontal regions and detecting the presence of “double-column annotation” within these regions. If such a situation is absent, the region is treated as an independent text line; otherwise, the characters within the “double-column annotation” will be further distinguished. After aggregating all the characters into separate text lines, the algorithm sorts the characters according to the top-to-bottom (within text line), left-to-right (between text lines) rules.

  • Character arrangement between text blocks: The ordering between text blocks follows the basic principle of top-to-bottom arrangement. Therefore, we sort the text blocks based on the vertical coordinates (y-coordinates), from small to large.

Fig. 3
figure 3

The visualization of character arrangement. (a) Segmentation of text blocks. (b) Sort characters within text blocks. (c) Sort characters between text blocks. Zoom in for a better view.

By applying the aforementioned algorithms to all document images, the character arrangement step is completed efficiently.

Text punctuation

After successfully extracting individual text sequences from each image, it is crucial to punctuate these sequences to align with standard reading conventions. Given the extensive number of sequences to process, we employed an automatic punctuation method based on deep learning.

To develop an efficient automatic punctuation system, we utilized an ancient text corpus (https://github.com/garychowcmu/daizhigev20) as our training dataset. This corpus encompasses a rich collection of punctuated paragraphs sourced from historical documents. The model we employed was Transformer16, which is widely recognized for its outstanding performance in handling sequential data.

The specific implementation steps are as follows: Initially, we preprocessed the selected ancient text corpus by removing all existing punctuation marks, yielding a set of unpunctuated text sequences for model input. Concurrently, the originally punctuated text served as the model’s training target. Through this approach, the model can learn how to predict and add appropriate punctuation marks based on the text content.

To enhance the accuracy of automatic punctuation, we adopted a strategy of chunked input. Specifically, each unpunctuated text sequence was segmented into 30-character chunks, which were subsequently fed into the punctuation model incrementally. The punctuation outcomes of these chunks were then concatenated sequentially to reconstitute the fully punctuated text.

Data validation

To evaluate the accuracy and reliability of our semi-automatic annotation method, we conduct the following data validation. Firstly, we randomly selected 100 images from the annotated dataset. Secondly, we invited domain experts to independently and meticulously annotate these images. These annotations served as the ground truth for comparison with the results produced by our semi-automatic annotation system. Thirdly, we performed a thorough comparative analysis, the comparison results are presented in Table 2. The metrics of character location, character annotation, character arrangement, and text punction are F1-score in 0.7 IoU, top1-accuracy, ARDs, and F1-score17, respectively. The results indicated that our semi-automatic annotation system performed well on the metrics, demonstrating the high quality of our dataset.

Table 2 Results of data validation.

Data Records

The HisDoc1B8 dataset consists of two main folders, one dedicated to storing historical documents in the form of e-books and the other containing their respective annotation files. The e-books are archived in PDF or DjVu formats. The annotation files are stored in JavaScript Object Notation (JSON) format, aligning with each corresponding e-book. The dataset employs a unique book identifier (ID) to pair e-books with their annotation files, ensuring precise alignment.

Each annotation file contains various annotation entries corresponding to the individual pages of the book. Each entry includes a detailed record of the following three key parts:

  • Character position: an array of rectangular boxes, sequenced in the reading order. Each rectangular box indicates the position of a character within the document image. The coordinates are represented in the format of ‘x1, y1, x2, y2’, where ‘x1, y1’ specifies the coordinates of the top-left corner of the rectangle, ‘x2, y2’ denotes the bottom-right corner.

  • Character content: a sequence of Unicode-encoded symbols, with the order corresponding to the character positions.

  • Punctuated text: the complete text sequence that includes punctuation marks.

HisDoc1B focuses mainly on commonly used font types in Chinese historical documents. However, some special fonts, such as Oracle18,19 and Inscriptions on bronze, are missing from our dataset. This is a direction for future work.

Technical Validation

In order to access the utility of the proposed HisDoc1B dataset for Chinese historical document recognition and analysis, we conducted three technical validation tasks: character detection, character recognition, and incremental pre-training of the language model.

Character detection

The aim of this validation task is to validate the application value of the character location annotations within our HisDoc1B. To this end, we devised and executed a cross-validation experiment for character detection: Employing a consistent model architecture, we conducted training across diverse datasets of Chinese historical documents and subsequently assessed their performance on each dataset. The comparative datasets included MTHv25 and M5HisDoc7. Given the extensive data scale of HisDoc1B, we implemented a sampling strategy, randomly selecting two images per book to construct the training and test sets, one for each subset, to improve the efficiency of the experiment. We chose the YOLOv711 model and trained it for 100 epochs using the above training set, respectively. Subsequently, we validated the model performance on the test set of each dataset. The validation metric is the F1-score at 0.7 Intersection over the Union (IoU) threshold. The results are shown in Table 3. Based on the results, we observed that the model trained on the HisDoc1B dataset demonstrates the most robust generalization performance on the non-homologous dataset. This is attributed to the diversity of styles and large-scale of our dataset, which enhances the model’s generalization capabilities. It also reflects the accuracy of the annotations. Conversely, models trained on other datasets perform poorly on the test set of HisDoc1B, mainly due to the extensive range of character categories in our dataset, which introduces new challenges to the field. The model trained on M5HisDoc exhibits the second-best generalization, which is attributed to its inclusion of multiple styles.

Table 3 Results of character detection.

Character recognition

To evaluate the practical utility of the character annotations within the HisDoc1B dataset, we designed and conducted a character recognition cross-validation experiment similar to the previous experiment. The datasets we compare with HisDoc1B include MTHv25, CASIA-AHCDB3, and M5HisDoc7. For HisDoc1B, we randomly sampled 10 images per category to construct the test set, the remaining images constituted the training set. In order to speed up the training, for categories with more than 300 samples in each training set, we downsampled the samples to 300. Considering the character categories vary across datasets, the zero-shot recognition model Hiercode20 was employed, to give the model the ability to generalize beyond the training categories. The model was trained for 90 epochs using the above training sets respectively. Subsequently, we validated the model performance on the test set of each dataset. The validation metric is the macro accuracy21. The results are shown in Table 4. Based on the results, we observed that the model trained on the HisDoc1B dataset exhibits the best generalization performance on the non-homologous dataset. This is attributed to the diversity of styles and large-scale of our dataset, which enhances the model’s generalization capabilities. It also reflects the accuracy of the annotations. Conversely, models trained on other datasets perform poorly on the test set of HisDoc1B, primarily due to the diverse range of character categories within our dataset, introducing new challenges to the field. The model trained on M5HisDoc exhibits the second-best generalization, which is attributed to its inclusion of multiple styles.

Table 4 Results of character recognition.

Incremental pre-training of language model

To validate the utility and quality of the book-level text provided in the HisDoc1B dataset, we conducted an experiment of large language model incremental pre-training. For comparative analysis, the DaiZhiGe (https://github.com/garychowcmu/daizhigev20) data was utilized, an open-source corpus dataset of Chinese ancient texts. The language model selected for this experiment is Qwen1.5-1.5B22, which was trained for 1 epoch using the above training set, respectively. Subsequently, we employed ACLUE23 as the dataset for evaluating the performance of pre-trained language models, it is a comprehensive benchmark of Ancient Chinese understanding. The training set of ACLUE was utilized for instruction fine-tuning of the pre-trained model, respectively. The scores on the test set of ACLUE are compared after the fine-tuning is completed. The results are detailed in Table 5, showing that the large language model pre-trained on the HisDoc1B dataset outperforms the model pre-trained on the DaiZhiGe corpus, both before and after fine-tuning. This superior performance can be attributed to the large-scale and diversity of the HisDoc1B dataset, which provides the model with rich knowledge of ancient Chinese culture and linguistic priors. Despite the DaiZhiGe corpus being larger than HisDoc1B, it contains some noise that negatively affects model performance. In contrast, the data quality in HisDoc1B is maintained at a high standard through our meticulous data construction pipeline. These findings not only confirm the accuracy and quality of annotations in the HisDoc1B dataset but also demonstrate its effectiveness in advancing language modeling for the understanding of ancient Chinese texts.

Table 5 Results of incremental pre-training of language model.

Baseline experiments

The baseline experiment for character detection follows the same data split as in Section Character detection. We conduct baseline experiments using a typical two-stage method, Faster R-CNN9, as well as two one-stage methods, YOLOX10 and YOLOv711. The models are trained for 12 epochs with a batch size of 8. We use stochastic gradient descent (SGD) for optimization, with an initial learning rate of 0.001, which decays by a factor of 0.1 at the 8th and 11th epochs. Input images are resized and padded to a resolution of 1024 × 1024 while maintaining the aspect ratio. The evaluation metrics include precision, recall, and F1 score at a 0.7 IoU threshold. The results are shown in Table 6. We can find that various methods demonstrate comparable performance.

Table 6 Results of baseline experiments on character detection.

The baseline experiment for character recognition uses the same data split as in Section Character recognition. We conduct baseline experiments on two types of methods: (1) conventional image classifiers, including ResNet5012 and ViT13, and (2) zero-shot character recognition models, including HDE24, RIE25, and HierCode20. The models are optimized using AdamW26 with a base learning rate of 1e-3, which decays to 1e-6 following a cosine annealing schedule. During the first 5 epochs, the learning rate linearly warms up from 1e-4 to 1e-3. We train the models for 90 epochs with a batch size of 1,024. All models use RandAugment27 for data augmentation. Input images are resized and padded to a resolution of 96 × 96 while maintaining the aspect ratio. Evaluation metrics include top-1 accuracy and macro accuracy21. Based on the results in Table 7, we can gain the following findings: (1) ViT slightly outperforms ResNet50 among conventional image classifiers, and (2) zero-shot character recognition models outperform conventional classifiers due to variations in character categories between training and test sets.

Table 7 Results of baseline experiments on character recognition.

Exploration of the challenges associated with the size of the dataset

The large-scale of our dataset introduces a wide variety of styles and character categories, presenting challenges discussed in Sections Character detection and Character recognition. To further explore the challenges posed by the size associated with training resources, we conducted the following experiments. Following the setup in Section Character recognition, we trained the HierCode20 model with backbones of varying parameter sizes and training durations. The results are shown in Table 8, where a clear performance improvement is observed with increased model parameters and training time. This demonstrates that the scale of our dataset necessitates using models with a larger number of parameters and longer training durations to achieve optimal results, which poses a significant challenge to training resources.

Table 8 Results of exploration experiments.

Usage Notes

The HisDoc1B dataset consists of two main folders. One folder contains the e-books of historical documents in the format of PDF or DjVu files. The other folder contains the corresponding annotation files for each book in JSON format. We have provided a Python script that facilitates the conversion of e-books into image format (JPEG), reads annotation data from the JSON files, and organizes them within the appropriate data structures.