Table 1 Comparison of HisDoc1B with existing Chinese historical document datasets.

From: A large-scale dataset for Chinese historical document recognition and analysis

Dataset

#Books

#Document images

#Characters

#Character categories

Text punctuation

MTHv14

—

1,500

521,370

4,058

× 

MTHv25

—

3,199

1,081,678

6,733

× 

IC19 HDRC6

—

11,715

2,482,994

8,353

× 

M5HisDoc7

—

8,000

4,367,360

16,151

× 

CASIA-AHCDB3

—

—

2,276,740

10,350

× 

HisDoc1B8 (Ours)

40,281

3,163,330 (270×)

1,082,544,808 (248×)

30,615 (1.9×)

✓

  1. The highest and second highest values within each column are denoted by bold and underline, respectively. HisDoc1B is more than 200 and 1.9 times greater than existing dataset in the count of image/character and character category, respectively. It is the only dataset that includes book-level and punctuation annotations.