Table 1 Comparison of HisDoc1B with existing Chinese historical document datasets.

Dataset	#Books	#Document images	#Characters	#Character categories	Text punctuation
MTHv1⁴	—	1,500	521,370	4,058	×
MTHv2⁵	—	3,199	1,081,678	6,733	×
IC19 HDRC⁶	—	11,715	2,482,994	8,353	×
M5HisDoc⁷	—	8,000	4,367,360	16,151	×
CASIA-AHCDB³	—	—	2,276,740	10,350	×
HisDoc1B⁸ (Ours)	40,281	3,163,330 (270×)	1,082,544,808 (248×)	30,615 (1.9×)	✓

The highest and second highest values within each column are denoted by bold and underline, respectively. HisDoc1B is more than 200 and 1.9 times greater than existing dataset in the count of image/character and character category, respectively. It is the only dataset that includes book-level and punctuation annotations.

Quick links

Search