Abstract
Oracle bone inscriptions, the earliest known form of Chinese writing, hold immense historical and linguistic significance. However, existing digital datasets are typically limited to isolated characters and lack contextual and structural information essential for comprehensive analysis. We present the Oracle Bone Inscriptions Multi-modal Dataset (OBIMD), a large-scale, publicly available corpus to provide pixel-aligned rubbing and facsimile images, character-level annotations, and sentence-level transcriptions with corresponding reading sequences. OBIMD encompasses 10,077 oracle bone inscription images spanning five phases of the Shang Dynasty, featuring 93,652 annotated characters, 21,667 recorded missing-character positions, 21,941 sentence units, and 4,192 non-sentential elements. By integrating visual, structural, and linguistic modalities, OBIMD supports multi-modal learning and diverse tasks such as facsimile enhancement, character retrieval, and syntactic reconstruction. It constitutes a foundational resource for oracle bone inscription recognition and interpretation, enabling scalable and systematic analysis of ancient Chinese writing.
Similar content being viewed by others
Data availability
The OBIMD dataset generated and analysed during the current study is available on the Hugging Face Hub14.
Code availability
Source code and scripts used for the technical validation experiments on the OBIMD dataset14 are publicly available on GitHub at https://github.com/libang1991/OBIMD. The repository includes the core implementations of the baseline models evaluated in this manuscript, supporting reproducibility of the reported results. The web-based annotation platform used for OBIMD construction is available at https://www.jgwlbq.org.cn/oracle-bone.
References
Boltz, W. G. Early Chinese writing. World Archaeology 17, 420–436 (1986).
Keightley, D. N. The Shang state as seen in the oracle-bone inscriptions. Early China 5, 25–34, https://doi.org/10.1017/S0362502800006118 (1979).
Qi, Y. & Yuan, W. A digital infrastructure for the study of oracle bone inscriptions https://jgw.aynu.edu.cn/ (2019).
Fujikawa, Y. et al. Recognition of oracle bone inscriptions by using two deep learning models. International Journal of Digital Humanities 5, 65–79, https://doi.org/10.1007/s42803-022-00044-9 (2023).
Li, J. et al. Towards better long-tailed oracle character recognition with adversarial data augmentation. Pattern Recognition 140, 109534, https://doi.org/10.1016/j.patcog.2023.10953 (2023).
Fu, X. & Zhou, R. Shape prior fusion for oracle bone inscriptions detection. in Proceedings of the 2024 7th International Conference on Image and Graphics Processing 394–401, https://doi.org/10.1145/3647649.3647711 (2024).
Liu, G., Xing, J. & Xiong, J. Spatial pyramid block for oracle bone inscription detection. in Proceedings of the 2020 9th International Conference on Software and Computer Applications 133–140, https://doi.org/10.1145/3384544.3384561 (2020).
Huang, S. et al. OBC306: A large-scale oracle bone character recognition dataset. in 2019 International Conference on Document Analysis and Recognition (ICDAR) 681–688, https://doi.org/10.1109/ICDAR.2019.00114 (2019).
Yue, X. et al. Dynamic dataset augmentation for deep learning-based oracle bone inscriptions recognition. ACM Journal on Computing and Cultural Heritage 15, 1–20, https://doi.org/10.1145/3532868 (2022).
Wang, M. & Deng, W. A dataset of oracle characters for benchmarking machine learning algorithms. Scientific Data 11, 87, https://doi.org/10.1038/s41597-024-02933-w (2024).
Li, B. et al. HWOBC-a handwriting oracle bone character recognition database. Journal of Physics: Conference Series 1651, 012050, https://doi.org/10.1088/1742-6596/1651/1/012050 (2020).
Wang, P. et al. An open dataset for oracle bone character recognition and decipherment. Scientific Data 11, 976, https://doi.org/10.1038/s41597-024-03807-x (2024).
Guan, H. et al. An open dataset for the evolution of oracle bone characters: EVOBC. Preprint at https://arxiv.org/abs/2401.12467 (2024).
Key Laboratory of Oracle Bone Inscriptions Information Processing, Li, B., Yang, J. et al. OBIMD. Hugging Face https://doi.org/10.57967/hf/7828 (2026).
Oracle Bone AI Collaborative Platform. An AI-driven research workspace for oracle bone inscriptions https://www.jgwlbq.org.cn/oracle-bone (2024).
Guo, M. & Hu, H. (eds) Jiaguwen Heji [Collection of Oracle Bone Inscriptions] (Zhonghua Book Company, 1978-1982).
Chinese Academy of Social Sciences. Yinxu Huayuanzhuang Dongdi Jiagu [Oracle Bone Inscriptions from Huayuanzhuang East at Yinxu] (Yunnan Nationalities Publishing House, 2003).
Huang, T. (ed.) Jiaguwen Moben Daxi [Comprehensive Series of Oracle Bone Facsimiles] (Peking University Press, 2022).
Oracular Digital Platform. Glyph library https://oracular.azurewebsites.net/glyphs (2024).
Ultralytics. ultralytics (v8.2.94). GitHub https://github.com/ultralytics/ultralytics/releases/tag/v8.2.94 (2024).
Kuhn, H. W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2, 83–97, https://doi.org/10.1002/nav.3800020109 (1955).
Ultralytics. ultralytics (v8.3.0). GitHub https://github.com/ultralytics/ultralytics/releases/tag/v8.3.0 (2024).
Vaswani, A. et al. Attention is all you need. in Advances in Neural Information Processing Systems 30 (NIPS 2017) 5998–6008, https://papers.nips.cc/paper/7181-attention-is-all-you-need (2017).
Han, W. et al. Self-supervised learning of Orc-Bert augmentor for recognizing few-shot oracle characters. in Proceedings of the Asian Conference on Computer Vision 652-668, https://doi.org/10.1007/978-3-030-69544-6_39 (2021).
Zhang, G. et al. Deciphering ancient Chinese oracle bone inscriptions using case-based reasoning. in International Conference on Case-Based Reasoning 309–324, https://doi.org/10.1007/978-3-030-86957-1_21 (Springer, 2021).
Yue, X. et al. An unsupervised automatic organization method for Professor Shirakawa’s hand-notated documents of oracle bone inscriptions. International Journal on Document Analysis and Recognition (IJDAR) 27, 583–601, https://doi.org/10.1007/s10032-024-00463-0 (2024).
Acknowledgements
This research was supported by the National Natural Science Foundation of China (Grant No. 62506007), the Natural Science Foundation of Henan Province (Grant No. 242300420680), the Paleography and Chinese Civilization Inheritance and Development Program (Grant Nos. G1807, G1806, G2821), the Henan Province Science and Technology Research Project (Grant Nos. 242102210116, 252102321071), the Open Research Topic of the Key Laboratory of Oracle Information Processing, Ministry of Education (Grant No. OIP2024E002, OIP2024H002), the Key Technology Project of Henan Educational Department of China (Grant No. 22ZX010), and the Henan Province High-Level Talents International Training Program (Grant No. GCC2025028).
Author information
Authors and Affiliations
Contributions
Bang Li and Jing Yang co-wrote the manuscript and contributed equally to this work. Yujie Liang and Zengmao Ding also contributed to manuscript writing. Technical validation experiments were conducted by Bang Li, Jing Yang, and Zengmao Ding. Xiaobin Hu and Taisong Jin proposed key revisions before submission. Bang Li and Donghao Luo initiated and supervised the construction of the OBIMD dataset. Yujie Liang, Zengmao Ding, and Xu Peng implemented algorithmic pre-annotation for the dataset. Bang Li, Jing Yang, and Shengwei Han conducted manual annotation and developed annotation guidelines. Bang Li, Donghao Luo, and Yongge Liu coordinated the annotation teams. Peichao Qin provided the standard oracle bone character library for the dataset. Rongrong Ji, Feng Gao, and Yongge Liu supported the project through funding and resources. Correspondence should be addressed to Donghao Luo or Taisong Jin, who provided overall guidance on project design, dataset framework, and manuscript review. All authors reviewed and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, B., Yang, J., Liang, Y. et al. OBIMD: A Multi-modal Dataset for Contextual Interpretation of Oracle Bone Inscriptions. Sci Data (2026). https://doi.org/10.1038/s41597-026-06967-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-06967-0


