Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Data
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific data
  3. data descriptors
  4. article
OBIMD: A Multi-modal Dataset for Contextual Interpretation of Oracle Bone Inscriptions
Download PDF
Download PDF
  • Data Descriptor
  • Open access
  • Published: 14 March 2026

OBIMD: A Multi-modal Dataset for Contextual Interpretation of Oracle Bone Inscriptions

  • Bang Li1 na1,
  • Jing Yang1 na1,
  • Yujie Liang2,
  • Xiaobin Hu3,
  • Zengmao Ding1,
  • Xu Peng3,
  • Shengwei Han1,
  • Peichao Qin4,
  • Donghao Luo3,
  • Taisong Jin2,
  • Feng Gao1,
  • Yongge Liu1 &
  • …
  • Rongrong Ji2 

Scientific Data , Article number:  (2026) Cite this article

  • 1214 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • History
  • Science in culture

Abstract

Oracle bone inscriptions, the earliest known form of Chinese writing, hold immense historical and linguistic significance. However, existing digital datasets are typically limited to isolated characters and lack contextual and structural information essential for comprehensive analysis. We present the Oracle Bone Inscriptions Multi-modal Dataset (OBIMD), a large-scale, publicly available corpus to provide pixel-aligned rubbing and facsimile images, character-level annotations, and sentence-level transcriptions with corresponding reading sequences. OBIMD encompasses 10,077 oracle bone inscription images spanning five phases of the Shang Dynasty, featuring 93,652 annotated characters, 21,667 recorded missing-character positions, 21,941 sentence units, and 4,192 non-sentential elements. By integrating visual, structural, and linguistic modalities, OBIMD supports multi-modal learning and diverse tasks such as facsimile enhancement, character retrieval, and syntactic reconstruction. It constitutes a foundational resource for oracle bone inscription recognition and interpretation, enabling scalable and systematic analysis of ancient Chinese writing.

Similar content being viewed by others

A multi-modal dataset and method for bone-level association prediction in oracle bone inscriptions

Article Open access 09 January 2026

Clustering-based feature representation learning for Oracle Bone Inscriptions detection

Article Open access 25 June 2025

A text image dual conditional stable diffusion model for oracle bone inscription decipherment

Article Open access 12 September 2025

Data availability

The OBIMD dataset generated and analysed during the current study is available on the Hugging Face Hub14.

Code availability

Source code and scripts used for the technical validation experiments on the OBIMD dataset14 are publicly available on GitHub at https://github.com/libang1991/OBIMD. The repository includes the core implementations of the baseline models evaluated in this manuscript, supporting reproducibility of the reported results. The web-based annotation platform used for OBIMD construction is available at https://www.jgwlbq.org.cn/oracle-bone.

References

  1. Boltz, W. G. Early Chinese writing. World Archaeology 17, 420–436 (1986).

    Google Scholar 

  2. Keightley, D. N. The Shang state as seen in the oracle-bone inscriptions. Early China 5, 25–34, https://doi.org/10.1017/S0362502800006118 (1979).

    Google Scholar 

  3. Qi, Y. & Yuan, W. A digital infrastructure for the study of oracle bone inscriptions https://jgw.aynu.edu.cn/ (2019).

  4. Fujikawa, Y. et al. Recognition of oracle bone inscriptions by using two deep learning models. International Journal of Digital Humanities 5, 65–79, https://doi.org/10.1007/s42803-022-00044-9 (2023).

    Google Scholar 

  5. Li, J. et al. Towards better long-tailed oracle character recognition with adversarial data augmentation. Pattern Recognition 140, 109534, https://doi.org/10.1016/j.patcog.2023.10953 (2023).

    Google Scholar 

  6. Fu, X. & Zhou, R. Shape prior fusion for oracle bone inscriptions detection. in Proceedings of the 2024 7th International Conference on Image and Graphics Processing 394–401, https://doi.org/10.1145/3647649.3647711 (2024).

  7. Liu, G., Xing, J. & Xiong, J. Spatial pyramid block for oracle bone inscription detection. in Proceedings of the 2020 9th International Conference on Software and Computer Applications 133–140, https://doi.org/10.1145/3384544.3384561 (2020).

  8. Huang, S. et al. OBC306: A large-scale oracle bone character recognition dataset. in 2019 International Conference on Document Analysis and Recognition (ICDAR) 681–688, https://doi.org/10.1109/ICDAR.2019.00114 (2019).

  9. Yue, X. et al. Dynamic dataset augmentation for deep learning-based oracle bone inscriptions recognition. ACM Journal on Computing and Cultural Heritage 15, 1–20, https://doi.org/10.1145/3532868 (2022).

    Google Scholar 

  10. Wang, M. & Deng, W. A dataset of oracle characters for benchmarking machine learning algorithms. Scientific Data 11, 87, https://doi.org/10.1038/s41597-024-02933-w (2024).

    Google Scholar 

  11. Li, B. et al. HWOBC-a handwriting oracle bone character recognition database. Journal of Physics: Conference Series 1651, 012050, https://doi.org/10.1088/1742-6596/1651/1/012050 (2020).

    Google Scholar 

  12. Wang, P. et al. An open dataset for oracle bone character recognition and decipherment. Scientific Data 11, 976, https://doi.org/10.1038/s41597-024-03807-x (2024).

    Google Scholar 

  13. Guan, H. et al. An open dataset for the evolution of oracle bone characters: EVOBC. Preprint at https://arxiv.org/abs/2401.12467 (2024).

  14. Key Laboratory of Oracle Bone Inscriptions Information Processing, Li, B., Yang, J. et al. OBIMD. Hugging Face https://doi.org/10.57967/hf/7828 (2026).

  15. Oracle Bone AI Collaborative Platform. An AI-driven research workspace for oracle bone inscriptions https://www.jgwlbq.org.cn/oracle-bone (2024).

  16. Guo, M. & Hu, H. (eds) Jiaguwen Heji [Collection of Oracle Bone Inscriptions] (Zhonghua Book Company, 1978-1982).

  17. Chinese Academy of Social Sciences. Yinxu Huayuanzhuang Dongdi Jiagu [Oracle Bone Inscriptions from Huayuanzhuang East at Yinxu] (Yunnan Nationalities Publishing House, 2003).

  18. Huang, T. (ed.) Jiaguwen Moben Daxi [Comprehensive Series of Oracle Bone Facsimiles] (Peking University Press, 2022).

  19. Oracular Digital Platform. Glyph library https://oracular.azurewebsites.net/glyphs (2024).

  20. Ultralytics. ultralytics (v8.2.94). GitHub https://github.com/ultralytics/ultralytics/releases/tag/v8.2.94 (2024).

  21. Kuhn, H. W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2, 83–97, https://doi.org/10.1002/nav.3800020109 (1955).

    Google Scholar 

  22. Ultralytics. ultralytics (v8.3.0). GitHub https://github.com/ultralytics/ultralytics/releases/tag/v8.3.0 (2024).

  23. Vaswani, A. et al. Attention is all you need. in Advances in Neural Information Processing Systems 30 (NIPS 2017) 5998–6008, https://papers.nips.cc/paper/7181-attention-is-all-you-need (2017).

  24. Han, W. et al. Self-supervised learning of Orc-Bert augmentor for recognizing few-shot oracle characters. in Proceedings of the Asian Conference on Computer Vision 652-668, https://doi.org/10.1007/978-3-030-69544-6_39 (2021).

  25. Zhang, G. et al. Deciphering ancient Chinese oracle bone inscriptions using case-based reasoning. in International Conference on Case-Based Reasoning 309–324, https://doi.org/10.1007/978-3-030-86957-1_21 (Springer, 2021).

  26. Yue, X. et al. An unsupervised automatic organization method for Professor Shirakawa’s hand-notated documents of oracle bone inscriptions. International Journal on Document Analysis and Recognition (IJDAR) 27, 583–601, https://doi.org/10.1007/s10032-024-00463-0 (2024).

    Google Scholar 

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation of China (Grant No. 62506007), the Natural Science Foundation of Henan Province (Grant No. 242300420680), the Paleography and Chinese Civilization Inheritance and Development Program (Grant Nos. G1807, G1806, G2821), the Henan Province Science and Technology Research Project (Grant Nos. 242102210116, 252102321071), the Open Research Topic of the Key Laboratory of Oracle Information Processing, Ministry of Education (Grant No. OIP2024E002, OIP2024H002), the Key Technology Project of Henan Educational Department of China (Grant No. 22ZX010), and the Henan Province High-Level Talents International Training Program (Grant No. GCC2025028).

Author information

Author notes
  1. These authors contributed equally: Bang Li, Jing Yang.

Authors and Affiliations

  1. Key Laboratory of Oracle Bone Inscriptions Information Processing, Ministry of Education of China, Anyang Normal University, Anyang, Henan, China

    Bang Li, Jing Yang, Zengmao Ding, Shengwei Han, Feng Gao & Yongge Liu

  2. Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Fujian, China

    Yujie Liang, Taisong Jin & Rongrong Ji

  3. Youtu Lab, Tencent, Shanghai, China

    Xiaobin Hu, Xu Peng & Donghao Luo

  4. Faculty of Asian and Middle Eastern Studies, University of Cambridge, Cambridge, UK

    Peichao Qin

Authors
  1. Bang Li
    View author publications

    Search author on:PubMed Google Scholar

  2. Jing Yang
    View author publications

    Search author on:PubMed Google Scholar

  3. Yujie Liang
    View author publications

    Search author on:PubMed Google Scholar

  4. Xiaobin Hu
    View author publications

    Search author on:PubMed Google Scholar

  5. Zengmao Ding
    View author publications

    Search author on:PubMed Google Scholar

  6. Xu Peng
    View author publications

    Search author on:PubMed Google Scholar

  7. Shengwei Han
    View author publications

    Search author on:PubMed Google Scholar

  8. Peichao Qin
    View author publications

    Search author on:PubMed Google Scholar

  9. Donghao Luo
    View author publications

    Search author on:PubMed Google Scholar

  10. Taisong Jin
    View author publications

    Search author on:PubMed Google Scholar

  11. Feng Gao
    View author publications

    Search author on:PubMed Google Scholar

  12. Yongge Liu
    View author publications

    Search author on:PubMed Google Scholar

  13. Rongrong Ji
    View author publications

    Search author on:PubMed Google Scholar

Contributions

Bang Li and Jing Yang co-wrote the manuscript and contributed equally to this work. Yujie Liang and Zengmao Ding also contributed to manuscript writing. Technical validation experiments were conducted by Bang Li, Jing Yang, and Zengmao Ding. Xiaobin Hu and Taisong Jin proposed key revisions before submission. Bang Li and Donghao Luo initiated and supervised the construction of the OBIMD dataset. Yujie Liang, Zengmao Ding, and Xu Peng implemented algorithmic pre-annotation for the dataset. Bang Li, Jing Yang, and Shengwei Han conducted manual annotation and developed annotation guidelines. Bang Li, Donghao Luo, and Yongge Liu coordinated the annotation teams. Peichao Qin provided the standard oracle bone character library for the dataset. Rongrong Ji, Feng Gao, and Yongge Liu supported the project through funding and resources. Correspondence should be addressed to Donghao Luo or Taisong Jin, who provided overall guidance on project design, dataset framework, and manuscript review. All authors reviewed and approved the final manuscript.

Corresponding authors

Correspondence to Donghao Luo or Taisong Jin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, B., Yang, J., Liang, Y. et al. OBIMD: A Multi-modal Dataset for Contextual Interpretation of Oracle Bone Inscriptions. Sci Data (2026). https://doi.org/10.1038/s41597-026-06967-0

Download citation

  • Received: 16 July 2025

  • Accepted: 24 February 2026

  • Published: 14 March 2026

  • DOI: https://doi.org/10.1038/s41597-026-06967-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims and scope
  • Editors & Editorial Board
  • Journal Metrics
  • Policies
  • Open Access Fees and Funding
  • Calls for Papers
  • Contact

Publish with us

  • Submission Guidelines
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Data (Sci Data)

ISSN 2052-4463 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing