A graph-based evolutionary dataset for oracle bone characters from inscriptions to modern Chinese scripts

Jiao, Qingju; Wu, Jingwen; Liu, Qi; Zhang, Han; Zhang, Zhan; Li, Bang; Xiong, Jing; Liu, Guoying; Liu, Yongge

doi:10.1038/s40494-025-01951-0

Download PDF

Article
Open access
Published: 26 July 2025

A graph-based evolutionary dataset for oracle bone characters from inscriptions to modern Chinese scripts

Qingju Jiao^1,2,3,
Jingwen Wu³,
Qi Liu³,
Han Zhang^1,2,
Zhan Zhang³,
Bang Li³,
Jing Xiong⁴,
Guoying Liu⁵ &
…
Yongge Liu³

npj Heritage Science volume 13, Article number: 369 (2025) Cite this article

832 Accesses
Metrics details

Abstract

The traditional semantic interpretation of oracle bone characters, the earliest known system of Chinese writing, has relied heavily on expert-driven manual analysis. Therefore, artificial intelligence (AI)-based approaches have been increasingly explored for deciphering these characters. A fundamental prerequisite for AI-driven semantic inference is construction of a high-quality evolutionary dataset. In this study, a graph-based evolutionary dataset is introduced, encompassing 756 groups and 3780 Chinese characters across five historical stages. Unlike existing datasets that primarily represent characters as images, the proposed dataset employs a graph-based representation, wherein nodes correspond to key structural points of a character, and edges define their spatial relationships. Experimental analyses demonstrate that graph representations offer superior capabilities in capturing the structural stability of characters across evolutionary stages compared to image-based representations. The dataset is expected to serve as a valuable resource for the application of AI-driven methodologies in the decipherment of unknown oracle bone characters.

A dataset of oracle characters for benchmarking machine learning algorithms

Article Open access 18 January 2024

A text image dual conditional stable diffusion model for oracle bone inscription decipherment

Article Open access 12 September 2025

Evaluating explainability for graph neural networks

Article Open access 18 March 2023

Introduction

Oracle bone inscriptions, the earliest known form of Chinese characters, are essential for understanding the origins and evolution of Chinese culture. Research of oracle bone inscriptions can offer important insights into the early phases of ancient Chinese written language. However, traditional methods for studying oracle bone inscriptions rely heavily on manual interpretation by experts¹, significantly hindering the advancement of research in this field. Given the successful application of artificial intelligence (AI) across various fields, efforts have been made to employ AI in the study of oracle bone inscriptions, yielding promising results¹. These studies primarily focus on the construction of oracle bone character datasets, character recognition, and the rejoining of oracle bone fragments, among other aspects. In fact, the construction of oracle bone inscriptions dataset is the foundation for using AI to solve oracle bone inscriptions issues, this is a great challenge for the AI field. There have been some pioneering efforts in this area.

In 2015, Guo et al. constructed a dataset (named Oracle-20K) containing 261 categories of 20,039 oracle bone characters, with the largest and smallest categories consisting of 291 and 25 characters, respectively. Then, a novel hierarchical representation, which combines a Gabor-related low-level representation and a sparse-encoder-related mid-level representation, was proposed to recognize to oracle bone characters². In 2019, Zhang et al. built a dataset named Oracle-AYNU for oracle bone character recognition. Oracle-AYNU contains 39,062 handprinted oracle bone characters and 2583 classes. Based on Oracle-AYNU, the authors proposed a classification method for oracle bone characters recognition. The proposed method employs a convolutional neural network (CNN) to map oracle bone character images to Euclidean space, and classes them by using the nearest neighbor rule³. In 2020, Han et al. constructed a dataset (called Oracle-50K) containing 59,081 oracle bone character images across 2668 categories, in which the oracle bone characters exhibit a long-tail distribution. In order to resolve the few-shot learning problem, Han et al. proposed a novel data augmentation approach pre-trained by self-supervised learning for oracle bone character recognition⁴. Li et al. constructed a handwriting dataset (called HWOBC) containing 83,245 oracle bone character samples and 3881 categories, and they also provided results for several state-of-the-art deep CNN algorithms for HWOBC⁵. In 2021, Zhang et al. constructed two datasets, named Ancient-3 and Ancient-5, each containing 3900 oracle bone samples and 1186 classes. Furthermore, this work employed an auto-encoder to learn suitable image representations to capture the relationship between two adjacent writing systems. The research can provide effective data for deciphering unknown oracle bone characters⁶. In 2022, in order to recognize oracle bone characters, Lin et al. first introduced two new datasets containing oracle radical character dataset (ORCD) and oracle combined-character dataset (OCCD). ORCD includes both rubbing source data and handprinted data, and has 6700 samples and 64 radical classes; OCCD includes 54,876 synthesized oracle characters and 7310 handprinted characters. And then, Lin et al. proposed an oracle radical extract and recognition framework (ORERF) to recognize oracle bone characters⁷.

In addition to building the datasets for oracle bone character recognition, researchers have also constructed datasets for oracle bone character evolution. In 2022, Wang et al. constructed an evolutionary dataset of oracle bone characters, comprising 972 categories and spanning 5 historical stages: oracle bone inscriptions, bronze inscriptions, seal script, official script and regular script⁸. In 2024, Guan et al. constructed an evolutionary oracle bone characters dataset (named EVOBC), consisting of 229,170 images representing 13,714 distinct character categories. The characters in the EVOBC spans six historical stages: oracle bone Characters, bronze inscriptions, seal script, spring and autumn period characters, warring states period characters and clerical script⁹.

However, the oracle bone characters in the datasets mentioned above are represented as images, which limits the effectiveness of these datasets. Compared with existing approaches, some graph theories are used to recognize characters. For example, Sahare et al. employ graph distance theory to separate touched (or joined) and overlapped characters^10,11, and use Dijkstra’s algorithm to segment text-line^12,13. Gan et al. first represent handwritten characters as skeleton graphs, and then recognize them via pyramid graph transformer¹⁴. The results of literature¹⁴ show that it is more interpretable to recognize characters as a graph structure. Therefore, this work builds an evolutionary oracle bone characters dataset represented by graphs. The dataset in this work includes 756 groups of 3780 Chinese characters from five stages, ranging from oracle bone inscriptions, bronze inscriptions, seal script, official script, to regular script. Note that each group in the proposed dataset consists of five evolutionary shape of a graph-based oracle bone character. Extensive experiments show that using a graph structure to represent Chinese characters can better reflect the stability and locality of evolution than using image representation.

Methods

The original image data of evolutionary oracle bone characters, as described in the literature⁸, includes 972 groups and 4860 images of oracle bone characters. Figure 1 presents the flowchart for building the graph-based evolutionary oracle bone character dataset. This work introduces four key steps for constructing the dataset including data source and optimization, image to graph, graph compression and similarity calculation, and data validation.

Data source

Most oracle bone characters in current research are represented by images¹ that can effectively capture the global structure of oracle bone characters. To effectively construct the evolutionary oracle bone characters dataset, we carried out a follow-up investigation using the 972 groups (each group contains five evolutionary shape of an image-based oracle bone character) of 4860 Chinese character images dataset reported in our previous work⁸. The images in this dataset come from two primary sources.

The first is “Guo Xue Da Shi” (国学大师, https://www.guoxuedashi.net/jgwhj/) initiated and maintained by classical Chinese enthusiasts as shown in Fig. 2. It includes a variety of historical texts and dictionaries, histories, and more.

The second is “Chinese Character Etymology and Evolution Dictionary” (汉字源流字典)¹⁵, as shown in Fig. 3. This is the first systematic presentation of ancient character forms of the gathered Chinese characters (oracle bone inscriptions, bronze inscriptions, prehistoric script, ancient script, and seal script), their initial meanings (the original meanings), their origins, and the evolution of the meanings of each ancient and modern item.

Next, the images downloaded from the two sources mentioned above are preprocessed by grayscale processing and image size normalization. First, since color images increase the complexity of computation and the color information does not contain the structural features of Chinese characters, therefore, we perform grayscale processing on the color images by the weighted average method⁸. The grayscale images can make structural information of oracle bone script clearer. And then, each grayscale image is normalized to 105 × 105 pixels by the bilinear interpolation method⁸. In the following section, the input data for image-based methods are grayscale images with 105 × 105 pixels.

Optimized image-based evolutionary oracle bone characters

Note that, although the oracle bone character image data used in this work originated from our previously published dataset⁸, minor optimizations have been applied to enhance its quality. For example, the original image dataset was mixed with images of other stages in addition to the five stages (oracle bone inscriptions, bronze inscriptions, seal script, official script and regular script) involved in the experiment, such as running script, cursive script. To avoid errors that could affect the experimental results, we manually filtered out this part of the images during the process of transforming them into graphs. As a result, 1080 samples were excluded in this categorization after being screened. At last, the number of Chinese character images used in this work is 3780, and these images are classed into 756 groups, where each group contains five evolutionary shape of an image-based oracle bone character. Table 1 summarizes all the groups in the evolutionary oracle bone characters dataset and gives the number of data in each timeframe.

Table 1 Statistics on the quantity of characters in different stages

Full size table

Oracle bone character image to graph structure conversion

The most important step for converting Chinese (or oracle bone character) character images into graph structures is to extract nodes from images. In 2011, Li et al. proposed rules (Li method in Fig. 4) for the conversion of oracle-bone characters to graph structures¹⁶. First, the start point (p1 in Fig. 4A2) and end point (p3 in Fig. 4A2) of straight stroke in oracle bone characters are abstracted as nodes in graph. Second, the start point (p4 in Fig. 4A2) and end point (p5 in Fig. 4A2) of circular stroke are abstracted as nodes in graphs. And third, the intersection (p2 in Fig. 4A2) of two or more strokes is abstracted as nodes in graphs. In the literature¹⁶, Li et al. abstracted the start point (p4 in Fig. 4A2) and end point (p5 in Fig. 4A2) of circular strokes as nodes in the graph as well, which would result in the existence of multiple edges (for example, there are three edges between point p4 and point p5 in Fig. 4A2) between two nodes in the oracle bone character graph structure, and this phenomenon reflects the topology of oracle bone characters because these three connections are merged into an edge (the coarse edge in Fig. 4A4) with a larger weight (the element 3 in Fig. 4A5). In addition, modern Chinese writing rarely contains circular strokes, therefore, we update the rule of extracting nodes from circular stroke in the second step mentioned above.

**Fig. 4: An example of transform oracle bone character “Hundred” image into graph.**

For a circular stroke (Our method in Fig. 4), three nodes are extract from start point (p4 in Fig. 4B2), middle point (p6 or p7 in Fig. 4B2) and end point (p5 in Fig. 4B2). In a word, the start point, end point, (or middle point) in straight (or circular) stroke and the intersectional point of two or more strokes are abstracted as nodes, and then, the graph of oracle bone character is formed by connecting these nodes. Figure 4 shows an example of oracle bone character “Hundred” image to graph structure conversion. From Fig. 4, it can be observed that the graph extracted by our method (Fig. 4B4) can better represent the geometric structure of oracle bone character compared to the graph extracted by Li method (Fig. 4A4). Note that, to ensure the accuracy of conversion, this process of oracle bone character image to graph structure conversion is performed entirely manually.

A character graph (Fig. 4B4) with $n$ nodes and $m$ edges can be described as ${\rm{G}}=({\rm{V}},{\rm{E}})$, ${\rm{V}}$ and ${\rm{E}}$ are the sets of nodes and edges respectively. The graph is then converted into an adjacency matrix (Fig. 4B5) to facilitate the next step of computation. The adjacency matrix ${\bf{A}}$ is an $n\times n$ square matrix, where the element ${{\bf{A}}}_{{ij}}$ in the matrix shows the connectivity between node $i$ and node $j$. If there is an edge (${e}_{{ij}}\in E$) between node $i$ and node $j$, then ${{\bf{A}}}_{{ij}}={w}_{{ij}}$; otherwise ${{\bf{A}}}_{{ij}}=0$. Since the edge only represents the connectivity of the nodes and has no meaning of length (that is, the graph is unweighted), therefore the value of ${w}_{{ij}}$ is 1 in our method (Fig. 4), and the adjacency matrix ${\bf{A}}$ is symmetric and invariant. Table 2 shows an evolution of “丙” (Third) from the view of image and graph.

Table 2 An evolution of “丙” (Third) from the view of image and graph

Full size table

Results

To demonstrate the utility of the oracle bone character evolutionary dataset represented as a graph structure, a technical validation task is designed from a computational perspective. This task calculated the stability of the character as it evolves and the changes in the similarity of its elements throughout the evolutionary process.

Similarity calculation of character graph

In this section, three methods including DHC-E¹⁷, FEATHER¹⁸ and GL2vec¹⁹ are employed to extract features of character graph (adjacency matrix). Here, we take DHC-E as an example to explain the process of feature extraction and similarity calculation of oracle bone character graph (adjacency matrix). DHC-E¹⁷ can compress a network into a one-dimensional entropy sequence with four main steps. Here, we denote the degree of node $i$ as ${k}_{i}$, and its neighbors’ degree as ${k}_{{j}_{1}},{k}_{{j}_{2}},\cdots ,{k}_{{j}_{{k}_{i}}}$. First, the H-index ${h}_{i}$ of node $i$ is calculated by the degrees of its neighbors (Eq. (1)).

$${h}_{i}=\beta ({k}_{{j}_{1}},{k}_{{j}_{2}},\cdots ,{k}_{{j}_{{k}_{i}}})$$

(1)

where ${\rm{\beta }}$ is the operation of calculating H-index of node $i$, and ${k}_{{j}_{1}},{k}_{{j}_{2}},\cdots ,{k}_{{j}_{{k}_{i}}}$ is sorted by descending order.

Second, the sth-order (s > 0) H-index can be calculated by Eq. (2). This process is repeated until the H-indexes of all nodes stabilize, obtaining s times H-index ${h}_{i}^{(s)}$. The node i’s H-index ${h}_{i}^{(0)},{h}_{i}^{(1)},{h}_{i}^{(2)},$⋯ sequence will converge to its coreness ${c}_{i}$.

$${h}_{i}^{(s)}=\beta ({h}_{{j}_{1}}^{\left(s-1\right)},{h}_{{j}_{2}}^{\left(s-1\right)},\cdots ,{h}_{{j}_{{k}_{i}}}^{(s-1)})$$

(2)

where ${h}_{i}^{\left(0\right)}={k}_{i}$ and ${h}_{i}^{\left(1\right)}={h}_{i}$.

Third, for a graph ${\rm{G}}$ with $n$ nodes, we can obtain $n$ H-index sequences $\left\{({h}_{i}^{\left(0\right)},{h}_{i}^{\left(1\right)},\cdots ,{h}_{i}^{(s)}),|,i=\mathrm{1,2},\cdots ,n\right\}$ after $s$ iterations, and ${h}_{1}^{(m)},{h}_{2}^{(m)},\cdots ,{h}_{n}^{\left(m\right)}(0\le m\le s)$ represent the state of the graph in the $m$ iterations. To represent the state $m$, DHC-E calculates the probability distribution ${p(h}_{1}^{(m)}),p({h}_{2}^{\left(m\right)}),\cdots ,p({h}_{n}^{\left(m\right)})$ of ${h}_{1}^{(m)},{h}_{2}^{(m)},\cdots ,{h}_{n}^{\left(m\right)}$, and measures its Shannon entropy ${H}_{m}$ by Eq. (3).

$${H}_{m}=-\mathop{\sum }\limits_{i=1}^{n}p({h}_{i}^{(m)}){\log }_{2}p({h}_{i}^{(m)})$$

(3)

Note that DHC-E is a parameter-free method, and the dimension of output vector of oracle bone characters graphs is 12, that is the number of iteration $s$ is 12.

Finally, a feature vector ${v}^{g}$ (Eq. (4)) representing the entire graph $g$ is generated based on the Shannon entropy value ${H}_{m}$ after each iteration.

$${v}^{g}=\left({H}_{1},{H}_{2},\ldots \ldots ,{H}_{s}\right)$$

(4)

The measure of vector similarity is one of the central tasks in the analysis of high-dimensional data to assess the relationship between data points. Therefore, it is necessary to use a commonly used distance metric, Euclidean distance, which determines the similarity between vectors by measuring their geometric distance in space. For two vectors ${v}^{{gi}}=({H}_{1}^{{gi}},{H}_{2}^{{gi}},\cdots ,{H}_{s}^{{gi}})$ and ${v}^{{gj}}={H}_{1}^{{gj}},{H}_{2}^{{gj}},\cdots ,{H}_{s}^{{gj}}$, the Euclidean distance formula is (Eq. (5)):

$$d\left({v}^{{gi}},{v}^{{gj}}\right)=\sqrt{\mathop{\sum }\limits_{t=1}^{s}{({H}_{t}^{{gi}}-{H}_{t}^{{gj}})}^{2}}$$

(5)

where ${v}^{{gi}}$ and ${v}^{{gj}}$ are the feature vectors of graphs ${gi}$ and ${gj}$ respectively, ${H}_{t}^{{gi}}$ and ${H}_{t}^{{gj}}$ are the values of Shannon’s entropy for graphs ${gi}$ and ${gj}$ after the t-th iteration respectively. s is the dimension of the feature vector, i.e., the number of iterations.

To simplify the conversion of distance into a similarity metric, we used a normalization method (Eq. (6)) that produces values in the range (0, 1], where values closer to 1 indicate higher similarity.

$${d}_{{ij}}^{N}=\frac{1}{1+\frac{d\left({v}^{{gi}},{v}^{{gj}}\right)}{12}}$$

(6)

Like DHC-E, FEATHER and GL2vec first compress a graph into a vector, and then calculate the similarity between vectors of oracle bone character graphs. Note that, in order to be consistent with the DHC-E method, the output dimension of vector for FEATHER and GL2vec is set 12.

Feather extraction of character image

In order to verify the expressive performance of character image and graph, the feature of character images are also extracted. Image similarity measurement is a foundational problem in the field of computer vision, and many algorithms have been proposed to address it. The algorithms can be divided into traditional methods and deep learning-based methods. Traditional algorithms include the histogram-based method (Hist)²⁰, perceptual hash (pHSAH)²¹, and structural similarity (SSIM)²². The deep learning-based algorithms first extract the feature vector from the input image, and then computer the similarity of the feature vectors. For example, neural networks used for extracting feature vector include VGG16²³, ResNet²⁴, EfficientNet²⁵, Vision Transformers (ViT)²⁶, and others. Therefore, the similarity values of oracle bone characters represented by graph are compared by 7 image-based methods including VGG16, ResNet, EfficientNet, Vision Transformers (ViT), pHASH, SSIM and Hist.

Experimental results

To gain a clearer understanding of the pattern of character evolution across different stages, we calculate the average similarity value for each of the five stages across all samples and visualize the evolution of character similarities. Here, we will give an example showing how to calculate the average value of similarity of each stage for a method. Assuming that the first group of five stages of the character vectors are ${S}_{1}^{1},{S}_{2}^{1},{S}_{3}^{1},{S}_{4}^{1},{S}_{5}^{1}$, the second group of five stages of the character vectors are ${W}_{1}^{2},{W}_{2}^{2},{W}_{3}^{2},{W}_{4}^{2},{W}_{5}^{2}$, In each group of characters, calculate the similarity (Eqs. (5) and (6)) between each pair of neighboring stages: for the first group of character vectors, calculate the similarity of neighboring stages ${S}_{\mathrm{1,2}}^{1},{S}_{\mathrm{2,3}}^{1},{S}_{\mathrm{3,4}}^{1},{S}_{\mathrm{4,5}}^{1}$, for the second group of character vectors, calculate the similarity of neighboring stages ${W}_{\mathrm{1,2}}^{2},{W}_{\mathrm{2,3}}^{2},{W}_{\mathrm{3,4}}^{2},{W}_{\mathrm{4,5}}^{2}$ and so on until all 756 groups are calculated. Next, the mean value of the similarity of each pair of neighboring stages is calculated. The similarity between the first and second stages can be calculated using Eqs. (7) and (8), respectively.

$${M}_{1,2}=\frac{{S}_{1,2}^{1}+{W}_{1,2}^{2}}{2}$$

(7)

$${M}_{2,3}=\frac{{S}_{2,3}^{1}+{W}_{2,3}^{2}}{2}$$

(8)

For each pair of neighboring phases, their average values of similarity are calculated. For the similarity mean of the i-th and i + 1-th stages in the n-group character, the equation can be written as follows (Eq. (9)):

$${M}_{i,i+1}=\frac{{\sum }_{k=1}^{n}{X}_{i,i+1}^{k}}{n}$$

(9)

where ${X}_{i,i+1}^{k}$ represents the similarity between the i and i + 1 stages of the k-th group.

The results of similarity values of 10 methods are shown in Fig. 5. As can be seen from Fig. 5, regardless of whether the similarity value is large or small, the oracle bone character evolutionary process is relatively stable with small fluctuations across four stages (OBI-BI, BI-SS, SS-OS, OS-RS). Generally speaking, comparing with image-based methods, graph-based methods achieve significant improvement on similarities, meaning that the graph structure can effectively describe the evolution process of Chinese characters. Graph-based approaches focuses on quantifying internal structural changes (such as the increase or decrease of strokes of character nodes and the changes of radicals) through graph-embedded vector representations derived from matrix transformations, while conventional image-based methods rely on pixel-level contour matching to assess superficial morphological similarity between graphical characters. Among 7 image-based approaches, the traditional methods including pHASH and Hist obtain larger values of similarity, however, deep learning-based methods achieve slightly poorer values of similarity. This phenomenon is most likely due to the fact that these deep learning-based approaches learn too much redundant information from oracle bone character images with simple character pattern.

**Fig. 5: The similarities of different methods in character evolutionary stages.**

In addition, we analyze the similarities of each stages in detail. In general, the similarities of four stages gradually increase, which means that the differences (higher similarity corresponds to lower difference) in the shapes of Chinese characters are becoming smaller and smaller. In order to reveal the differences of Chinese characters from the perspective of graph and image, we calculate the average similarity (Fig. 6) of graph-based methods (DHC-E, FEATHER and GL2vec) and image-based methods (VGG16, ResNet, EfficientNet, ViT, pHASH, SSIM and Hist). Regardless of the perspectives of the graph and image, the minimum dissimilarity across the four stages is OS-RS (Fig. 6), implies that the modern Chinese characters exhibit increasing structure similarity. From the perspectives of graph and image, the maximum differences across the four stages are OBI-BI and SS-OS respectively.

**Fig. 6: The average similarity of graph-based and image-based methods.**

Next, we explain an interesting phenomenon of the SS-OS stage transition from seal script to official script. In 221 BCE, Qin Shi Huang, the first emperor of the Qin dynasty, unified China and standardized the Chinese writing system. This reform replaced the earlier seal script with a more simplified official script, introducing significant stylistic and structural modifications. From Fig. 5, we can see that the SS-OS similarities of VGG-16, ResNet, pHASH and SSIM are the smallest among 4 phases (Likewise, the average similarity of SS-OS is the smallest (Fig. 6B)), meaning that these four methods can reflect the significant change of seal script and official script from a global view. Unfortunately, graph-based methods cannot capture the significant change of SS-OS.

While the transition from seal script to official script during the Qin unification marked a significant transformation in Chinese character morphology, the evolutionary trajectory of Chinese character reveals an unbroken continuum of development, with each phase demonstrating incremental modifications rather than abrupt discontinuities. Therefore, from a computational perspective, graph-based representations are less influenced by evolutionary perturbations that affect character outlines, making them more suitable for typological studies requiring structural stability analysis.

We provide eight groups of typical oracle bone character evolutionary data, and the graph-based similarity and image-based similarity are shown in Tables 3 and 4. Table 3 shows the similarity of stage from oracle bone character (OBI) to bronze inscriptions (BI). From Table 3, we can see that the similarity of the graph-based methods are significantly higher than that of other image-based methods as a whole. For example, from the perspective of graph structure, the oracle bone character ‘’ (Bamboo) is identical with bronze inscriptions (the similarity is equal to 1 in DHC-E and FEATHER methods), whereas from an image-based perspective, Oracle Bone Inscriptions and bronze inscriptions exhibit significant differences (the similarities of VGG16, ResNet, EfficientNet, ViT, PHASH, SSIM and Hist are equal to 0.551, 0.571, 0.544, 0.476, 0.652, 0.670 and 0.487 respectively).

Table 3 Graph-based and image-based similarities of stage OBI-BI

Full size table

Table 4 Comparison of similarity in the evolution of characters at various stages

Full size table

From Table 4, we can see that the similarity of graph-based method are higher than that of the image-based methods in BI-SS, SS-OS, and OS-RS phases. These typical examples demonstrate that the method we used to analyze the regularity of character evolution based on graph representation is feasible. Compared to image processing methods, graph-based method yields a higher similarity between the stages of oracle bone character evolution, and the graph representation can more accurately reflect the subtle changes in the evolution of oracle bone characters, resulting in stronger consistency and coherence in similarity assessments across different stages of evolution.

Discussion

Oracle bone inscriptions represent the earliest fully developed writing system in China, and the semantic inference of unknown oracle bone characters remains one of the most critical challenges in the study of these inscriptions. Understanding the evolutionary patterns of oracle bone characters is a fundamental step towards inferring their semantics. In this study, a graph-structured dataset of evolutionary oracle bone characters has been constructed, comprising 756 groups and 3780 Chinese characters. Compared to traditional image-based datasets, the graph-structured dataset introduced here offers two significant advantages. First, the graph representation methods outperform the conventional image-based methods in terms of the stability of oracle bone character evolution. Second, the graphical representation approach demonstrated higher accuracy in comparing similarities across different evolutionary stages. It is anticipated that the evolutionary dataset constructed in this work will provide valuable data support for the semantic reasoning of unrecognized oracle bone characters.

Here, note that we convert an unknown character image into graph manually using the rules proposed in this work. In fact, some methods¹⁴ based on machine learning are proposed to represent Chinese characters as graphs. For example, Gan et al. use two steps to construct graphs from images of Chinese characters. First, the character skeleton is extracted by thinning the given image according to the rules of ref. ²⁷ or a pre-trained fully convolutional network (FCN)²⁸; and then, the character graph is constructed by clustering closing nodes and merging the corresponding edges from the skeleton. The oracle bone characters graphs constructed by in this work can serve as a reference dataset for training machine learning methods, or can be used to research the law of evolution of Chinese ancient characters.

Data availability

The datasets for 3780 images and graphs (matrixes) of oracle bone characters, and the vectors of matrixes processed by DHC-E, FEATHER and GL2vec methods are freely available online at https://github.com/BrisksHan/GBEDOBC.

References

Li, J. et al. A comprehensive survey of oracle character recognition: challenges, datasets, methodology, and beyond. Pattern Recognit. 169, 111824 (2026).
Article Google Scholar
Guo, J. et al. Building hierarchical representations for oracle character and sketch recognition. IEEE Trans. Image Process. 25, 104–118 (2015).
Article Google Scholar
Zhang, Y. K., Zhang, H., Liu, Y., Yang, Q. & Liu, C. L. Oracle character recognition by nearest neighbor classification with deep metric learning. In 2019 International Conference on Document Analysis and Recognition (ICDAR) 309–314 (The Institute of Electrical and Electronics Engineers (IEEE), 2019).
Han, W., Ren, X., Lin, H., Fu, Y. & Xue, X. Self-supervised learning of orc-bert augmentator for recognizing few-shot oracle characters. In Proc. Asian Conference on Computer Vision (ACCV) (Springer, 2020).
Li, B. et al. HWOBC—a handwriting oracle bone character recognition database. J. Phys. Conf. Ser. 1651, 012050 (2020).
Article Google Scholar
Zhang, G., Liu, D., Smyth, B. & Dong, R. Deciphering ancient Chinese oracle bone inscriptions using case-based reasoning. In International Conference on Case-Based Reasoning 309–324 (Springer, 2021).
Lin, X., Chen, S., Zhao, F. & Qiu, X. Radical-based extract and recognition networks for Oracle character recognition. Int. J. Doc. Anal. Recognit. 25, 219–235 (2022).
Article Google Scholar
Wang, M. et al. Study on the evolution of Chinese characters based on few-shot learning: from oracle bone inscriptions to regular script. PLoS ONE 17, e0272974 (2022).
Article CAS PubMed PubMed Central Google Scholar
Guan, H. et al. An open dataset for the evolution of oracle bone characters: Evobc. arXiv https://doi.org/10.48550/arXiv.2401.12467 (2024).
Sahare, P. & Dhok, S. B. Robust character segmentation and recognition schemes for multilingual Indian document images. IETE Tech. Rev. 36, 209–222 (2019).
Article Google Scholar
Sahare, P. & Dhok, S. B. Multilingual character segmentation and recognition schemes for Indian document images. IEEE Access 6, 10603–10617 (2018).
Article Google Scholar
Sahare, P., Tembhurne, J. V., Parate, M. R., Diwan, T. & Dhok, S. B. Script independent text segmentation of document images using graph network based shortest path scheme. Int. J. Inf. Technol. 15, 2247–2261 (2023).
Google Scholar
Sahare, P., Tembhurne, J. V., Parate, M. R., Diwan, T. & Dhok, S. B. Script-independent text segmentation from document images. Int. J. Ambient Comput. Intell. 13, 1–21 (2022).
Article Google Scholar
Gan, J. et al. Characters as graphs: interpretable handwritten Chinese character recognition via Pyramid Graph Transformer. Pattern Recognit. 137, 109317 (2023).
Article Google Scholar
Gu, Y. K. Chinese Character Etymology and Evolution Dictionary (Huaxia Publishing House, 2003).
Li, Q., Yang, Y. & Wang, A. Recognition of inscriptions on bones or tortoise shells based on graph isomorphism (in Chinese). Comput. Eng. Appl. 47, 112–114 (2011).
Google Scholar
Wang, H. Deng, Y., Lu, L. & Chen, G. Hyperparameter-free and explainable whole graph embedding. arXiv https://doi.org/10.48550/arXiv.2108.02113 (2021).
Rozemberczki, B. & Sarkar, R. Characteristic functions on graphs: birds of a feather, from statistical descriptors to parametric models. In Proc. 29th ACM International Conference on Information and Knowledge Management (CIKM) 1325–1334 (Association for Computing Machinery (ACM), 2020).
Chen, H. & Koga, H. GL2vec: graph embedding enriched by line graphs with edge features. In International Conference on Neural Information Processing (ICONIP) 3–14 (Springer, 2019).
Rother, C., Minka, T., Blake, A. & Kolmogorov, V. Cosegmentation of image pairs by histogram matching-incorporating a global constraint into MRFs. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06) 993–1000 (The Institute of Electrical and Electronics Engineers (IEEE), 2006).
Monga, V., Banerjee, A. & Evans, B. L. A clustering based approach to perceptual image hashing. IEEE Trans. Inf. Forensics Secur. 1, 68–79 (2006).
Article Google Scholar
Wang, Z., Bovik, A. C., Sheikh, H. R. & Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004).
Article PubMed Google Scholar
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR) (OpenReview.net, 2015)
He, K. Zhang, X. Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (The Institute of Electrical and Electronics Engineers (IEEE), 2016).
Tan, M. & Le, Q. EfficientNet: rethinking model scaling for convolutional neural networks. In Proc. 36th International Conference on Machine Learning 6105–6114 (Association for Computing Machinery (ACM), 2019).
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations (ICLR) (OpenReview.net, 2021).
Zhang, T. Y. & Suen, C. Y. A fast parallel algorithm for thinning digital patterns. Commun. ACM 27, 236–239 (1984).
Article Google Scholar
Wang, T. Q. & Liu, C. L. Fully convolutional network based skeletonization for handwritten Chinese characters. In Proc. AAAI Conference on Artificial Intelligence (Association for the Advancement of Artificial Intelligence (AAAI), 2018).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 61806007, Henan Provincial Colleges and Universities Youth Key Teacher Training Plan under Grant 2021GGJS129, Ancient Characters and the Chinese Civilization Inheritance and Development Project under Grant G2821, G1806, G3028 and G1807, the Key Technology Project of Henan Educational Department of China under Grant 22ZX010, Henan Revitalization Cultural Engineering Special Project under Grant 2023XWH296.

Author information

Authors and Affiliations

School of Computer and Information Engineering, Anyang Normal University, Anyang, Henan, China
Qingju Jiao & Han Zhang
Oracle Bone Inscriptions Application Big Data Development Innovation Laboratory, Anyang Normal University, Anyang, Henan, China
Qingju Jiao & Han Zhang
Key Laboratory of Oracle Bone Inscriptions Information Processing, Ministry of Education of China, Anyang, Henan, China
Qingju Jiao, Jingwen Wu, Qi Liu, Zhan Zhang, Bang Li & Yongge Liu
School of Computer Science, Qufu Normal University, Rizhao, Shandong, China
Jing Xiong
Office of Academic Research, Anyang Normal University, Anyang, Henan, China
Guoying Liu

Authors

Qingju Jiao
View author publications
Search author on:PubMed Google Scholar
Jingwen Wu
View author publications
Search author on:PubMed Google Scholar
Qi Liu
View author publications
Search author on:PubMed Google Scholar
Han Zhang
View author publications
Search author on:PubMed Google Scholar
Zhan Zhang
View author publications
Search author on:PubMed Google Scholar
Bang Li
View author publications
Search author on:PubMed Google Scholar
Jing Xiong
View author publications
Search author on:PubMed Google Scholar
Guoying Liu
View author publications
Search author on:PubMed Google Scholar
Yongge Liu
View author publications
Search author on:PubMed Google Scholar

Contributions

Qingju Jiao has designed and written methods, experiments, and original draft of this paper, Jingwen Wu and Qi Liu have collected and processed original data of the oracle bone characters, Jing Xiong, Guoying Liu and Yongge Liu have designed and written methods, experiments, and original draft of this paper, Jingwen Wu and Han Zhang has analyzed and validated the dataset and format of this paper and given a writing review and editing. Zhan Zhang and Bang Li have given experiment advice on the methodology. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Qingju Jiao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Jiao, Q., Wu, J., Liu, Q. et al. A graph-based evolutionary dataset for oracle bone characters from inscriptions to modern Chinese scripts. npj Herit. Sci. 13, 369 (2025). https://doi.org/10.1038/s40494-025-01951-0

Download citation

Received: 21 April 2025
Accepted: 13 July 2025
Published: 26 July 2025
DOI: https://doi.org/10.1038/s40494-025-01951-0

A graph-based evolutionary dataset for oracle bone characters from inscriptions to modern Chinese scripts

Abstract

Similar content being viewed by others

A dataset of oracle characters for benchmarking machine learning algorithms

A text image dual conditional stable diffusion model for oracle bone inscription decipherment

Evaluating explainability for graph neural networks

Introduction