Multimodal AI for Yuan Buddhist sculpture chronology and style

Xing, Jia; Ren, Wei; Lei, Du; Zhao, Lin; Qin, Xue; Shao, Hongdang; Li, Wenjie; Han, Yirui; Yu, Zike; Xu, Zheng; Yin, Rui; Yuan, Jiantao; Wang, Jun; Chen, Wei; Xu, Jun; Zhou, Xiaoping; Yang, Cheng; Zhou, Wei; Zhou, Binbin

doi:10.1038/s40494-025-01994-3

Download PDF

Article
Open access
Published: 05 September 2025

Multimodal AI for Yuan Buddhist sculpture chronology and style

Jia Xing^1,2,
Wei Ren¹,
Du Lei¹,
Lin Zhao¹,
Xue Qin³,
Hongdang Shao³,
Wenjie Li³,
Yirui Han¹,
Zike Yu¹,
Zheng Xu⁴,
Rui Yin⁵,
Jiantao Yuan⁵,
Jun Wang⁶,
Wei Chen⁶,
Jun Xu⁷,
Xiaoping Zhou¹,
Cheng Yang¹,
Wei Zhou⁸ &
…
Binbin Zhou⁶

npj Heritage Science volume 13, Article number: 443 (2025) Cite this article

2434 Accesses
Metrics details

Abstract

The analysis and protection of grotto sculptures face growing urgency due to climate-driven deterioration and a shortage of domain experts. Here we present ChronoStyleNet (CSN) the world’s first domain-specific multimodal large model purpose-built for sculptural heritage. CSN is trained on 295 expert-annotated statues from the Feilai Peak Grottoes (covering 49% of the West Lake region and 13% of Zhejiang grotto sculptures) and 2.46 GB of archaeological literature. It achieves precise performance through targeted fine-tuning and structured prompting under limited data conditions. Evaluated on 22 Yuan-dynasty samples, CSN outperformed five mainstream multimodal large language systems within a six-dimensional ontology-aligned framework. This work establishes a scalable benchmark for domain-adaptive AI in cultural heritage, offering replicable methodology for other endangered monuments worldwide. CSN demonstrates that domain-adapted multimodal AI can deliver high-precision interpretation even with scarce data, providing new tools for digital preservation and scholarly research. It highlights AI’s potential to bridge expertise gaps and reshape conservation practices in vulnerable heritage contexts.

Cross-modal deep learning framework for 3D reconstruction and information integration of Zhejiang wood carving heritage

Article Open access 05 December 2025

Building a Chinese ancient architecture multimodal dataset combining image, annotation and style-model

Article Open access 16 October 2024

Knowledge graph enhanced cross modal generative adversarial network for martial arts motion reconstruction and heritage preservation

Article Open access 21 January 2026

Introduction

The preservation of cultural heritage is facing mounting global challenges. Climate change, environmental degradation, and the destruction of heritage assets have led to the irreversible loss of historical and cultural knowledge. Among these, grotto sculptures stand out as vital material embodiments of religious art, encapsulating the social, cultural, and technological transformations across different historical periods. However, as outdoor stone carvings, they are particularly vulnerable to environmental factors such as weathering, erosion, and human impact.

Within this context, the Feilai Peak Grottoes in Hangzhou represent a major heritage site in the Jiangnan region, preserving a rich corpus of carvings that embody both regional traditions and evolving temporal styles. During the Yuan dynasty, Feilai Peak witnessed the emergence of a highly distinctive Sino-Tibetan syncretic style, characterized by a fusion of Han Chinese and Tibetan Buddhist artistic elements^1,2. Sculptures from this period exhibit remarkable diversity in facial modeling, attire detailing, and religious iconography. This creates a complex and heterogeneous stylistic landscape that requires both precise visual analysis and deep cultural understanding for accurate interpretation.

This highly diverse and structurally complex sculptural corpus provides a challenging testing ground for AI in tasks of stylistic recognition and cultural semantic interpretation. However, accurately parsing and understanding these features remains a significant challenge^3,4. Traditional approaches to stylistic and chronological classification of grotto sculptures rely heavily on the expertise of trained archaeologists and art historians, who integrate visual analysis, textual interpretation, and comparative stylistics⁵. However, this manual process faces growing limitations when applied to increasingly large and diverse image datasets. On the one hand, there is a growing shortage of domain experts: the field is experiencing a pronounced generational gap, with aging scholars and limited successor capacity, which has slowed documentation and interpretation efforts⁶. This expertise gap is compounded by the scarcity of annotated data, as high-quality documentation of irreplaceable heritage requires specialized skills and ethical considerations that limit large-scale data collection. On the other hand, many grotto sites—particularly in climatically sensitive regions like southern China—are experiencing accelerated deterioration due to weathering and environmental change^7,8,9,10, placing urgent demands on timely documentation, analysis, and preservation. These challenges highlight the need for a novel approach that can scale with limited expert resources, leveraging sparse annotations and domain knowledge to address both data scarcity and interpretive complexity.

The rapid advancement of artificial intelligence is profoundly reshaping multiple academic fields, including cultural heritage research, historical artifact analysis, and digital preservation^{11,12,13,14,15}. High-quality data collection and secure preservation are foundational to heritage digitization efforts, with existing studies primarily focusing on large-scale data mining^16,17, knowledge graph construction^15,18,19,20, and three-dimensional reconstruction techniques^21,22,23,24. While traditional AI models, such as convolutional neural networks (CNNs), have achieved notable success in tasks such as image classification and feature extraction^25,26,27,28, their inherent “black-box” nature limits scalability, interpretability, and adaptability to domain-specific knowledge^3,4,29.

In recent years, multimodal large models (MLLMs) have been increasingly applied to cultural heritage analysis due to their ability to process and reason over both visual and textual inputs^30,31,32,33. However, general-purpose MLLMs often lack nuanced understanding of archaeological terminology and historical context^34,35. As task complexity grows^36,37, conventional human supervision also faces challenges, as annotators find it increasingly difficult to reliably validate the accuracy of model outputs^38,39,40,41. Meanwhile, MLLMs have begun to be employed as automated evaluation tools to support quality assessment of model-generated content^42,43.

Despite these advancements, existing research largely emphasizes the development and transferability testing of general-purpose technologies, with limited attention paid to domain-specific adaptation. In particular, the analysis of grotto sculptures from the Jiangnan region—artifacts characterized by complex regional cultural features—remains underexplored. To address the unique demands of such regionally contextualized heritage, it is crucial to develop multimodal recognition and evaluation systems that are both culturally adaptive and aligned with rigorous academic standards⁴⁴.

To address these challenges in grotto sculpture analysis, this research introduces ChronoStyleNet (CSN), a domain-adaptive multimodal generative framework for sculptural heritage. Built upon an instruction-tuned multimodal large language model architecture^45,46,47,48, CSN uniquely integrates structured knowledge distillation to overcome limitations of traditional approaches. Specifically, the model encodes archaeological reasoning—such as chain-of-thought style analysis—into prompt templates, enabling it to simulate expert decision-making processes (e.g., linking “the right-exposed kasaya” features to Tibetan Vajrayana influences). To our knowledge, this is the world’s first domain-specific multimodal large model purpose-built for sculptural heritage, establishing both a benchmark dataset and a replicable methodological template for other endangered monuments worldwide.

CSN is trained on a curated dataset of 295 expert-annotated sculptures from Hangzhou’s Feilai Peak Grottoes (representing 49% of West Lake region carvings and 13% of Zhejiang’s known statues), supplemented by a 2.46 GB domain knowledge base including The Complete Collection of Chinese Grotto Sculpture, Chinese Grotto Temples^1,2,10,49, and 233 research papers—covering a wide range of dynasties, regions, iconographic systems, and scholarly perspectives on Chinese Buddhist art. This hybrid training strategy—fusing visual features (via frontal/lateral photographs) with textual semantics (e.g., historical context from local gazetteers)-enables cross-modal association learning, allowing CSN to achieve robust performance with minimal labeled data. Unlike general-purpose MLLMs, CSN’s architecture prioritizes low-resource adaptability: its structured prompting mechanism and fine-tuning on archaeologically meaningful features (e.g., facial modeling, ritual gestures) preserve expert-level interpretive logic while reducing dependency on massive annotations. This design makes CSN particularly suitable for heritage sites with limited documentation resources, where traditional expert-driven methods are impractical.

To evaluate its effectiveness, we construct a six-dimensional evaluation system aligned with expert archaeological practice, covering: (1) historical period identification, (2) stylistic classification, (3) visual detail description, (4) terminological accuracy, (5) cultural interpretation, and (6) linguistic coherence. We then test ChronoStyleNet on a sample of 22 Yuan-dynasty grotto sculptures and compare its outputs with those of five leading mainstream multimodal large language systems with visual input capabilities—GPT-4o, Claude 3.5, Gemini 1.5 Pro, LLaMA 3.3 70B, and Grok Beta. Scoring is conducted using GPT-4o under blind conditions, with expert moderation for validation. Quantitative performance is supplemented with qualitative case analysis to examine models’ strengths and failure modes in detail. The full experimental framework of ChronoStyleNet is summarized in Fig. 1.

**Fig. 1: Overview of ChronoStyleNet research workflow.**

Here we show that ChronoStyleNet outperforms general-purpose models in both recognition and interpretive accuracy for heritage sculpture analysis, demonstrating the value of domain-adapted MLLMs in supporting digital preservation, expert collaboration, and public education in low-resource archaeological contexts.

Methods

Data collection and preprocessing

This study curated a dataset of 345 grotto sculptures from the Feilai Peak site in Hangzhou. From this set, 295 well-preserved sculptures were selected based on visual completeness and historical clarity. These samples represent approximately 49% of all documented statues in the West Lake region and 13% of the known grotto sculptures in Zhejiang Province. It is important to note that at least 40% of Zhejiang’s grotto sculptures are unsuitable for data annotation. In many cases, key features—such as the heads—have been destroyed and later reconstructed in modern times, as seen in the Shiwudong Cave in Hangzhou, which contains around 500 such statues. In other cases, the statues are so small that they lack discernible facial or bodily features, as exemplified by the Great Buddha Temple in Xinchang, which includes over 1000 miniature figures. Each piece was systematically documented with one frontal and two lateral photographs. Attributes including name, period, inscription, facial structure, hairstyle, costume, posture, gesture, pedestal, halo, and body-head proportions were annotated using an archaeological ontology schema. A sample annotation format is illustrated in Supplementary Table 1.

We preprocessed images via normalization and quality filtering. Complementary textual materials—such as Buddhist terminology dictionaries, local gazetteers, and image-term alignment samples—were added to enhance domain specificity. A mixed data strategy balanced general image-text pairs with domain-specific content. Structured prompt templates were introduced to guide the model through archaeological reasoning tasks.

Model architecture

ChronoStyleNet (CSN) is built upon the llava-onevision-qwen2-7b⁵⁰ architecture, a MLLM that integrates Qwen2 as the language backbone and supports high-resolution image input up to 2306 × 2306 pixels. Among the tested open-source MLLMs (llava-v1.5, llava-next, llava-onevision), the Qwen2-based variant exhibited the most robust Chinese-language understanding, making it a suitable foundation for this domain-specific task.

CSN was fine-tuned using 19,708 instruction–response pairs, including 10,731 general-domain samples and 8977 domain-specific samples related to heritage and Buddhist statuary. General instructions were curated from 13 open-source datasets and filtered for quality. Domain-specific data were generated using structured expert annotations combined with prompt-based synthesis using Qwen2-VL-72B.

All images were standardized via center-cropping and resizing, converted to RGB, and normalized using ImageNet statistics. Data augmentation techniques included random rotation, horizontal flipping, and perspective distortion.

Six-dimensional evaluation framework

To assess the archaeological applicability of CSN, we designed a six-dimensional evaluation framework that covers both visual recognition and cultural reasoning⁵¹, including:

Historical period identification: Accuracy of dynastic attribution

Stylistic classification: Ability to differentiate between regional and religious stylistic systems

Visual detail description: Precision in describing facial, attire, iconography, and pose features

Terminological accuracy: Consistency with archaeological and art-historical vocabulary

Cultural interpretation: Depth of understanding regarding religious or symbolic meaning

Linguistic coherence: Clarity, structure, and logical reasoning in textual output

This framework serves both as an internal performance benchmark and a comparative standard against general-purpose MLLMs in cultural heritage tasks.

Test sample construction

This study focuses on Yuan dynasty sculpture samples based on both historical significance and practical accessibility. Feilai Peak, located in Hangzhou, houses the largest and best-preserved collection of Yuan-period sculptures, notable for their fusion of Han Chinese and Tibetan Buddhist artistic traditions. In addition to its historical value, the sculptures are generally well preserved, with critical visual features such as dress, posture, and halos remaining largely intact. Many works are securely dated through local gazetteers, stone inscriptions, and archaeological research. Moreover, our research team is based in Hangzhou, enabling consistent on-site documentation and high-resolution image capture, thereby ensuring both the feasibility and completeness of the dataset. We selected 22 Yuan dynasty sculptures as evaluation samples, based on two criteria:

(1)
strong representativeness of the stylistic and iconographic diversity of Yuan-period Feilai Peak artifacts;
(2)
clear and uncontested period attributions supported by inscriptions and literature.

Samples were divided into:

In-domain group (18): Statues from Feilai Peak covered in training data.

Out-of-domain group (4): Statues from Baocheng Temple, a site in Hangzhou not included in the training set but sharing stylistic traits with the in-domain samples.

Stylistic similarity between the Baocheng Temple samples and the Feilai Peak corpus was determined based on expert assessment of religious figure types, drapery carving techniques, and facial modeling styles. Both sites are located within the same historical-cultural region, and reflect the institutional spread of Tibetan Buddhism in the Jiangnan area during the Yuan dynasty. Despite differences in iconographic themes, the Baocheng sculptures are stylistically complementary to those at Feilai Peak. For example, the Mahākāla statue(a fierce guardian deity in Esoteric Buddhism, often regarded as a wrathful form of Avalokiteshvara) at Baocheng Temple, which bears a clear Yuan inscription, shares key iconographic motifs with wrathful deities at Feilai Peak—such as fierce expressions, flaming aureoles, bone ornaments, and trampling poses. These visual parallels reflect a localized reinterpretation of tantric sculpture rituals under state-sponsored Buddhism, highlighting an integrated visual and cognitive framework for Yuan-era esoteric imagery in Jiangnan. This setup enables testing both recognition accuracy and cross-domain generalization. Detailed annotations of all samples, including domain classification and descriptive metadata, are provided in Supplementary Table 2.

Baseline models and evaluation targets

To benchmark CSN, we selected five widely adopted multimodal-capable mainstream large language systems as comparative baselines: GPT-4o, Claude 3.5, Gemini 1.5 Pro, LLaMA 3.3 70B, and Grok Beta. These models represent current multimodal capabilities and were evaluated on the same 22 samples under controlled blind conditions. Each received identical image inputs and standardized prompts. This ensured evaluation fairness by avoiding prompt engineering differences or model-specific bias. GPT-4o, Claude 3.5, Gemini 1.5 Pro, and Grok 3 were accessed via their respective official APIs. LLaMA 3.3 70B (Meta) was accessed through the Monica platform, which enables multimodal querying via an integrated vision module. The specific versions were as follows, with model characteristics and performance summarized in Table 1.

Table 1 Comparison of model characteristics and performance

Full size table

GPT-4o (OpenAI): Version gpt-4o, released November 20, 2024

Claude 3.5 Sonnet (Anthropic): Version Claude 3.5 Sonnet, released June 21, 2024

Gemini 1.5 Pro Experimental (Google): Version gemini-1.5-pro-exp-03-25, released March 25, 2025

LLaMA 3.3 70B (Meta): Version LLaMA-3.3-Nemotron-70B-Select, initially released December 6, 2024, last updated March 18, 2025 (accessed via Monica API)

Grok 3 Beta (xAI): Version Grok 3 Beta, released February 17, 2025

Scoring protocol

Evaluation was conducted using GPT-4o as a neutral scorer following a standardized multi-step process:

Reference creation: Target answers for each statue were written by domain experts. Throughout the processes of image annotation and evaluation, we collaborated closely with specialists in Buddhist grotto sculpture. Detailed expert evaluation descriptions can be found in Supplementary Table 2.

Output normalization: All model responses anonymized and uniformly formatted.

Automated scoring: GPT-4o compared model outputs to references across six dimensions and provided justifications⁴³. Prompt structure and representative annotated scoring outputs can be found in Supplementary Note 1.

Expert calibration: Final scores were refined based on expert feedback to ensure the validity of the evaluation. An independent panel of cave-temple specialists, unaffiliated with the authors, rigorously assessed the model’s identification performance. Figure 2 illustrates the full evaluation workflow.

**Fig. 2: Evaluation workflow for grotto-style identification based on GPT-4o scoring.**

Statistical information

Both quantitative and qualitative analyses were conducted. To evaluate model performance across six expert-defined evaluation dimensions, independent samples two-tailed t-tests were used to compare ChronoStyleNet against each baseline model (GPT-4o, Claude 3.5, Gemini 1.5 Pro, LLaMA 3.3 70B, Grok Beta). For each comparison, Levene’s Test was applied to determine variance equality. If the assumption of equal variances was violated (p < 0.05), Welch’s t-test results were reported. The corresponding t-values, degrees of freedom, p-values, and Levene’s F-values for homogeneity of variance are provided in Supplementary Table 3. To assess the generalization capacity of ChronoStyleNet, an independent samples two-tailed t-test was conducted to compare its performance on in-domain (n = 18) and out-of-domain (n = 4) samples. The corresponding t-values, degrees of freedom, and p-values are provided in Supplementary Table 4. In all figures, error bars represent standard deviation (SD). Each model was evaluated across 22 test samples, with scoring conducted independently across six evaluation dimensions. All statistical analyses were performed using IBM SPSS Statistics 27.0 and GraphPad Prism 10.2.

Qualitative review examined reasoning logic, term usage, and symbolic interpretation. Error analysis focused on typical failures in period judgment, reasoning logic, and the precision of domain-specific knowledge usage. This helped define the current model’s limits and informed future optimization directions.

Results

ChronoStyleNet performance across six evaluation dimensions

This study evaluates the performance of ChronoStyleNet (CSN) against five mainstream multimodal systems (GPT-4o, Claude 3.5, Gemini 1.5 Pro, LLaMA 3.3 70B, and Grok Beta) using 22 Yuan Dynasty grotto statues as test samples. To aid understanding of ChronoStyleNet’s performance, Fig. 3 illustrates its overall architecture, highlighting key modules from data preprocessing to multimodal integration. The evaluation framework focuses on six key dimensions: historical period identification, style classification, descriptive details, terminology standardization, cultural interpretation, and linguistic logic. Figure 4 presents a streamlined visualization of this six-dimensional evaluation framework, highlighting scoring dimensions and their interrelations. the full scoring rubric is detailed in Supplementary Table 5. The output performance of sample tasks was systematically tested, with scoring conducted by GPT-4o as a neutral scorer under blind conditions using identical inputs and uniform instructions. This minimized the influence of cue words, information leakage, and human bias. The results were further refined with expert reviews to ensure fairness. Complete per-sample evaluation scores for each model across the six dimensions are available in Supplementary Table 6.

**Fig. 3: Architecture of ChronoStyleNet multimodal framework.**

**Fig. 4: Evaluation framework overview.**

Among the six score items, CSN ranked first in the mean scores across five dimensions, showing a significant overall advantage (see Supplementary Table 3 for full test statistics). It performed well in the dimensions of historical period identification accuracy (3.95 points) and logical-contextual coherence (4.18 points), scoring much higher than all other models. CSN was able to extract precise temporal features that aligned with Yuan-dynasty characteristics—such as Vajrayana elements and regional facial modeling—thus enabling accurate dynastic classification. In contrast, general-purpose multimodal large language models tended to produce broader stylistic descriptions lacking specific cultural markers, often leading to vague or incorrect period attribution. This is reflected in their tendency to classify many Yuan samples as Tang dynasty (Fig. 6b), a pattern likely influenced by the dominance of Tang-era material in their training corpus and their limited integration of domain-specific chronology and symbolic reasoning.

CSN also significantly outperforms the other models in terms of terminological normativity (3.14 points) and depth of cultural interpretation (2.77 points), reflecting its stronger academic normativity and interpretive ability in terms of accurate use of domain terminology and comprehension and expression of cultural contextual information. CSN is capable of articulating how dynastic periods, regional styles, and religious influences interact—particularly the syncretic fusion of Han Chinese and Tibetan elements in Yuan-dynasty grotto art—thereby demonstrating more advanced cultural reasoning and academic expressiveness.

In terms of richness of descriptive details, CSN scores (2.50) are slightly lower than GPT-4o (2.73) and Claude 3.5 (3.00), ranking fourth among the models. This suggests that while CSN demonstrates a strong ability to identify critical stylistic cues for accurate period classification, it tends to terminate its descriptive output once sufficient evidence for dynastic attribution has been established. In contrast, general-purpose models often provide more exhaustive—but less targeted—descriptions. This difference reflects a trade-off between precision and verbosity: CSN prioritizes diagnostic features relevant to archaeological reasoning, whereas baseline models emphasize broader scene rendering. Consequently, there remains room for CSN to enhance its descriptive depth and contextual supplementation without compromising its accuracy.

To visualize model performance, we present a comparative summary in Fig. 5, which illustrates multi-dimensional evaluation outcomes.

**Fig. 5: Visual Comparison of Multimodal Model Performance and Interpretability on Yuan Dynasty Grotto Sculptures.**

Deviation is generally higher than other models (e.g., 1.43 on historical period recognition accuracy), indicating that its generation results fluctuate from case to case. This may be related to the fact that its generation strategy is more oriented towards diversity and exploration. In contrast, models such as Claude 3.5, which had a standard deviation of 0, were stable but performed more stereotypically in terms of output content. The Tukey HSD post-hoc test of the CSN against the other models with the highest scores in each of the six scoring dimensions showed the following results:

In six dimensions (historical period identification, style categorization, terminology specification, cultural interpretation, logical coherence, and overall evaluation), the CSN scored significantly higher than the other models (p < 0.001), which showed a clear advantage.

Only in the dimension of “richness of descriptive details”, Claude 3.5 scored the highest, and the difference with CSN was significant (p = 0.012), which showed the relative advantage of the Claude model in describing details.

Overall, CSN demonstrated leadership in knowledge accuracy, logic, terminology standardization, and cultural understanding, making it one of the most comprehensive models in the current test, and only slightly inferior to Claude 3.5 and GPT-4o in detail depiction.

In-domain and out-of-domain generalizations

This experiment focuses on the recognition and analysis of Yuan dynasty grotto statues, aiming to explore the understanding and judgment ability of CSN when dealing with grotto statues with distinctive regional characteristics. Based on the accumulated features of the Yuan dynasty statues of Feilai Peak in the previous training set, this round of experiment is especially set to compare the “in-domain” and “out-of-domain” samples:

In-domain samples (18 statues): from the Feilai Peak area in the training set, with highly consistent styles. Out-of-domain samples (4 statues): from other grottoes in Hangzhou, which were not involved in the training but have similar styles.

Figure 6 integrates visual heatmaps, category flow diagrams, and scoring comparisons to highlight CSN’s marked advantage in accurately analyzing in-domain samples, while also exposing performance gaps on out-of-domain tasks. From the statistical results (Fig. 6c), it can be seen that CSN scores significantly higher on the all dimensions of “in-domain” than “out-of-domain”, which shows a stronger adaptability and expressive ability to the geographical nature of the information in the domain, and the generalization ability of the model needs to be further improved. A full breakdown of generalization scores on in-domain and out-of-domain samples is provided in Supplementary Table 4.

**Fig. 6: Comparative evaluation of CSN and baseline multimodal models.**

In contrast, other mainstream models such as GPT-4o and Claude 3.5 are more homogeneous and neutral in their overall scores. This homogeneity is reflected in the relatively uniform and low variance in heatmap shading across models (Fig. 6a), indicating limited differentiation in evaluation scores across samples. For example, GPT-4o demonstrates consistent scores on dimensions such as logic & coherence and term consistency, as evidenced by solid color blocks without gradation. However, this consistency also highlights GPT-4o’s limited ability to capture nuanced regional differences and complex stylistic variations in out-of-domain samples.

Error analysis and interpretability findings

In order to further reveal the boundaries and limitations of the CSN model’s ability in the Buddhist statue recognition task, typical misjudgments and omissions are selected for analysis, focusing on the models’ deficiencies in stylistic judgments, terminology use, symbolic interpretation and logical expression.

In terms of overall performance, the CSN is able to determine the Yuan Dynasty style of the statues in most cases, but its specific analysis often appears to be based on unclear and ambiguous expressions, and lacks clear and characteristic supporting details. When identifying Yuan dynasty Buddhist statues, the model mostly uses general expressions such as “complex decoration” and “delicate carving” as the basis for judgment, but fails to point out key artistic features such as “the five-leafed crown” and “royal ease pose (lalitasana)”. In Sample 01 Vairocana (a central figure in Esoteric Buddhism, representing the universal Buddha), the model correctly attributes the statue to the Yuan period but omits doctrinally significant elements such as “the five-Buddha crown” and “turning-the-wheel mudrā”, which are essential for identifying Vairocana’s religious identity. This tendency of generalization weakens the persuasiveness and professionalism of the model judgment.

In terms of language expression, the CSN had problems with unclear logical chains and jumps in reasoning. For example, some of the responses contained assertions such as “It can be seen from the complex decoration that this is the style of the Yuan Dynasty”, failing to establish a reasonable reasoning process from visual details to style judgment and then to period attribution. This makes the output of the model formally complete, but lacking in academic verifiability and rigor.

The interpretation of symbolism is another challenge of the model. In the face of Tibetan Buddhist statues with strong religious connotations, the CSN often fails to accurately recognize their symbolic systems or misunderstands their religious functions. In Sample 05 Zambala (a tantric wealth deity in Tibetan Buddhism), although the model correctly identifies the dynasty, it fails to recognize core tantric symbols such as the jewel-spitting mongoose or the deity’s ritual association with abundance, thereby overlooking the statue’s esoteric religious role. In Sample 22 Mahākāla, the model completely ignores its typical identity as a protector deity, failing to mention its visual features such as stepping on people, embracing skulls, and flaming patterns, and even misinterpreting the “wrathful” shape as a symbol of “wisdom and enlightenment”, which is completely out of the cultural context of the Buddhist Tantric statue system. These issues are illustrated in Fig. 6d, which presents a comparative visualization of expert interpretations, CSN outputs, and baseline models’ responses across three Yuan-dynasty Buddhist statues.

Summary of evaluation outcomes

Through the comprehensive analysis of the CSN model’s recognition performance and multi-dimensional evaluation results on 22 groups of Yuan dynasty grotto statues, it can be seen that the model has a good performance in the areas of “historical period recognition”, “terminology standardization”, “cultural interpretation” and “linguistic interpretation”. Its overall score is ahead of other mainstream models, showing strong professional expression ability and adaptability to regional contexts. Within the coverage of the training corpus, CSN is able to accurately grasp the style of Yuan dynasty statues, and complete the task of style categorization and expression in a more standardized language.

However, the model still has systematic limitations in several key aspects. Firstly, CSN performs significantly better on “in-domain” samples than “out-of-domain” samples, which shows that its recognition ability is strongly dependent on the training context, and it is still difficult to cope with the task of migrating the recognition of unfamiliar styles and rare subjects, and its generalization ability is insufficient. Secondly, its scores on the “richness of descriptive details” dimension are low, and the standard deviation of the scores on each dimension is generally high, reflecting that its output still needs to be optimized in terms of consistency and completeness of details.

Discussion

This study proposes and evaluates ChronoStyleNet, a domain-specific multimodal model designed for stylistic identification and semantic interpretation of grotto sculptures, with a focus on Yuan-dynasty artworks from Feilai Peak, Hangzhou. Aligning with the trend of applying MLLMs to cultural heritage research, we construct a six-dimensional evaluation system incorporating expert-level archaeological criteria. Comparative testing with five general-purpose multimodal-capable systems reveals that CSN outperforms them in dimensions such as historical period recognition, cultural interpretation, and terminological precision.

ChronoStyleNet serves as an assistive tool for large-scale grotto image processing, style recognition, and scholarly description generation. Its structured prompt design and feedback loop allow for programmable expression of expert knowledge and continuous optimization during deployment. Importantly, CSN’s modular architecture and prompt-based design offer the potential to extend the model to other religious traditions or regional sculptural systems. With minimal architectural changes, the model can be transferred to new iconographic corpora through targeted prompt calibration and transfer learning. While the framework is broadly adaptable, certain iconographic systems may benefit from targeted adjustments to align with distinct symbolic logics. With its reusable scoring criteria and modular supervision interface, CSN also lowers the barrier for non-specialist users, offering educational applicability under guided use. As human–AI interaction increases, the model’s descriptive and interpretive capabilities will likely evolve through expert feedback and iterative refinement. This work highlights the potential of AI to bridge expertise gaps in endangered sites globally, redefining best practices for digital conservation and setting a new state of the art for heritage informatics.

However, the model faces multiple limitations that inform directions for future development. First, its current training scope is limited to sculptures from Feilai Peak. While this site offers strong preservation and clear stylistic features, broader coverage across regions and dynasties is needed to improve generalizability. We propose a three-phase data expansion plan: first extending to other major grottoes within Zhejiang, then to Jiangnan sites, and ultimately toward a balanced national-scale dataset. Future iterations will incorporate additional images per object, including multi-angle and close-up perspectives, to support more comprehensive spatial understanding and visual detail recognition.

CSN still struggles to capture the symbolic and ritual significance embedded in visual details. To improve this, we will introduce structured symbolic annotations informed by expert knowledge, linking elements such as gestures, ornaments, and deity forms to their religious functions. These annotations will support a stepwise training process—from visual recognition to symbolic inference and final interpretation. In parallel, we plan to incorporate religious ontologies and knowledge graphs to help the model reason across deity hierarchies and symbolic systems. In terms of output expressivity, CSN sometimes produces less detailed descriptions than baseline model. This may reflect a trade-off between clarity and richness. To address this, we will explore multi-stage generation and reinforcement-based strategies that allow for more elaborate yet accurate outputs. Prompt design will also be improved by embedding symbolic cues to make responses more specific and culturally grounded.

These findings suggest that while ChronoStyleNet has demonstrated its feasibility and utility, its output should remain subject to professional review in high-stakes applications such as digital conservation or public exhibitions. Future development may include integrating real-time environmental sensing for deterioration monitoring, extending the model’s utility in risk-aware heritage management. In public-facing applications, CSN can serve as the backbone for interactive tools such as AR-guided tours or semantic visualization interfaces. By bridging the gap between expert knowledge and lay perception, CSN provides a framework that facilitates the translational application of grotto art in public education and cultural communication.

Dataset Documentation

The dataset covers 49% of the qualified sculptural samples from the West Lake region, based on a total of 596 grotto sculptures across 24 cave sites. Sculptures from Shiwudong (Stone House Cave), amounting to approximately 500 entries, were excluded from the count due to their modern reconstructions and lack of historical authenticity.

Data availability

The expert-annotated sculpture dataset (295 statues from Feilai Peak Grottoes) and the test dataset (22 Yuan-dynasty sculptures) generated and analyzed during the current study are not publicly available at this time due to the sensitive nature of ongoing research and agreements with heritage authorities for the protection of cultural site data. However, these datasets, along with a detailed list of sources for the 2.46 GB archaeological literature corpus, are available from the corresponding author (Wei Ren, wei.ren1012@foxmail.com) on reasonable request, subject to appropriate ethical considerations and data sharing agreements.

Code availability

The custom code for the ChronoStyleNet (CSN) model, including scripts for data processing, model training (based on the llava-onevision-qwen2-7b framework), and evaluation, are currently under active development and refinement as part of ongoing research. The code is not yet publicly archived but is available from the corresponding author (Wei Ren, wei.ren1012@foxmail.com) on reasonable request for research and verification purposes, subject to a material transfer agreement or appropriate licensing to protect intellectual property. All training and evaluation code will be released in a public GitHub repository within six months of the paper’s acceptance. The repository link will be provided upon publication.

References

Lai, T. Han-Tibetan Treasures: a Study of the Feilai Peak Cave Statues in Hangzhou. (Cultural Relics Publishing House, 2015).
Shao, Q. (Ed.) Hangzhou Grottoes (Shanghai Jiao Tong University Press, 2023).
Grilli, E., Özdemir, E. & Remondino, F. Application of machine and deep learning strategies for the classification of heritage point clouds. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. XLII-4/W18, 447–454 (2019).
Article Google Scholar
Cintas, C. et al. Automatic feature extraction and classification of Iberian ceramics based on deep convolutional networks. J. Cult. Herit. 41, 106–112 (2020).
Article Google Scholar
Howard, A. F., Wu, H., Li, S. & Yang, H. Chinese Sculpture (Yale University Press, 2006).
Croce, V. et al. Semi-automatic classification of digital heritage on the aïoli open source 2D/3D annotation platform via machine learning and deep learning. J. Cult. Herit. 62, 187–197 (2023).
Article Google Scholar
Wang, K., Xu, G., Li, S. & Ge, C. Geo-environmental characteristics of weathering deterioration of red sandstone relics: a case study in tongtianyan grottoes, southern China. Bull. Eng. Geol. Environ. 77, 1515–1527 (2018).
Article CAS Google Scholar
Zhang, J. et al. Surface weathering characteristics and degree of niche of sakyamuni entering nirvana at dazu rock carvings, China. Bull. Eng. Geol. Environ. 78, 3891–3899 (2019).
Article Google Scholar
Sesana, E., Gagnon, A. S., Ciantelli, C., Cassar, J. & Hughes, J. J. Climate change impacts on cultural heritage: a literature review. WIREs Clim. Change 12, e710 (2021).
Article Google Scholar
Su, B. Studies on Chinese Grotto Temples (SDX Joint Publishing Company, 2019).
Yu, T. et al. Artificial intelligence for dunhuang cultural heritage protection: The project and the dataset. Int. J. Comput. Vis. 130, 2646–2673 (2022).
Article Google Scholar
Anichini, F. et al. Developing the ArchAIDE application: a digital workflow for identifying, organising and sharing archaeological pottery using automated image recognition. Internet Archaeol. https://doi.org/10.11141/ia.52.7 (2020).
Chen, H. et al. DualAST: dual style-learning networks for artistic style transfer. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 872–881. https://doi.org/10.1109/CVPR46437.2021.00093 (IEEE, 2021).
Gîrbacia, F. An analysis of research trends for using artificial intelligence in cultural heritage. Electronics 13, 3738 (2024).
Article Google Scholar
Chapinal-Heras, D. & Díaz-Sánchez, C. A review of AI applications in human sciences research. Digit. Appl. Archaeol. Cult. Herit. 30, e00288 (2023).
Google Scholar
Windhager, F. et al. Visualization of cultural heritage collection data: State of the art and future challenges. IEEE Trans. Vis. Comput. Graph. 25, 2311–2330 (2019).
Article PubMed Google Scholar
Li, M., Wang, Y. & Xu, Y.-Q. Computing for Chinese cultural heritage. Vis. Inform. 6, 1–13 (2022).
CAS Google Scholar
Marsili, G. & Orlandi, L. M. Digital humanities and cultural heritage preservation: the case of the BYZART (byzantine art and archaeology on europeana) project. Stud. Digit.Herit. 3, 2, https://doi.org/10.14434/sdh.v3i2.27721 (2019).
Article Google Scholar
Marchand, E. et al. Extraction of a Knowledge Graph from French cultural heritage documents.In Proc. ADBIS, TPDL and EDA 2020 Common Workshops and Doctoral Consortium, 23–35 https://doi.org/10.1007/978-3-030-55814-7_2 (Springer International Publishing, 2020).
Yang, S. & Hou, M. Knowledge graph representation method for semantic 3D modeling of Chinese grottoes. Herit. Sci. 11, 266 (2023).
Article Google Scholar
Spallone, R. et al. 3D modelling and virtual reality for museum heritage presentation: Contextualisation of sculpture from the tang era. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. XLVIII-2-W4-2024, 413–420 (2024).
Article Google Scholar
Skublewska-Paszkowska, M., Milosz, M., Powroznik, P. & Lukasik, E. 3D technologies for intangible cultural heritage preservation—literature review for selected databases. Herit. Sci. 10, 3 (2022).
Article PubMed PubMed Central Google Scholar
De Luca, L. 3D modelling and semantic enrichment in cultural heritage. in Photogrammetric Week ’13 (eds. Fritsch D.). https://hal.science/hal-02892090 (2013).
Grilli, E., Dininno, D., Marsicano, L., Petrucci, G. & Remondino, F. Supervised segmentation of 3D cultural heritage. in Proc. 3rd Digital Heritage International Congress (DigitalHERITAGE) held jointly with 2018 24th International Conference on Virtual Systems & Multimedia (VSMM 2018), 1–8. https://doi.org/10.1109/DigitalHeritage.2018.8810107 (IEEE, 2018).
Bi, X., Sun, Z. & Chen, Z. A novel unsupervised contrastive learning framework for ancient Yi script character dataset construction. Npj Herit. Sci. 13, 39 (2025).
Article Google Scholar
Yu, T. et al. End-to-end partial convolutions neural networks for dunhuang grottoes wall-painting restoration. in Proc. IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 1447–1455. https://doi.org/10.1109/ICCVW.2019.00182. (IEEE, 2019).
Li, Y. et al. Universal style transfer via feature transforms. Adv. Neural Inf. Process. Syst. 30, https://proceedings.neurips.cc/paper/2017/hash/49182f81e6a13cf5eaa496d51fea6406-Abstract.html (2017).
Marafini, F. A proposal of classification for machine-learning vibration-based damage identification methods. 593–598. https://doi.org/10.21741/9781644902431-96 (2023).
Tan, X., Wu, X. & Yang, C. Visual cultural symbol recognition based on muti-feature extracting. In Proc. 8th International Symposium on Computational Intelligence and Design (ISCID), 306–310. https://doi.org/10.1109/ISCID.2015.304 (IEEE, 2015).
Zhu, D. et al. XunZi-MLLM: A multimodal large language model for ancient text and image recognition. Digit. Scholarsh. Humanit. fqaf026. https://doi.org/10.1093/llc/fqaf026 (2025).
Rachabatuni, P. K., Principi, F., Mazzanti, P. & Bertini, M. Context-aware chatbot using MLLMs for Cultural Heritage. In Proc. ACM Multimedia Systems Conference 2024 on ZZZ, 459–463. https://doi.org/10.1145/3625468.3652193 (ACM, 2024).
Zhang, C. et al. Can MLLMs understand the deep implication behind Chinese images? In Proc. 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 700, 14369–14402. https://doi.org/10.18653/v1/2025.acl-long. (Association for Computational Linguistics, Vienna, Austria, 2025).
Liu, S. et al. CultureVLM: Characterizing and improving cultural understanding of vision-language models for over 100 countries. Preprint at https://doi.org/10.48550/arXiv.2501.01282 (2025).
Zhou, Z., Xi, Y., Xing, S. & Chen, Y. Cultural bias mitigation in vision-language models for digital heritage documentation: a comparative analysis of debiasing techniques. Artif. Intell. Mach. Learn. Rev. 5, 28–40 (2024).
Article Google Scholar
Zhong, T. et al. Opportunities and challenges of large language models for low-resource languages in humanities research. Preprint at. https://doi.org/10.48550/arXiv.2412.04497 (2024).
Urailertprasert, N., Limkonchotiwat, P., Suwajanakorn, S. & Nutanong, S. SEA-VQA: Southeast Asian cultural context dataset for visual question answering. In Proc. 3rd Workshop on Advances in Language and Vision Research (ALVR), 173–185. https://doi.org/10.18653/v1/2024.alvr-1.15 (Association for Computational Linguistics, 2024).
Luo, Y., Tang, J., Huang, C., Hao, F. & Lian, Z. CalliReader: contextualizing chinese calligraphy via an embedding-aligned vision-language model. Preprint at https://doi.org/10.48550/arXiv.2503.06472 (2025).
Rein, D. et al. GPQA: a graduate-level Google-proof Q&A benchmark. First Conference on Language Modeling. https://openreview.net/forum?id=Ti67584b98 (ICLR, 2024).
Pavlova, V. & Makhlouf, M. Building an efficient multilingual non-profit IR System for the Islamic domain leveraging multiprocessing design in rust. In Proc. 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track 73, 981–990, https://doi.org/10.18653/v1/2024.emnlp-industry. (Association for Computational Linguistics, Miami, Florida, US, 2024).
Alrefaie, M. T., Salem, F., Morsy, N. E., Samir, N., & Gaber, M. M. The Dynamics of Meaning Through Time: Assessment of Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2501.05552 (2025).
Chartier, M., Dakkoune, N., Bourgeois, G. & Jean, S. HiBenchLLM: historical inquiry benchmarking for large language models. Data Knowl. Eng. 156, 102383 (2025).
Article Google Scholar
Liu, Y.et al. G-eval: NLG evaluation using GPT-4 with better human alignment. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing 153, 2511–2522 https://doi.org/10.18653/v1/2023.emnlp-main. (Assoclation for Computational Linguistics, Singapore, 2023).
Sottana, A., Liang, B., Zou, K. & Yuan, Z. Evaluation metrics in the era of GPT-4: reliably evaluating large language models on sequence to sequence tasks. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing 543, 8776–8788 https://doi.org/10.18653/v1/2023.emnlp-main. (Association for Computational Linguistics, Singapore, 2023).
Akbulut, C. et al. Century: a framework and dataset for evaluating historical contextualisation of sensitive images. In Proc. 13th International Conference on Learning Representations (2025).
Yu, T. et al. Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants. Preprint at https://doi.org/10.48550/arXiv.2310.00653 (2023).
Li, F. et al. LLaVA-NeXT-interleave: tackling multi-image, video, and 3D in large multimodal models. Preprint at https://doi.org/10.48550/arXiv.2407.07895 (2024).
Dong, X. et al. InternLM-XComposer2: mastering free-form text-image composition and comprehension in vision-language large model. Preprint at https://doi.org/10.48550/arXiv.2401.16420 (2024).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual Instruction Tuning. In Proc. 37th International Conference on Neural Information Processing Systems 1–25 (Curran Associates Inc., New Orleans, LA, USA, 2023).
Zhejiang Provincial Institute of Cultural Relics and Archaeology & Institute for Cultural Heritage, Zhejiang University. Collected Studies on the Archaeology of Zhejiang Grottoes: Volume I & II. (Zhejiang Ancient Books Publishing House, 2024).
LMMs-Lab. LLaVA-OneVision-Qwen2-7B Model Card. https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov (2024).
Tibaut, A. & Guerra de Oliveira, S. A framework for the evaluation of the cultural heritage information ontology. Appl. Sci. 12, 795 (2022).
Article CAS Google Scholar
OpenAI. GPT-4o Technical Report. https://openai.com/gpt-4o (2024).
Anthropic. Claude 3.5 Sonnet Model Card. https://www.anthropic.com/index/claude-3-5 (2024).
DeepMind, G. Gemini 1.5 Pro Experimental Release Notes. https://deepmind.google/technologies/gemini (2025).
AI, M. LLaMA-3.3-Nemotron-70B-Select Release Notes. https://ai.meta.com/llama (2024).
xAI. Grok 3 Beta Overview. https://x.ai/blog/grok (2025).

Download references

Acknowledgements

This research was funded by the Natural Science Foundation of Zhejiang Province (Y22D010489; “3D identification and classification of Buddhist sculptures based on the ResNet model”), the Philosophy and Social Sciences Planning Project of Zhejiang Province—Rare & Under-studied Disciplines Programme (23LMJX15YB; “AI-based cataloguing of figural sculptures from Zhejiang’s grotto temples and rock carvings”), the second cohort of “14th Five-Year Plan” Provincial Graduate-level Teaching Reform Projects, Zhejiang (JGCG2024479; “Construction and sharing of a graduate educational resource bank built on large-scale digital models of grotto sculptures”), and the 2022 Spring Breeze (Chunhui) Collaborative Research Project of the Ministry of Education (HZKY202220194; “Digital restoration of overseas-lost grotto sculptures using artificial intelligence”). We also acknowledge the advanced computing resources provided by the Supercomputing Center of Hangzhou City University. The authors are grateful to Kui Su, Jun Lu, Yio Zhang, Bing Yang and Yi Ding for their valuable assistance with high-performance computing.

Author information

Authors and Affiliations

School of Art and Archaeology, Hangzhou City University, Hangzhou, China
Jia Xing, Wei Ren, Du Lei, Lin Zhao, Yirui Han, Zike Yu, Xiaoping Zhou & Cheng Yang
School of Art and Archaeology, Zhejiang University, Hangzhou, China
Jia Xing
Sigmai Co. Ltd, Hangzhou, China
Xue Qin, Hongdang Shao & Wenjie Li
Zhejiang Wanli University, Ningbo, China
Zheng Xu
School of Information and Electrical Engineering, Hangzhou City University, Hangzhou, China
Rui Yin & Jiantao Yuan
School of Computer and Computing Science, Hangzhou City University, Hangzhou, China
Jun Wang, Wei Chen & Binbin Zhou
International School of Cultural Tourism, Hangzhou City University, Hangzhou, China
Jun Xu
Shenzhen Technology University, Shenzhen, China
Wei Zhou

Authors

Jia Xing
View author publications
Search author on:PubMed Google Scholar
Wei Ren
View author publications
Search author on:PubMed Google Scholar
Du Lei
View author publications
Search author on:PubMed Google Scholar
Lin Zhao
View author publications
Search author on:PubMed Google Scholar
Xue Qin
View author publications
Search author on:PubMed Google Scholar
Hongdang Shao
View author publications
Search author on:PubMed Google Scholar
Wenjie Li
View author publications
Search author on:PubMed Google Scholar
Yirui Han
View author publications
Search author on:PubMed Google Scholar
Zike Yu
View author publications
Search author on:PubMed Google Scholar
Zheng Xu
View author publications
Search author on:PubMed Google Scholar
Rui Yin
View author publications
Search author on:PubMed Google Scholar
Jiantao Yuan
View author publications
Search author on:PubMed Google Scholar
Jun Wang
View author publications
Search author on:PubMed Google Scholar
Wei Chen
View author publications
Search author on:PubMed Google Scholar
Jun Xu
View author publications
Search author on:PubMed Google Scholar
Xiaoping Zhou
View author publications
Search author on:PubMed Google Scholar
Cheng Yang
View author publications
Search author on:PubMed Google Scholar
Wei Zhou
View author publications
Search author on:PubMed Google Scholar
Binbin Zhou
View author publications
Search author on:PubMed Google Scholar

Contributions

J.X. collected and organized the dataset, assisted in model fine-tuning, conducted experiments, analyzed results, created visualizations, and finished the manuscript. D.L. and L.Z. contributed to data collection and organization, and assisted in model tuning. X.Q. provided suggestions for research design and offered critical feedback. H.D.S. and W.R. designed the visual prompt tuning strategy and provided technical consultation. Y.R.H. organized data and provided feedback during model training. Z.K.Y. contributed to data collection. Z.X., R.Y., J.T.Y., J.W., W.C., and J.X. participated in the review of the manuscript. X.P.Z. was responsible for data labeling and annotation correction. C.Y. collected part of the data on grotto cultural relics and statues. W.Z. constructed and repeatedly revised the theoretical framework for research on cultural relic models. B.B.Z. contributed to the review of the manuscript. W.R. supervised the project, proposed the ChronoStyleNet framework, and provided key guidance on the overall direction. All authors contributed to the interpretation of results and provided substantial feedback on the analysis and manuscript.

Corresponding author

Correspondence to Wei Ren.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xing, J., Ren, W., Lei, D. et al. Multimodal AI for Yuan Buddhist sculpture chronology and style. npj Herit. Sci. 13, 443 (2025). https://doi.org/10.1038/s40494-025-01994-3

Download citation

Received: 18 June 2025
Accepted: 11 August 2025
Published: 05 September 2025
Version of record: 05 September 2025
DOI: https://doi.org/10.1038/s40494-025-01994-3

Multimodal AI for Yuan Buddhist sculpture chronology and style

Abstract

Similar content being viewed by others

Cross-modal deep learning framework for 3D reconstruction and information integration of Zhejiang wood carving heritage

Building a Chinese ancient architecture multimodal dataset combining image, annotation and style-model

Knowledge graph enhanced cross modal generative adversarial network for martial arts motion reconstruction and heritage preservation

Introduction