Introduction

Cultural heritage serves as a bridge between the past and the present, embodying the diversity of human expression, belief, and values. Its preservation is vital, connecting communities to history while fostering identity, belonging, and cultural continuity1. Cultural heritage comprises both tangible and intangible dimensions. Tangible heritage refers to physical objects and architectural structures of cultural, historical, or artistic significance, including castles, palaces, grottoes, temples, churches, ancient cities, gardens, and archeological sites. Intangible heritage, by contrast, encompasses non-material cultural practices transmitted across generations, such as performing arts, traditional craftsmanship, folklore, and diverse social customs. Together, tangible and intangible heritage form the core of cultural legacy, playing a vital role in shaping identity, preserving historical memory, and sustaining cultural continuity2.

Digitization of tangible cultural heritage relies largely on three-dimensional reconstruction. Early methods, notably laser scanning (e.g., LiDAR)3,4, produced accurate meshes but were costly and limited in reproducing surface materials and photorealism5,6. Advances in photogrammetry later introduced Structure-from-Motion (SfM)7 combined with Multi-View Stereo (MVS), enabling the generation of mesh models from images captured with ordinary cameras8,9 and thereby significantly lowering the threshold for reconstruction10,11.

For intangible cultural heritage (ICH), research has moved beyond static images or audio12 toward technologies that capture embodied performance. Motion capture (MoCap)13,14 can precisely record performers’ spatial positions and movement sequences, enabling accurate reconstruction of intangible practices. These data are applied to virtual character models derived from scanning or parametric modeling, then integrated with speech alignment and physical simulation to reproduce dynamic performances such as dance and opera. Combined with virtual reality, MoCap delivers immersive ICH experiences15. Despite surpassing traditional video documentation in terms of movement accuracy and editability, MoCap-based methods still struggle to convey performers’ appearance and fine-grained detail with fidelity16.

Digital documentation of cultural heritage should move beyond the archiving of isolated elements and instead seek to reconstruct the holistic contexts in which heritage practices are embedded. The UNESCO Yamato Declaration explicitly promotes the principle of integrated safeguarding17, emphasizing the intrinsic interdependence of tangible and intangible cultural heritage: material heritage provides the spatial and material conditions for intangible practices, while intangible practices confer meaning, vitality, and social value upon physical spaces. Nevertheless, many existing digitalization efforts depart from this principle. Approaches that focus on the recording of movements and rituals in intangible heritage, as well as those that prioritize geometric reconstruction in architectural conservation, have each developed mature methodological frameworks but remain largely segregated, lacking coordinated integration within a shared cultural context. This fragmentation directly undermines the concept of the spirit of place articulated in the ICOMOS Quebec Declaration18, which defines it as a composite of tangible and intangible elements. At a theoretical level, when intangible heritage is abstracted from its original contexts and digitized in isolation, this essential interaction is disrupted, leading to the loss of critical cultural information and increasing the risk of misinterpretation. At a practical level, intensified tourism development at heritage sites has progressively displaced intangible practices from their physical settings through overcrowding and noise, making it increasingly difficult to experience the integrated presence of people, place, and setting in situ. This predicament underscores the urgent need for digital technologies to intervene and compensate for such losses. Yet, much existing research relies on conventional three-dimensional reconstruction methods19, which often fail to simultaneously capture nuanced performances and complex environments. The resulting visual incongruities weaken immersion in digital experiences and sever the emotional connection between audiences and the spirit of place. This, in turn, constrains the potential of digital dissemination to foster cultural identification among communities and the broader public. Consequently, the development of integrated methods for recording and presenting intangible and tangible cultural heritage that are visually faithful, spatiotemporally coherent, and culturally holistic has become a critical challenge in cultural heritage digitalization.

3D Gaussian Splatting (3DGS)20 introduces Gaussian distributions as the fundamental primitives for scene representation, marking a paradigm shift in three-dimensional reconstruction from geometric approximation toward photorealistic rendering. As the field has progressed, this technique has demonstrated remarkable performance and robustness in cutting-edge domains such as spatial computing21, autonomous driving22, and digital human generation23. Its potential within the digital reconstruction of cultural heritage has likewise become increasingly evident. Compared with traditional photogrammetry, 3DGS exhibits clear advantages in handling complex geometries and non-Lambertian surfaces—including semi-transparent materials (e.g., jade and glass), highly reflective objects (e.g., lacquerware), and dynamic water reflections—whose intricate light-matter interactions are difficult to reproduce using conventional modeling pipelines. Moreover, it enables immersive real-time rendering with minimal manual intervention24,25.

Building on the development of Gaussian Splatting, recent studies have introduced a temporal dimension to construct temporally continuous dynamic 3DGS, commonly referred to as Gaussian Splatting-based volumetric video26. This representation is particularly well suited for documenting performance arts that are highly dependent on spatiotemporal rhythm, such as martial arts, dance, and traditional opera27. Compared with conventional two-dimensional video, volumetric video offers multi-view, interactive observation, providing a substantially more immersive spatial viewing experience. Compared with motion capture (MoCap), volumetric video acquisition does not require performers to wear sensors or tight-fitting suits, thereby preserving the authentic physical dynamics of traditional costumes—such as the flowing “water sleeves” in Kunqu opera—while avoiding the psychological intrusion that technical apparatuses may impose on performers. Despite these conceptual and technical advantages, empirical studies that employ dynamic 3DGS for comprehensive and systematic documentation of intangible cultural heritage remain scarce.

It must also be acknowledged that, as an emerging technology, 3DGS and its derived volumetric video representations still face several limitations. First, high-quality dynamic Gaussian models are often accompanied by significant rendering overhead. Second, the algorithm exhibits a certain sensitivity to lighting conditions during data acquisition; signal-to-noise ratio (SNR) issues in extremely low-light scenarios may slightly affect the reconstructed geometric details. Nevertheless, a growing body of research is actively addressing these challenges, and the rapid pace of technical iteration further underscores the strong developmental potential of 3DGS28,29,30. Nevertheless, the overwhelming advantages of 3DGS in terms of visual realism, combined with the rapid evolution of its ecosystem, position it as arguably the most effective solution for balancing efficiency and quality in immersive cultural heritage visualization.

Suzhou, located by Lake Tai, is celebrated as “heaven on earth” for its natural beauty and cultural prosperity. Among its most significant cultural contributions is Kunqu Opera, which originated in Kunshan, Suzhou, during the 14th century and dominated Chinese theater for nearly three centuries from the mid-Ming dynasty onward. In 2008, it was inscribed by UNESCO on the Representative List of the Intangible Cultural Heritage of Humanity. As one of the oldest forms of Chinese opera, it remains a cultural treasure of Chinese performing arts. A Stroll in the Garden, An Interrupted Dream (Youyuan Jingmeng), a renowned excerpt from The Peony Pavilion, exemplifies its artistic richness through the dream encounter and a mutual declaration of love between Du Liniang and Liu Mengmei, and was memorably performed by masters Mei Lanfang and Yu Zhenfei. Complementing this intangible heritage are Suzhou’s classical gardens, which originated in the Spring and Autumn period, developed during the Jin and Tang dynasties, flourished in the Song, and reached their peak in the Ming and Qing. In recognition of their outstanding value, the Humble Administrator’s Garden, Lingering Garden, Master of the Nets Garden, and Mountain Villa with Embracing Beauty were inscribed as UNESCO World Heritage Sites in 1997, with the Canglang Pavilion, Lion Grove Garden, Couple’s Garden Retreat, Garden of Cultivation, and Retreat and Reflection Garden added in 2000. Taken together, these heritage sites provide an exemplary context for the digital representation of both intangible and tangible heritage.

Grounded in the principles of the ICOMOS Quebec Declaration regarding the use of digital technologies to preserve, transmit, and interpret the spirit of place, and based on the premise that this spirit is fundamentally mediated through human experience and perception, this study selects the Humble Administrator’s Garden and the Kunqu Opera excerpt A Stroll in the Garden, An Interrupted Dream as case studies. We created a freely navigable, high-fidelity, and mobile Integrated Digital Theater that enables audiences to interactively explore the Kunqu garden scene. This platform simultaneously reproduces highly lifelike dynamic representations of performers and costumes alongside photorealistic static reconstructions of the World Heritage site, creating a virtual experience that is nearly indistinguishable from physical reality. Utilizing a VR platform, the experimental design first conducted a multidimensional comparison among flat video, virtual scenes generated via traditional photogrammetry and motion capture, and those reconstructed using Gaussian Splatting, covering aspects such as cognitive interaction and esthetic-emotional responses. Building on this, we further compared the narrative efficacy of the Integrated Digital Theater against a standalone intangible heritage narrative mode to examine how different storytelling strategies influence cultural understanding and immersive experience. Ultimately, this study attempts to construct a hybrid digital platform for cultural heritage, aiming to provide novel pathways for the synergistic dissemination of World Heritage and intangible cultural heritage, as well as the digital continuation of the spirit of place.

In summary, the contributions of this study are twofold:

  1. 1.

    we propose a workflow for unified digitization of intangible and tangible heritage using Gaussian splatting. Results show it enables low-cost, photorealistic, real-time rendering of dynamic performances and complex scenes.

  2. 2.

    We construct an Integrated Digital Theater that combines intangible and tangible cultural heritage. By fusing the garden setting of the Humble Administrator’s Garden with Kunqu Opera performance, the theater offers an immersive, visually realistic, and collaborative heritage experience, establishing a pathway for their synergistic representation in a unified virtual environment.

Methods

Architecture

Our research workflow is illustrated in Fig. 1. The platform utilizes a range of digital assets, including flat videos, motion capture data, 3D models, Gaussian splatting point clouds, and recordings of the Kunqu opera segment. First, we designed a garden tour route by integrating the layout of Suzhou’s Humble Administrator’s Garden with the narrative of A Stroll in the Garden, An Interrupted Dream, and recorded the corresponding singing segments for each key location. Foreground masks were extracted from all video frames using background subtraction31 and composited with photographs of the garden scenes to produce flat videos. Scene reconstruction was performed using both photogrammetry and Gaussian splatting methods, each employing extensive multi-view video as input. To support this, a custom capture system was established to record key scenes along the tour. For reenactment of Kunqu performances, actors were filmed in a multi-view studio, and Gaussian splatting was applied to generate volumetric videos. For comparison, motion capture data were collected to produce character mesh models, with animations generated by automatically mapping captured motions onto the models32. To evaluate the impact of different presentation methods on visual perception, immersion, emotional response, and intangible cultural heritage cognition and transmission willingness, three display modes were compared: Flat Video, Photogrammetric Mesh with Mocap, and Gaussian Splatting (GS). We also conducted a comparative experiment contrasting Kunqu performances presented without background against those set in the Humble Administrator’s Garden. This aimed to examine audience comprehension and engagement in Kunqu Digital Theater versus Integrated Digital Theater. Finally, we developed an Integrated Digital Theater. Audiences can explore the A Stroll in the Garden, An Interrupted Dream segments from a garden sandbox, experiencing the performance immersively.

Fig. 1: Research framework of the proposed system.
figure 1

The workflow comprises three stages: data acquisition, content production, and experimental study. We generated Kunqu performances and the Humble Administrator’s Garden scenes in three different presentation modes to comparatively evaluate their strengths and limitations. We further examined the impact of context-free Kunqu performances versus contextually integrated performances on cultural heritage dissemination. Based on the findings, we developed an integrated platform for Kunqu digital performance.

Data Acquisition and Preprocessing for Gaussian Splatting

Gaussian splatting uses multi-view video as input. We applied different equipment setups and preprocessing workflows for capturing static large scenes and dynamic actor performances.

For static large scenes, extensive multi-view video collection is necessary33. To improve efficiency, a multi-camera system comprising four action cameras, one panoramic camera, and multiple support rigs was deployed. All action cameras were set to 4K resolution at 60 Hz, with low-distortion mode (DEWARP), 1/60 shutter speed, automatic ISO, and auto white balance to ensure high-resolution images with sufficient overlap for 3D reconstruction. Cameras were mounted on a 3-meter telescopic pole34 with horizontal mounts, quick-release plates, and ball heads oriented approximately 45 apart to maximize overlapping viewpoints (Fig. 2a). A panoramic camera atop the rig provided additional geometric reference for alignment. For taller scenes, drone photography supplemented high-angle views (Fig. 2d e). Camera movement was slowed near key areas to improve local detail quality. Frames were extracted from the multi-camera video to reduce redundancy and computational load, followed by quality screening to remove blurred or poorly exposed frames. To mitigate interference from crowded heritage sites, automated human detection and occlusion correction35,36 were applied, ensuring accurate feature extraction and point cloud reconstruction.

Fig. 2: Data acquisition and preprocessing workflow for garden scenes.
figure 2

a Multi-camera shooting array. b Video frame extraction. c Human removal. d Aerial footage captured by drone. e Drone-assisted Gaussian Splatting for elevated imaging (roof as an example).

For dynamic actor performances, an 81-camera Z-CAM studio under global illumination was established with green screens, capturing performances at 3840 × 2160 resolution and 30 fps27. To reduce motion blur during rapid movements, such as water sleeves, the shutter speed was set to 640 microseconds. Background subtraction37 extracted clean actor dynamics for subsequent 3D reconstruction.

Regarding audio acquisition, lavalier microphones were utilized for near-field recording to ensure the acoustical fidelity of Kunqu vocal performances and to capture subtle movement-induced acoustical details (e.g., friction sounds generated by sleeves). The microphones were positioned approximately 30 cm from the performer’s mouth (adjusted according to costume variations) to minimize environmental reverberation and obtain dry audio recordings with a high signal-to-noise ratio (SNR). This configuration aligns with measurement standards recommended in recent studies on the acoustical heritage of Kunqu Opera38, thereby ensuring the reliable capture of sound pressure level (SPL) and fundamental frequency (F0).

3D Scene Reconstruction and Volumetric Video Production via Gaussian Splatting

High-fidelity 3D reconstructions of the Humble Administrator’s Garden were achieved using 3D Gaussian Splatting (3DGS)20. After acquiring multi-view images through camera arrays and drone footage, Structure-from-Motion (SfM) was used for camera parameter estimation and sparse point cloud generation. The sparse point cloud served as input to construct a scene representation of millions of 3D Gaussian kernels, each parameterized by spatial position, anisotropic covariance, opacity, and spherical harmonic coefficients. Kernel optimization produced high-fidelity reconstructions. Visual refinement, including noise removal and redundancy reduction, was performed using SuperSplat. Following this workflow along the designed garden route, key locations—including the Orange Pavilion, Snow-Like Fragrant Prunus Mume Pavilion, Reliance on Jade Pavilion, Pavilion in Lotus Breezes, Hall of Eighteen Camellias, Hall of Thirty-Six Pairs of Mandarin Ducks, and Keep and Listen Pavilion—were reconstructed. Figure 3c shows partial 3D Gaussian point cloud results, and Fig. 3b compares Gaussian splatting and traditional mesh models for the Hall of Thirty-Six Pairs of Mandarin Ducks from three viewpoints.

Fig. 3: Production of 3D mesh models and Gaussian Splatting reconstructions for garden environments.
figure 3

a On-site data acquisition. b Comparison between Gaussian Splatting (top) and photogrammetric mesh model (bottom). c Examples of generated 3D Gaussian point clouds: indoor (left) and outdoor (right).

For volumetric videos of Kunqu performances, frame-by-frame rendering of Gaussian point clouds compromises temporal consistency and is resource-intensive for VR. To address this, we applied the optimization scheme of DualGS27, separating Gaussian representations into Joint Gaussians for global motion and skeleton dynamics and Skin Gaussians for surface appearance and fine details, thereby decoupling motion and appearance. A coarse-to-fine training strategy first aligned Joint Gaussians, then progressively refined Skin Gaussians’ spatial and visual parameters, ensuring temporal consistency and high visual fidelity. Figure 4b illustrates a comparison between Gaussian volumetric video and motion capture-based mesh animation.

Fig. 4: Production of dynamic mesh animation and Gaussian Splatting volumetric video.
figure 4

a On-site data acquisition. b Motion capture session. c Studio shooting environment. d Illustration of two-person interaction in volumetric video, arranged in chronological order from left to right. e Volumetric video generated by Gaussian Splatting (top) vs. mesh animation created via motion capture (bottom). f Gaussian Splatting results of all actor instances.

Integrated Digital Theater

Digital interactive applications increasingly facilitate cultural heritage dissemination39,40,41. Building on this, a prototype system of an Integrated Digital Theater was constructed using Unity 2022.3.55f1 and the open-source UnityGaussianSplatting plugin. The theater combines the tangible cultural landscape of the garden with the intangible heritage of Kunqu, providing an immersive and interactive experience to enhance heritage impact and engagement.

The system’s primary interface is a drone-view digital sandbox, as illustrated in Fig. 5, allowing audiences to overview the garden. The garden is divided into representative locations, following the narrative of A Stroll in the Garden, An Interrupted Dream. Audiences may follow the recommended tour route or freely select any scene and its associated singing segment. Upon entering a scene, the view smoothly transitions from a drone perspective to a ground-level perspective, presenting volumetric performances that correspond to the scene. Audiences can freely navigate the digital garden, immersively experiencing both the environment and the opera. After each performance, the system guides viewers to the next scene, while allowing return to the primary interface at any time, enabling highly autonomous exploration of the cultural heritage environment.

Fig. 5: Interface and interaction design of the Integrated Digital Theater.
figure 5

The image demonstrates the system’s drone-view digital sandbox, which provides an overview of the entire Humble Administrator’s Garden. Audiences can freely select representative scenes and their corresponding Kunqu singing segments for immersive viewing.

Living Documentation Preservation

Regarding long-term preservation, Gaussian Splatting remains an emerging technology whose file formats are subject to rapid evolution and iteration, introducing a degree of long-term uncertainty. In accordance with the principles of scholarly transparency and the requirements articulated in Principle 4 of the London Charter42, we therefore go beyond preserving the final 3DGS point cloud assets. We additionally archive the paradata generated throughout the production process—including multi-view video footage, camera parameters, and training configuration files—as well as historical documentation and contextual knowledge related to the associated cultural heritage15. This comprehensive preservation strategy ensures the regenerability of the digital heritage, allowing future researchers to reprocess the original data using updated algorithms rather than being constrained by a single, contemporary rendering format.

With regard to metadata standards and interoperability, we adopt CIDOC-CRM (ISO 21127)43 and its extension model CRMdig44 to accurately represent the paradigm of living documentation, understood here as a digitally mediated reconstruction of cultural contexts. As illustrated in Fig. 6, CRMdig is used to establish a clear digital provenance: the Kunqu opera performance is mapped as an E5 Event, the garden environment as an E53 Place, and both are integrated through a D2 Digitization Process (scene composition and integration) into a final D1 Digital Object, namely the integrated digital theater. This ontology-based and structured approach enables explicit semantic linkages between intangible performances and tangible spatial contexts at the data level, ensuring that future researchers can transparently trace both the origins of the source materials and the logic of their reconstruction. In this sense, living documentation constitutes the core value of a sustainable digital archive.

Fig. 6: Semantic metadata structure for digital heritage preservation.
figure 6

CRMdig representation of the Integrated Digital Theater.

Participants

This study was approved by the Ethics Committee of Shanghai University (Approval No. ECSHU 2024-007). All participants provided informed consent prior to participation. In addition, for all identifiable individuals appearing in the figures, written informed consent was obtained for the publication of their images and digital representations.

This study was conducted in two experimental phases. The first experiment recruited 39 participants (21 male, 18 female). The age distribution included 23 participants aged 19-35 and 16 aged over 35. Regarding VR/AR experience, 15 participants reported no experience, 22 reported limited experience, and 2 had extensive experience. In terms of cultural familiarity, 3 participants had no knowledge of Kunqu Opera, 28 had awareness only, and 8 reported moderate familiarity. Regarding Suzhou Classical Gardens (exemplified by the Humble Administrator’s Garden), 2 participants reported no knowledge, 15 had awareness only, 20 reported moderate familiarity, and 2 possessed high familiarity.

The second experiment involved 18 participants (8 male, 10 female), with 12 aged 19-35 and 6 aged over 35. Of these, 6 participants had no experience with VR, while the remaining 12 reported limited experience. Regarding Kunqu Opera, 1 participant reported possessing no prior knowledge of the subject, 13 indicated having an awareness only, and 4 self-rated as having a moderate level of familiarity. Regarding Suzhou Classical Gardens, 2 participants reported no knowledge, 6 had awareness only, and 10 reported moderate familiarity.

To mitigate potential simulator sickness, participants were briefed on possible discomfort prior to the experiment and screened for any history of motion sickness. They were explicitly informed of their right to pause or withdraw at any stage. Rest intervals of 30-60 seconds were enforced between conditions to alleviate visual and vestibular fatigue. No participants withdrew due to discomfort. A post-experiment inspection of the questionnaire data revealed no anomalies; thus, no data were excluded.

Experimental Design

To examine how different digital presentation modes influence audience perceptions of visual quality, immersion, emotional response, and engagement with intangible cultural heritage (ICH) in virtual reality (VR), we conducted a first experiment comparing three presentation modes: Flat Video (C1), Photogrammetric Mesh with Mocap (C2), and Gaussian splatting (C3). A standalone Kunqu excerpt from A Stroll in the Garden, An Interrupted Dream, in which Du Liniang and her maid Chunxiang visit the rear garden and experience emotional resonance with the scenery, served as the content on the system of the Integrated Digital Theater. The background was set outside the Hall of Thirty-Six Pairs of Mandarin Ducks in the Humble Administrator’s Garden. Pico4 and Quest3 headsets were used for VR presentation.

As illustrated in Fig. 7b-d, after explaining the study objectives and obtaining informed consent, participants received a brief introduction to the procedure and a short text describing the history of Kunqu and the Humble Administrator’s Garden. VR devices were adjusted to ensure clear visualization of the presentations. Participants then experienced the Kunqu excerpt sequentially under the three presentation modes. To counterbalance potential order effects, a Latin Square design was employed, assigning participants to one of three sequence groups (C1-C2-C3, C2-C3-C1, or C3-C1-C2). For participants unfamiliar with Kunqu, real-time subtitles were provided to facilitate comprehension. To address depth sorting artifacts—specifically where the plugin occasionally miscalculated the depth relationship between the actor and the scene—we manually partitioned the environment into foreground and background layers. This depth stratification ensured that the actor was correctly composited within the 3D space, thereby resolving occlusion issues and enhancing the viewing experience. At the conclusion of the experiment, all participants completed questionnaires and participated in brief semi-structured interviews.

Fig. 7: Experimental design and interactive features for the user study.
figure 7

a Experimental grouping diagram. b Background introduction of Kunqu opera and the Humble Administrator’s Garden. c Real-time subtitles. d Correct occlusion relationships established through manual spatial partitioning (Foreground/Background stratification). e On-site experiment setup.

Building on the first experiment, a second experiment was conducted to examine differences between the Integrated Digital Theater and the Kunqu Digital Theater. A new condition (C4) presented only the Kunqu performance, excluding the Humble Administrator’s Garden background, while retaining the same excerpt performed by the actor. This allowed for paired comparisons with C3 (Integrated Digital Theater including the garden). The experimental procedure remained consistent with the first experiment, with participants completing the questionnaires and semi-structured interviews immediately after the viewing session. To counterbalance presentation order effects, a similar grouping strategy was employed, assigning participants to one of two sequence groups (C3–C4 and C4–C3). Building on the first experiment, a second experiment was conducted to examine differences between the Integrated Digital Theater and the Kunqu Digital Theater. A new condition (C4) presented only the Kunqu performance, excluding the Humble Administrator’s Garden background, while retaining the same excerpt performed by the actor. This allowed for paired comparisons with C3 (Integrated Digital Theater including the garden). The experimental procedure remained consistent with the first experiment, with participants completing the questionnaires and semi-structured interviews immediately after the viewing session. To counterbalance presentation order effects, a similar grouping strategy was employed, assigning participants to one of two sequence groups (C3–C4 and C4–C3). Building on the first experiment, a second experiment was conducted to examine differences between the Integrated Digital Theater and the Kunqu Digital Theater. A new condition (C4) presented only the Kunqu performance, excluding the Humble Administrator’s Garden background, while retaining the same excerpt performed by the actor. This allowed for paired comparisons with C3 (Integrated Digital Theater including the garden). The experimental procedure remained consistent with the first experiment, with participants completing the questionnaires and semi-structured interviews immediately after the viewing session. To counterbalance presentation order effects, a similar grouping strategy was employed, assigning participants to one of two sequence groups (C3–C4 and C4–C3).

To ensure comparability across conditions, the Kunqu Opera content was standardized for all presentation modes (using the identical standalone Kunqu excerpt from A Stroll in the Garden, An Interrupted Dream), resulting in a duration of approximately 5 minutes and 30 seconds per condition. Accounting for the initial introductory briefing and the rest intervals between conditions, the total duration of the experimental procedure was approximately 20 minutes for the first experiment and 14 minutes for the second.

As shown in Table 1, the questionnaire employed in this study was adapted from prior literature on cultural heritage preservation in virtual reality19,45, whose structure has been validated in the original research. To ensure applicability in the present context, reliability and validity tests were conducted on the collected data. For the first experiment, Cronbach’s alpha was 0.973, with values for each dimension ranging from 0.826 to 0.936. For the second experiment, Cronbach’s alpha was 0.955, with values ranging from 0.680 to 0.965, indicating acceptable to excellent internal consistency (most values exceeded the commonly accepted threshold of 0.70). In addition, the KMO values were 0.933 and 0.778, respectively, and Bartlett’s tests of sphericity were significant (p < 0.001), further confirming that the data were suitable for subsequent analyses.The complete list of questionnaire items is provided in Supplementary Information.

Table 1 Questionnaire Dimensions and Sources

Results

Quantitative Analysis of Reconstruction Quality

We employed Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS) to evaluate the 3D reconstruction quality. These three metrics constitute the established standard within the fields of 3D reconstruction and neural rendering: PSNR measures pixel-level signal fidelity, SSIM evaluates structural similarity, and LPIPS assesses perceptual similarity based on deep features. It is important to emphasize that these metrics fundamentally quantify the error between the rendered output and the corresponding ground truth (GT) video. Regarding the photogrammetry-and-MoCap-based baseline method, its reliance on skeletal-driven animation and pre-constructed models confines the output to a stylized approximation of real-world scenes. This, in essence, results in a fundamental domain gap between its generated imagery and authentic video footage. In this context, enforcing the calculation of pixel-wise error metrics such as PSNR or SSIM is not only mathematically ill-posed but also fails to objectively reflect the method’s true value in terms of user visual perception. Consequently, we report quantitative fidelity data exclusively for the Composite Gaussian Splatting method in Table 2.

Table 2 Quantitative reconstruction quality and temporal consistency comparison

To evaluate temporal stability, we adopted the T-LPIPS metric. As presented in Table 2, the “Photogrammetric Mesh with Mocap” method achieved the lowest T-LPIPS score (0.0065). This result aligns with expectations, as the fixed topology of geometric meshes possesses inherent temporal coherence, thereby yielding the best stability performance. Conversely, since real-world footage and the chroma-keying process inevitably introduce sensor noise and edge flickering, the “Flat Video” baseline exhibited a slightly higher score compared to the mesh-based approach. Notably, our “Composite Gaussian Splatting” method achieved a score (0.0167) highly comparable to that of the “Flat Video” baseline (0.0147). This indicates that, despite 3DGS being a discrete point-based representation, our approach effectively suppresses temporal flickering, maintaining a level of visual fluidity commensurate with natural video footage.

Quantitative Runtime Performance Analysis

To evaluate the practical operational efficiency of the system, we conducted an objective quantitative analysis of the performance overhead for the three rendering schemes within the Unity environment. All data are summarized in Table 3, which also specifies the hardware configuration used for testing.

Table 3 Quantitative runtime performance evaluation of the three methods (Platform: Intel Core Ultra 9, NVIDIA RTX 5070 Ti, 16GB RAM; VR Resolution: 3648 × 1968; Plugin: UnityGaussianSplatting)

As presented in Table 3, the “Flat Video” and “Photogrammetric Mesh with Mocap” methods remain extremely lightweight in terms of system memory consumption (65 MB and 171 MB, respectively), imposing a negligible burden on the operating system. Regarding GPU performance, both methods exhibited similar resource utilization profiles (VRAM usage ≈ 3.6–3.7 GB). This suggests that the resource load for these methods primarily stems from the fixed baseline overhead of the VR runtime environment and the high-resolution frame buffer (3648 × 1968), rather than the geometric complexity of the content itself. Furthermore, it is worth noting that this efficiency is also attributed to the Unity engine’s native support and mature optimization pipeline for these traditional digital assets. Consequently, both methods effortlessly maintained the target frame rate of 72 FPS.

In contrast, our “Composite Gaussian Splatting” method indeed demonstrates higher resource demands, with system memory usage rising to 3.1 GB and VRAM usage reaching 4.4 GB. However, this memory footprint remains well within the limits of mainstream consumer hardware configurations (16 GB/32 GB RAM). Furthermore, compared to the baseline methods, the marginal increment in VRAM usage introduced by 3DGS is only 0.8 GB. This indicates that the proposed method achieves superior visual quality at a performance cost that is acceptable for consumer-grade laptops.

Regarding frame rate, the observed decrease in average FPS is attributable to the Asynchronous SpaceWarp (ASW) mechanism inherent in the VR runtime46. When computational loads are high, ASW forces the application to run at half-rate while synthesizing intermediate frames via extrapolation to maintain motion smoothness. Consequently, despite the numerical reduction in FPS, the perceptual experience remained fluid during the actual operation.

Comparison of Digital Presentation Modes

As shown in Fig. 8, the mean ratings and standard deviations across the three digital presentation modes—Flat Video (C1), Photogrammetric Mesh with MoCap (C2), and Gaussian Splatting (C3)—indicate that C3 consistently achieved the highest scores on all dimensions, followed by C2 and then C1. The Friedman test revealed significant within-group differences across all dimensions (p < .001). Pairwise comparisons further demonstrated that C3 outperformed both C2 and C1 with highly significant differences across all dimensions (p < .001). When comparing C2 and C1, C2 scored significantly higher in “Interest” (p < .001), moderately higher in “Immersion” (p < .01), and slightly higher in “Understanding” (p < .05).

Fig. 8: Results of comparison of digital presentation modes.
figure 8

Bar charts compare mean ratings for seven dimensions: Interest, Visual Experience, Immersion, Understanding, Flow Experience, Empathic Imagination, and Motivation. One, two, and three asterisks indicate significant differences at the p < 0.05, p < 0.01, and p < 0.001 levels, respectively, based on the Friedman test.

In the post-questionnaire semi-structured interviews, participants were asked about their satisfaction with the different digital presentation modes and the reasons for preferring or disliking a particular mode. Regarding the Kunqu performers, participants noted that volumetric video using Gaussian splatting made the actors appear “more realistic” and “three-dimensional,” whereas the traditional motion capture with mesh modeling approach evoked a “strong sense of incongruity” and felt “puppet-like.” In terms of motion accuracy, actors in C3 moved smoothly, whereas C2 exhibited issues such as “clothing gaps or tearing during movement.” One participant commented that, compared with C1—which felt “like watching television”—C3 “offered multiple perspectives, allowing close-up views of the performance details,” which they found highly engaging. Other participants noted issues with eye contact in C2, describing the character models’ gazes as vacant.

Regarding the Humble Administrator’s Garden scene, participants consistently highlighted C3’s advantages of “stronger immersion” and “greater naturalness”. One participant specifically emphasized the reconstruction quality of trees and the water surface, noting that, unlike C2—where “tree canopies had noticeable gaps and deficiencies”—C3’s “leaves were more complete and realistic, the lotus leaves filled the pond, the newly emerging lotus buds looked natural, and the lake even reflected Hall of Thirty-Six Pairs of Mandarin Ducks”. In terms of integration between the performance and the scene, C3 was perceived as “more lifelike” and “best combining the two elements,” while some participants felt C2 presented “a mismatch between characters and environment” and caused “a sense of floating”. Ultimately, the vast majority of participants preferred C3, with only two favoring C1 for reasons of “greater clarity” and “minimal dizziness or discomfort”.

Comparison of Narrative Modes

As shown in Fig. 9, the mean ratings and standard deviations across the two VR narrative modes indicate that the Integrated Digital Theater (C3) consistently outperformed the Kunqu Digital Theater (C4) across all dimensions, highlighting the overall experiential advantage of C3. The Wilcoxon test further confirmed that C3 achieved significantly higher scores in Interest (p < .001), Visual Experience (p < .05), Immersion (p < .001), Understanding (p < .01), Flow Experience (p < .001), Empathic Imagination (p < .001), and Motivation (p < .05). Collectively, these results suggest that the Integrated Digital Theater enhances comprehension and learning, while simultaneously fostering esthetic appreciation, emotional engagement, and creative involvement, contributing to a more immersive and motivating participant experience.

Fig. 9: Results of comparison of narrative modes.
figure 9

Bar charts compare mean ratings for seven dimensions: Interest, Visual Experience, Immersion, Understanding, Flow Experience, Empathic Imagination, and Motivation. One, two, and three asterisks indicate significant differences at the p < 0.05, p < 0.01, and p < 0.001 levels, respectively, based on the Friedman test.

Subsequent semi-structured interviews asked participants about their impressions of the two modes and invited them to choose their preferred version. Fourteen out of eighteen participants selected C3. They generally agreed that C3 offered a texture surpassing that of film, creating the sensation of “stepping inside the screen” and experiencing the beauty of the Humble Administrator’s Garden alongside the protagonist. One participant who had previously visited the garden remarked that the crowded conditions of in-person visits often made it difficult to fully appreciate the scenery, sometimes even causing frustration. In contrast, the Integrated Digital Theater allowed them to immerse themselves in the beauty of the garden in a more focused and comfortable way. The incorporation of the garden environment not only enriched the visual experience but also complemented the Kunqu narrative, enabling audiences to better appreciate the interplay of character emotions and environmental atmosphere that shapes cultural meaning. A Kunqu enthusiast noted that the single-actor mode required imagination to fill in the absent scenery; while the familiar excerpt had been heard countless times, only in the garden setting (C3)—watching Du Liniang dance amid spring blossoms—were those long-imagined scenes finally materialized before their eyes. As another participant reflected, Du Liniang’s line, “Without entering the garden, how could one know the abundance of spring?” seemed not only to question herself within the play but also to address the audience—"and in this very moment, I myself have stepped into the play, into the spring-filled garden of the Humble Administrator’s Garden”—realizing the immersive experience of “the person in the play, the play in the scene”.

For the few participants who preferred the single-actor mode, their reasons centered on the concern that the garden setting might distract attention. Some felt that the esthetic appeal of Kunqu partly derives from the stage’s spatial abstraction, while others thought the scenery was overly striking, diminishing focus on the singing and performance itself.

Discussion

This study first validated the potential of Gaussian Splatting, particularly volumetric videos generated by Gaussian Splatting, for the digitization of cultural heritage. Compared with conventional baseline methods, 3DGS is capable of generating photorealistic garden scenes more efficiently and at an acceptable performance cost. Building on this, the volumetric videos generated by this technology can better capture the dynamic costumes and facial expressions of Kunqu performers, providing a solid foundation for immersive presentations. Furthermore, this study proposes integrating tangible cultural heritage (the Humble Administrator’s Garden) and intangible cultural heritage (Kunqu performances) within a unified Integrated Digital Theater. Compared with the standalone narrative mode, this fusion model is not merely a visual superposition; rather, it fundamentally reconstructs the spirit of place within the digital space. That is, the holistic value of heritage does not reside solely in isolated material artifacts, but rather emerges from the symbiotic coupling of continuous interactions between tangible spaces and intangible practices. This study demonstrates that high-fidelity digital integration serves as the key pathway to realizing this coupling and sustaining the spirit of place.

In comparative experiments on presentation modes, the Gaussian Splatting display significantly outperformed Flat Video and Photogrammetric Mesh with Mocap in visual realism, immersion, and esthetic appeal. Both traditional mesh and GS approaches offer autonomous spatiality, enabling audiences to observe performances and environments from multiple angles and to “rewrite” their perspective through movement, transforming “viewed planar memories” into “experienced spatial memories”47,48. Within the context of cultural heritage digitization, such spatiality synchronizes the performance site with the audience’s presence, allowing the relationships between performers and environment, as well as the paths and rhythms of the performance, to be continuously reconstructed through audience movement. This experience not only stimulates exploration and enhances esthetic engagement but also directs attention to intangible heritage and the tangible heritage upon which it depends.

Although Photogrammetric Mesh with Mocap can address spatiality, their visual presentation often lacks realism. In cultural heritage digitization, realism is crucial not only for visual immersion but also for maintaining the continuity of cultural subjects and their environment3. Gaussian Splatting highlights realism in performer representation, costume, movement, and environmental context. Without accurate portrayal of performers, the identity of the transmitted heritage is unclear; without environmental reconstruction, the site of transmission is lost. Traditional mesh approaches demand high geometric precision and rendering quality, limiting their ability to fully present the dynamic unity of “person” and “place” under constrained resources. In contrast, Gaussian Splatting (GS), with its high fidelity in dynamic performance and complex scene replication, preserves realism in both performers and environments, where the coexistence of “people” and “place” enables contextual realism for cultural transmission49. This triple-layered realism avoids the incongruity caused by virtual avatars in traditional meshes and supports the authenticity, identity, and regionality of intangible heritage, allowing digital Kunqu transmission to move beyond static representations of “person” and “place” toward a dynamically unified reproduction of the “setting”.

In the comparative experiment examining narrative modes—specifically the Kunqu Digital Theater versus the Integrated Digital Theater—The Integrated Digital Theater (C3) outperformed the Kunqu Digital Theater (C4) across all metrics. Interviews indicated it improved understanding and learning while boosting engagement and esthetic, emotional, and creative responses. This approach emphasizes visual, cultural, and symbolic wholeness. Tangible heritage anchors intangible performances, creating visual wholeness that supports immersion and reinforces understanding and imagination. Single intangible heritage presentations risk reducing practices to mere “skill demonstrations”, whereas integrated heritage presentations restore cultural wholeness, embedding them in historical and lived contexts and preserving a coherent cultural ecosystem. Pure skill transmission may fade over time, but embedding it within a cultural context ensures continuity. Furthermore, from the perspective of social dissemination, intangible heritage—due to its fluidity and abstract nature—often lacks distinct visual markers. This makes it difficult for audiences lacking relevant cultural background to discern its significance when viewing the performance in isolation, whereas tangible heritage inherently possesses high visual recognizability. As noted by Bouchenaki50, for intangible heritage to be effectively perceived and preserved, it often needs to be embodied in tangible visible signs. Their integration imparts distinctive markers to intangible heritage, reinforcing symbolic wholeness, cross-cultural relevance, and social impact, and enabling digital dissemination from local to global audiences.

Overall, this study meets the fundamental requirements for integrating tangible and intangible heritage digitization, providing a new viable paradigm for the digital representation of combined cultural heritage. At the technical level, Gaussian Splatting overcomes the limitations of flat video and traditional modeling, enabling synchronous reproduction of performers and environments while enhancing spatiality and realism. At the narrative level, Integrated Digital Theater scenes reinforce the wholeness of tangible and intangible heritage, allowing audiences to engage not only with skills but also with the cultural setting and historical atmosphere. Within digital space, these scenes reconstruct the spirit of place, addressing the disconnection between tangible and intangible heritage associated with tourism overcrowding at popular material heritage sites, and enabling audiences to experience heritage value within a complete contextual setting. Results indicate that the Integrated Digital Theater significantly enhances cognitive, emotional, and esthetic experiences, promoting dual interest in intangible and tangible heritage. By maintaining the unity of “person, place, and setting,” this approach safeguards intangible heritage in its cultural and transmission context, offering a novel framework for digital dissemination and cross-cultural education.

Although this study demonstrates the effectiveness of the proposed workflow in the specific context of Suzhou Classical Gardens and Kunqu Opera, extending the living documentation paradigm to a wider range of intangible–tangible heritage combinations and more complex environments will require further improvements in technical adaptability and scene generalization. First, despite cost optimizations in the digital acquisition pipeline for static scenes, the current approach to dynamic volumetric video generation still presents substantial barriers in terms of hardware cost, deployment complexity, and computational requirements, thereby limiting its adoption. At present, volumetric capture relies on an array of 81 professional cameras with hardware-level synchronization, which are costly, cumbersome to transport, and necessitate a dedicated green-screen studio with professional lighting to ensure reliable reconstruction quality. Furthermore, the generation of high-quality 3DGS models demands substantial GPU memory (typically exceeding 16 GB) and significant storage capacity, posing additional computational challenges. To mitigate these constraints and enhance the accessibility of the living documentation paradigm, we are developing cost-effective alternatives. These include portable multi-camera systems based on high-frame-rate consumer action cameras utilizing algorithmic, image-level post-hoc synchronization, as well as robust background segmentation techniques based on optical flow and deep learning to eliminate the reliance on professional green screens and controlled lighting. Collectively, these efforts aim to establish a lightweight and mobile acquisition workflow that empowers lower-resourced heritage institutions to achieve high-quality, integrated documentation even under limited budgets and non-specialized site conditions.

Second, while the selected case study is representative, its cultural context is relatively stable and does not encompass larger-scale, highly dynamic, or environmentally complex heritage scenarios. To validate the generalizability of the proposed approach, future work will employ low-altitude drone-based orbital capture strategies for the reconstruction of large-scale outdoor heritage sites, and utilize our portable multi-camera system for the acquisition of collective dynamic performances. This phase will prioritize evaluating system adaptability within complex, large-scale spatial environments and under outdoor, uncontrolled lighting conditions.

Finally, cultural heritage experiences extend beyond the visual modality alone. For art forms such as Kunqu opera, in which vocal technique and melodic expression are fundamental, acoustical heritage constitutes an essential dimension38,51. While the present work achieves spatial alignment between sound sources and performer positions, the physical reproduction of the characteristic spatial reflections and reverberation of garden architecture has not yet been realized, owing to the persistently high density of visitors within the Humble Administrator’s Garden. Reconstructing an authentic soundscape in virtual environments, thereby enabling a deeper integration of visual and auditory experiences, remains a key direction for future research toward further refining the integrated digital theater.