Multimodal interaction enhancement of digital cultural heritage system: user behavior analysis and interface reconstruction of the heritage scanning library of the palace museum

Ke, Linghui; Qin, Huimin; Long, Jiaao; Xiao, Pengyu

doi:10.1038/s41598-026-44955-x

Download PDF

Article
Open access
Published: 30 March 2026

Multimodal interaction enhancement of digital cultural heritage system: user behavior analysis and interface reconstruction of the heritage scanning library of the palace museum

Linghui Ke¹,
Huimin Qin¹,
Jiaao Long¹ &
…
Pengyu Xiao¹

Scientific Reports volume 16, Article number: 10654 (2026) Cite this article

273 Accesses
Metrics details

Subjects

Abstract

The Palace Museum’s digital cultural relics library, a key component of Chinese cultural digitization initiative, offers high-precision image acquisition and multi-dimensional search. However, its practical usage reveals significant issues, including a unimodal interaction design, fragmented information structure, and an insufficiently layered user experience. Based on the system logic of “behavior-driven-perception enhancement-interface reconstruction”, this study adopts the empirical method of combining eye tracking and behavior analysis to investigate the visual attention, information acquisition efficiency and user experience of three types of typical user groups in the digital cultural relics library. The results of the study reveal significant differences in the understanding of cultural content and interface use strategies among different users, which further indicates the cultural expression gap and the defective navigation mechanism in the current system. Based on the above findings, this paper proposes a multimodal interface optimization scheme with “audio-visual interaction” as the core, which covers the visual guide system, hierarchical voice explanation, semantic structure reconstruction and user hierarchical adaptation mechanism, and integrates their logical paths visually through the system flowchart. The optimization goal is not only to improve the interface friendliness and cultural communication, but also to foster immersive perception and facilitate the narrative transmission of cultural information. The study finally builds up a closed loop of the theory of “empathic psychology-media materiality-multimodal interaction”, which provides a new direction for the digital platform of cultural heritage from “static presentation” to “dynamic transmission”.

Introduction

As a treasure of Chinese civilization, the Palace Museum carries thousands of years of history and culture, and is one of the most important cultural heritages in China. With the rapid development of digital technology, the digital conservation of the Forbidden City has moved from mere physical retention to a new stage of value transfer¹. The wide application of high-precision scanning technology has enabled the Forbidden City to build a digital artifact library containing rich architectural components and cultural artifacts, providing strong support for the long-term preservation and wide dissemination of cultural heritage². However, there are certain problems in the practical application of the digital heritage library of the Palace Museum. First, there is an imbalance between technical effectiveness and cognitive efficiency³. Although the scanning accuracy of the cultural relics has reached a very fine, but the interaction between the user and the system is still relatively single, mainly limited to one-way browsing, the lack of immersive experience. The average user spends a relatively short time on each scanned artifact page, resulting in a low rate of information absorption. Consequently, technological advancements have not yet translated into a measurable enhancement of users’ cognitive understanding of cultural heritage. Cultural symbols are distorted in the process of digital translation⁴. Secondly, the rich cultural semantics contained in the manufacturing techniques of the cultural relics of the Palace Museum are often covered by technical data, resulting in users “seeing things and not knowing the text” when browsing, and not being able to deeply understand the cultural connotations⁵. Finally, the system lacks an effective user stratification mechanism. Different types of users, such as professional researchers and mass tourists, have significantly different needs for the culture of the Forbidden City⁶. Professional researchers need in-depth disassembly and analysis of component parameters, while mass tourists prefer storytelling narratives. However, the existing homogenized interface is unable to meet these diverse needs, resulting in a poor user experience. The existence of these contradictions highlights the urgency of the transformation of cultural heritage digitization from “technology-centric” to “user-centric”. How to utilize advanced technical means to enhance the user’s interactive experience with digital cultural heritage has become an urgent problem to be solved at present⁷.

At present, digital exhibitions of museum cultural relics are widely used, and digital technology is used to realize the heritage and innovation of museum cultural relics through three-dimensional scanning and modeling, immersive interaction, intelligent monitoring and repair, and gamification design. For instance, the ‘Jinling Diagrams Digital Art Exhibition’ in Nanjing uses UWB positioning and multisensory feedback to create interactive scenes like ‘Characters in the Picture.’ This scene incorporates a task chain based on Song Dynasty occupations, achieving gamified cultural dissemination. Furthermore, the exhibition integrates holographic printing to embed cultural relics into creative products⁸. “Touching Dreams of Sanxingdui” and other immersive exhibitions reconstruct the ancient Shu Immersive exhibitions such as “Touching the Dream of Sanxingdui” reconstruct the ancient Shu cultural scene with light projection and interactive experience, combine non-heritage elements to weaken the audience’s attachment to the original cultural relics, and promote the living transmission of culture⁹. However, there are some shortcomings in the current field of cultural heritage digitization. On the one hand, the application of empathic psychological mechanisms in dynamic graphic interaction is still insufficient. Most studies have focused mainly on visual presentation and neglected the synergistic potential of multisensory synergy on cultural cognition². In fact, human perception is a multi-sensory synergistic process, and by integrating visual, auditory, tactile and other multi-sensory information, cultural connotations can be understood and experienced in a more comprehensive and in-depth way. On the other hand, media materiality is fragmented in the process of digital communication. Digital models often strip away the physical characteristics of traditional carriers such as paper and wood, leading to the phenomenon of “decontextualization” of cultural experience¹⁰. The physical carriers of cultural heritage carry rich cultural information in themselves, and ignoring these physical characteristics can weaken the effect of cultural communication.

From a technical point of view, the current digital repositories of cultural heritage are generally single-minded in terms of interaction modes. Most of the existing cultural heritage digital libraries only support basic rotation and zoom functions, which are far from being able to satisfy the increasing exploratory needs of users. This single mode of interaction limits the user’s in-depth understanding and experience of cultural heritage. In addition, the broken narrative chain is also a prominent issue. Take the scanning data of the Taihe Temple of the Forbidden City as an example, it fails to effectively associate the related knowledge networks such as “construction craftsmanship-social etiquette”, which leads to the fragmentation of cultural information and makes it difficult for users to form a systematic and comprehensive cognition¹¹. In the practical application of digital presentation of museum artifacts, the lack of user stratification is a major challenge facing the existing system of the Forbidden City. As the cognitive differences of different user groups such as scholars and tourists are not fully considered, mass users often have comprehension barriers when faced with specialized terminology. This not only affects the user experience, but also hinders the wide dissemination of cultural heritage¹². As highlighted by Hornecker and Stifter¹³, museum installations must move beyond visual spectacle to support cognitive curiosity through structured interaction.

To address the above problems, this paper proposes a three-level framework of “behavior-driven–perception-enhanced–interface reconstruction”, aiming to improve both the user experience and the cultural dissemination effectiveness of the Palace Museum’s digital heritage library. Although research on digital heritage platforms and immersive interaction techniques has grown rapidly, most studies focus either on technical affordances or on general user experience outcomes, with limited empirical attention to how different user groups cognitively process interface information in situ. In particular, comparative evidence on attentional distribution and perceptual strategies among expert users, enthusiasts, and general tourists remains scarce. Moreover, while multimodal interaction is frequently advocated as a design direction, the perceptual mechanisms through which interface structures influence user understanding are often assumed rather than empirically examined. To address this gap, this study investigates user cognitive behavior at a finer granularity through eye-tracking data, with the aim of informing interface-level design.

Taking the Palace Museum Digital Cultural Relics Library as the research object, this study analyzes interaction optimization and information reconstruction from two perspectives: user behavior analysis and multimodal interaction reconstruction. Through the analysis of user operation logs and eye-tracking experiments, the cognitive paths and decision-making processes of three representative user groups—professional scholars (primarily in history and cultural studies), history enthusiasts, and general tourists—are examined during their interaction with the platform. This study is positioned as an empirical investigation of user cognitive and attentional behaviors in digital heritage interfaces rather than the development or validation of a complete multimodal system. Its primary contribution lies in identifying perceptual and cognitive differences among user groups through eye-tracking and behavioral analysis. The multimodal interface strategies discussed later are therefore presented as design implications derived from empirical findings, rather than implemented or empirically validated systems.

Specifically, this study aims to answer the following research questions:

RQ1

Are there significant differences in the distribution of visual attention (gaze duration) among the three user groups when interacting with different information areas (e.g., 3D display, parameters, cultural interpretation, recommendations)?

RQ2

Do the three user groups differ significantly in their information acquisition efficiency and task completion ability?

RQ3

Are there distinct, identifiable cognitive path patterns among the three user groups during complex functional operations?

The findings provide a data-driven basis for the multimodal interface optimization scheme.

Related work

Digitization process

Early cultural heritage digitization work mainly focused on high-precision archiving, such as the three-dimensional reconstruction of Dunhuang murals and other projects, aiming to achieve the permanent preservation of cultural heritage through digital means. However, work at this stage often had a tendency to “emphasize technology over communication”, focusing too much on data collection and storage and neglecting the communication and transmission of cultural heritage³⁰. In recent years, with the deepening understanding of the value of cultural heritage, research has gradually shifted to “living heritage”. This concept emphasizes the centrality of user participation in the transmission of cultural heritage, arguing that through effective interactive narratives, users’ cultural memories can be activated to sustain the value of cultural heritage in modern society¹¹. Meanwhile, the continuation of media materiality has also received more attention, i.e., digital models should restore the physical properties of the original carriers of cultural heritage as much as possible in order to maintain the integrity of the cultural experience¹⁰.

The Palace Museum Digital Heritage Library is a heritage digitization project officially launched on July 16, 2019, covering 26 categories of cultural relics such as ceramics, paintings, and calligraphy, and displaying more than 50,000 cultural relics in detail through high-definition images. The Palace Museum Digital Heritage Library supports a multi-dimensional search system, based on the traditional keyword search, the new dynasty filtering, classification index and decorative features query function, support for users to locate the target artifacts through the color, shape and other artistic features; at the same time the details of the cultural relics to do high-definition image rendering, to provide 50 megapixel cultural relics image to support the local zoom function, for example, “On the River During the Qingming Festival” can be displayed to show details of brushstrokes that are difficult to distinguish with the naked eye. It is difficult to distinguish the details of the brush strokes¹. Nowadays, the Palace Museum pursues the upgrading of the tour experience, and has begun to pilot VR guided tours and other new technologies to enhance the user experience³³. However, the overall process of browsing and displaying cultural relics still suffers from narrative fragmentation, making it difficult for users to develop a comprehensive cultural understanding through the system³.

Theoretical framework

This study focuses on the heritage and innovation of museum cultural relics driven by digital technology, and its theoretical foundation is rooted in the deep integration of the theory of empathic psychology and the theory of media materiality, which together construct the logic of transmission of “technology-perception-culture”, providing core support for the communication of cultural heritage empowered by digital technology.

Empathy psychology

It should be clarified that this study does not seek to address empathy as a comprehensive psychological construct. Instead, empathy is operationalized here in a limited and interface-relevant sense, focusing primarily on cognitive empathy and affective engagement¹⁴. Other dimensions of empathy, such as interpersonal or clinical empathy, fall outside the scope of this research. This operational definition is adopted to ensure analytical clarity and alignment with the study’s focus on perceptual and cognitive interaction with digital heritage interfaces¹⁵.

The core principle of empathy psychology is to transcend the cognitive limitations of any single sense through multi-sensory synergy, thereby strengthening users’ immersive perception and emotional resonance with cultural heritage¹⁶. In cultural heritage digitalization scenarios, the practical value of this theory is reflected in “sensory symbolization” - transforming the historical context and material characteristics of cultural heritage into perceptible visual, auditory and tactile signals, guiding users from " Cognition” to “experience” leap, and ultimately realize the depth of cultural connotation internalization¹⁷.

For example, the design of UWB-based positioning technology and multi-sensory feedback system⁸ in the “Jinling Figure Digital Art Exhibition” in Nanjing is a typical application of the theory of psycho-sensory: after the audience wears the positioning bracelet “into the painting”, when they are near the virtual CaoYun dock, the 4D projection reproduces the image of a slender man, and the audience can see the virtual CaoYun dock. When the audience wears the positioning bracelet “into the painting” and approaches the virtual canal dock, the 4D projection restores the sound of the slender man’s horn that changes in strength with distance, and the swaying of the wooden boat simulated by the tactile floor forms a linkage with the visually presented Song Dynasty marketplace scene, and the three sensory signals jointly build up a “sense of presence in the life of the Song Dynasty”, so that the audience’s perception of the marketplace culture and the history of the canal transport embedded in the Jinling Map is transformed from an abstract text into a figurative experience. In addition, in the VR archaeological experience^18,29 of the Tanshishan Ruins Museum, when the audience wears the VR equipment to “excavate” the virtual tomb, the equipment simulates the tactile sensation of grasping the ceramic pieces through force feedback, and the combination of the visually presented structure of the tomb and the aurally restored on-site ambient sound, forming a “sight - hearing - touching " sense of communication closed loop, so that the specialized knowledge of prehistoric archaeology becomes perceptible and participatory through sensory linkage, effectively reducing the cognitive threshold of cultural heritage.

Media materiality

Media materiality emphasizes that the inheritance of cultural heritage not only relies on the physical retention of “material carriers”, but also requires the regeneration and dissemination of cultural symbols through the “material translation” of the media. In the context of digitization, this theory is embodied in two dimensions: the first is the precise restoration of the material properties of cultural relics, and the second is the digital translation of the historical logic and cultural significance carried by the material carriers, and ultimately the realization of the “material truth”. The dual heritage of “material truth” and “cultural symbols” is finally realized. The digital restoration and modeling practice of the Tanshishan Ruins Museum²⁹ profoundly interprets this theory: through the 0.05 mm level three-dimensional scanning to obtain the texture of the pottery piece, fracture morphology and other material details, the AI algorithm in the virtual splicing not only calculates the mechanical equilibrium, but also analyzes the historical association through the evolutionary law of the pottery piece pattern, so that the digital model not only restores the material texture of the pottery but also presents the “pottery piece - tomb - burial” through the visualization of the data. The digital model not only restores the material texture of the pottery, but also presents the cultural logic of “shards - tombs - prehistoric society” through data visualization. The same is true for the 3D modeling of Notre Dame de Paris’¹⁹ exhibition: the digital model not only accurately reproduces the material texture of the stone carvings, carving techniques and other material features, but also transforms the age of the stone carvings’ construction and religious symbolism into interactive holographic annotations through AR technology, so that the materiality of architectural components and cultural symbolism can be realized in the digital medium. Through AR technology, the construction age and religious symbolism of stone carvings are transformed into interactive holographic annotations, so that the materiality of architectural components and cultural symbols can be unified in the digital media, and cultural heritage can be transformed from “static preservation” to “dynamic dissemination”.

Multimodal interaction & technical support

The implementation of this study relies on several key technologies. Generative AI plays an important role in deconstructing the semantics of cultural symbols in the study, and is able to analyze the evolutionary history of cultural relics and architecture. Multimodal interaction technology, on the other hand, realizes the synergy of gesture control, sound and haptic feedback to enhance the user interaction experience, and its design idea is in line with the theory of empathic psychological augmentation²⁰. AR spatio-temporal superimposed layer technology can holographically project the historical scene, which helps to construct the memory ecology. Blockchain depository technology can solidify user behavior data and weight the data for analysis, which is suitable for communication effect monitoring. Together, these technologies provide technical support for the research and promote the realization of multimodal interaction enhancement and interface reconstruction.

The empowerment of digital technology in the inheritance and innovation of museum artifacts relies on the synergistic effects of multimodal interaction key technologies. It is not only a tool for theoretical implementation but also reconfigures multiple interactive relationships through its inherent characteristics, providing systematic support for the digital dissemination of cultural heritage²¹. Multimodal interaction technology integrates interaction methods such as gesture control, voice control, and haptic feedback to construct “natural and immersive” human-machine interaction scenarios, enhancing users’ perception of culture through the synergistic enhancement of multi-sensory signals. In the construction of South Korea’s intangible cultural heritage digital archive, a method for comprehensively managing and providing intangible cultural heritage from a digital perspective was proposed²². Based on multimodal interaction, policies were established considering factors related to cultural governance and standardized management. In the design of cultural heritage digitization using ancient Egyptian theological totems, through visual development, animation processing, and interaction design, digital technology and multimodal interaction play a core role in the inheritance, innovation, and dissemination of cultural heritage²³.

In summary, the empathic psychology and media materiality for the application of digital technology provides a logical guide, and multimodal interaction technology is the key support for the theory of landing, the two synergistically promote the museum cultural relics from the static preservation of the “dynamic heritage, immersion experience, innovation and dissemination of transformation, and ultimately achieve the maximization of the transmission of cultural heritage value²⁴.

Based on the eye movement experiment of the Forbidden City digital heritage library user cognitive behavior analysis

Experimental design

Experimental objectives, research questions, and hypotheses

With the rapid development of digital museums, the cognitive behavior and interaction path of users in the face of the complex structure and rich content of the digital cultural relics library show highly heterogeneous characteristics. In order to better understand the cognitive variability of different user groups in the digital cultural relics environment, this study selects the digital cultural relics library of the Palace Museum as an experimental platform, adopts eye-tracking technology and behavioral recording methods, and systematically analyzes the visual behavior and interaction paths of three types of typical users—professional scholars, history and culture enthusiasts, and general tourists—during the process of using the library.

Based on the study objectives, the following research questions (RQs) and hypotheses (H) are proposed:

RQ1

Are there significant differences in the distribution of visual attention (gaze duration) among the three user groups when interacting with different information areas (e.g., 3D display, parameters, cultural interpretation, recommendations)?

H1

History and culture enthusiasts will exhibit a significantly higher gaze duration in the Cultural Interpretation tab (AOI3) compared to professional scholars and general tourists, reflecting a culture-oriented attention focus.

RQ2

Do the three user groups differ significantly in their information acquisition efficiency and task completion ability?

H2

Professional scholars will show higher information acquisition efficiency (lower First Fixation Time, lower Completion Time, higher Information Accuracy Rate) compared to the other two groups, due to their specialized background and search experience.

RQ3

Are there distinct, identifiable cognitive path patterns among the three user groups during complex functional operations?

H3

The cognitive paths will be group-specific: professional scholars will follow a “technology-oriented” linear path, history and culture enthusiasts a “culture-associated” cyclic path, and general tourists an “interface-dependent” random path.

The study analyzes the distribution of visual attention, information acquisition efficiency and interaction path patterns of these three types of typical users in the process of using the library. The quantitative indicators reveal the behavioral characteristics of users in the perceptual and cognitive levels, aiming to provide data support and theoretical basis for the interface design and layered service strategy of the digital cultural relics library.

Experimental design and participants

The experiment adopted a mixed-methods approach, combining eye-tracking and behavior analysis with post-hoc semi-structured interviews to achieve both quantitative precision and qualitative depth.

Sample Size Rationale and Recruitment. A total of 18 participants were recruited and divided into three distinct groups (n = 6 per group): Professional Scholars, History Enthusiasts, and general tourists. The rationality of the sample size has been demonstrated in previous HCI-related studies²⁵. The sample size was constrained primarily by two critical factors:

1.
High-Fidelity Data Collection Requirements: The study required simultaneous, resource-intensive collection of eye-tracking data, detailed behavior logs, and subjective cognitive load assessment, limiting the feasible number of sessions.
2.
Rarity of the Population: The necessity of accessing and recruiting specialized “Professional Scholar” populations relevant to the Palace Museum’s research field presented a significant field constraint. While the overall number is small, the allocation (n = 6 per group) meets the minimum threshold often cited in specialized Human-Computer Interaction (HCI) and eye-tracking studies for detecting a minimum detectable effect size (\({\eta}_{\text{P}}^{\text{2}}\text{≥}\text{0.14}\)) when comparing distinct user groups under controlled conditions.

A purposive sampling strategy was employed to ensure clear demarcation and maximum typicality across the three user groups, thus maximizing internal validity despite the small sample size. Inclusion and Exclusion Criteria:
1. 1
  Professional Scholars: Must hold a master’s degree or higher in archaeology, museology, or art history, and have published at least one peer-reviewed paper related to Chinese cultural relics or museums.
2. 2
  History Enthusiasts: Must actively visit museums (at least 3 times per year) and regularly engage with cultural history content through digital or print media.
3. 3
  general tourists: Must have no formal academic background in history or museology and only visit museums casually (less than 1 time per year).
Recruitment was primarily conducted through university academic networks (for scholars), targeted online history forums (for enthusiasts), and public advertisements (for tourists).

Experimental equipment and platform

To ensure efficient acquisition and processing of experimental data, the Tobii Glasses 3 portable eye tracker was used in this study for visual data tracking. The device has a sampling rate of 100 Hz, strong head motion compensation capability, a field of view of 82°×50°, and a delay control of less than 50ms, which is able to accurately record the position of the gaze point, the duration and the trajectory of the eye jump. In terms of behavioral recording, with UXLogger software to synchronize the acquisition of the user’s mouse operation data on the web page, including clicking behavior, page dwell time and scrolling trajectory, etc., with a sampling frequency of 10 Hz.

The experimental interaction platform is selected from the Chinese side of the official website of the Digital Heritage Library of the Palace Museum (https://en.dpm.org.cn/collections/), using Chrome 112.0 browser to present, the experimental machine is a 23-inch high-definition display (resolution of 1920 × 1080), and uniformly turn off the advertisement interception and auto-filling plug-ins to ensure the consistency of the operating environment and the purity of the data collection. The advertisement blocking and auto-filling plug-ins are uniformly turned off to ensure the consistency of the operation environment and the purity of data collection (Fig. 1).

Experimental task design

Combined with the functional structure of the digital library of cultural relics and the user’s real operating path, the experiment is designed with three tasks, which are arranged according to the depth of cognition from shallow to deep:

Task 1: free exploration (5 min) Participants are asked to independently browse the “ceramics” category under the “Qing Qianlong pastel hollowed bottle” page. The page contains a 3D model, basic information, and a cultural labeling module. This task was designed to capture users’ natural gaze behavior and browsing preferences in an unguided state.

Task 2: Targeting (3 min) Participants are required to locate the “Black lacquer and gold double-dragon medicine cabinet” under the category of “Household Appliances” and obtain three key pieces of information about its material composition, functional use and decorative symbols. This task is used to assess the user’s navigation efficiency and information retrieval ability.

Task 3: Functional Operation (5 min) Users are required to complete the following operations on the display page of the “Black Lacquer and Gold Double-Dragon Medicine Cabinet”:

Zoom in to observe the details of the “dragon pattern”;
Go to the “Craftsmanship Analysis” tab and browse the process of “Gold Painting” and “Lacquering”;
Select a similar artifact from the “Related Recommendations” to compare materials and functions. The task tests the users’ mastery of complex interactive operations and semantic integration ability.

Experimental process

The experiment is carried out in accordance with a standardized process, with the following specific steps:

Pre-test questionnaire: Record the subjects’ cultural background, frequency of using digital tools and familiarity with heritage knowledge;
Eyetracker calibration: Adopt the nine-point calibration method to adjust the accuracy, and control the error within 0.5°;
Task implementation: the three tasks of free exploration, target localization and functional manipulation were completed sequentially, with a 2-minute interval between each task to reduce the fatigue effect;

A single round of experiments was controlled within 40 min, and all experiments were conducted in a controlled laboratory environment to improve the effectiveness of experimental control.

Data analysis methods

This study adopts a multi-level analysis method that combines quantitative and qualitative methods in order to achieve a systematic exploration of the characteristics of user cognitive behavior. At the quantitative level, based on the raw eye movement data generated by Tobii Studio software, we extracted the First Fixation Time (FFT), Total Fixation Duration (TFD), Fixation Count (FC), and Eye Jump Distance (SD). Saccade Distance (SD). These indicators portray users’ visual behavior from the perspectives of initial visual attraction, information processing depth, attention distribution density and visual search breadth.

In order to further analyze the users’ attention to different information modules, the key pages in the experimental task are divided into areas of Interest (AOI) based on the page structure and functional logic. The division criteria mainly refer to the type of information, user interaction density and visual proportion, and the following four main AOIs are finally determined:

In addition, preliminary inspection of gaze heatmaps revealed stable and recurrent clusters of visual attention across participants. These empirically salient regions informed the final AOI configuration, which was consolidated into four AOIs to ensure analytical consistency and interpretability across stimuli (Fig. 2).

AOI1: 3D artifact display area (the main visual area, supporting detail zoom);
AOI2: Parameter description panel (contains technical information such as material, size, etc.);
AOI3: Cultural Interpretation tab (provides historical background and craftsmanship interpretation);
AOI4: related recommendation column (displaying thumbnails of similar cultural relics).

At the qualitative level, combining the operation paths recorded by UXLogger with the user interviews, we sort out the typical cognitive paths of users in the process of completing the three types of tasks, and summarize the characteristics of their behavioral patterns. At the same time, the cognitive load felt by users in each task stage was scored and compared by NASA-TLX scale data, covering multiple dimensions such as psychological demand, time pressure and operational frustration.

The statistical analysis of the data was processed using SPSS 27.0 software. For the differences in the performance of different user groups on each eye movement index, one-way ANOVA (One-way ANOVA) and post-hoc multiple comparisons (Tukey HSD) were used to test the differences; while in the correlation analysis part, Pearson’s correlation coefficient was used to explore the linear relationship between the key indexes, in order to reveal the statistical correlation between the cognitive behaviors and the operation results.

Analysis of experimental results

Differences in visual attention allocation

A one-way analysis of variance (ANOVA) was conducted on the percentage of gaze duration across four AOIs for the three user groups. Results revealed significant differences in visual attention allocation among the three groups (F (2, 15) = 7.89, P < 0.01), supporting research hypothesis H1 (Table 1).

Table 1 Differences in Visual Attention Allocation among three types of users.

Full size table

Professional scholars’ gaze duration was primarily concentrated on AOI2 (44.9%), indicating their focus on precise physical and historical parameters of the artifacts.

History enthusiasts showed a significantly higher interest in AOI3 (22.8%) than the other two groups, reflecting their pursuit of the historical narratives and emotional connections behind cultural relics.

General tourists’ attention was mainly focused on AOI1 (52.3%), that is, the 3D model and main images of the cultural relics, while the least attention was paid to the parameter information and cultural interpretation.

Information acquisition efficiency analysis

In the target localization task, users need to find the designated artifacts and extract the three key information through classification navigation. By analyzing the indicators of first fixation time (FFT), task completion time, number of clicking errors and information accuracy rate, it is found that there is a significant difference in the information retrieval ability of the three types of users. The results show that there is a significant difference in information acquisition efficiency among the three groups of users (supporting hypothesis H2).

The results are shown in Table 2.

Table 2 Comparison of information access efficiency indicators among three types of users.

Full size table

Professional scholars have the shortest first fixation time, indicating that they can quickly find the information target. general tourists have the longest duration, reflecting their difficulty in identifying information targets. The ANOVA results showed significant differences.

general tourists have the highest number of click errors, indicating that they exhibit more trial and error behaviors during the interaction process. The analysis of error types shows that the clicking errors of general tourists are mainly concentrated in two categories: misjudgment of category navigation (63%) and misunderstanding of terminology (28%): for example, confusing “household utensils” with “furniture” and misunderstanding the terms “gilded” and “gold-plated”. Pearson’s correlation analysis showed a significant positive correlation between FFT and total completion time (r = 0.83, p < 0.001), and a significant negative correlation between clicking errors and information accuracy (r = − 0.76, p < 0.001).

Cognitive path pattern analysis

The typical cognitive paths of three types of users in functional operation tasks are summarized through the joint analysis of their eye movement trajectories and webpage behavioral records in the process of task operation (supporting hypothesis H3):

The path of professional scholars is characterized by a “technology-oriented” linear structure, with the order of the path being AOI1 (model observation) → AOI2 (parameter review) → AOI4 (relevant recommendation comparison). The path of this group is smooth, with fewer jumps and strong information integration ability;

The path of history and culture enthusiasts is characterized by a “culture-associated” cyclic structure, mainly switching repeatedly between AOI1 and AOI3, with the path expressed as AOI1→AOI3→AOI1→AOI3, and reinforcing the understanding of cultural symbols through multiple comparisons;

general tourists’ paths show an “interface-dependent” random structure, with complex paths, high repetition rate, frequent jumping between AOI4 and AOI1, easy to click on the “home page” or recommended content several times due to failure of targeting, and a lack of clear cognitive strategies.

Summary of qualitative findings and cognitive load

In addition to the robust quantitative analysis, the qualitative data—derived from the UXLogger operation paths, user interviews, and NASA-TLX scale data—provided essential context on user experience and cognitive effort.

Behavioral Patterns: The smooth, linear path of professional scholars and the repeated, cyclic comparisons of history and culture enthusiasts suggest clear, self-directed cognitive strategies, while the high repetition rate and complex jumping of general tourists indicate a lack of clear cognitive strategies and dependence on interface guidance.

Information Barriers: Qualitative error analysis revealed that general tourists’ primary difficulties stemmed from two categories: misjudgement of category navigation (63%) and misunderstanding of specialized terminology (28%). For example, confusing “household utensils” with “furniture” and misunderstanding terms like “gilded”. This highlights a significant knowledge barrier for non-specialist users.

Cognitive Load (NASA-TLX): The comparison of cognitive load scores showed that general tourists experienced higher perceived cognitive load across multiple dimensions, including psychological demand, time pressure, and operational frustration. This is consistent with their lower information accuracy and higher clicking errors. Conversely, professional scholars reported the lowest overall load, reflecting their superior goal manipulation and information recognition efficiency.

These differences provide a clear optimization direction for the subsequent interface design, particularly in terms of layering information and clarifying specialized terminology for general tourists.

Following the completion of the quantitative tasks, all 18 participants (6 from each group) engaged in a 15–20 min semi-structured, in-depth interview.

The interview guide focused on three core areas:

(1)
Perceived difficulties during information acquisition;
(2)
Suggestions for improving the interaction mode of the 3D model;
(3)
Subjective feeling about the cultural value perception.

Following the completion of the experimental tasks, all interviews were audio-recorded and transcribed verbatim.
The qualitative data were analyzed using a thematic analysis approach, following the widely accepted six-phase procedure proposed by Braun²⁷ and Clarke²⁸. The analysis proceeded in four stages. First, two researchers independently conducted open coding on the interview transcripts to identify meaningful units related to users’ perceived difficulties, interaction preferences, and cultural understanding. Second, the initial codes were iteratively compared and clustered into candidate themes through discussion and consensus-building. Third, the themes were reviewed, refined, and validated against the original transcripts to ensure internal coherence and conceptual clarity. Disagreements in coding were resolved through repeated discussion until full agreement was reached.
ChatGPT was used as an auxiliary tool to assist in language summarization, preliminary pattern prompting, and reflexive checking, such as reorganizing researcher-generated codes and facilitating comparison across participant groups. All AI-assisted outputs were carefully examined, revised, and validated by the researchers to ensure analytical accuracy and methodological rigor.
To enhance the reliability and transparency of the qualitative analysis, the thematic results were triangulated with quantitative findings from eye-tracking data, behavioral logs, and NASA-TLX cognitive load measures. This multi-source triangulation ensured that the qualitative interpretations were grounded in observable behavioral evidence and strengthened the credibility of the findings (Table 3).

Table 3 Key qualitative findings and triangulation.

Full size table

The strong convergence between the objective eye-tracking metrics and the subjective interview themes significantly enhances the internal validity of the core findings.

In this chapter, we analyze the eye movement data, task indicators and user behavioral trajectories to reveal the significant cognitive differences between professional scholars, history and culture enthusiasts, and general tourists in the process of using the Forbidden City’s digital artifact repository. The convergence of objective metrics (eye-tracking), logged behaviors (UXLogger), and subjective reports (NASA-TLX and interviews) successfully reveals the significant cognitive differences… These differences provide a clear optimization direction for the subsequent interface design.

Discussion

Theoretical interpretation of user behavior data

The eye-tracking experiment and behavior analysis revealed significant differences in the visual attention, information acquisition efficiency, and cognitive paths among the three typical user groups (professional scholars, history enthusiasts, and general tourists). These findings provide strong empirical evidence that can be interpreted within the theoretical framework of empathy psychology and media materiality introduced in Sect. 2.2.

Interpretation based on empathy psychology

The results of visual attention allocation clearly indicate that users’ focus is driven by their inherent cognitive motivations. Professional scholars, driven by the need for precise data verification and comparative study, exhibited higher attention on AOI2 (Parameter Information) and demonstrated the highest information acquisition efficiency. This behavior aligns with the cognitive dimension of empathy, where their professional background predisposes them to seek structured, factual data. Conversely, history enthusiasts dedicated the most attention to AOI3 (Cultural Interpretation). This suggests their motivation is rooted in emotional empathy, seeking connections with the historical narrative and cultural significance of the artifact, leading to a strong demand for in-depth, interpretative content. General tourists, due to their relatively lower prior knowledge, exhibited a more diffuse and fragmented attention pattern across AOIs and showed the lowest efficiency. Their high cognitive load reflects a struggle to form a coherent mental model, indicating a barrier to achieving effective empathetic engagement.

Interpretation based on media materiality

The analysis of cognitive path patterns and the high cognitive load among tourists supports the concept of media materiality. The current system, while providing high-resolution images, relies predominantly on text-based and static visual interfaces. For scholars, this “materiality” is sufficient, allowing them to follow a linear cognitive path (Metadata→Parameters →3D Model).

However, for tourists, the single materiality fails to provide an intuitive, multi-sensory experience necessary for complex object comprehension. Their random cognitive path and high inefficiency are a direct manifestation of the interface’s inability to dynamically guide interaction and reduce the perceptual gap between the digital object and the real cultural relic, thereby creating a material barrier to information acquisition.

Validation of research hypotheses

Based on the statistical analysis of the experimental data, the research hypotheses are validated as follows:

H1 (Visual Attention Difference): Supported. The analysis of gaze count and gaze duration confirmed a significant difference in the allocation of visual attention among the three user groups across the three defined AOIs, demonstrating that user motivation is the primary factor driving visual focus.

H2 (Information Acquisition Efficiency Difference): Supported. The results confirm that professional scholars exhibit the highest information acquisition efficiency and accuracy, while general tourists show the lowest efficiency, as evidenced by completion time, accuracy rate, and subjective cognitive load scores.

H3 (Cognitive Path Difference): Supported. Three distinct cognitive path patterns (Linear, Cyclic, and Random) were successfully identified and shown to correspond directly to the professional scholar, history enthusiast, and ordinary tourist groups, respectively, confirming the relationship between expertise and interaction strategy.

Limitations, robustness, and theoretical basis for enhancement

The experimental results highlight a critical functional gap: the existing interface, designed primarily for expert retrieval, fails to cater to the diverse needs of non-expert users, leading to high cognitive load and poor information acquisition efficiency for a significant user base.

Limitations and Robustness. We acknowledge that the small, purposively sampled size (Sect. 3.1.2) limits the external validity and generalizability of the results to the wider public. This sampling design carries a potential for selection bias, favoring highly motivated individuals. To mitigate this risk and enhance robustness, we performed triangulation by comparing the quantitative eye-tracking and behavior results with the rich qualitative data from the semi-structured interviews (Sect. 3.3.4). The strong, consistent alignment between the objective metrics and the subjective experiences indicates the key finding: interaction mode fragmentation is the core problem—a conclusion robust enough to drive the proposed enhancement scheme. We did not perform alternative specifications (e.g., sub-sample analyses) due to the already small group sizes, relying instead on the mixed-methods approach for validation.

Our analysis demonstrates that the current system’s materiality is inadequate for inducing empathic understanding, particularly among general users. This finding, corroborated by both quantitative and qualitative data, points directly to multimodal interaction (Sect. 2.3) as the theoretical and practical means to overcome these core limitations of a single interaction mode, fragmented information structure, and insufficient user experience hierarchy.

Design implications for multimodal interface optimization for the digital cultural relics library page of the palace museum

This section synthesizes the empirical findings into a set of design implications for multimodal interface optimization. These implications are conceptual and inferential in nature, grounded in observed user behavior patterns, rather than representing an implemented or empirically evaluated system.

Design orientation and problem restatement

Based on the user behavior study in the previous section, this section systematically proposes interface improvement strategies in terms of information structure, visual guidance and cultural content accessibility. In the eye-tracking experiment and behavioral analysis in the previous section, it was found that the digital heritage library of the Palace Museum has a significant differentiation of user experience among different user groups. Professional scholars focus on technical parameters and structural details, history and culture enthusiasts tend to explore cultural semantics, while general tourists rely on interface guidance and find it difficult to understand the page content in depth. The problems are mainly focused on:

Single interaction mode: most of the pages adopt static graphics and sliding zoom models, lacking perceptible interaction dimensions;

Fragmentation of information structure: the shape, pattern and historical context of cultural relics do not build a systematic narrative path, and users can only receive scattered information in isolation;

Uneven load of user perception: especially the common users have a significant sense of disorientation on the page information density and operation flow, which affects the browsing efficiency and cultural understanding.

Based on the above problems, this chapter proposes a three-dimensional optimization path based on “behavior-driven-perception enhancement-interface reconstruction matching” from the perspective of multimodal interaction, and explores the synergistic design strategy of deeper transmission of cultural information and enhancement of user experience by combining with the existing functions of the website of the National Palace Museum’s digital cultural relics repository.

Multimodal interaction system construction: interface enhancement design based on audiovisual integration

This paper focuses on the multimodal “visual-auditory” fusion, breaking through the limitations of the traditional “see-read” type of cultural relics display, realizing the context between cultural symbols and user perception. It realizes the context and emotional connection between cultural symbols and user perception, and enhances the user’s sense of immersion and depth of understanding through dynamic visual guidance and voice interaction²⁶.

Visual layer reconstruction

(1)
Interactive visual guide layer design: introduce “contextualized visual guide” module, embed dynamic guide bar at the top of each cultural relics detail page, including “material analysis”, “the evolution of the use”, “decorative symbolism”, “the background of the era” four panels, each panel with dynamic illustrations, such as the material layer can be presented in the local deconstruction animation, decorative layer to show the flow of the pattern of the evolution process, to activate the user’s willingness to visually explore.
(2)
Key visual markers and image hotspots: the hotspot marking map mechanism is used to provide visual hints to the local key areas of cultural relics, such as the eye of the “dragon pattern” and the location of cultural symbols such as the “tire glaze combination”, and users are prompted to click on the explorations through the micro-animation and luminous borders.
(3)
Visual flow guidance mechanism: Combine with the eye movement experiment path to build a “cultural browsing path”, and dynamically adjust the visual layout of the page according to user behavioral data, such as moving up the AOI3 cultural labeling area, which is often focused on by history and culture enthusiasts, and integrating the main visual area with the recommendation column to form a smooth visual flow.

Auditory interaction system design

Voice explanation guidance system

In order to meet the cognitive needs of different levels of users, the construction of layered voice explanation, each piece of cultural relics configured with three kinds of explanation dimensions:

Academic-oriented: focusing on craftsmanship, structure, and characteristics of the era, recorded by digital literature and exposition experts, applicable to professional scholars;

Cultural narrative: telling the human stories and historical scenes behind the cultural relics, applicable to cultural enthusiasts;

Popular science type: explaining terms and symbolism in simple language, applicable to general tourists. Users can intelligently assign the type of explanation by switching the voice mode or AI recommendation mode.

Voice Interactive Q&A System.
Embedded with a voice Q&A system based on voice recognition and generative AI, users can actively ask “What is this pattern?” “Why is it called ‘tracing gold’?” The system response structure is as follows:

The first statement: accurate answer (extraction of cultural relics metadata and AI content generation).

The second statement: extend the supplement (guide the user into the relevant cultural relics or background).

The third statement: recommended links (such as “Click to learn about the evolution of Qing Dynasty craftsmanship”) The system elevates the sense of hearing from “passive reception” to “active exploration” tool.

Auditory Symbol Simulation Mechanism.
Part of the specific cultural relics page to join the “sound pattern reduction module”, through artificial sound synthesis or real recording simulation of cultural relics material timbre. For example, the page of “red lacquer ware” plays the subtle sand sound when scraping the lacquer, and the page of “bronze bell” restores the resonance sound wave when striking the bell, which together with the visualization spectrum animation constitutes the audio-visual joint experience.

Design of audio-visual synergy mechanism

In order to further realize the inter-sensory linkage enhancement, the page constructs a “synergistic trigger” mechanism between the visual and auditory outputs:

When the user clicks on a hotspot area, the corresponding voice clip is automatically triggered;

When the page slides to the cultural interpretation area, the background explanation will be played automatically (mute switch can be set);

When the user has not operated for more than 30 s, the system will prompt the user to continue exploring with a cultural voice “wake-up call”.

The above design will effectively enhance users’ active exploration motivation and multi-sensory immersion in cultural information, especially help general tourists from surface browsing to in-depth understanding, and build a cultural experience channel for different levels of users that can be felt, heard and traced (Fig. 3).

Information structure and cultural narrative design

Optimization objectives

Information structure design not only concerns the efficiency of user search, but also directly determines the presentation logic and cognitive depth of cultural information. This section is committed to reconstructing the knowledge organization and narrative presentation path of the Forbidden City digital heritage library, so that it shifts from “object focus” to “context focus”, and realizes that cultural relics from the static display to the cultural narrative of the leap.

Three-level information structure system

(1)
Basic information layer: including the name of cultural relics, age, material, size and other technical data. This layer adopts standardized structure display, supports keyword search and condition filtering, and provides data support for professional users.
(2)
Semantic Interpretation Layer: Construct semantic interpretation units from the dimensions of historical background, vessel function, and decorative symbols.
(3)
Cultural Narrative Layer: Build story paths around cultural relics, emphasizing the historical tension of “people-objects-events-time and space”.

Content organization strategies: semantic mapping and thematic cluster

Timeline indexing mechanism: Introducing the dynastic evolution timeline, users can browse the representative artifacts of different historical periods to form a horizontal cultural comparison vision (Fig. 4).

Thematic network navigation: building a four-dimensional knowledge map of “technique - artifacts - use - system”, for example, clicking on “gilding” can link to For example, clicking on “gilding” can be linked to the types of crafts such as “picking red” and “gilding” to help users understand the evolution of cultural crafts (Fig. 5).

Adopting the concept of modularized layout, the cultural explanatory content and visualization materials are presented side by side to enhance the readability and absorption rate of the information. The narrative process unfolds in sections through the scrolling trigger mechanism, guiding users to track the evolution of cultural relics in time and space (Fig. 6).

Through the systematic optimization of the three major paths of multimodal interaction, information structure and layered interface in this chapter, the Forbidden City’s digital heritage library will realize the transformation from a “high-precision image display platform” to an “explorable, comprehensible, and perceptible digital cultural space”, and effectively enhance the communication power and infectious power of cultural heritage digitization. The dissemination and infectious power of cultural heritage digitization¹⁸.

Conclusion and future work

This study initiated from the analysis of existing problems within the Palace Museum’s digital cultural heritage system and proposed the process of “behavior-driven—perception enhancement—interface reconstruction.” By employing eye-tracking and behavior analysis on three typical user groups (professional scholars, history enthusiasts, and general tourists), we empirically investigated the differences in their visual attention allocation, information acquisition efficiency, and cognitive path patterns.

Response to research questions

With respect to visual attention allocation across interface areas, the results indicate pronounced differences among the three user groups. History and culture enthusiasts demonstrated significantly longer fixation durations in the Cultural Interpretation area (AOI3), reflecting a strong orientation toward contextual understanding and meaning construction. Professional scholars, by contrast, allocated more visual attention to the Technical Parameter area (AOI2), consistent with their goal-driven and information-specific interaction strategies. general tourists predominantly focused on the main visual display (AOI1) and recommendation-related content (AOI4), suggesting a more surface-oriented and interface-guided engagement pattern.

Regarding information acquisition efficiency, professional scholars exhibited the highest performance across all relevant indicators, including task completion time and fixation efficiency, which can be attributed to their structured cognitive strategies and domain expertise. general tourists showed comparatively lower efficiency and higher cognitive load, indicating difficulties in navigating and integrating complex information. History and culture enthusiasts occupied an intermediate position, combining high engagement with moderate efficiency, reflecting an exploratory and learning-oriented interaction style.

In terms of cognitive path patterns, scanpath analysis revealed distinct interaction trajectories among the user groups. Professional scholars tended to follow a linear and technology-oriented path, typically progressing from the main visual area to technical information and then to auxiliary content. History and culture enthusiasts exhibited a more cyclical path, frequently alternating between the main visual display and interpretive content to reinforce understanding and emotional engagement. general tourists displayed a more fragmented and interface-dependent path structure, characterized by irregular transitions across multiple areas.

Taken together, these findings demonstrate that differences in user expertise and cultural interest systematically shape visual attention distribution, information processing efficiency, and interaction pathways in digital heritage interfaces.

Theoretical and practical contributions

This study contributes to digital heritage interface research by empirically revealing how differences in user expertise and cultural interest shape visual attention patterns, information acquisition efficiency, and cognitive interaction paths. By grounding these findings in eye-tracking data, the study advances a behavior-based understanding of user cognition in digital heritage contexts.

At the theoretical level, the research extends existing discussions on immersive interaction and empathy by operationalizing cognitive understanding and affective engagement at the interface level, rather than treating empathy as a generalized psychological construct. This approach provides a more focused、interface-oriented application of empathy-related concepts in digital heritage research.

From a practical perspective, the findings offer evidence-based implications for multimodal interface design. The results suggest that different user groups benefit from differentiated interface support, highlighting the need for layered information structures and adaptive design strategies in digital heritage platforms.

Limitations and future research

The primary limitation of this study is the small, non-randomized sample size (N = 18), which restricts the statistical generalizability of the quantitative findings to the wider population of digital cultural heritage users. The experimental design was justified by its reliance on purposive sampling to capture deep-seated cognitive differences, but increasing participant numbers in future studies would further strengthen the robustness of the findings.

To mitigate this limitation, the qualitative interview findings and NASA-TLX cognitive load data served as critical supplementary evidence. These qualitative data points provided essential context on information barriers—such as misjudging navigation or misunderstanding specialized terminology (e.g., “gilded”)—which powerfully corroborated the quantitative eye-tracking and task performance metrics, thereby enhancing the internal validity of the proposed optimization direction.It should be noted that the proposed multimodal optimization scheme is conceptual and design-driven, and its effectiveness has not yet been empirically validated through implementation and user testing.

In addition, the integration of emerging digital tools in qualitative research introduces methodological and ethical considerations. Although generative AI technologies were involved during the qualitative analysis process, AI-assisted qualitative support tools were used strictly in a supportive role, in line with emerging ethical guidelines for responsible AI use in academic research. Specifically, all coding, theme generation, and interpretative decisions were conducted manually by the researchers, while AI tools were limited to auxiliary functions such as language organization and reflexive checking.

Future research should further explore standardized frameworks for the transparent and ethical integration of AI-assisted tools in qualitative analysis, ensuring that methodological rigor, researcher reflexivity, and analytical accountability remain central.

Furthermore, the implementation of the proposed multimodal system faces practical constraints. Technical feasibility remains a key challenge, as the full integration of advanced technologies—such as generative AI for semantic deconstruction and haptic feedback systems—requires substantial engineering efforts to ensure system stability and low latency in high-traffic environments. Cost considerations are also significant, given the substantial investment required for infrastructure, hardware, and long-term maintenance. In addition, the adoption of complex multimodal interfaces may necessitate dedicated user training or onboarding mechanisms, particularly for general tourists who exhibited high cognitive load with the existing system.

In summary, this study explores potential pathways for transforming digital cultural heritage systems into more user-centered and experience-oriented platforms by integrating empathy theory, multimodal interaction, and user cognition-driven cultural communication logic. The findings not only offer practical design insights for optimizing the Palace Museum’s digital heritage library but also contribute a methodological reference for future research in digital humanities and digital heritage studies.

Data availability

The datasets generated and analysed during the current study are not publicly available due to privacy concerns mandated by the IRB, but are available from the corresponding author on reasonable request.

References

Dalong, D. The Construction of the Virtual Museum in the Forbidden City of China [Article]. Inform. Cult. 59 (3). https://doi.org/10.7560/ic59302 (2024).
Fang, L., Sun, J. & Liu, Y. Research on the Quality Control Method of Cultural Heritage Digital Information Service: A Case Study of the Digital Cultural Relics Library Platform of the Palace Museum in Beijing [Article]. Libr. Trends. 71 (4). https://doi.org/10.1353/lib.2023.a927953 (2023).[In Chinese]
Li, J., Nie, J. W. & Ye, J. Evaluation of virtual tour in an online museum: Exhibition of Architecture of the Forbidden City [Article]. Plos One. 17 (1). https://doi.org/10.1371/journal.pone.0261607 (2022). Article e0261607.
Tu, J. C., Liu, L. X. & Cui, Y. A Study on Consumers’ Preferences for the Palace Museum’s Cultural and Creative Products from the Perspective of Cultural Sustainability [Article]. Sustainability, 11 (13), Article 3502. (2019). https://doi.org/10.3390/su11133502
Fan, Q. Research on intangible cultural heritage resource description and knowledge fusion based on linked data [Article]. Electron. Libr. 42 (4), 521–535. https://doi.org/10.1108/el-01-2023-0018 (2024).
Article Google Scholar
Yu, Z., Hongxiao, C. & Qingfeng, Z. Study on Visitor Experience Value Assessment Based on Multi-Dimensional Factor Analysis - Case Study of Palace Museum in Beijing, World Cultural Heritage Site [research-article]. Int. J. Adv. Cult. Technol. 12 (4), 96–112 (2024). ://KJD:ART003152829.
Google Scholar
Sanderson, K. Visual Interface Design for Digital Cultural Heritage: A Guide to Rich-prospect Browsing. Electron. Libr. 30 (1), 150–151. https://doi.org/10.1108/02640471211204150 (2012).
Article Google Scholar
Li, D. N. & Zhao, X. C. Design Aesthetic Innovation and Practical Case Studies in the Digital Protection of Museum Cultural Relics: A Case Study of the Jinling Painting Digital Art Exhibition in Nanjing. Orient. Collect., (3), 68–70. (2025).
Cao, C. X. & Duan, Y. Transformation and dilemmas of artifact-free immersive digital cultural heritage exhibitions: Audience perception analysis based on online text. Dongnan Wenhua. 5, 162–170 (2024).[In Chinese]
Google Scholar
Zabulis, X. et al. A Digitally Enhanced Ethnography for Craft Action and Process Understanding [Article]. Appl. Sciences-Basel. 15 (10). https://doi.org/10.3390/app15105408 (2025). Article 5408.
Huang, Z. Research on a Cross-Resource Interaction Model for Archival Heritage in Virtual and Real Interaction. Shanxi Archives. 6, 172–175 (2023). 171.[In Chinese]
Google Scholar
Ting, Z. & Young, K. S. A Comparison Study of Youth Participatory Culture and Arts Education Projects in Chinese and Korean Museums: Focused on the Palace Museum in China and the National Museum of Korea [research-article]. 9 (4), 59–76. (2022).
Hornecker, E. & Stifter, M. Learning from interactive museum installations about interaction design for public settings. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. (2006).
Paananen, V., Kiarostami, M. S., Lik-Hang, L., Braud, T. & Hosio, S. From Digital Media to Empathic Spaces: A Systematic Review of Empathy Research in Extended Reality Environments. In (2023). ACM Computing Surveys.
Ye, M., Zhao, C., Ma, H. & Wang, J. Bridging Through Multisensory Experiences: The Inheritance and Innovation of Grand Canal Intangible Cultural Heritage in Cultural and Creative Products. In Advances in Engineering Technology Research. (2025).[In Chinese]
Li, Q., Wang, P., Liu, Z. & Wang, C. How generous interface affect user experience and behavior: Evaluating the information display interface for museum cultural heritage. In Computer Animation and Virtual Worlds. (2023).
Seongmi, J. For Advancing Digital Archives in Multifaceted Utilization of Intangible Cultural Heritage [research-article]. Cult. Convergence. 45 (12), 133–147 (2023). ://KJD:ART003031175.
Google Scholar
Zou, C., Rhee, S. Y., He, L., Chen, D. & Yang, X. Sounds of History: A Digital Twin Approach to Musical Heritage Preservation in Virtual Museums [Article]. Electronics, 13 (12), Article 2388. (2024). https://doi.org/10.3390/electronics13122388
Cheng, J. X. Research on the multidimensional perception reconstruction of the virtual ecosystem of sports intangible cultural heritage in the new era. J. Shandong Sport Univ. 2, 63–64. https://doi.org/10.26914/c.cnkihy.2025.014198 (2025).[In Chinese]
Article Google Scholar
Liu, Y. T. Application of dynamic graphic interactive narrative based on synsensory psychology [Master’s Thesis, Luxun Academy of Fine Arts]. (2023). https://doi.org/10.27217/d.cnki.glxmc.2023.000044
Niccolucci, F. & Felicetti, A. Digital Twin Sensors in Cultural Heritage Ontology Applications [Article]. Sensors 24 (12). https://doi.org/10.3390/s24123978 (2024). Article 3978.
Jeong, H. H., Oh, H. J., Kim, T. Y. & Kim, Y. A Study on the Construction and Utilization of Digital Archives for Intangible Cultural Heritage in Korea [research-article]. J. Korean Biblia Soc. Libr. Inform. Sci. 27 (2), 95–134. https://doi.org/10.14699/kbiblia.2016.27.2.095 (2016).
Article Google Scholar
Zeng, Q., Lee, M. & Eune, J. Digital design method of cultural heritage using Ancient Egyptian theological totem [Article]. Heliyon, 9 (5), Article e15960. (2023). https://doi.org/10.1016/j.heliyon.2023.e15960
Smithies, J. et al. MaDiH ((sic)): A Transnational Approach to Building Digital Cultural Heritage Capacity [Article]. Acm J. Comput. Cult. Herit. 15 (4), 71. https://doi.org/10.1145/3513261 (2022).
Article Google Scholar
Caine, K. Local standards for sample size at CHI. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 981–992. (2016).
Sweller, J. Cognitive load theory. In J. P. Mestre & B. H. Ross (Eds.), The psychology of learning and motivation: Cognition in education (pp. 37–76). Elsevier Academic Press. (2011). https://doi.org/10.1016/B978-0-12-387691-1.00002-8
Braun, V. & Clarke, V. Using thematic analysis in psychology. Qualitative Res. Psychol. 3 (2), 77–101. https://doi.org/10.1191/1478088706qp063oa (2006).
Article Google Scholar
Braun, V. & Clarke, V. Reflecting on reflexive thematic analysis. Qualitative Res. Sport Exerc. Health. 11 (4), 589–597. https://doi.org/10.1080/2159676X.2019.1628806 (2019).
Article Google Scholar
Chen, M. X. Application and reflection of digital technology in archaeological site museums: A case study of the Tanshishan Site Museum in Fujian Province. Fujian Wenbo. 4, 78–83 (2024).[In Chinese]
Google Scholar
Geng, G., He, X. L., Wang, M. L., Li, K. & He, X. W. Research Progress on Key Technologies for the Revitalization of Cultural Heritage. Chin. J. Image Graphics. 27 (6), 1988–2007 (2022).[In Chinese]
Article Google Scholar
Gopakumar, M. et al. Full-colour 3D holographic augmented-reality displays with metasurface waveguides. In Nature. (2024).
Kim, J. Y., 한한 & 김영진 A Study on the Utilization of Modern Exhibition in Traditional Chinese Palace Architectural Space - Centering on the Ansiru of the Palace Museum [research-article]. J. Korea Intitute Spat. Des. 17 (5), 71–86 (2022).
Google Scholar
Ting, M. Y. & heehyun, K. Research on Package Design of Cultural Products for The Palace Museum of Beijing- Centered on Emotional Design Theory [베이징 고궁박물관의 문화상품 패키지 디자인 연구- 감성디자인 이론을 중심으로 -] [research-article]. J. Brand Des. Association Korea. 22 (1), 45–56 (2024). ://KJD:ART003064808.
Google Scholar

Download references

Funding

No funding.

Author information

Authors and Affiliations

School of Art and Design, Hubei University of Technology, Nanli Road, Hongshan District, Hubei Province, Wuhan City, People’s Republic of China
Linghui Ke, Huimin Qin, Jiaao Long & Pengyu Xiao

Authors

Linghui Ke
View author publications
Search author on:PubMed Google Scholar
Huimin Qin
View author publications
Search author on:PubMed Google Scholar
Jiaao Long
View author publications
Search author on:PubMed Google Scholar
Pengyu Xiao
View author publications
Search author on:PubMed Google Scholar

Contributions

Linghui Ke and Huimin Qin wrote the main manuscript text and Jiaao Long and Pengyu Xiao prepared figures. All authors reviewed the manuscript.

Corresponding author

Correspondence to Linghui Ke.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics

This study has been approved by HBUT Ethics Research Committee. The ethics project ID number is HBUT/2025/0070.

Methods

In accordance with the ethical principle soutlined in the Declaration of Helsinki, all participants provided informed consent before participating in the study.The anonymity and confidentiality of the participants were guaranteed, and participation was completely voluntary.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ke, L., Qin, H., Long, J. et al. Multimodal interaction enhancement of digital cultural heritage system: user behavior analysis and interface reconstruction of the heritage scanning library of the palace museum. Sci Rep 16, 10654 (2026). https://doi.org/10.1038/s41598-026-44955-x

Download citation

Received: 18 July 2025
Accepted: 16 March 2026
Published: 30 March 2026
Version of record: 31 March 2026
DOI: https://doi.org/10.1038/s41598-026-44955-x

Subjects

Abstract

Introduction

RQ1

RQ2

RQ3

Related work

Digitization process

Theoretical framework

Empathy psychology

Media materiality

Multimodal interaction & technical support

Based on the eye movement experiment of the Forbidden City digital heritage library user cognitive behavior analysis

Experimental design

Experimental objectives, research questions, and hypotheses

RQ1

H1

RQ2

H2

RQ3

H3

Experimental design and participants

Experimental equipment and platform

Experimental task design

Experimental process

Data analysis methods

Analysis of experimental results

Differences in visual attention allocation

Information acquisition efficiency analysis

Cognitive path pattern analysis

Summary of qualitative findings and cognitive load

Discussion

Theoretical interpretation of user behavior data

Interpretation based on empathy psychology

Interpretation based on media materiality

Validation of research hypotheses

Limitations, robustness, and theoretical basis for enhancement

Design implications for multimodal interface optimization for the digital cultural relics library page of the palace museum

Design orientation and problem restatement

Multimodal interaction system construction: interface enhancement design based on audiovisual integration

Visual layer reconstruction

Auditory interaction system design

Voice explanation guidance system

Design of audio-visual synergy mechanism

Information structure and cultural narrative design

Optimization objectives

Three-level information structure system

Content organization strategies: semantic mapping and thematic cluster

Conclusion and future work

Response to research questions

Theoretical and practical contributions

Limitations and future research

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics

Methods

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links