Introduction

With the rise of immersive technologies, DTT has become an important tool in cultural heritage preservation and dissemination. Through high-precision 3D modeling, real-time synchronization, and interactive simulation, it functions as a digital “life-support system” for fragile sites such as the Mogao Caves and Pompeii1,2. Beyond preservation, DTT enables tourists to “travel through time and space,” fostering immersive historical experiences grounded in the notion of “virtual uniqueness”3: the idea that cultural symbols acquire exclusive presence and interactivity in digital environments. For example, the Palace Museum in Beijing uses DTT to reconstruct Qing Dynasty court life, integrating role-play and narrative design to transform visitors from passive viewers into embodied participants, shifting the experience from “browsing” to deep cultural understanding4.

However, despite growing investments in immersive technologies, such experiential enhancements have not translated proportionally into sustainable behavioral change among tourists5. This paradox of technological enthusiasm versus behavioral indifference reveals a persistent gap between digital engagement and behavior transformation2. Existing research has primarily emphasized technical aspects such as modeling, visualization, and interface optimization, while paying insufficient attention to the underlying psychological and behavioral mechanisms1,6.

Current DTT literature predominantly bifurcates into two tracks. The first concerns technological realization, focusing on 3D modeling, data integration, and system interactivity7,8. The second, anchored in TAM, evaluates user attitudes through variables like perceived usefulness and ease of use9,10. Yet TAM, while well-established in information systems research, proves inadequate in highly contextualized, affect-rich environments such as cultural heritage tourism. It lacks the capacity to explain emotional resonance, symbolic interpretation, and narrative immersion—factors fundamental to the construction of cultural meaning10,11.

Moreover, models emphasizing surface-level satisfaction or hedonic response fail to capture how visitors internalize cultural content psychologically2,12. In this regard, cultural identity should be reconceptualized as a dynamic process evolving from cognitive understanding to emotional involvement and ultimately behavioral commitment13,14,15,16. However, such internalization is often reduced to simplified indicators such as “pleasure,” neglecting the layered symbolic meaning-making process. Similarly, place attachment theories developed in physical contexts struggle to explain how emotional bonds form in virtual environments driven by symbolic cues17. Mechanisms such as narrative immersion, symbolic reconstruction, and role-playing underpin what we term symbolic embeddedness, yet this remains under-theorized18,19.

Cultural capital adds another layer of complexity. As Bourdieu noted20, it shapes individuals’ interpretive schemas and cultural preferences. Tourists with higher cultural capital may be more attuned to symbolic narratives21,22, but they may also exhibit a critical distance toward hypermediated experiences19—a phenomenon we refer to as the “anti-immersion effect.” This intersection of cultural capital and digital heritage experience remains theoretically underdeveloped. Most existing studies rely on SEM, which estimates linear, population-level relationships23. However, sustainable behavioral intentions often result from the interplay of multiple psychological conditions24. To better address this complexity, we adopt a dual approach, using PLS-SEM to examine average effects and fsQCA to uncover diverse causal combinations associated with behavioral outcomes.

In summary, the research questions addressed in this study are as follows:

  1. (1)

    How do immersive experiences triggered by DTT transform into cultural psychological construction processes through perception (realism and narrativity)?

  2. (2)

    How do cultural identity and place attachment mediate the relationship between perceptual experiences and tourists’ behavioral intentions?

  3. (3)

    How does an individual’s cultural capital moderate the pathways through which immersive experiences influence cultural psychological construction?

  4. (4)

    Under various combinations of perceptual and psychological variables, are there multiple equivalent pathways leading to tourists’ sustainable behavioral intentions?

To address these questions, this study proposes an integrated framework that moves beyond traditional perception–behavior models by incorporating perceptual stimuli, psychological construction, and behavioral responses. Theoretically, it emphasizes the mediating roles of cultural identity and place attachment within the PPB pathway, and introduces the moderating role of cultural capital. Methodologically, it combines PLS-SEM with fsQCA to identify both average effects and heterogenous causal configurations. Through this theoretical–methodological dual integration, the study seeks to respond to the limitations of current models and better capture the complexity of tourist behavior in digital heritage contexts.

Methods

The dual drivers of digital twin technology

In recent years, DTT has been widely employed in cultural heritage preservation and communication, combining precise visual modeling with immersive interaction2,7,25. By integrating 3D scanning, IoT sensors, and real-time rendering, DTT constructs a “virtual mirror” of heritage sites that enables visitors to explore and engage with cultural resources remotely26. From a perceptual perspective, two experiential drivers are particularly central to the way DTT shapes users’ responses: Perceived Realism (PR) and Narrativity (NC). We argue that both can be directly traced back to identifiable technical inputs in DTT, which provides a strong rationale for our hypotheses.

PR describes the subjective sense of consistency between virtual representations and their real-world cultural counterparts27. In DTT contexts, this sense of realism is not a vague impression but the outcome of several concrete technical inputs. Geometric and material fidelity—for example, dense point clouds, mesh resolution, and physically based rendering—allows surfaces, textures, and architectural details to be discerned without visual artifacts, reinforcing the impression of “being there”28. Physically plausible lighting and physics engines ensure that shadows, reflections, and object behaviors follow natural causal rules, thereby reducing contradictions that might otherwise undermine credibility29. Spatiotemporal registration and real-time data feeds (e.g., synchronizing IoT sensor input with a virtual environment) provide a temporal anchor, so visitors experience events that align with “what is happening now”30. Finally, system performance and latency (adequate frame rate, minimal lag) prevent perceptual breaks that can destroy the illusion of reality.

Together, these inputs converge to strengthen authenticity, presence, and the impression of “experiencing the real site”31,32. Empirical studies show that such realistic rendering significantly contributes to the internalization of cultural meaning and the formation of Cultural Identity (CI). Yet prior work also warns that realism is not linearly positive: too much sensory detail can overwhelm users’ cognitive resources, diverting attention to technical elements rather than cultural significance33,34. Leow and Ch’ng19 similarly note that an excessive focus on sensory fidelity may crowd out deeper cultural interpretation. Thus, while the relationship is contingent on context and individual background, the dominant expectation supported by both technology design and empirical evidence is that greater PR will facilitate the strengthening of cultural identity.

H1: PR positively influences CI.

NC refers to the storyfulness of digital-heritage presentations, capturing the extent to which visitors are immersed in a culturally meaningful storyline through role enactment, scene re-creation, and multisensory cues35,36. In DTT, narrativity is supported by several technical inputs. Interaction granularity, such as object-level inspection, manipulation, and path choice, allows visitors to become active participants rather than passive viewers, which heightens their sense of agency and story involvement. Narrative orchestration through branching triggers, quest-chain logic, and diegetic interfaces maintains dramatic progression and role goals, sustaining narrative transportation. Multisensory delivery, including spatial audio and haptic feedback, enriches the storyworld with embodied cues. Stable system performance prevents rhythm breaks that would disrupt immersion, while spatiotemporal registration with real-time events adds credible “evidence nodes” that make the enacted story feel situated rather than scripted.

Compared with static exhibitions, these narrativity-oriented affordances foster temporal transcendence—the sense of experiencing the past in the present—and situated immersion. Case projects illustrate this vividly: the Mostar Old Bridge combined 360° VR with embodied action to let visitors re-enact intangible rituals37, while the Carignano Palace reconstruction employed dynamic narration to place audiences within parliamentary history, deepening both comprehension and affective engagement38. Converging evidence shows that when visitors are narratively transported, that is, when they feel inside the storyworld, they are more likely to develop place-based bonding and belonging39,40. Nevertheless, narrativity also has boundary conditions. Fragmented plots or excessive branching can lead to cognitive overload and weaken transportation41, and when PR is very high, sensory fidelity may capture attention at the expense of narrative processing19. Acknowledging these risks yet following the dominant empirical pattern, we propose the following hypothesis:

H2: NC positively influences place attachment.

Psychological mechanisms of cultural identity and place attachment

Cultural Identity (CI) refers to the psychological bond that tourists form with a specific culture during heritage experiences. It typically comprises three dimensions: cognitive identity (understanding and internalizing cultural values), emotional identity (a sense of belonging and self-association), and behavioral intention (a commitment to cultural preservation)42,43. This construct not only reflects tourists’ subjective acceptance of culture but is also recognized as a key psychological foundation for fostering sustainable tourism behaviors44.

In virtual cultural contexts enabled by DTT, the formation mechanism of CI is undergoing transformation2. Through immersive experiences and interactive learning, DTT allows tourists to directly engage with the history and preservation values of heritage sites. For example, virtual archeology platforms that simulate restoration processes significantly enhance users’ cognitive identity by deepening their understanding of heritage craftsmanship45. Augmented reality (AR) guided tours reconstruct historical narratives, strengthening tourists’ situational resonance and emotional identification with heritage46.

However, some studies caution that in pursuit of immersion, certain projects adopt overly theatrical or entertaining approaches, which may divert tourists’ attention to the storyline itself while neglecting underlying cultural values. This can result in CI remaining at a superficial level of pleasurable experience39. Therefore, the effectiveness of DTT in fostering CI may vary depending on content design and user background. Nonetheless, there is broad consensus on its potential to influence identity construction. Prior research suggests that stronger CI leads to greater engagement in both EBI (e.g., pollution reduction, low-carbon travel) and CRI (e.g., observing local customs, protecting heritage sites)42. Accordingly, the following hypotheses are proposed to explore the mediating role of CI between PR and behavioral intention:

H3a: CI mediates the relationship between PR and EBI. H3b: CI mediates the relationship between PR and CRI.

Place Attachment (PA) refers to an individual’s emotional bond and psychological connection to a specific location, typically comprising two dimensions: place identity, which reflects tourists’ emotional belonging and sense of self-extension toward a heritage site; and place dependence, which denotes functional reliance and the perceived uniqueness of experiences associated with the site17,47.

In traditional tourism contexts, PA is primarily formed through physical presence and on-site experiences. However, this psychological mechanism is undergoing transformation within digitally constructed environments powered by DTT7,48. On one hand, the high replicability and on-demand accessibility of virtual environments may weaken tourists’ dependence on physical space49. On the other hand, immersive storytelling and role-play experiences may significantly enhance contextual resonance, thereby reinforcing tourists’ sense of PA19,50. For example, the VR reconstruction of Pompeii simulates life before the volcanic eruption, enabling visitors to develop deeper historical identification.

Existing studies suggest that higher levels of PA are positively associated with stronger intentions to engage in EBI51,52. When individuals form deep emotional connections to a heritage site, they are more likely to participate in protective actions, such as minimizing waste or supporting eco-friendly initiatives. Therefore, the following hypothesis is proposed:

H4a: NC influences EBI through PA.

On the other hand, when tourists perceive a heritage site as functionally irreplaceable, they are more inclined to demonstrate culturally respectful behaviors, such as complying with behavioral norms or maintaining the sanctity of the site51,53. This indicates that PA also plays a significant role in guiding behavioral responses. Thus, the following hypothesis is proposed:

H4b: NC influences CRI through PA.

CI and PA are closely linked in tourists’ psychological mechanisms32,54. Existing research has shown that an increase in CI often coincides with a deeper emotional attachment to cultural heritage sites55. In the context of DTT, this relationship is especially pronounced: visitors gain a deeper understanding of the historical significance of cultural heritage through virtual interactions, which not only strengthens their CI but also enhances their emotional connection and attachment to the site1,2. For example, the digital restoration project of the Yuanmingyuan (Old Summer Palace) not only helps visitors recognize its historical destruction but also fosters national cultural consciousness, thereby enhancing visitors’ PA to the site56. Therefore, CI and PA may jointly influence tourists’ behavioral intentions, forming a chain mediation path. Based on this, the following hypotheses are proposed:

H5a: CI and PA form a chain mediation path in the relationship between PR and EBI.

H5b: CI and PA form a chain mediation path in the relationship between PR and CRI.

The role of cultural capital in digital twin experiences

Cultural Capital (CC) refers to the cultural resources, cognitive structures, and aesthetic abilities possessed by individuals, profoundly influencing how people perceive, decode, and respond to cultural information20,57. In cultural heritage tourism, CC not only determines whether visitors can deeply understand the intrinsic meaning of heritage, but also affects their experience quality and psychological engagement pathways in DTT scenarios1,58. However, research on the moderating mechanisms of CC in DTT contexts remains scarce, particularly the lack of systematic empirical testing.

Firstly, CC enhances tourists’ narrative decoding ability. Visitors with high CC typically possess richer historical knowledge and aesthetic literacy, enabling them to interpret complex cultural narratives during immersive experiences21,59. For example, in the AR restoration project at Kyoto’s Kiyomizu-dera, they not only focus on the architectural details but also understand the religious symbolism and cultural context22. In contrast, those with lower CC often react more strongly to the “novelty” of the technological presentation but struggle to understand the cultural context60,61. Therefore, CC may enhance the positive impact of NC on PA.

Bourdieu’s habitus theory emphasizes that the cognitive frameworks and behavioral patterns formed through an individual’s social background and historical experiences profoundly influence their perception and response to the external world20. In the digital cultural heritage experience, tourists’ CC influences their perception and understanding of virtual heritage through habitus. For instance, visitors with higher CC are accustomed to critically examining the presentation of virtual heritage, placing higher demands on the accuracy, historical consistency, and cultural value of the technology2,62. This critical examination may lead high CC visitors to develop a more rational and profound CI in virtual environments, while those with lower CC are more likely to be influenced by sensory stimuli, resulting in a more superficial CI63. Based on this, the following hypotheses are proposed:

H6: CC positively moderates the impact of NC on PA.

H7: CC negatively moderates the impact of PR on CI.

PPB model and theoretical integration

Traditional models such as TAM and emotional theories provide valuable insights into the impact of digital technology on user behavior64, but they fall short in capturing the complex psychological processes in cultural heritage tourism. TAM emphasizes perceived usefulness and ease of use but overlooks identity construction and emotional engagement65, and remains insufficient in addressing culturally embedded behaviors66,67,68,69. Meanwhile, emotional theories often treat affect as an isolated driver, failing to integrate perception and cognition in a systemic way9,10.

The PPB model provides a more integrative framework, positioning CI and PA as mediating variables that bridge perceptual stimuli and behavioral outcomes. Unlike TAM’s linear logic, PPB emphasizes how psychological mechanisms shape behavioral intentions, incorporating elements from emotional and behavior change theories into a multi-layered causal chain. As shown in Fig. 1, this model helps conceptualize the relational structure among PR, NC, CI, PA, and behavioral outcomes such as EBI and CRI.

Fig. 1: Research hypothesis paths of the PPB model.
figure 1

Solid blue arrows denote hypothesized positive effects; dashed orange arrows denote hypothesized negative effects. Light-blue nodes indicate independent variables, orange nodes indicate mediating variables, green nodes indicate dependent variables, and the yellow node denotes the moderating variable. See hypotheses H1–H6 for the specific moderated paths.

However, tourist responses in immersive cultural environments are rarely linear or homogeneous. For example, some individuals develop strong attachment under narrative engagement despite low realism, while others bypass affective identification through high cultural capital. These divergent pathways challenge the explanatory power of mean-based models70. To address this, we introduce fsQCA to identify multiple causal configurations underlying behavior. This configurational approach allows us to explore the path heterogeneity within the PPB chain, offering a more nuanced understanding of behavior formation in digital heritage contexts23,24,71.

Study area

This study focuses on three representative heritage sites in Guangzhou, China: the Chen Clan Ancestral Hall, Yongqingfang, and the Nanyue King Palace Ruins (Fig. 2). Site selection was based on three considerations:

Fig. 2: Research location map.
figure 2

The base map shows district boundaries, rivers, and major green spaces. The inset indicates the urban core. Panels mark the study sites: Chen Clan Ancestral Hall, Yongqingfang, Nanyue Kingdom Palace. Color coding: green = green space; blue = river region.

First, the sites cover three major heritage types defined by the International Council on Monuments and Sites72—tangible (Chen Clan Ancestral Hall), intangible (Yongqingfang), and archeological (Nanyue King Palace Ruins). This typological diversity enhances the generalizability of the findings73.

Second, the cases reflect a gradient of digital interventions. The Chen Clan Ancestral Hall features high-fidelity AR restoration (0.1 mm accuracy), suitable for testing perceived realism’s effect on cultural identity (CI, H1)74. Yongqingfang employs narrative-driven VR to activate place attachment (PA, H2)75. The Nanyue King Palace Ruins combine LiDAR and AR for guided interaction, allowing investigation of PR–NC synergy on behavioral intention (H3–H5)76. This gradient offers a practical basis for hypothesis testing.

Third, selecting sites within a single urban context controls for regional variation. Guangzhou, a top-ranked cultural tourism destination in China (“China Cultural Tourism Statistical Yearbook”, 2023), ensures cultural consistency and minimizes confounding influences.

Questionnaire design and variable measurement

The structured questionnaire consisted of two parts. The first part collected basic demographic data (e.g., gender, age, education level, and site visited), while the second part focused on measuring the core constructs of this study, including seven variables in total: four independent variables (PR, NC, CI, and PA), two dependent variables (EBI and CRI), and one moderator (CC). Each variable was measured through a dedicated subscale using a 5-point Likert scale (1 = strongly disagree, 5 = strongly agree). A total of 21 items were used in the main scale.

To ensure theoretical consistency and cross-contextual validity, all scales were adapted from validated instruments in tourism, digital heritage, and immersive experience. The measurement of PR referred to the spatial presence scale by Wagler and Hanus77, with items anchored to the digital-twin experience (e.g., “the visuals and interactions in this digital-twin experience felt lifelike”) to ensure context specificity. NC was grounded in narrativity-oriented frameworks, drawing on Reese et al.78 and Mulholland et al.79, and adapted to highlight story involvement within the digital-twin setting. CI adopted the cognitive–emotional identity structure developed by Fu and Luo16, with explicit referent shifts from physical heritage visits to digital-twin presentations, using anchors such as “in this digital-twin experience” and “as presented in the digital twin” to capture identity formation in virtual contexts. PA was measured using the dual-dimensional model of Williams and Vaske17 (identity and dependence), with item wording similarly adapted to emphasize attachment to the place as represented in the digital twin. EBI and CRI were derived from established environmental-psychology scales80,81,82, and contextualized by framing behavioral intentions after the digital-twin experience rather than in generic terms. CC was measured through indicators of education, aesthetics, and participation, adapted from Bourdieu-inspired empirical models83,84, and treated as a background resource without DTT contextualization.

All English-language items underwent a rigorous translation and localization process, involving two experts—one specializing in digital heritage and the other in applied linguistics. They conducted independent forward translations, followed by consensus synthesis and back-translation. To ensure semantic clarity and contextual adaptability, a pilot test was conducted with 28 participants who had recently visited a cultural site and used DTT tools. Feedback indicated confusion with terms such as “narrative coherence” and “virtual realism,” prompting refinement of item phrasing. Descriptions of example DTT applications (e.g., AR-guided tours, VR immersion theaters) were added to reduce ambiguity.

Following the pilot, minor adjustments were made to item wording to improve comprehension, while preserving conceptual integrity. The final version demonstrated adequate face validity and internal consistency, and a summary of variables, item examples, and sources is presented in Table 1.

Table 1 Measurement variables and scales for key constructs

Sample selection

This study adopted a stratified convenience sampling method, targeting tourists who had visited three representative cultural heritage sites in Guangzhou—Chen Clan Ancestral Hall, Yongqingfang, and the Nanyue King Palace Ruins. The stratification was based on two dimensions: heritage type (traditional architecture, intangible cultural heritage, and archeological ruins) and mode of technological engagement (VR, AR, and digital interaction), ensuring comprehensive coverage of DTT application scenarios. Respondents were required to meet three core criteria: (1) be at least 18 years old; (2) have physically visited one of the targeted heritage sites within the past 12 months; and (3) have engaged with at least one DTT application during the visit, verified through device usage records or a DTT feature recognition test.

According to the dual criteria for sample size determination under PLS-SEM, the minimum sample size was calculated using both Hair et al.85 “10-times rule” (7 paths in the model, minimum 70 samples) and G*Power analysis (effect size = 0.15, α = 0.05, power = 0.8), which yielded a required sample size of at least 395. Considering a potential invalid response rate of 20%, a total of 600 questionnaires were distributed. Ultimately, 516 valid responses were collected, yielding a valid response rate of 86%.

Data collection and quality control

A multimodal approach combining online and offline data collection was employed. Offline responses were collected by trained surveyors stationed at the exits of the heritage sites, with DTT engagement verified via device usage logs. Online data were distributed through the official heritage site apps to maintain data integrity. To ensure sample diversity and data quality, measures such as IP address filtering and response time monitoring were implemented.

The data cleaning process was conducted in two stages: (1) primary cleaning removed responses with abnormally short completion times (less than three standard deviations below the mean, i.e., 98 s) or with over 10% missing data; (2) advanced cleaning used the Longstring index (threshold = 0.8) to detect patterned responses and Mahalanobis distance (p < 0.001) to identify multivariate outliers, ensuring robustness and reliability86,87.

Data analysis method

PLS-SEM was used to assess the hypothesized structural relationships, including mediation and moderation effects among multiple latent constructs. PLS-SEM is particularly suitable for studies with moderate sample sizes, non-normally distributed data, and evolving theoretical frameworks88. It also enables the analysis of complex models without requiring multivariate normality or large samples, distinguishing it from covariance-based SEM89.

The analysis was performed using SmartPLS 4.0. To ensure data quality, diagnostic checks were conducted in SPSS 29.0. All variance inflation factor (VIF) values were below the recommended threshold of 5.0, confirming no multicollinearity88. The Shapiro–Wilk test was used to evaluate normality, and deviations supported the appropriateness of PLS-SEM for this dataset.

Measurement model evaluation followed established criteria⁸⁵. Internal consistency reliability was assessed using Cronbach’s α and rho_A, both with thresholds of 0.70. Composite Reliability (CR) values above 0.70, and Average Variance Extracted (AVE) values above 0.50 indicated acceptable convergent validity. Discriminant validity was examined using the heterotrait–monotrait (HTMT) ratio, with values below 0.85 deemed acceptable85,88.

The structural model was evaluated using R² and adjusted R² values to assess explanatory power. Mediation and moderation effects were tested using bias-corrected bootstrap resampling (5000 iterations), with significance assessed based on confidence intervals90. The path weighting scheme was applied with a convergence criterion of 10⁻⁵, following recommendations for stable model estimation88.

To complement the linear estimations of PLS-SEM and uncover causal asymmetry and conjunctural patterns, this study employed fsQCA using fsQCA 3.1b software. FsQCA is particularly suitable for exploring equifinal mechanisms where multiple configurations of conditions can lead to the same outcome, thereby offering a richer causal interpretation beyond net effects24.

For five-point Likert scales, previous studies recommend direct calibration with values of 4, 3, and 2 to align thresholds with the semantic meaning of the scale and to avoid misclassifying mid-range responses as full members24,91. Percentile-based thresholds may distort meaning when distributions cluster near the midpoint. As shown in Table 2, our data exhibit moderate negative skewness and clustering around scale points 3–4, making percentile calibration particularly prone to inflating membership scores. In this study, we first computed continuous composite scores for each reflective construct by averaging their items, for example, PR1 to PR3. These scores were then calibrated through direct anchors. To preserve discrimination under the slightly skewed distributions in our data, we adopted a stricter 5–3–1 scheme, with 5 representing full membership, 3 the crossover point, and 1 full non-membership. Specifically, we employed the calibrate function in fsQCA 3.1b, which mapped the 1–5 item scores onto fuzzy set membership values between 0 and 1 according to these anchors.

Table 2 Descriptive statistics

The analysis followed standard fsQCA procedures. A necessity analysis was first conducted using a consistency threshold of 0.90 to identify individually indispensable conditions. This was followed by truth table construction, applying a frequency threshold of 1 and a consistency cutoff of 0.80 to determine sufficient condition sets. Among the solution types produced, the parsimonious solution was selected for interpretation, as it retains only the most essential causal paths while minimizing redundancy24. This approach ensures theoretical clarity and facilitates robust cross-model comparison with the PLS-SEM results.

Results

Descriptive analysis

To examine the overall distribution of the measured variables, descriptive statistics including means, standard deviations, skewness, and kurtosis were calculated (see Table 2). The mean scores for all items ranged from 3.5 to 4.2, with standard deviations between 0.6 and 0.9, indicating a moderately high and concentrated level of agreement among respondents. The absolute values of skewness and kurtosis were all below 1.0, suggesting no significant deviations from normality. These results support the assumption of approximate normal distribution, confirming the data’s suitability for subsequent PLS-SEM analysis involving path estimation and mediation testing.

Reliability analysis of constructs

According to Hair et al.88, internal consistency is acceptable when Cronbach’s α and rho_A exceed 0.70, while convergent validity is supported when CR > 0.70 and AVE > 0.50. Multicollinearity concerns are ruled out when all VIFs remain below 5.0. As shown in Table 3, all constructs met these standards: Cronbach’s α ranged from 0.822 to 0.906, rho_A from 0.839 to 0.925, CR from 0.893 to 0.940, and AVE from 0.735 to 0.840. VIF values (1.774–2.950) indicated no collinearity issues.

Table 3 Reliability, validity, and model fit evaluation

The explanatory power of the model was evaluated using R² values for the four endogenous constructs. According to the interpretive thresholds proposed by Hair et al.88—0.75 (substantial), 0.50 (moderate), and 0.25 (weak)—the results indicate moderate explanatory capacity for PA (R² = 0.301) and EBI (R² = 0.219), while CRI (R² = 0.150) and CI (R² = 0.084) fall within the range of weak yet meaningful predictive relevance.

Construct validity verification

To assess the suitability of the dataset for factor analysis, we applied the KMO and Bartlett’s test of sphericity (see Table 4). A KMO value above 0.80 is generally considered meritorious92; the obtained value of 0.846 thus indicates adequate sampling adequacy. Bartlett’s test yielded χ² = 5793.101 with df = 210 and p < 0.001, which is well below the conventional significance threshold of 0.05, rejecting the null hypothesis that the correlation matrix is an identity matrix. Together, these results confirm that the data meet the statistical prerequisites for factor analysis.

Table 4 KMO and Bartlett’s test

To verify the construct validity of the measurement model, principal component analysis with varimax rotation was conducted. This method is commonly employed to extract orthogonal factors and examine whether items cluster around theoretically expected dimensions. According to Hair et al.88, a factor loading of 0.70 or above is considered strong, while 0.60 is still acceptable in exploratory contexts. As shown in Table 5, seven components were extracted, matching the seven predefined constructs of the model.

Table 5 Factor analysis

All measurement items exhibited high loadings on their respective factors (range = 0.765–0.913), with no significant cross-loadings observed, indicating a clearly differentiated structure. Specifically, items for CC loaded on Component 1 (0.880–0.913), PA on Component 2 (0.793–0.852), EBI on Component 3 (0.800–0.847), CRI on Component 4 (0.790–0.872), CI on Component 5 (0.801–0.864), PR on Component 6 (0.765–0.858), and NC on Component 7 (0.769–0.810). The clean structure and absence of cross-loading further confirm the construct discriminant validity and theoretical coherence of the measurement model.

Following standard PLS-SEM criteria, indicator reliability was assessed via standardized outer loadings. All items exceeded the recommended threshold of 0.70, with values ranging from 0.833 to 0.933. The standard errors were narrow, and the bootstrapped t-values fell between 32.026 and 109.477, all significant at p < 0.001 as reported in Table 6. No indicator was found within the 0.40 to 0.70 interval, so no removal was required. The interaction terms CC×NC and CC×PR were modeled as single-indicator constructs in the two-stage procedure; their loadings were fixed at 1.000 and were therefore excluded from inferential testing. Taken together with the CR and AVE values reported below, these results confirm satisfactory indicator reliability and convergent validity for all reflective constructs.

Table 6 Standardized outer loadings

Discriminant validity was further assessed using the HTMT, a method proposed by Henseler et al.93 that offers higher sensitivity in detecting discriminant validity issues compared to the traditional Fornell–Larcker criterion or cross-loading analysis. HTMT estimates the ratio of between-construct correlations (heterotrait–monotrait) to within-construct correlations (monotrait–monotrait), with values below 0.85 generally considered acceptable. As shown in Table 7, all HTMT values ranged from 0.017 to 0.605, remaining well below the conservative threshold of 0.85, thus confirming adequate discriminant validity among the latent constructs.

Table 7 Heterotrait–monotrait (HTMT) ratios among latent constructs

PLS-SEM results

To assess the structural relationships among the latent variables, PLS-SEM was employed, and path significance was evaluated via bootstrapping with 5000 resamples85. As shown in Table 8, the evaluation involved four statistical indicators: standardized path coefficients (reflecting the magnitude and direction of effects), standard errors (STDEV), t-values, and p values. A path is considered statistically significant if its t-value exceeds 1.96 and its p-value falls below 0.05 under a two-tailed test assumption.

Table 8 Path coefficients and significance for hypotheses testing

All hypothesized paths were statistically supported. Specifically, PR positively influenced CI (β = 0.230, t = 3.998, p < 0.001), and NC significantly enhanced PA (β = 0.275, t = 4.805, p < 0.001), confirming H1 and H2. Regarding moderation, CC strengthened the indirect effect of NC on PA (H6: β = 0.154, t = 3.738, p < 0.001), while it negatively moderated the effect of PR on CI (H7: β = –0.115, t = 2.143, p = 0.032), suggesting a suppressor effect.

The mediating roles of CI and PA were also confirmed. CI significantly predicted both PA (β = 0.338, t = 6.346, p < 0.001) and CRI (β = 0.243, t = 4.227, p < 0.001), while PA positively influenced both CRI (β = 0.219, t = 4.032, p < 0.001) and EBI (β = 0.356, t = 6.591, p < 0.001). Additionally, CI had a direct effect on EBI (β = 0.191, t = 3.247, p = 0.001), and CC directly enhanced PA (β = 0.182, t = 4.353, p < 0.001), confirming the parallel pathways proposed in the model.

This study employed the Bootstrap method to examine mediation effects. As shown in Table 9, all 95% confidence intervals for indirect effects excluded zero, indicating the significance of the mediation paths. Specifically, in H3a, PR significantly influences EBI via CI (indirect effect = 0.045, T = 2.107, P = 0.035). In H3b, PR significantly promotes CRI through CI (indirect effect = 0.056, T = 2.870, P = 0.004). H4a reveals that NC promotes EBI through PA (indirect effect = 0.099, T = 3.461, P = 0.001), and H4b indicates that NC significantly impacts CRI through PA (indirect effect = 0.060, T = 2.972, P = 0.003). For H5a, the chain mediation path PR → CI → PA → EBI is confirmed to be significant (indirect effect = 0.028, T = 3.039, P = 0.002), and H5b demonstrates that the same chain pathway also significantly influences CRI (indirect effect = 0.017, T = 2.510, P = 0.012).

Table 9 Mediation effect test

Figure 3 illustrates the PPB model developed in this study, highlighting the path relationships and corresponding coefficients among key variables, including PR, NC, CI, and PA. This diagram provides a clear visual representation of the interactions and strengths of influence between constructs, shedding light on their critical roles in shaping tourists’ behavioral intentions within the context of digital heritage experiences.

Fig. 3: Perception-place-behavior model.
figure 3

Solid blue arrows indicate positive effects; dashed orange arrows indicate negative effects. Light-blue nodes = independent variables; orange = mediators; green = dependent variables; yellow = moderating variable. Coefficients are standardized; asterisks denote significance (*P < 0.05, **P < 0.01, ***P < 0.001).

fsQCA results

Necessary Condition Analysis was conducted to identify whether any single condition is indispensable for achieving the outcomes of interest. As shown in Table 10, the analysis examined both positive outcomes (EBI and CRI) and their negations (~EBI and ~CRI). A condition is considered necessary only if its consistency exceeds the threshold of 0.90, indicating that the outcome does not occur without the presence of that condition94.

Table 10 Necessary condition analysis

For EBI, the highest consistency values were observed in CI (0.897), PR (0.893), and PA (0.874), all falling short of the 0.90 benchmark. Similar patterns were found for CRI and its negation, where no condition met the threshold of necessity. These results confirm that none of the antecedents function as necessary conditions on their own.

Note: indicates the presence of a core condition; (non-bold) indicates a peripheral condition; indicates the absence (negation) of a condition; blank cells indicate an irrelevant or “do not care” status in the configuration.

Table 11 presents six sufficient configurations for achieving high levels of EBI. The overall solution demonstrates high consistency (0.916) and substantial coverage (0.829), both exceeding the recommended thresholds of 0.80 for consistency and 0.45 for coverage71, confirming the reliability of the solution.

Table 11 Configurational results for EBI

Configuration 1 consists of the presence of NC and PA as core conditions, with PR and CI as peripheral conditions. Configuration 2 includes PR and CRI as core conditions, while NC and CI act as peripheral conditions. Configuration 3 features NC and CRI as core conditions, with PR and PA included as peripheral conditions. Configuration 4 is composed of PR and NC as core conditions, and PA and CC as peripheral conditions. Configuration 5 contains no core conditions; PR, PA, CC, and CRI are present as peripheral conditions. Configuration 6 also presents no core conditions; NC, CI, PA, CC, and CRI are all included as peripheral conditions.

Table 12 presents seven configurations that are sufficient for achieving high levels of CRI. The overall solution demonstrates a high consistency of 0.914 and substantial raw coverage of 0.883, indicating that these configurations collectively account for the majority of cases with high CRI.

Table 12 Configurational results for CRI

Configuration 1 consists of the presence of PR as a core condition, with CC and the absence of CI (i.e., ~CI) as peripheral conditions. Configuration 2 includes no core conditions; PR, NC, and CI appear as peripheral conditions. Configuration 3 features PR and NC as core conditions, with PA as a peripheral condition. Configuration 4 includes PA and CC as core conditions, and CI as a peripheral condition. Configuration 5 presents no core conditions; PR appears as a peripheral condition, alongside the absence of NC (i.e., ~NC), PA (i.e., ~PA), and CC (i.e., ~CC). Configuration 6 contains no core conditions; CC is included as a peripheral condition, with the absence of NC (i.e., ~NC), CI (i.e., ~CI), and PA (i.e., ~PA). Configuration 7 features no core conditions; CI and PA are included as peripheral conditions, and both PR and NC are absent (i.e., ~PR and ~NC).

Discussion

This study employed a dual-method strategy, combining PLS-SEM and fsQCA, to examine how DTT relates to tourists’ sustainable behavioral intentions. By analyzing both net effects and configurational sufficiency, we identified dominant pathways and additional bundles that suggest robustness within our sample across heterogeneous subgroups. This section integrates the two sets of results to provide a layered interpretation of the perception–emotion–behavior chain and the moderating role of cultural capital.

The PLS-SEM analysis indicated two primary routes: a cognitive pathway (PR → CI) and an emotional pathway (NC → PA). Both were statistically significant, with PR associated with higher CI (β = 0.230, p < 0.001) and NC associated with higher PA (β = 0.275, p < 0.001). This pattern is consistent with Fan, Jiang, and Deng’s meta-analytic evidence that immersive AR/VR experiences improve appraisals and cognitive alignment in tourism contexts31. The fsQCA results mirror this distinction. For EBI, narrative-driven configurations (Configurations 1 and 3) consistently placed NC as a core condition, paralleling the PLS pattern that NC predicts PA and thereby supports environmental engagement. This result aligns with Chrysanthi et al.’s argument that when narrative structures are spatially embedded within heritage settings, they foster stronger affective bonding and situated emotional immersion35. For CRI, realism-driven and CC-stabilized configurations (Configurations 1, 3, and 6) emphasized PR and CI, resonating with the PLS pattern that cognitive alignment through realism is linked to normative respect. Thus, both methods converge on the pattern that emotional immersion tends to anchor environmental intentions, whereas cognitive recognition tends to underpin cultural respect intentions in this dataset.

While the net-effect model highlighted the prominence of NC → PA and PR → CI, fsQCA revealed additional sufficient combinations. For EBI, a realism–cognition pattern (Configurations 2 and 4) showed that PR, combined with ethical attitudes or weaker narrativity, could still be sufficient for behavioral activation. Similarly, multi-factor collaborations (Configurations 5 and 6) indicated that a mix of comparatively weaker conditions (e.g., PR, PA, CC) can jointly cross the sufficiency threshold. For CRI, emotion–cognition collaboration (Configurations 2, 4, and 7) suggested that CI and PA can compensate for limited PR or NC, while CC-stabilized pathways indicated that cultural capital may support respect intentions even when immersive cues are weak. Taken together, these findings are consistent with causal plurality: beyond dominant net-effect routes, redundant and compensatory bundles can also produce the outcome under our calibration. This helps explain why digital heritage interventions may remain effective even when technical realism or narrative intensity is modest—other conditions can fill the gap.

A notable result concerns the directionally opposed moderation of CC. In the PLS-SEM model, CC negatively moderated the PR → CI path (β = −0.115, p = 0.032) but positively moderated the NC → PA path (β = 0.154, p < 0.001). This asymmetry is consistent with museum and heritage learning studies showing that technology-forward presentations can induce technology overload and critical distancing when provenance or interpretation is opaque¹⁹. A broader systematic review likewise cautions that preservation technologies may distort intended meanings unless balanced by interpretability and authenticity scaffolds62. The fsQCA solutions reinforce this duality: CC-stabilized configurations for CRI (Configurations 1, 4, and 6) suggest that configurations including CC were associated with higher CRI even when dominant perceptual or affective cues were limited.

Why CC weaken the realism–cognition route. Psychological accounts suggest that higher CC shifts audiences from heuristic acceptance to accuracy-oriented, analytic processing (dual-process and elaboration-likelihood perspectives). Confronted with photorealistic scenes, high-CC visitors engage epistemic vigilance and authenticity norms—checking provenance, comparing rendered details with prior knowledge, and probing gaps between technical realism and historical/cultural authenticity. In this effortful mode, realism functions less as a persuasive shortcut and more as a credibility test, attenuating PR’s ability to translate perception into identity unless paired with verifiability scaffolds (e.g., source annotations, version histories, uncertainty cues). Expertise research also indicates that knowledgeable audiences prioritize evidential grounding over surface spectacle, which can produce aesthetic distance when verification is costly or opaque.

Why CC strengthens the narrative–affect route. Conversely, CC provides denser schema networks that make plots, symbols, and rituals easier to decode. When narrative cues fit these schemata, coherence increases and ambiguity resolves with less effort, facilitating narrative transportation and empathic involvement. In Bourdieu’s account20, cultural capital operates as a meaning “decoder,” enabling efficient parsing of symbols and plots and thereby amplifying narrative effects when narrativity is strong. Hence, CC does not globally suppress affect; it re-channels affect away from surface realism toward meaning-laden stories, deepening place attachment when narrativity is strong.

The PLS-SEM model also indicated a sequential mediation (PR → CI → PA → behavioral intention; H5a, H5b). Although the indirect effects were modest (H5a = 0.028, p = 0.002; H5b = 0.017, p = 0.012), the sequence is consistent with a layered process whereby technical perception aligns identity, which then consolidates into emotional attachment and, in turn, relates to intention. This mediating role of place attachment echoes evidence that attachment transmits perceptual appraisals to pro-environmental behavior in tourism settings48. This resonates with accounts of multi-layered place attachment and situates technology as a cognitive primer. fsQCA complements this by showing that such layering is not the only viable mechanism: configurations indicating the absence of CI (~CI) were still sufficient for high CRI when PR was strong, and emotion–cognition bundles could substitute for missing realism.

The necessary-condition check (consistency threshold = 0.90) did not identify PR, NC, CI, PA, or CC as necessary for either EBI or CRI under our calibration. This aligns with the sufficiency-focused results: no single must-have factor was detected. Instead, higher levels of respect intentions arose through alternative combinations. Theoretically, this supports an equifinality view in which multiple distinct pathways can reach the same outcome. Practically, it cautions against single-cue optimization (e.g., hyper-realism) and favors orchestration of complementary features that can substitute for or reinforce one another.

Although modeled within the same PPB framework, the mechanisms behind EBI and CRI differ in emphasis. EBI is more closely tied to the narrative–attachment route, where NC fosters PA and affective bonds support environmental action. CRI is more closely linked to the realism–identity route, where PR enhances CI and cognitive alignment supports normative respect. The fsQCA evidence strengthens this differentiation, and the narrative emphasis is consistent with findings on place-based digital storytelling that highlight the affective power of spatially grounded narratives³². In this study, these patterns support retaining the two intentions as distinct outcomes rather than collapsing them into a higher-order construct. In practice, DTT platforms may emphasize immersive storytelling and emotional transport to cultivate EBI, while strengthening verifiable realism and provenance cues to foster CRI. These design implications are also in line with cautions on authenticity management under technological intensification19,62.

These findings elaborate on TAM by proposing a PPB lens that captures both linear net-effect patterns and configurational sufficiency. Unlike TAM’s traditional focus on perceived ease and usefulness, the PPB perspective highlights that realism, narrativity, emotion, and symbolic identity jointly shape sustainable heritage intentions, showing that affective bonding and symbolic coherence can be as influential as utilitarian assessments. The study also refines Bourdieu’s20 notion of CC by demonstrating its dual role: as a decoder along narrative routes, CC strengthens schema-based comprehension and symbolic immersion, while as a filter along realism-to-cognition routes, it raises verification demands and reduces reliance on visual fidelity, making realism less persuasive unless supported by provenance cues. Finally, combining PLS-SEM and fsQCA illustrates methodological complementarity: PLS identifies dominant associations and mediated sequences such as PR → CI → PA → intention, whereas fsQCA reveals equifinal and compensatory configurations such as CC-stabilized pathways that sustain CRI even when affective or perceptual cues are weak. Together, these insights provide a richer account of how heterogeneous visitors interpret and act on digital heritage experiences and underscore the importance of mixed-method validation in contexts where behavioral causality is distributed across multiple routes.

From a practical standpoint, the results suggest that digital heritage platforms should adopt adaptive presentation strategies. At the technical application level, offering a dual-mode system—“expert mode” for high-CC users (featuring in-depth historical content, source annotations) and “story mode” for low-CC users (featuring gamified interactions, simplified storylines)—could enhance both engagement and educational outcomes. Additionally, integrating PPB-based behavior prediction models into smart heritage site management systems can support dynamic resource allocation and tailored communication. The fsQCA findings on “multi-factor collaboration” highlight that even weak individual signals can collectively generate behavioral outcomes. Hence, designing composite interventions that combine modest realism, engaging narrative, and symbolic cues may be more effective than over-reliance on any single element.

This study has several limitations that should be acknowledged. First, the data are based on cross-sectional, self-reported surveys, which restrict causal inference. Accordingly, the directional relationships discussed in this study should be interpreted as theoretically informed associations rather than definitive causal effects. Such designs may also be influenced by social desirability and common method variance. Longitudinal or experimental designs would provide stronger evidence of temporal dynamics and causal mechanisms. Second, the survey was conducted in technologically advanced Lingnan heritage sites. These contexts may amplify the salience of digital twin technology, limiting the transferability of findings to regions with less advanced infrastructure, oral traditions, or different cultural norms. Third, the fsQCA findings are sensitive to calibration and threshold choices. While we adopted a widely used 5–3–1 direct calibration and reported solution consistency and coverage, alternative anchors could alter peripheral configurations, especially those involving weaker conditions. Fourth, the dependent variables captured behavioral intentions rather than actual behaviors. Linking survey data with behavioral traces (e.g., digital interaction logs, donation or volunteering records) would improve ecological validity. Finally, the moderation patterns of cultural capital, though statistically supported, were not tested for measurement invariance across subgroups. Future research using multi-group models, longitudinal validation, or complementary qualitative inquiry could further confirm the stability and interpretive depth of these effects.

Finally, the study opens several directions for future research. While focused on Lingnan heritage in urbanizing China, the findings may not generalize to oral cultures or low-tech settings. Cross-cultural studies could assess whether the same configurations hold in contexts with different cultural norms or infrastructural conditions. Additionally, with the rise of Artificial Intelligence Generated Content (AIGC), the boundaries of authorship, authenticity, and interaction are shifting. Future work could investigate how AI-generated narratives alter the perception–emotion–behavior chain, whether algorithmic personalization reshapes place attachment, and how narrative “truth” is negotiated in human–machine collaborations. These questions could further expand the PPB model’s relevance in an age of intelligent cultural mediation.