Introduction

Cultural landscapes are the result of long-term interactions between human activities and the natural environment. They reflect social transformations and environmental characteristics across different historical periods1. As an essential part of cultural landscapes, traditional towns preserve unique spatial structures and architectural styles2,3. They also carry rich historical and cultural heritage, serving as important sites for studying the interaction between human settlements and the environment.

As tourist demands diversify, the factors influencing visitor experiences have also become increasingly varied. Based on geographical features, this study classifies China’s traditional towns into four types: water towns, original ecological towns, commercial and trade-oriented towns, and religious and cultural towns. Water towns, such as Zhouzhuang (Fig. 1a4), are known for their dense water networks and stone bridges over flowing rivers, creating a unique waterfront cultural atmosphere. Original ecological towns, such as Hongcun (Fig. 1b5), emphasize harmony between humans and nature. They preserve the original village layout and ecological characteristics. Commercial and trade-oriented towns, such as Pingyao Ancient City (Fig. 1c6), were historically trade centers. They feature well-preserved market districts and traditional architectural complexes. Religious and cultural towns, such as Shangqing Ancient Town (Fig. 1d7), attract tourists with their religious architecture, ritual activities, and strong cultural atmosphere. These traditional towns exhibit significant differences in landscape features and cultural connotations. As a result, they influence tourists’ perceptions and behavior patterns in distinct ways. However, quantitative research on how different types of traditional town landscape features shape the tourist experience remains limited.

Fig. 1: Examples of four typical traditional town types in China.
figure 1

a Water town. b Original ecological town. c Commercial and trade-oriented town. d Religious and cultural town.

In recent years, social media has become an important platform for tourists to record and share their travel experiences. Tourists share text, images, and ratings on social media, which accurately reflect their emotional experiences, key interests, and landscape perceptions8,9. Compared to traditional surveys and interviews, social media data offers advantages such as large scale, real-time updates, and rich information. However, social media data is often unstructured, and its heterogeneity and diversity across modalities present a significant challenge in effectively utilizing and integrating this information. Text data typically describes tourists’ subjective experiences, such as cultural atmosphere, comfort, and accessibility, while image data primarily presents architectural styles, natural landscapes, and religious elements. Single-modality data struggles to capture the full scope of tourists’ experiences. Therefore, in recent years, multimodal learning has been increasingly applied to cultural heritage research,, enhancing the comprehensiveness and accuracy of feature extraction by integrating different modal data. However, the fusion of multimodal data still faces challenges such as data heterogeneity, information redundancy, and modality misalignment. There is an urgent need for more systematic methods to optimize these processes.

The use of machine learning in cultural heritage and tourism research has steadily expanded, particularly in areas such as tourist sentiment analysis, behavior prediction, and experience evaluation10,11,12. However, many machine learning models remain black boxes, lacking transparent interpretability, which limits researchers’ understanding of how various landscape features influence tourists’ perceptions13. Therefore, explainable machine learning has increasingly attracted attention. Among them, SHAP (Shapley Additive Explanations) is an interpretability method based on game theory that quantifies the contributions of different features to model prediction outcomes14. Although SHAP has been widely applied in fields such as medicine, finance, and environmental science15,16,17,18, its application in cultural heritage and tourism research remains limited. SHAP can help quantify the impact of landscape features—such as environment, infrastructure, and experience—on tourist ratings and sentiment values, thereby providing scientific evidence for traditional town landscape optimization and tourism management.

This study uses XGBoost (eXtreme Gradient Boosting) as a regression model, combined with the SHAP method, to quantify the impact of various landscape features on tourist ratings and sentiment values19. XGBoost, with its powerful nonlinear modeling capabilities, robustness to missing data, and good generalization performance, has become an important tool for handling large-scale, multidimensional data. In multimodal data analysis, XGBoost can effectively integrate text and image information20,21. When combined with the SHAP method, it reveals which landscape features have the greatest impact on tourist ratings and sentiment values, along with their underlying mechanisms22. This method not only provides high-accuracy prediction results but also enhances the interpretability of the research, offering scientific evidence for cultural landscape management and tourism optimization.

Based on the above research background, this study aims to answer the following questions:

(1) How can a multimodal labeling system suitable for traditional towns be constructed, and how can an effective feature analysis framework be established using XGBoost and SHAP?

(2) How does the multimodal fusion of text and images quantify the impact of landscape features on tourist ratings and sentiment values? What are the advantages and complementarities of this fusion?

(3) What are the key differences in tourist landscape perception and evaluation across different types of traditional towns?

Methods

This study collects text and image review data on traditional towns in China from Ctrip. After data preprocessing, a two-level tagging system is established. By converting text and images into vectors, cosine similarity is calculated to perform tag mapping. To fully utilize multimodal information, tag data from text and images are further fused. XGBoost and SHAP are used to analyze the impact of different tags on sentiment values and ratings. This quantifies the importance of various features in tourist evaluations. The research framework of this study is shown in Fig. 2.

Fig. 2
figure 2

Research framework diagram.

Study area and data

China has many traditional towns with long histories and diverse styles. They carry rich cultural heritage and distinctive landscape features, making them an ideal case for exploring tourist perceptions and the multidimensional characteristics of traditional towns. Ctrip (https://www.ctrip.com), one of China’s most widely used travel booking and review platforms, has accumulated a large volume of authentic tourist feedback. It provides a reliable data source for this study. This study collects tourist review data on traditional towns from Ctrip, covering the period from January 1, 2015, to January 31, 2025. The dataset includes both textual reviews and user-uploaded images, enabling a multimodal representation of visitor perceptions. To ensure data representativeness, we included only traditional towns with at least 40 reviews. As a result, we obtained valid data from 114 traditional towns. To ensure the scientific validity and accuracy of traditional town classification, the study invited eight university researchers in the field of landscape design to review and verify the classifications. Based on their geographical environment, cultural background, and functional characteristics, the traditional towns were classified into four categories: ecological traditional towns (n = 47), waterfront traditional towns (n = 21), commercial traditional towns (n = 28), and religious cultural traditional towns (n = 18) (Fig. 3). During data preprocessing, invalid and duplicate reviews were removed. As a result, the study obtained 131,596 valid text reviews and 382,026 images uploaded by tourists. These data are used for subsequent multimodal data fusion and SHAP-based interpretability analysis to quantify the impact of landscape features in different types of traditional towns on tourist perceptions.

Fig. 3
figure 3

Study area.

Data Preprocessing

To ensure data standardization and usability, the text and image data collected from Ctrip must undergo preprocessing. In the text preprocessing stage, reviews with fewer than 10 characters are removed to filter out invalid data. The text content is cleaned by removing HTML tags and special characters. During word segmentation, the Jieba segmentation tool is used to split the text23. Stop words and punctuation are removed to reduce noise and improve data quality24. In the image preprocessing stage, all images are resized to 640×640 pixels to ensure uniform dimensions. Random horizontal flipping and a maximum 10-degree rotation are applied for data augmentation25,26. This enhances data diversity and improves model generalization. A hash algorithm is used to detect and remove duplicate images, preventing data redundancy27.

Definition of the traditional town tagging system

This subsection establishes a two-level tagging system for traditional towns based on their landscape characteristics, cultural background, and tourist experiences. The first-level tags include environment, infrastructure, and experience. The second-level tags are categorized as follows: environment includes water, vegetation, and topography; infrastructure includes architecture and accessibility; and experience includes aesthetics, comfort, culture, and consumption.

Text sentiment analysis

This study uses the BosonNLP tool to conduct sentiment analysis on tourists’ text reviews, extracting continuous sentiment scores ranging from 0 to 128. Higher values indicate more positive emotions, while lower values indicate more negative emotions. To facilitate subsequent modeling and interpretation, the sentiment scores are divided into three intervals: positive sentiment (0.8–1.0), neutral sentiment (0.4–0.8), and negative sentiment (0–0.4), which characterize the emotional tendencies expressed by tourists in their reviews.

Sentiment scores, together with the ratings actively given by tourists on the platform, constitute the satisfaction evaluation system in this study. The rating represents an explicit judgment of the overall experience by tourists and is a discrete variable ranging from 1 to 5. The sentiment score is derived from the naturally expressed emotions in the review text and, as a continuous variable, offers greater granularity in capturing nuanced differences. The sentiment score originates from the naturally expressed emotions in the review text and, as a continuous variable, provides finer granularity.

The inclusion of sentiment scores helps to identify potential discrepancies between ratings and emotions. For example, some reviews have high ratings but express negative emotions in the text, while others have low ratings but use positive language. Such discrepancies between ratings and sentiment provide a methodological basis for analyzing perception gaps and identifying key influencing factors in subsequent analysis. They also offer targeted supplementary information for the optimization of cultural landscape spaces.

Text label mapping

To extract key information from tourist reviews, this section applies a label mapping method. Text reviews are matched with the predefined tagging system for traditional towns, and label vectors are computed. The label vector serves as a key input for the subsequent SHAP analysis. It quantifies the features that tourists focus on. This section employs the SentenceTransformer model for label mapping29. Text reviews are converted into embedding vectors, and their similarity to label text is calculated to determine the relevant feature categories.

(1) Embedding representation of text and labels

The goal of text embedding is to convert textual information into fixed-dimensional vector representations. This captures semantic features, enabling similarity computation in the vector space. This section uses SentenceTransformer for text vectorization. Each review and label text is mapped into the same high-dimensional space to facilitate label matching30.

Let the embedding representation of the review text be \({V}_{c}\) and the embedding representation of the label text be \({V}_{t}\). The embedding calculation formula is as follows:

$${V}_{c}=f({T}_{c}),\,{V}_{t}=f({T}_{t})$$
(1)

\(f(\cdot )\) represents the encoding function of SentenceTransformer, while \({T}_{c}\) and \({T}_{t}\) correspond to the input review text and label text, respectively.

(2) Similarity calculation

The core of label mapping is to calculate the semantic similarity between review text and label text in the embedding space. This section uses cosine similarity for calculation. This method measures text similarity based on the angle between vectors. Its value ranges from -1 to 1. A value closer to 1 indicates a higher semantic similarity between the text and the label31. The calculation formula is as follows:

$$S({T}_{c},{T}_{t})=\frac{{V}_{c}\cdot {V}_{t}}{\parallel {V}_{c}\parallel \parallel {V}_{t}\parallel }$$
(2)

\({V}_{c}\cdot {V}_{t}\) represents the dot product of the embedding vectors of the review text and the label text. \(\parallel {V}_{c}\parallel\) and \(\parallel {V}_{t}\parallel\) are the vector norms of the review text and label text, respectively. To ensure the similarity values range from 0 to 1 for better interpretability, the results are normalized using the following transformation formula:

$${S}^{{\prime} }=\frac{S+1}{2}$$
(3)

(3) Label matching

After calculating similarity, a threshold needs to be set for label matching32. This ensures that the extracted labels are highly relevant to the review content. This study refers to previous research33,34,35 and selects 0.6 as the similarity threshold. This value is chosen by balancing precision and recall.

When the similarity between a review text and a label is 0.6 or higher, the label is considered relevant to the review. Each review is ultimately mapped to a label vector. Each element in the vector corresponds to a label. If a label matches the review, it is assigned a value of 1; otherwise, it is assigned 0.

Image label mapping

To systematically identify image content and assign semantic labels to each traditional town image, this section employs the CLIP (Contrastive Language-Image Pre-training) model36. This enables semantic mapping between traditional town images and the predefined tagging system32. By calculating the semantic similarity between images and label texts, the correspondence between images and labels is determined.

Image embedding generation: The image encoder in the CLIP model extracts high-dimensional features from each input image \(P\), generating the image embedding vector \({V}_{p}\). To facilitate similarity calculation, the vector is normalized to ensure a unit length constraint:

$${V}_{p}^{{\prime} }=\frac{{V}_{p}}{\parallel {V}_{p}\parallel }$$
(4)

Label embedding generation: The CLIP model’s text encoder converts predefined second-level label texts into high-dimensional embedding vectors. These vectors are then normalized to ensure scale consistency, facilitating subsequent similarity calculations.

Similarity calculation: Cosine similarity is used to measure the similarity between each image embedding vector and all label embedding vectors. This determines the semantic relevance between images and label texts. The similarity calculation follows the same method as described in the section “Text Label Mapping”, resulting in a set of assigned labels \({L}_{p}\).

Multimodal label integration

This subsection employs a multimodal fusion method to integrate the label information from text and images, enhancing the representation of traditional town features. Text and images each have their strengths in presenting information. Text reflects tourists’ subjective experiences, whereas images convey intuitive visual features. Single modalities may be prone to information loss or bias. Therefore, integrating labels from both text and images improves data completeness and enhances model interpretability37.

The calculation method for integrating labels is as follows:

$${L}_{f}={L}_{c}\vee {L}_{p}$$
(5)

\({L}_{c}\) and \({L}_{p}\) represent the label values for text and images, respectively, and \({L}_{f}\) is the final fused label.

Feature importance analysis based on SHAP

This study uses the SHAP method to analyze the influence of different features on tourist ratings and sentiment scores. SHAP is model-agnostic and offers sample-level interpretability, enabling it to reveal feature influences on predictions at both global and individual levels.

The input data in this study consists of multimodal features combining text and images. Tourist perception outcomes may involve nonlinearity and interaction effects, making traditional linear regression limited in both modeling capability and interpretability. Therefore, we use XGBoost to build the prediction model and apply the SHAP method to interpret its outputs. This method provides both global feature ranking and sample-level explanations, making it particularly suitable for exploring perceptual differences among tourists across various types of traditional towns in this study.

In terms of model selection, this study uses XGBoost for regression analysis. XGBoost is widely used in large-scale, multidimensional data analysis due to its strong nonlinear modeling capabilities, excellent overfitting resistance, and robustness to missing values. Especially in social media data analysis scenarios, XGBoost can effectively handle text, image, and fused features, improving prediction accuracy and stability. This section first trains the model using XGBoost and then uses the SHAP method to calculate the contribution of text, image, and fused features to the prediction of tourist ratings and sentiment values.

The SHAP value calculation for feature \(i\) is as follows38:

$${\psi }_{i}=\sum _{{\rm{T}}\subseteq {\rm{M}},\,\{i\}}\frac{|{\rm{T}}|!(|{\rm{M}}|-|\text{T}|-1)!}{|{\rm{M}}|!}[g({\rm{T}}\cup \{i\})-g({\rm{T}})]$$
(6)

\({\rm{M}}\) is the set of all features, \({\rm{T}}\) is the subset of features excluding feature \(i\), and \(g({\rm{T}})\) represents the model prediction value using only the feature set \({\rm{T}}\). This formula takes into account the influence of the feature in different combinations, ensuring the rationality and interpretability of feature importance.

Results

Data statistics

This study collected tourist review data from 114 traditional towns in China, covering four dimensions: text reviews, image information, user ratings, and sentiment values. After preprocessing and cleaning, a total of 131,596 valid text reviews and 382,026 review images were obtained. The distribution of traditional towns is shown in Fig. 4.

Fig. 4: Distribution of sentiment values and user ratings for traditional town reviews.
figure 4

a Sentiment value distribution. b User rating distribution.

As shown in Fig. 4a, the sentiment values of traditional town reviews exhibit a noticeable right-skewed distribution. Reviews in the 0.9 ~ 1.0 range account for 62.04%, the highest proportion among all intervals. After classifying sentiment values into three categories—negative (0–0.4), neutral (0.4–0.8), and positive (0.8–1.0)—the results show a clear pattern. Positive reviews dominate, accounting for 68.33%. Negative reviews make up 20.21%. Neutral reviews account for only 11.46%. This indicates that tourists generally have a positive perception of traditional towns. Although some negative sentiment exists, its proportion is relatively low.

Figure 4b shows the distribution of user ratings. Ratings of 5 stars account for 67.17%, while 4-star ratings make up 21.88%. Together, they represent nearly 90%, indicating that most tourists are highly satisfied with traditional towns. Ratings of 3 stars and below are relatively rare. 3-star ratings account for 7.37%, while 1-star and 2-star ratings make up only 2.29% and 1.29%, respectively. Overall, both sentiment values and user ratings show a clear positive trend in tourists’ evaluations of traditional towns.

Label mapping statistics

To examine the differences and complementarities between text and images in representing traditional town features, as well as the effectiveness of multimodal data integration, this section conducts a statistical analysis of the label mapping results. The label mapping results for text, images, and fused data are shown in Fig. 5.

Fig. 5
figure 5

Statistical analysis of traditional town label mapping.

Text data: In text reviews, aesthetics (49.27%), architecture (43.03%), and accessibility (31.86%) are the most prevalent among infrastructure and experience-related labels. This indicates that tourists are more concerned with the architectural style, visual appeal, and transportation convenience of traditional towns. In contrast, vegetation (5.05%) and topography (6.38%) have lower proportions among environment-related labels. This suggests that tourists rarely use text to describe specific landscape features and are more inclined to express their overall experience.

Image data: The proportion of image labels for topography (26.50%) and vegetation (22.05%) is significantly higher. This indicates that tourists prefer to use photos to visually convey the natural landscape features of traditional towns. However, the proportions of architecture (21.67%), culture (18.32%), and comfort (17.78%) are relatively lower. This suggests that images have certain limitations in conveying tourists’ subjective feelings and cultural experiences.

Fused data: After integrating text and images, the distribution of the three label categories becomes more balanced. The proportions of aesthetics (71.90%), architecture (61.67%), and accessibility (61.58%) increase significantly. At the same time, topography (43.60%) and vegetation (37.17%) show a substantial increase among environment-related labels. This indicates that fused data integrates the strengths of both text and images, enabling a more comprehensive and accurate representation of the diverse characteristics of traditional towns.

Table 1 presents the improvement in label coverage after integrating text and image labels. After integration, the overall text label coverage increased by 186.40%, while image label coverage rose by 147.22% compared to single-modal data. Among text labels, the most significant increases are observed in water (456.61%) and architecture (199.00%). Among image labels, the most significant increases are observed in vegetation (843.00%) and topography (644.78%). This demonstrates that integration effectively improves the insufficient label representation in single-modal data. It further highlights the advantages of multimodal fusion in representing traditional town characteristics.

Table 1 Improvement in text and image label coverage after integration

Correlation analysis of labels

Correlation analysis is used to evaluate the strength of linear relationships between variables39. This study utilizes the Pearson correlation coefficient for analysis, which ranges from -1 to 140. A higher absolute value approaching 1 signifies a stronger linear relationship between variables, whereas values near 0 suggest a weaker correlation. The sign denotes the relationship’s direction, where positive values signify a positive correlation, while negative values reflect a negative correlation. This section calculates the correlation coefficients for text, image, and fused evaluation labels separately.

The correlation results for text evaluation labels are shown in Fig. 6a. Most correlation coefficients are below 0.3, indicating that the labels are relatively independent. Among them, culture and architecture have the highest correlation (0.603), indicating that tourists often associate architectural features with cultural connotations in their evaluations.

Fig. 6: Heatmap of pearson correlation coefficients.
figure 6

a Correlation matrix based on text labels. b Correlation matrix based on image labels. c Correlation matrix based on fused multimodal data.

The correlation results for image evaluation labels are shown in Fig. 6b. Most correlation coefficients have absolute values below 0.2, suggesting that tourists typically focus on a single feature when uploading images. This suggests that image data labels exhibit strong independence.

For example, accessibility shows a noticeable negative correlation with both architecture and aesthetics. This suggests that tourists’ images of architecture rarely include elements related to infrastructure or landscape accessibility.

The analysis results for fused data are shown in Fig. 6c. After integrating text and images, the correlation between labels increases. In particular, the correlation between consumption and topography (0.321), as well as between consumption and vegetation (0.281), shows the most significant increase. This indicates that multimodal data fusion enhances the connection between environmental features and tourist experiences. It reveals label relationships that are difficult to capture with a single modality and improves the overall representation of different feature dimensions.

SHAP analysis of traditional town label importance

Importance of review labels for score

The importance of traditional town review labels for score (Score) is shown in Fig. 7. The bar chart on the left side of the figure represents the average SHAP values for each label, indicating the average contribution of each label to the score prediction. The summary plot on the right side illustrates the impact trends of specific label values on scores. Lighter colors indicate higher label values.

Fig. 7
figure 7

Importance of review labels for score.

The data in the figure shows that comfort and aesthetics, both belonging to the experience category, have the most significant impact on scores. Their contribution rates are 24.51% and 17.21%, respectively, far exceeding other categories. Among the environment-related labels, water contributes the most (11.85%), followed by vegetation (7.05%), while topography contributes the least (3.33%). Among the infrastructure-related labels, accessibility has a relatively high contribution (11.28%), while architecture contributes less (7.06%). Additionally, culture (10.65%) and consumption (7.06%) also play important roles within the experience category. Overall, tourist scores are primarily influenced by experience-related labels, followed by environmental and infrastructure-related labels.

Importance of review labels for sentiment value

The importance of review labels for sentiment value is shown in Fig. 8. The data in the figure shows that experience-related labels contribute the most to sentiment value. Among them, aesthetics has the highest contribution, accounting for 22.06%. The second highest contributor is comfort, accounting for 19.06%. The consumption label ranks third in contribution, accounting for 12.22%. Among environment-related labels, water has the highest importance for sentiment value (11.63%), significantly higher than topography (6.58%) and vegetation (6.54%). Among infrastructure-related labels, accessibility accounts for 7.59%, while architecture has the lowest contribution at 5.60%. Overall, tourists’ sentiment values are primarily influenced by experience-related labels, followed by environmental and infrastructure-related labels.

Fig. 8
figure 8

Importance of review labels for sentiment value.

Importance of labels in four types of traditional towns

We calculated the SHAP values for the four types of traditional towns, as shown in Fig. 9. The data reveals significant differences in tourist concerns across different types of traditional towns.

Fig. 9
figure 9

Comparison of label importance across four types of traditional towns.

In water towns, water has the highest importance (0.0391). This indicates that tourists pay the most attention to the water environment when evaluating these towns. In original ecological towns, vegetation has the highest importance (0.0251). This aligns with the emphasis on the authenticity of the natural environment in these towns. In religious and cultural towns, comfort (0.1444) and aesthetics (0.1317) have the highest importance among the four town types. The high importance of aesthetics may be closely related to the aesthetic value of religious elements, as tourists are more likely to be influenced by this cultural aesthetic experience. In commercial and trade-oriented towns, consumption holds the highest importance (0.0456), while aesthetics has the lowest (0.0642). This suggests that tourists focus more on the shopping experience when evaluating commercial traditional towns, while paying relatively less attention to aesthetic elements.

Additionally, the culture (0.0877) and accessibility (0.1098) labels are more important in religious cultural traditional towns compared to other types. This may be related to the strong cultural atmosphere of religious towns and the higher expectations tourists have for convenient transportation. Overall, these differences validate the alignment between the label design and the types of traditional towns. They suggest that the tagging system effectively reflects the core characteristics of each type of town.

Although this study does not employ specialized behavior recognition algorithms, the content recorded by tourists in images and texts contains a large number of clues related to leisure activities. These clues have been systematically extracted and represented through the tagging system. In the images, tourists often photograph scenes related to rest, consumption, or cultural experiences, such as seating areas, shops, or temples. These contents are categorized under tags such as comfort, consumption, and culture. In the text, tourists also frequently describe their activity processes, and related terms are categorized into corresponding tag classes during the matching process. In this way, the model not only captures tourists’ perception of spatial features but also, to some extent, reflects their behavioral preferences and modes of participation.

Discussion

Firstly, we explore the complementarity and optimization of multimodal data. Text and image data have complementary advantages in recognizing the landscape features of traditional towns41. Text data can capture tourists’ subjective feelings, such as cultural atmosphere, comfort, and accessibility, while image data visually presents architectural style, natural environment, and infrastructure. However, single-modal data has certain limitations42. Text reviews often lack specific descriptions of environmental elements, while images, although capable of presenting landscape features, are not able to directly reflect tourists’ emotions and experiences.

The research results indicate that text data has a relatively higher proportion in aesthetics, architecture, and accessibility, while image data emphasizes environmental features such as topography and vegetation. After multimodal fusion, the complementarity between these features is fully leveraged. For example, vegetation, which has a relatively low proportion in the text data, sees a significant increase after fusion. This allows the data to more comprehensively reflect tourists’ perceptions of the traditional town landscapes. At the same time, the text label coverage increased by 186.40% and the image label coverage increased by 147.22%, effectively reducing the bias of single-modal data and improving the interpretability of the model.

The potential of multimodal data fusion is particularly significant in optimizing traditional town landscapes. For example, when text reviews indicate that tourists appreciate the cultural atmosphere and aesthetic value of a traditional town, but image analysis shows that vegetation coverage is low, it may suggest that there is room for improvement in the town’s greenery. By integrating multimodal data, a more accurate assessment of the alignment between tourist perceptions and environmental features can be made. This provides a scientific basis for cultural heritage conservation, landscape optimization, and tourism management.

Secondly, we discuss the influence of cultural landscape features on tourist perceptions. This study uses the SHAP method to quantify the contribution of different cultural landscape features to tourist ratings and sentiment values. The results show that experience-related features have the greatest impact on tourist ratings and sentiment values, particularly comfort and aesthetics. This suggests that when evaluating traditional towns, tourists are more likely to focus on the overall atmosphere and visual appeal. Among environmental features, water and vegetation have a significant impact on sentiment values, indicating that a good natural environment can effectively enhance tourists’ enjoyment. Infrastructure-related features, such as accessibility, have a more significant impact on tourist ratings, reflecting the critical role of transportation convenience in tourist satisfaction.

Further analysis of the relationships between features reveals that there are certain interactions among different features. For example, culture and architecture exhibit a strong positive correlation, indicating that tourists often consider the cultural background when evaluating architectural features. The correlation between water and comfort is relatively low, suggesting that optimizing the water environment may positively affect overall comfort. Additionally, consumption has a particularly strong impact in commercial traditional towns, with the shopping experience becoming a key factor influencing tourist sentiment values and ratings. The integration of multimodal data further enhances the completeness of feature representation, clarifying the relationships between environmental, infrastructure, and experience-related features. Notably, after fusion, the correlation between accessibility and vegetation strengthens. This suggests that when tourists evaluate traditional towns comprehensively, they may simultaneously consider both transportation convenience and the quality of natural landscapes. This result not only validates the validity of the SHAP analysis but also provides a scientific basis for traditional town landscape optimization and tourism management. Specifically, it suggests that, while preserving cultural features and visual aesthetics, further improvements in infrastructure are necessary to enhance the overall tourism experience.

Thirdly, we discuss the differences in tourist perceptions among different types of traditional towns. Different types of traditional towns exhibit significant differences in tourist perceptions, reflecting the varying core features of each town type and the aspects that attract tourists’ attention. Waterfront traditional towns, due to their unique water network environment, see tourists placing greater focus on the quality and atmosphere of water features, while attention to architectural style and cultural elements is relatively lower. Tourists in ecological traditional towns are more inclined to focus on the natural environment, such as vegetation coverage and topographical features. This indicates that their experience in these towns is more focused on natural authenticity rather than man-made facilities.

In commercial traditional towns, tourists are more focused on the consumption experience. Shopping and the commercial atmosphere are key factors influencing sentiment values. In contrast, the aesthetic evaluation of these towns is relatively low, suggesting that tourists focus more on economic activities rather than cultural or landscape aesthetics. In religious cultural traditional towns, tourist evaluations highlight comfort and aesthetics, which may relate to the tranquil atmosphere of religious sites and their unique architectural aesthetics. Additionally, the evaluation of cultural attributes is more prominent in religious cultural traditional towns, indicating that tourists have a stronger perception of the historical and spiritual cultural expressions of these towns.

These differences indicate that tourist evaluations of traditional towns are influenced not only by their natural and cultural characteristics but also by the town’s primary functional focus. Therefore, when planning and managing different types of traditional towns, it is essential to optimize them based on tourists’ core needs. For example, enhancing the water environment maintenance in waterfront traditional towns, improving the shopping experience in commercial traditional towns, and optimizing cultural displays in religious cultural towns43,44. This approach can better meet tourists’ preferences and enhance the overall tourism experience.

Fourthly, we discuss the implications for cultural heritage conservation and tourism management. Comfort, aesthetics, and accessibility are the main factors influencing ratings, while sentiment is more strongly driven by aesthetics and cultural atmosphere. In the integrated sentiment model, the combined contribution of the aesthetics and culture tags exceeds 30%. This indicates that tourists not only focus on the cultural heritage itself but also place greater importance on the overall environment and experiential atmosphere. Therefore, the conservation strategies for traditional towns should be based on tourists’ actual perceptions, balancing cultural authenticity with user experience, and enhancing spatial comfort and aesthetic quality.

Tourist perceptions vary significantly across different types of traditional towns. In original ecological town and water-town, tourists pay more attention to natural environmental elements such as water bodies and vegetation. Therefore, efforts should focus on preserving the ecological structure, avoiding excessive artificial modifications, and enhancing the quality of natural experiences through measures such as water system restoration and green space renewal. In commercial and religious-cultural towns, tourists are more concerned with accessibility, architectural character, and cultural atmosphere. Relevant towns should improve supporting facilities while controlling excessive commercialization to preserve the continuity of their cultural character. Religious and cultural towns, in particular, should guide visitor behavior to avoid disrupting spaces dedicated to religious rituals.

This study establishes a dual-indicator satisfaction system combining ratings and sentiment scores, which, together with text-image tags, enables dynamic identification of changes in tourist perception. For example, some reviews show inconsistencies between ratings and sentiment, reflecting a gap between tourists’ evaluations and their actual experiences. By tracking the changes in the relationship between satisfaction and tag features, managers can determine which optimization measures truly enhance the tourist experience and which aspects still need to be improved. This allows for the establishment of a flexible evaluation mechanism based on public feedback, enabling adaptation to the dynamic changes and diverse needs of cultural landscapes.

Although the study effectively analyzes tourist perception, some aspects still need improvement. These limitations include the following. (1) The data mainly rely on social media, which may lead to user group bias. Future work may incorporate surveys and field studies to enhance data coverage and representativeness. (2) While the SHAP method offers good interpretability, it is affected by feature correlations and cannot accurately infer causal relationships. Methods such as structural equation modeling or causal forests could be introduced to further explore underlying mechanisms. (3) The current label system focuses on environment, facilities, and experience, with limited representation of cultural features and tourist behavior patterns. Future work may improve the label structure through semantic analysis and knowledge graphs. (4) The study samples are concentrated in specific regions and have not yet systematically accounted for cultural background differences. Future studies could conduct cross-cultural comparisons to enhance the generalizability of the findings.