Introduction

Visual media are reshaping the ways individuals engage with the landscapes they visit. Photographic composition displays identifiable patterns, suggesting that landscape design can be systematically analyzed and evaluated. In cultural heritage and tourism contexts, user-generated photographs (UGPs) provide valuable empirical evidence for examining these visual patterns, revealing how interactions with attractions relate to place, environment, and emotional experience1. They also serve as important cues that help travelers identify, categorize, and compare different heritage elements, while reflecting perceptions of specific architectural types within destination images2. From a sustainability perspective, UGPs offer insight into the visual representation of heritage sites, supporting the evaluation of promotional strategies, the improvement of heritage management, the strengthening of social connections, and the creation of shared value in cross-cultural contexts3. Accordingly, tourists’ interpretations of heritage places play a critical role in shaping destination image4 and informing landscape planning5.

Chinese classical gardens (CCGs) have long been recognized as important carriers of cultural and historical heritage6. The distinctiveness of CCGs lies in their irregular forms and intricate architectural structures, which make their intended meanings challenging to interpret. A growing body of research has sought to incorporate tourists’ memories, emotions, and experiences into the understanding of heritage values7. Existing studies address diverse aspects of CCGs, including garden architecture8, garden elements5, landscape restoration9, and landscape preferences10,11. For example, garden elements are typically defined through observable spatial attributes and color schemes12,13, such as vegetation, rockery textures, enclosing structures (e.g., walls and screens), and architectural components including traditional windows, doorways, and pathways5,14,15,16. Thematic color schemes—particularly greens and earth tones—also enhance spatial coherence and support cognitive mapping13,17. To analyze complex landscape photographic compositions, recent studies have increasingly leveraged the amalgamation of deep learning (DL) and artificial intelligence (AI) to enhance cultural heritage theory and practice in photos18 for preserving, protecting, and rehabilitating cultural heritage. For instance, DL frameworks have been developed in heritage protection for image enhancement19, crack repair20, color degradation21, surface impairment22,23, and damaged murals24, enabling early warnings of cultural asset deterioration25. These techniques generally fall into two categories: object detection and semantic segmentation26. Object detection identifies and localizes specific targets using bounding boxes, operating at the object level (e.g., detecting the presence and position of a car). Applications include extracting scenic spots27, quantifying the landscape preference28, and identifying specific cultural aspects29 and objects28. Semantic segmentation classifies each pixel into predefined categories without distinguishing individual instances (e.g., all pixels belonging to cars may be labeled in blue in a segmentation map), thereby operating at the pixel level. This approach has been successfully applied to tasks such as on-site inspection of remote heritage buildings30.

However, several limitations remain in the existing literature. First, current studies examining landscape photography in relation to visitor perception or destination image have predominantly focused on the observer’s perspective, recruiting participants for either eye-tracking experiments1,3,8,31 or questionnaire surveys32. Eye-tracking experiments have been extensively applied to understand observers’ visual behaviors, particularly visual attention. Researchers have also integrated other analytical techniques with eye-tracking analysis1,3. Questionnaire surveys have also been widely used to examine how visual preferences, experiences, cultural identity, and emotions shape perception33,34,35. However, research adopting the photographer’s perspective within CCG contexts remains limited. Second, research on landscape preferences in photographic compositions has primarily focused on the identification of objects27,28, spots3, or shapes8. Some researchers have linked the photographic compositions to esthetic scale, demonstrating its impact on cultural heritage identity36,37. However, although pixel-level analysis provides greater analytical granularity than object-based approaches, semantic segmentation has rarely been applied to heritage photographs to examine tourist perception. In particular, automated identification of compositional structures, extraction of irregular elements (e.g., vegetation and water), and assessment of their spatial organization (e.g., coverage ratios) remain insufficiently explored. Third, cultural background and individual characteristics further shape variations in tourist perception across destinations12,16,38. Yet cross-cultural analyses of tourist perceptions and behaviors across different heritage types remain limited16.

These gaps motivate the present study, which addresses three research questions: how garden-related compositional elements in heritage photographs can be quantitatively identified using advanced computer vision techniques; how photographic compositions reflect the landscape preferences of content creators; and whether heritage landscape elements evoke consistent perceptions among tourists. Utilizing various quantitative methodologies, specifically DeepLab visual processing, spatial grid analysis, and k-means clustering, we explored the photographic compositions of diverse Chinese gardens through 9004 UGPs sourced from Dianping.com and TripAdvisor.com. This research offers multiple additions to the literature on cultural heritage. First, we integrate DL–based semantic segmentation with spatial grid analysis to identify garden compositional elements and their spatial distributions, enabling examination of the relationships among compositional structures, spatial layouts, visual hotspots, and tourist preferences. Second, we propose an objective, data-driven framework for analyzing UGPs. This framework enables the inference of photographers’ perceptions and landscape preferences while clarifying the relationship between compositional structures and garden design and planning. Third, we differentiate tourists’ perceptions between two groups of tourists with diverse backgrounds and two famous gardens, highlighting how tourists’ perceptions and destination image are shaped by cultural variations and different types of CCGs. Our findings yield significant insights for landscape design in classical gardens, enhance the integration of heritage marketing and tourist engagement, and furnish scientific guidance for developing a more appealing and functional garden environment.

Methods

Research framework

This study leverages the theoretical framework of environmental psychology, DL methodologies, spatial grid analysis, and data mining to decode cross-cultural tourists’ perceptions through UGPs in two famous CCGs. First, DL-based semantic segmentation is applied to each UGP to assign a category label to every pixel, followed by a label-to-component mapping associated with historical gardens. Second, occupancy analysis is conducted to quantify the spatial distribution and proportional coverage of each component within the image. Third, the Rule of Thirds39 is introduced to illustrate a geometric sense of proportion40 in the psychology of content providers, underscoring a nuanced grid analysis. Fourth, component-to-cell occupancy metrics and spatial grid analysis are computed to evaluate compositional features, including component combinations, layout structures, and visual hotspots. Figure 1 presents the overall research framework.

Fig. 1: Research framework (Source: created by the authors).
Fig. 1: Research framework (Source: created by the authors).
Full size image

The light blue blocks represent the study cases, focusing on the Humble Administrator’s Garden (HAG) in Suzhou and the Summer Palace (SP) in Beijing. The light green blocks indicate the data collection phase, where Python-based web crawlers are used to gather UGPs from Dianping.com and TripAdvisor.com. The light orange blocks represent the data analysis phase, which includes semantic segmentation, occupancy statistics, and composition analysis. In this phase, semantic segmentation is applied using DeepLab V3+ , occupancy statistics are calculated for ten key components, and composition analysis is performed using the Rule of Thirds and k-means clustering.

Study areas

In alignment with previous research on the comparative spatial structures41 and the significant role of royal gardens in Northern China and private gardens in Southern China42, this study selects the Summer Palace (SP) in Beijing and the Humble Administrator’s Garden (HAG) in Suzhou as the focal study areas. Located in Beijing, Northern China, the SP is a representative royal garden renowned for its integration of large-scale natural landscapes, monumental architecture, and imperial cultural symbolism. The HAG, a classical private garden in Suzhou, Southern China, is celebrated for its refined spatial organization, close interaction between architecture and nature, and rich literary and esthetic traditions. Both are recognized as China’s ‘Four Great Classical Gardens’ and have been classified as world cultural heritage relics by the United Nations Educational, Scientific, and Cultural Organization (UNESCO) (http://whc.unesco.org/en/list/), underscoring the value of natural and cultural sceneries with historical merit. The selection of these two sites ensures data availability, comparability across contrasting garden typologies, and analytical diversity, while maintaining methodological consistency. Consequently, other CCGs are not included in this study.

Data collection

This study incorporated UGPs to analyze tourists’ perceptions of heritage sites, as visual data capture experiential and perceptual dimensions that are often difficult to obtain through scaled questionnaires or interviews43. Dianping.com was selected to represent Eastern tourists due to its extensive repository of photographs related to the selected gardens, exceeding that of comparable platforms such as Mafengwo.cn, Ly.com, and Ctrip.com. TripAdvisor, which has a larger quantity, higher quality, and greater ownership of user-generated CCG photographs than Instagram, Flickr, and Wikimedia Commons, was established to serve Western tourists. This data selection strategy follows the approach adopted by Sun et al.16. We developed a web crawler to extract user-generated images from the photo galleries of Dianping and TripAdvisor. To ensure dataset balance and comparability, 2736 photographs were collected for each garden from each platform. This sample size was determined by the minimum number of available photographs across the four site-platform combinations, specifically those from the HAG on TripAdvisor. When the number of available images exceeded this threshold, photographs were randomly sampled.

Data preprocessing

All images underwent manual screening to ensure sampling accuracy for each study site. Certain images were excluded from subsequent analysis due to challenges in recognition, the prominence of human close-ups, or their lack of connection to the places under study. After human verification, the initial dataset of 10,944 photographs was refined to a final collection of 9004 photographs, which comprises 2495 for HAG on Dianping (HAG_D), 2023 for HAG on TripAdvisor (HAG_T), 2485 for SP on Dianping (SP_D), and 2001 for SP on TripAdvisor (SP_T). Prior to analysis, platform-specific visual artifacts were removed. For Dianping images, watermarks were eliminated by cropping the lower 15% of the image area to ensure consistency and avoid interference with pixel-level semantic segmentation.

Semantic segmentation

We introduced DeepLab, a semantic segmentation approach extensively utilized in computer vision44, to identify landscape components within each photograph. First, all images were resized to 512 × 512 pixels to conform to the input requirements of the DeepLab architecture. Second, we utilized the pre-trained DeepLab V3+ model available in the TensorFlow Model Zoo, trained on the ADE20K dataset. Third, each standardized image was processed through the model to generate a segmentation map, in which every pixel was assigned a class ID (i.e., label). Fourth, we created a label-to-component mapping associated with historical gardens, as outlined in Table 1. These components are fundamental features of classical gardens5,15. We further refined the labels to keep those relevant to our task and classified each reserved label into one of ten components. Fifth, segmentation maps were resized back to the original image dimensions to ensure spatial consistency for subsequent analysis. Finally, manual verification was conducted to reduce potential misclassification and enhance labeling accuracy.

Table 1 Meaning of the 10 fundamental compositional components (inspired by Kiani and Khakzand’s study64 and adapted by the authors)

Occupancy statistics

Rather than quantifying architectural attributes such as enclosing walls or vegetation density5, which primarily reflect physical design characteristics, this study evaluates semantic components through occupancy analysis from the tourists’ visual perspective. This approach emphasizes the spatial distribution of landscape elements that appear within photographic compositions. The occupancy analysis consisted of a presence indicator and a coverage ratio of the landscape components. The presence indicator was determined by the occurrence probability that any image contains the element, as detailed in formulas (1)–(10). The coverage-ratio index was measured by pixel-level proportion of the component within images that contain it, as detailed in formulas (11)–(20).

$$O{P}_{A}=\frac{Number\,of\,images\,containing\,architectural\,openings}{Total\,number\,of\,images}$$
(1)
$$O{P}_{B}=\frac{Number\,of\,images\,containing\,buildings}{Total\,number\,of\,images}$$
(2)
$$O{P}_{D}=\frac{Number\,of\,images\,containing\,decorative\,elements}{Total\,number\,of\,images}$$
(3)
$$O{P}_{{E}}=\frac{Number\,of\,images\,containing\,enclosure\,walls}{Total\,number\,of\,images}$$
(4)
$$O{P}_{L}=\frac{Number\,of\,images\,containing\,literati\,garden\,furnishings}{Total\,number\,of\,images}$$
(5)
$$O{P}_{M}=\frac{Number\,of\,images\,containing\,mountains\,and\,hills}{Total\,number\,of\,images}$$
(6)
$$O{P}_{P}=\frac{Number\,of\,images\,containing\,pathways\,and\,bridges}{Total\,number\,of\,images}$$
(7)
$$O{P}_{R}=\frac{Number\,of\,images\,containing\,rocks\,and\,boulders}{Total\,number\,of\,images}$$
(8)
$$O{P}_{T}=\frac{Number\,of\,images\,containing\,trees\,and\,plants}{Total\,number\,of\,images}$$
(9)
$$O{P}_{W}=\frac{Number\,of\,images\,containing\,water\,features}{Total\,number\,of\,images}$$
(10)
$$C{R}_{A}=\frac{Number\,of\,pixels\,belonging\,to\,architectural\,openings}{Total\,pixels\,in\,the\,image}$$
(11)
$$C{R}_{B}=\frac{Number\,of\,pixels\,belonging\,to\,buildings}{Total\,pixels\,in\,the\,image}$$
(12)
$$C{R}_{D}=\frac{Number\,of\,pixels\,belonging\,to\,decorative\,elements}{Total\,pixels\,in\,the\,image}$$
(13)
$$C{R}_{E}=\frac{Number\,of\,pixels\,belonging\,to\,enclosure\,walls}{Total\,pixels\,in\,the\,image}$$
(14)
$$C{R}_{L}=\frac{Number\,ovf\,pixels\,belonging\,to\,literati\,garden\,furnishings}{Total\,pixels\,in\,the\,image}$$
(15)
$$C{R}_{M}=\frac{Number\,of\,pixels\,belonging\,to\,mountains\,and\,hills}{Total\,pixels\,in\,the\,image}$$
(16)
$$C{R}_{P}=\frac{Number\,of\,pixels\,belonging\,to\,pathways\,and\,bridges}{Total\,pixels\,in\,the\,image}$$
(17)
$$C{R}_{R}=\frac{Number\,of\,pixels\,belonging\,to\,rocks\,and\,boulders}{Total\,pixels\,in\,the\,image}$$
(18)
$$C{R}_{T}=\frac{Number\,of\,pixels\,belonging\,to\,trees\,and\,plants}{Total\,pixels\,in\,the\,image}$$
(19)
$$C{R}_{W}=\frac{Number\,of\,pixels\,belonging\,to\,water\,features}{Total\,pixels\,in\,the\,image}$$
(20)

Component combination identification

Component combinations are operationalized using co-occurrence cardinality, defined as the number of distinct landscape component types (out of the ten predefined categories) identified within a single image. This image-level metric captures the compositional complexity reflected in UGPs. The distribution of co-occurrence cardinality is then computed to determine how many images contain exactly one, two, and up to ten component types. This analysis enables assessment of the diversity and structural richness of landscape elements within photographic compositions.

Composition layout recognition

To examine spatial composition, we incorporate the Rule of Thirds as a grid-based analytical framework. The Rule of Thirds divides an image into three equal horizontal and three equal vertical sections, forming a 3 × 3 grid with four intersection points that represent areas of visual emphasis. The analysis begins by constructing a density distribution map. For each landscape component, we calculate the intersection area between its bounding box and each of the nine grid cells. Particularly, bounding boxes are generated directly from the segmentation masks to enable instance-level compositional analysis. For each class (i.e., the labels listed in Table 1), we identify all spatially discrete regions—defined as connected components—within the corresponding segmentation mask. The bounding box of each instance is then projected back to the original image dimensions and constrained to valid pixel coordinates. This ensures that distinct objects belonging to the same class are treated as independent instances in subsequent spatial analysis. To simplify the resulting distribution, we determine each component’s spatial intensity by identifying the single grid cell with which it shares the largest intersection area; this cell is defined as the component’s primary compositional location. The analysis then shifts perspective to identify the dominant component within each grid cell. The analysis culminates in the identification of the dominant component within each grid cell. This reverses the analytical perspective: rather than determining where a given component is most prominent, we evaluate which component predominates within each cell. For each of the nine cells, all components whose primary spatial intensity falls within that cell are compared, and the component occupying the largest area is designated as dominant. This multi-step procedure establishes a robust framework for recognizing compositional layouts and underpins the statistical analyses reported in the results section.

Composition hotspot detection

The composition hotspot detection reflects the tourist photo-taking locations. Initially, we constructed a vector consisting of 10 components using the one-hot encoding method. The vector was subsequently applied to each image. Subsequently, k-means clustering was carried out on the vectors to identify image clusters for each source. The elbow technique indicated that the optimal k values were 5, 6, 7, and 5 for HAG_D, HAG_T, SP_D, and SP_T, respectively. Finally, we eliminated the clusters based on the findings of the predominant co-occurrence components outlined in the subsection on component combination identification.

Results

Semantic segmentation and component occupancy

Figure 2 illustrates representative examples of semantic segmentation results. Figure 3 displays the occurrence probability and average coverage ratios of the 10 components from different sources, thereby identifying the constituent composition of the photos and representing the overall landscape preferences. Table 2 reports the occurrence probability of each component, and Fig. 4 displays selected examples featuring the original photographs and their semantic components. Generally, across both gardens, components such as ‘enclosure walls,’ ‘trees and plants,’ and ‘water features’ are predominantly significant and are likely integral to Chinese classical landscape design. These are followed by ‘literati garden furnishings,’ ‘decorative elements,’ and ‘buildings.’ Specifically, according to Fig. 3a and Table 2, ‘trees and plants’ frequently serve as the backdrop in most images of the CCGs favored by tourists (over 95% of the images). Eastern tourists favor ‘enclosure walls,’ ‘trees and plants,’ and ‘water features’ more frequently than their Western counterparts, particularly for the SP (see Fig. 4a, b, g, h, t). Components such as ‘architectural openings,’ ‘mountains and hills,’ ‘pathways and bridges,’ and ‘rocks and boulders’ attract Western tourists (see Fig. 4c, e, n, i, p, q, t, v) to capture photographs as mementos of their travels, in contrast to Eastern visitors.

Fig. 2: Examples of UGPs under study, and their segmented images.
Fig. 2: Examples of UGPs under study, and their segmented images.
Full size image

Each sub-figure displays the original image on the left (Source: retrieved from Dianping.com or TripAdvisor.com and obscured by the authors for illustration) and the segmented image on the right (Source: created by the authors). UK unknown (the component/label is excluded from this study). ac are from the Humble Administrator’s Garden dataset (Dianping.com). Following semantic segmentation, the results primarily identify: a Architectural openings, trees and plants, and water features; b Enclosure walls, literati garden furnishings, and trees and plants; c Buildings, rocks and boulders, and water features. df are from the Humble Administrator’s Garden dataset (TripAdvisor.com). Following semantic segmentation, the results primarily identify: d Enclosure walls, buildings, trees and plants, and rocks and boulders; e Buildings, water features, enclosure walls, and trees and plants; f Enclosure walls, literati garden furnishings, architectural openings, and trees and plants. gi are from the Summer Palace dataset (Dianping.com). Following semantic segmentation, the results primarily identify: g Pathways and bridges, water features, and buildings; h Trees and plants, water features, enclosure walls, and mountains and hills; i Buildings. jl are from the Summer Palace dataset (TripAdvisor.com). Following semantic segmentation, the results primarily identify: j Buildings, water features, enclosure walls, and pathways and bridges; k Buildings and enclosure walls; l Buildings, trees and plants, enclosure walls, and water features.

Fig. 3: Histogram of component occurrence and average coverage ratio across different sources (Source: created by the authors).
Fig. 3: Histogram of component occurrence and average coverage ratio across different sources (Source: created by the authors).
Full size image

a, b The x axis represents the ten fundamental compositional components. a The y axis denotes the number of images containing each component, whereas in b, it denotes the average coverage ratio of each component (in percentage). The four datasets are consistently color-coded: HAG_D (red), HAG_T (green), SP_D (blue), and SP_T (yellow). The panels correspond to a histogram of component occurrence, b average of the component coverage ratio.

Fig. 4: Examples of UGPs under study and the segmented semantic components.
Fig. 4: Examples of UGPs under study and the segmented semantic components.
Full size image

(Source: retrieved from Dianping.com or TripAdvisor.com and obscured by the authors for illustration). Through semantic segmentation, the recognized semantic components in each image are identified as follows. a, b are from the HAG dataset (Dianping.com), where the segmented results identify: a buildings, enclosure walls, rocks and boulders, pathways and bridges, and trees and plants; b buildings, rocks and boulders, trees and plants, and water features. ce are from the HAG dataset (TripAdvisor.com), where the segmented results identify: c buildings and trees and plants; d buildings and trees and plants; e architectural openings, buildings, trees and plants, and water features. fm are from the SP dataset (Dianping.com), where the segmented results identify: f trees and plants and water features; g buildings, enclosure walls, pathways and bridges, rocks and boulders, and trees and plants; h enclosure walls and trees and plants; i buildings and mountains and hills; j pathways and bridges, rocks and boulders, and trees and plants; k buildings and trees and plants; l buildings and trees and plants; m buildings and enclosure walls. nv are from the SP dataset (TripAdvisor.com), where the segmented results identify: n buildings, mountains and hills, trees and plants, and water features; o architectural openings, rocks and boulders, and trees and plants; p buildings, pathways and bridges, trees and plants, and water features; q pathways and bridges; r buildings and trees and plants; s pathways and bridges, rocks and boulders, trees and plants, and water features; t buildings and trees and plants; u mountains and hills, trees and plants, and water features; v buildings, decorative elements, mountains and hills, trees and plants, and water features.

Table 2 Results of the occurrence probability of the 10 fundamental compositional components across different sources

The average coverage ratio in Fig. 3b indicates that ‘trees and plants’ are not only frequently photographed but also compositionally central, suggesting that visitors, particularly domestic travelers, are continually attracted to greenery. Although ‘water features’ occur less frequently than vegetation, their relatively high coverage ratios indicate expansive and open spatial compositions. Certain components, such as ‘mountains and hills,’ ‘pathways and bridges,’ and ‘rocks and boulders,’ may exhibit low frequency (see Fig. 3a). Yet, they possess a moderate coverage ratio upon occurrence. Typically, there are few buildings in SP’s UGPs for Eastern tourists (see Fig. 3a). Nonetheless, they dominate the image due to their scale and architectural prominence. It is characterized by a substantial coverage ratio of 15% (ranked 2nd), demonstrating a moderate occurrence frequency of 856, which accounts for 34.31%. Some components, such as ‘mountains and hills,’ ‘pathways and bridges,’ and ‘rocks and boulders,’ may also exhibit low frequency, but they possess a moderate coverage ratio upon occurrence. Figure 4j–o, r–t may provide insights into these aspects.

Figure 5 presents histograms illustrating the distribution of coverage-ratio intervals for each component across different sources. Most components exhibit either a long-tail or single-dominant distribution pattern. Among them, ‘buildings,’ ‘mountains and hills,’ and ‘pathways and bridges’ are particularly noteworthy. For the ‘buildings’ component, SP_D demonstrates a distinct and relatively even distribution across coverage-ratio intervals (refer to Fig. 5b), indicating that buildings of varying scales—from minor elements to dominant architectural subjects—are consistently represented in these photographs (see Fig. 4i, k, l, m). In contrast, SP_T is concentrated in the lower coverage intervals (below 0.5), with very few Western tourist photographs depicting buildings occupying more than half of the frame (see Fig. 4c, d, v). This suggests that Eastern tourists are more inclined to incorporate buildings of diverse scales into their compositions at SP. Regarding ‘mountains and hills,’ SP_T shows a pronounced concentration in the [0, 0.1) interval, with over 100 images within this range (refer to Fig. 5f). This indicates that international tourists more frequently capture distant mountainous scenery as background elements (see Fig. 4d, u, v). The distribution of ‘pathways and bridges’ further reveals a consistent preference among Western tourists in both gardens (see Figs. 5g and 4c, g, p), suggesting that linear spatial elements play a stronger compositional role in their photographic practices.

Fig. 5: Histogram of coverage ratio bins for each component across different sources (Source: created by the authors).
Fig. 5: Histogram of coverage ratio bins for each component across different sources (Source: created by the authors).
Full size image

In all panels, the x axis represents coverage ratio bins ranging from 0 to 1 with an interval of 0.1, and the y axis denotes the number of images within each bin. The four datasets are consistently color-coded: HAG_D (red), HAG_T (orange), SP_D (green), and SP_T (blue). The panels correspond to a architectural openings, b buildings, c decorative elements, d enclosure walls, e literati garden furnishings, f mountains and hills, g pathways and bridges, h rocks and boulders, i trees and plants, j water features. Each panel illustrates the distribution of coverage ratios for the corresponding component across the four datasets.

Co-occurrence among multiple components

The co-occurrence of multiple components can substantially enhance the compositional richness and esthetic complexity of garden photographs. Figure 6 presents a histogram illustrating the distribution of component combinations across different sources, thereby revealing tourists’ preferred landscape configurations. Overall, most photographs contain between one and five component types, suggesting a tendency to balance visual simplicity with compositional diversity. Despite certain similarities, notable differences emerge between visitor groups and gardens. For instance, Western tourists were less inclined to capture single-component images in HAG compared with SP. Instead, they more frequently composed photographs featuring three or four components in HAG (see Fig. 6), indicating a preference for more layered spatial compositions in this garden. In contrast, domestic tourists exhibited relatively consistent co-occurrence patterns across both gardens, suggesting stable compositional preferences regardless of site context. However, international tourists demonstrated greater variability between HAG and SP. Specifically, for Western visitors, the distributions of component counts differ markedly between the two gardens: images containing 1, 2, 5, and 6 components are more prevalent in SP than in HAG (see Fig. 4c, q, u, v), whereas photographs featuring 3, 4, 7, and 8 components are more predominant in HAG compared to SP (see Fig. 4e, g, i).

Fig. 6: Histogram of the co-occurrence of multiple compositional components across different sources (Source: created by the authors).
Fig. 6: Histogram of the co-occurrence of multiple compositional components across different sources (Source: created by the authors).
Full size image

The x axis represents the number of compositional components identified in each image (ranging from 1 to 10), and the y axis denotes the number of images corresponding to each quantity. The four datasets are consistently color-coded: HAG_D (red), HAG_T (orange), SP_D (green), and SP_T (blue). The figure illustrates the distribution of images by the number of compositional components across the four datasets.

Composition layout recognition

Figure 7 presents the histogram of compositional layout across grid positions for different sources. Overall, essential components such as ‘trees and plants,’ ‘water features,’ ‘literati garden furnishings,’ and ‘enclosure walls,’ which visitors frequently mention, are most frequently located in the central and bottom rows of the frame. This distribution suggests that these elements typically occupy the foreground or near-ground spatial zones in tourist photographs. By contrast, the upper row contains fewer components, indicating either reduced visual emphasis or a background function within the composition. Clear cultural distinctions are also evident. Photographs taken by Eastern tourists tend to exhibit more centralized or balanced layouts, whereas Western tourists more frequently produce bottom-heavy compositions, concentrating visual weight in the lower portion of the frame. Further examination reveals that Eastern tourists demonstrate a stronger tendency to utilize upper-row space, particularly in HAG, suggesting an effort to incorporate vertically layered elements into the frame. However, this tendency is less pronounced in SP. In contrast, Western tourists consistently favor positioning components in the center and bottom rows, reinforcing a compositional strategy anchored in foreground emphasis. At the component level, ‘water features’ are frequently arranged in bottom-left positions (see Fig. 4a, f, h, k, n, p, u), indicating a recurring spatial preference in framing water elements. ‘Buildings’ in SP predominantly occupy bottom-row positions (see Fig. 4l, o, r), whereas in HAG they are more often centered within the frame (see Fig. 4a, b, d). Additionally, as shown in Figs. 7 and 4c, g, q, Western tourists tend to position ‘pathways and bridges’ toward the bottom-left or bottom-right areas, further emphasizing directional or linear foreground elements in their compositions.

Fig. 7: Histogram of compositional layout within each spatial grid across different sources (Source: created by the authors).
Fig. 7: Histogram of compositional layout within each spatial grid across different sources (Source: created by the authors).
Full size image

In each panel, the x axis represents the ten fundamental compositional components, and the y axis denotes the number of images in which each component appears within the corresponding grid position. The four datasets are consistently color-coded: HAG_D (red), HAG_T (orange), SP_D (green), and SP_T (blue). The figure consists of nine panels corresponding to the spatial grid positions defined by the rule-of-thirds framework: a top-left, b top-center, c top-right, d middle-left, e middle-center, f middle-right, g bottom-left, h bottom-center, and i bottom-right. Each panel illustrates the distribution of components within its respective grid position.

Composition hotspot detection

Table 3 summarizes the clustering results, listing potential photography locations in the third column and corresponding illustrative examples in the final column. These locations represent popular tourist photography hotspots identified using semantic analysis. At HAG, tourists most frequently capture the lotus pond or the Zigzag Bridge area, where these features serve as focal subjects and trees, flora, and enclosure walls form the background. These selections center on symbolically significant and iconic features, consistent with HAG’s literati narrative. Cluster #2 indicates that Western tourists tend to prefer viewpoints featuring notable architecture and water elements, whereas Eastern tourists more often include a wider range of landscape components within courtyard settings. For Eastern tourists, distinct clusters (i.e., Clusters #3 and #5) highlight repeated emphasis on particular architectural and natural attractions. For Western tourists, unique clusters (i.e., Clusters #3, #4, #5, and #6) typically feature ‘buildings,’ ‘decorative elements,’ ‘enclosure walls,’ ‘literati garden furnishings,’ ‘trees and plants,’ and ‘water features’ in varied configurations. Together, these patterns point to greater emphasis on architectural components, pathway variety, and framed views incorporating water and islands. Among Western tourists, the ‘Elegant Bamboo House Courtyard’ (Cluster #4) and ‘Moon Gate Views’ (Cluster #6) are the only clusters in which ‘architectural openings’ emerge as a key element; this component does not appear as a key element in any Eastern tourist cluster. This pattern further indicates that Western tourists are more likely to focus on specific architectural openings and framed views. In summary, Eastern tourists more often produce balanced compositions that foreground literati-associated elements, whereas Western tourists more frequently capture architecturally framed and visually striking perspectives.

Table 3 Image clusters across different sources based on compositional components

In SP, the clusters show greater thematic diversity, although their overall sizes are generally smaller than those in HAG. Approximately half of the clusters include ‘buildings’ (three of seven for Eastern tourists; two of five for Western tourists), suggesting that tourists value iconic structures and enclosed views, potentially shaped by historical education, cultural identity, and national pride associated with the garden. Both Eastern and Western tourists at SP consistently show interest in locations with prominent ‘water features’, including the “Kunming Lake Shoreline,” “Marble Boat,” “Suzhou Street Area,” “Hall of Embracing the Universe & Lakeside Views,” and “Seventeen-Arch Bridge & South Lake Island Approaches.” Taken together, the expansive water body and its associated features appear to be major attractions for both tourist groups. Specifically, the primary cluster for Eastern tourists at SP centers on the expansive “Kunming Lake Shoreline & Willow-lined Walks,” incorporating ‘decorative elements’, ‘trees and plants’, and ‘water features’. In contrast, Western tourists focus more on the “Marble Boat (Qingyan Boat)” cluster, which highlights ‘decorative elements’, ‘literati garden furnishings’, and ‘water features’. Overall, Eastern tourists appear to be drawn to the lake’s broad esthetic appeal and expansive views, whereas Western tourists more often focus on specific, prominent architectural features embedded in the landscape. The “Long Corridor” (Cluster #2 in SP_D) emerges as a key attraction for Eastern tourists, whereas the “Suzhou Street Area” (Cluster #2 in SP_T) draws substantial interest among Western tourists, possibly reflecting its distinctive cultural experience and scenic setting. The “Tower of Buddhist Incense Complex and Views” (Cluster #4 in SP_D) contrasts with the “Temple Complexes on Longevity Hill” (Cluster #3 in SP_T). Both tourist groups show interest in religious and architectural complexes on Longevity Hill, but their focal emphases differ subtly. In conclusion, Eastern tourists tend to engage with imperial narratives and themes of harmony, whereas Western tourists are more often drawn to dramatic, visually symbolic sites.

Discussion

The significance of heritage cannot be fully understood without considering human involvement and perception28. Photographic compositions reflect the travelers’ perceived preferences and on-site behaviors at a given heritage location. This study applies a quantitative framework to analyze the photographic compositions of UGPs in CCGs. This study has analyzed the photographic compositions of Chinese and Western visitors in the northern and southern CCGs, utilizing 4980 images from Dianping and 4,024 images from TripAdvisor. A comparison of component occupancy, compositional layouts, and perceived hotspots highlights differences in visual attention between Chinese and Western tourists. The analysis integrates quantitative approaches such as the Rule of Thirds, k-means clustering, and DeepLab. The findings reveal distinct patterns between Chinese and Western visitors and a north–south perceptual disparity across the gardens. Specifically: 1) Tourists preferentially include greenery; among international travelers, attention is more frequently directed toward ‘architectural openings,’ ‘mountains and hills,’ ‘pathways and bridges,’ and ‘rocks and boulders’. 2) Although some components appear infrequently, they can disproportionately influence the photograph; in domestic images, ‘buildings’ show wide variation in coverage ratios. 3) Most photographs incorporate one to five component types to emphasize the garden landscape; domestic tourists exhibit a stable co-occurrence pattern across gardens, whereas overseas travelers capture a broader range of components. 4) Photographs by Eastern travelers tend to be more centralized or balanced compositions, whereas those by Western visitors more often exhibit pronounced bottom-heavy layouts. 5) Eastern and Western travelers appear to favor different photographic hotspots. These results advance understanding of destination-image reconfiguration in heritage landscapes by documenting diverse representations of tourists’ perspectives. They also inform strategies to align destination appeal with cross-cultural differences. The findings support sustainable heritage management by enhancing managers’ ability to recognize and respond to shifts in photo-producer perspectives. While HAG and SP serve as illustrative cases, the analytical approach is transferable to other heritage sites. Building on critical heritage studies, this study evaluates how tourists ascribe significance to heritage. By examining photographic composition mechanisms in cross-cultural travel imagery, the study provides new evidence for understanding how heritage assets are appreciated and assigned meaning. First, this study integrates a DL–based semantic segmentation model with spatial grid analysis to identify an expanded set of garden-related compositional components and map their spatial distributions in cultural heritage photographs. This framework enables quantitative links to be established between compositional elements—component co-occurrence, layout patterns, and visual hotspots—and tourists’ photographic preferences. Second, the study provides an objective assessment of UGPs. Using a comprehensive analytical pipeline, the findings capture photographers’ viewpoints and landscape preferences while clarifying the potential implications of compositional elements for garden design and planning. Third, the study compares perceptions across two tourist groups with distinct cultural backgrounds and two renowned gardens, highlighting how cultural diversity and CCG type shape tourists’ interpretations.

This study provides theoretical contributions to heritage-perception research. First, we introduce a DL-based semantic segmentation workflow for photographic analysis that identifies a broad set of landscape components and characterizes their compositional arrangements in studies of historic gardens. Specifically, 150 semantic labels are consolidated into 10 garden-element categories to examine the photographic compositions of UGPs in CCGs. We then quantify component co-occurrence, layout patterns, and visual hotspots using presence–coverage indices and their spatial distributions. These presence–coverage metrics differ from most related work, which has relied on architectural drawings or satellite imagery to infer spatial configurations41,45,46. Although this approach does not directly evaluate architectural authenticity, vegetation condition, or decorative quality, the images capture tourists’ holistic visual impressions of the gardens. The proposed presence–coverage indices and their applications extend the literature on the conceptual representation of destination image. Moreover, pixel-level segmentation enables the automated delineation of landscape components in historic gardens. Compared with object detection approaches used in prior studies27,28, the proposed semantic analysis delineates pixel-accurate boundaries and shapes. It also accommodates non-rigid or amorphous regions, supports comprehensive scene parsing, improves sensitivity to small or overlapping objects, and provides occupancy grids for geometry-sensitive applications.

Second, this study investigates landscape preferences by examining a comprehensive set of compositional elements in historic-garden photographs from the image producers’ perspective. These compositional elements encompass cultural syntagms (i.e., component combinations), structural conventions (i.e., layout patterns), and interaction anchors (i.e., visual hotspots). This producer-oriented perspective differs from most prior studies, which typically infer attention areas from viewers (e.g., visual preferences)3,8,11,31,36. Because UGPs are created during travel, they provide a complementary lens to observer-centered evaluations and can therefore support cautious inferences about travelers’ shooting intentions. By analyzing UGPs, researchers can characterize photographer–landscape interactions and infer patterns of preference and human–environment engagement10. Our analysis extends Li and Qiu’s study28 by moving from five landscape types, two-element co-occurrences, and within-garden locations to ten constituent compositions, all-component co-occurrences, and within-photo spatial structures. It also foregrounds core compositional elements (e.g., layouts and hotspots) to link these patterns with tourists’ preferences. Likewise, the scale of co-occurrence combinations is comparable to that reported by Wang et al.8 in a different context, who found that integrating intricate building forms can enhance visual attractiveness, whereas excessive building density may induce visual fatigue and discomfort. Building on this line of work, we incorporate a broader range of constituent components beyond buildings using an automated, segmentation-based approach (i.e., semantic segmentation) rather than viewer-based measurements (i.e., eye-tracking).

Third, this study proposes a quantitative framework for analyzing tourists’ perceptions in CCGs. Existing methods for examining tourists’ perceptions of cultural heritage have been predominantly qualitative16,28,47, relying mainly on questionnaires5,32,33,35,37,48,49,50,51 and interviews7,51,52, as well as approaches such as netnography6 and grounded theory16; comprehensive quantitative evidence remains limited. We use DL-based recognition techniques to automatically extract salient information from photographs, which may be underrepresented in subjective evaluations, as suggested in other contexts28. We also employ spatial grid analysis to compute cell-occupancy maps, enabling an automated characterization of component spatiality. Our work highlights the value of large-scale datasets and quantitative approaches for decoding travelers’ perceptions52 and identifying photographic hotspots, aligning with recent initiatives aimed at integrating tourist perceptions into heritage management2 and emphasizing the intersection of heritage landscapes with environmental context, human cognition, and sensory experience35,38.

Fourth, this study extends the literature on CCGs by foregrounding cross-cultural destination images captured by Eastern and Western visitors. This cross-cultural comparison identifies both shared regularities and systematic differences in landscape components and their compositions in CCGs, consistent with observations reported in another context8. Our results suggest that Eastern tourists tend to seek culturally immersive, context-rich scenes, whereas Western travelers more often favor visually distinctive subjects and object-centered compositions. These patterns are in line with prior evidence53,54. Cultural psychology suggests that individuals internalize culturally preferred modes of attention through sustained exposure to cultural elements55. This “cultural imprinting” may heighten sensitivity to culturally normative cues: individuals with a more independent self-construal tend to privilege focal objects and people, whereas those with a more interdependent self-construal are more attuned to contextual information and to the relationships between focal subjects and their surroundings. In our sample, these tendencies are reflected in Western visitors’ greater emphasis on salient built features (e.g., buildings and walkways) and a preference for more structured viewpoints, which may echo Renaissance-derived conventions of proportion and order. This tendency aligns with the “external perspective” (Ich-Ferne), which foregrounds an object or topic as distinct from its environment. By contrast, Eastern visitors more often adopt dynamic, multidimensional viewpoints that emphasize harmony and fluidity—consistent with Taoist thought and Chinese esthetic traditions—and this tendency is reflected in balanced, often centered compositions (e.g., plants and water). This resonates with the “internal perspective” (Ich-Nähe) in psychology, in which the environment is perceived as interrelated and holistic53, as also suggested by traditional Chinese calligraphy54. Together, these psychological and cultural lenses help explain how compositional arrangements may mirror tourists’ cultural orientations and affective ties to the places they document53. The results also suggest that service providers and site managers should attend to visitors’ perspectives when interpreting historic-site values7. Overall, our framework supports research on heritage communication and cross-cultural engagement.

Fifth, this study addresses a gap in the literature by comparatively evaluating northern and southern CCGs. Globalization can erode regional cultures and contribute to the homogenization of urban landscapes across many countries41. Accordingly, a careful examination of the distinctive attributes of local cultural heritage landscapes is warranted41. This study highlights the importance of attending to regional historical and cultural landscapes when interpreting heritage sites. By adopting a cross-cultural lens, we reveal both cognitive diversity and cultural coherence across multiple gardens within destination images. This diversity is reflected in the different co-occurrence patterns of component combinations observed in both gardens, as detailed in the subsection on co-occurrence among multiple components. By contrast, the enclosure walls at HAG and the buildings at SP remain salient in tourists’ collective perception, despite cross-cultural differences between Western and Eastern visitors.

Our findings also have practical implications for heritage management and destination marketing. First, destination marketing organizations (DMOs) can strengthen photography-based campaigns by tailoring content to tourists from different cultural backgrounds. Such culturally targeted strategies can stimulate interest and travel behavior across markets56. DMOs can leverage the distinct attributes of northern and southern gardens to develop market-specific promotional materials for Eastern and Western audiences. For example, during peak periods of Western visitation, DMOs could feature under-visited buildings framed by literati garden furnishings and flora to encourage exploration and photography, consistent with Western tourists’ preference for more diverse component combinations. This approach could help redistribute visitor flows, reducing pressure on heavily used viewpoints and mitigating localized environmental impacts. Moreover, our results show that ‘buildings’ form distinct compositional layouts in HAG and SP, enabling site-specific selection of advertising imagery. Decorative elements that attract Western tourists can be accentuated to enhance visual salience in promotional materials, consistent with prior work5,15. Hotspots identified via semantic segmentation and spatial grid analysis can further guide the selection of vantage points and framing for promotional photography. Overall, recognizing regional and cultural variation in heritage interpretation can support more inclusive and sustainable heritage communication through diverse photographic narratives that engage different audiences.

Second, this study can help DMOs evaluate destination-focused promotional initiatives, improve landscape quality, and refine marketing strategies using evidence from UGPs. Managers can examine how branding and on-site management shape tourists’ impressions not only through official operations and personnel but also through tourists’ own perspectives. In particular, they can assess whether user-generated contributions reinforce favorable destination perceptions56, thereby providing an operational way to evaluate whether UGPs cultivate positive destination images. Building on our findings, managers can adjust marketing strategies in light of component co-occurrence, compositional layouts, and hotspots extracted from UGPs. A clearer understanding of photo-producers’ landscape preferences can help identify target markets, inform the design of tourism products and settings aligned with visitors’ expectations, and enhance satisfaction by highlighting distinctive attributes or directing attention to under-visited places28. For example, well-designed garden paths and visually coherent layouts may increase a garden’s appeal, consistent with Liu et al.’s57 observation of a close relationship between human activities and the natural environment. Given the high visibility of water across both gardens—and the prominence of buildings among Western tourists at SP—DMOs could consider enhancing lake-based experiences (e.g., boat services) and strengthening architectural presentation at SP (e.g., lantern lighting). More broadly, landscape style and spatial layout should be considered in garden design to improve overall tourist experience and satisfaction and, as previously reported, to reduce stress, improve mood, and encourage physical activity10. Overall, these results provide managers with an analytical lens for linking marketing actions with tourists’ perceptions.

Third, this study may help local governments develop evidence-informed policies for heritage conservation and enhance tourist experience value through co-creation. By examining cross-cultural landscape preferences and multi-garden design, our study underscores the need for coordinated strategies to manage heritage assets, consistent with prior work58. Local governments should improve the perceptual accessibility of diverse heritage expressions and foster cross-cultural engagement among tourists from different cultural backgrounds, thereby supporting conservation and enriching tourist experiences. Governments can promote the photography-relevant features identified here—such as salient architectural components, fine-grained spatial layouts, and effective framing angles—to support conservation and help international tourists overcome communication barriers. This, in turn, calls for local governments to integrate heritage landscapes into urban design as tangible cultural representations, creating perceptual links between global travelers and destination landscapes and, ultimately, strengthening tourists’ co-creation of heritage value.

Fourth, incorporating UGPs into organization-generated photographs may motivate tourists to produce and share higher-quality images online. Tourists may favor value-added offerings that support the creation of UGPs. In turn, these user-led modes of place promotion may generate credible sources of destination imagery and viable alternatives to official representations59. This approach directly involves tourists in creating promotional content60. Such co-created content may elicit more positive responses during on-site photography, thereby supporting heritage digitization and marketing, informing service optimization, and improving interpretation and display quality—ultimately enhancing tourist satisfaction and loyalty.

This study has several limitations that suggest directions for future research. First, the robustness of our findings may be constrained because the study relied on two platforms (Dianping and TripAdvisor) and a limited sample. Images from other social media platforms (e.g., Flickr, Wikimedia Commons, or Xiaohongshu) and travel websites (e.g., Mafengwo.cn or CTrip) were not included in our sample. Analyzing a more comprehensive dataset can mitigate sampling error61; therefore, the model should be validated across additional channels, particularly international platforms with abundant content (e.g., Flickr and Wikimedia Commons). Second, because of data availability, we examined only two well-known Chinese gardens, limiting the scope of cross-cultural comparison. Other renowned gardens and additional source markets were not examined; future work could expand to these contexts to further test cross-cultural differences. Third, following Sun et al.’s study16, Eastern and Western tourist perceptions were proxied using Dianping.com and TripAdvisor.com, respectively. However, images on Dianping.com may not originate exclusively from Chinese tourists, and TripAdvisor.com posts may include contributions from Chinese users. Consequently, proxy-based misclassification may introduce sample bias and affect cross-group comparability. Fourth, although architectural specifics (e.g., engraving patterns, typologies, and decorative features of windows and doors) can enhance destination imagery6,15, our semantic segmentation pipeline does not capture such fine-grained details. In this instance, DeepLab trained on ADE20K did not support reliable identification of these features. Similarly, some elements (e.g., floors and animals) were excluded from the analysis; future work could incorporate these categories to extend the semantic label set. Fifth, this study assessed the feasibility of applying DeepLab v3+ to cultural-heritage imagery. To achieve higher segmentation accuracy, the model should be adapted using domain-specific, human-annotated images; this constitutes a key direction for future work to improve semantic-segmentation precision. Sixth, for the Dianping dataset, we cropped the bottom 15% of each image to reduce watermark interference. Although the main content was largely preserved, cropping may have affected component-level segmentation performance. Future work could explore more robust preprocessing (e.g., vision transformers62 and dynamic networks63) to mitigate cropping-related artifacts. Lastly, we focused on semantic segmentation and spatial grid analysis to identify compositional essentials and examine how these features relate to tourist preferences, emphasizing the cognitive dimension of tourists’ destination images. Future studies could examine UGPs from an affective perspective, incorporate content-provider demographics, and apply complementary methods (e.g., color extraction, object detection, sentiment analysis, and esthetic rating) to further evaluate tourist preferences.