A coordinate-to-semantic attention network for multi-label ancient Chinese architecture image classification

Zhang, Sulan; Wang, Fei; Zhou, Huiyuan; Hu, Lihua; Yang, Haifeng; Zhang, Jifu; Cai, Jianghui

doi:10.1038/s40494-025-01547-8

Download PDF

Article
Open access
Published: 15 February 2025

A coordinate-to-semantic attention network for multi-label ancient Chinese architecture image classification

Sulan Zhang¹,
Fei Wang¹,
Huiyuan Zhou¹,
Lihua Hu¹,
Haifeng Yang¹,
Jifu Zhang¹ &
…
Jianghui Cai^1,2

npj Heritage Science volume 13, Article number: 14 (2025) Cite this article

2092 Accesses
5 Citations
Metrics details

Abstract

Ancient Chinese architecture (ACA) has rich historical and cultural value, and the preservation of ACA can be enhanced by designing an efficient multi-label ACA image classification model. However, most existing methods mainly focus on convolutional neural networks to learn the local features of ancient architectural images, which cannot retain the positional information of each part of the ancient architecture and fail to leverage the semantic context effectively. This limitation leads to semantic confusion when dealing with similar ancient architectural structures. To solve these problems, we propose a novel coordinate-to-semantic attention network. It explicitly explores positional information and semantic context in an image to optimize the visual feature representation, thereby improving the accuracy of multi-label ACA image classification. Specifically, the coordinate attention mechanism is introduced to process information from different locations and capture long-range dependencies in the image. Subsequently, the semantic attention module realizes the interaction between visual and semantic information by establishing alignments between the two modalities, thereby producing semantically relevant visual representations for each label. Ultimately, the aggregated label features are processed to output the corresponding predictions. Experimental results on the 6-class ACA dataset and other datasets show that the proposed method significantly outperforms existing methods in terms of classification accuracy. Furthermore, the proposed method is of great significance in the protection and research application of ancient architectural heritage.

Reconstructing the local structures of Chinese ancient architecture using unsupervised depth estimation

Article Open access 30 August 2024

DSC-Net: learning discriminative spatial contextual features for semantic segmentation of large-scale ancient architecture point clouds

Article Open access 02 August 2024

A deep learning approach to fight illicit trafficking of antiquities using artefact instance classification

Article Open access 05 August 2022

Introduction

Ancient Chinese architecture (ACA), as an important part of traditional culture, carries a wealth of information about history, culture, religion, and art. These buildings, including palaces and temples, etc., have unique architectural styles and characteristics, making their preservation particularly important. In recent years, the development of digital preservation technology¹ and deep learning^2,3,4 has driven rapid advancements in the field of image processing of ancient architecture. These technologies can more accurately record and recognize ancient architecture to fully inherit and display their historical and cultural value. In particular, by accurate semantic categories of ancient architecture, Chinese traditional architecture can be better understood and recorded, which is conducive to the protection and inheritance of cultural heritage. Nonetheless, due to the complex structural features and diverse categories of ACA images, it is difficult for a single label to fully describe the overall content of the image. Therefore, multi-label ACA classification emerges as a more effective approach compared to single-label classification, providing vital technical assistance for the conservation of ancient architecture.

In general, ancient buildings exhibit similarities, with structures from different styles or periods sharing comparable visual features. This makes it difficult for the model to distinguish them accurately. Thus, one of the key challenges in the multi-label ACA image classification is to solve the confusion caused by the similarity of architectural features. Currently, the multi-label image classification task has made great progress in various applications such as textile fiber⁵ and mineral identification⁶. Two important features, namely spatial relation learning^7,8 and label correlation^9,10,11, are the main considerations in the study of multi-label image classification tasks. For example, ResNet-SRN¹² captures the potential spatial relationships of labels using an attentional mechanism and provides spatial regularization for the final classification results. ML-GCN¹³ introduces GCN to the multi-label image classification task to explicitly establish the correlation between labels. MMDSR¹⁴ proposes a multi-layered, multi-perspective dynamic semantic representation to enhance the modeling of relationships between labels. NOAH¹⁵ introduces a pairwise category attention mechanism to improve classification performance. Inspired by the development of multi-label image classification technology, there has been research devoted to the construction of the ACA image classification model. The study mainly focuses on two aspects: feature representation and semantic analysis of ancient architectural images. On the one hand, feature representation emphasizes designing effective feature extraction methods that enable the model to extract distinguishable features from ACA images. For complex ancient architectural structures and textures, some studies use pre-trained CNN models to extract features^16,17. For example, Wu et al.¹⁸ proposed an automatic classification algorithm for ancient Buddhist architecture based on local feature learning in a multicultural context. Compared to traditional feature extraction methods, the local feature learning method based on deep learning demonstrates better expression. On the other hand, in terms of semantic analysis, researchers are dedicated to understanding the semantic information in ancient architectural images through techniques such as deep learning^19,20, which includes the function, style, and period of the buildings. Additionally, Graph-related techniques are increasingly being employed to explore the semantics of ACA images. Zhang et al.²⁰ proposed a method that combines visual attention mechanisms with graph convolutional networks to enrich semantic information.

The aforementioned ACA image classification methods share common shortcomings: the ancient architecture feature learning scheme concentrates solely on CNNs^16,18 to learn the local relationships in ancient architecture images, which cannot capture long-range dependencies and preserve positional information. Furthermore, these methods ignore the correlations between image regions and semantic labels, failing to effectively utilize accurate semantic context information. This leads to an inability to resolve the confusion caused by the similarities between ancient architecture, resulting in inaccurate classification. To be specific, the reasons for the confusion are the following: (i) When different ancient architecture shares common features such as similar structures, ornamentation, and styles, the ancient architecture may present a similar appearance in the image. As shown in Fig. 1a, the Forbidden City and the Summer Palace are such ancient palace architecture from the Ming and Qing Dynasties, sharing a similar court architectural style. They both employ the typical design of yellow tiles and red walls, forming striking visual features. When the model focuses only on these common local features, it may get confused due to their high similarity, leading to difficulty in distinguishing between them. (ii) The rich diversity of ancient architecture means that different categories may have significant differences in design and form. However, because of historical, regional, and stylistic diversity, ancient architecture may exhibit similar appearances in the image. As shown in Fig. 1b, although Fayu Temple and Zhantan Temple have significant differences in design and style, they may exhibit similar appearances in the image. This similarity arises because they are temple structures within the same religious tradition or cultural context, sharing common features such as roof design and architectural structure. In this case, the confusion mainly arises from some shared features within this diversity, lacking semantic context to help distinguish them, thereby increasing the complexity of the classification task.

**Fig. 1: Samples of feature similarity.**

To address the above issues, this paper proposes a novel multi-label ACA image classification model, the coordinate-to-semantic attention network (CSA-Net) model. This model identifies the spatial relationship between different architectural features and adaptively aligns to adjust the feature representations of images and labels, thereby capturing their correlation and semantic information. Specifically, to finely focus on specific regions in the image and capture long-range spatial dependencies, we introduce a coordinate attention (CA) mechanism to establish positional information. This approach allows the model to consider more comprehensive and abstract features, better distinguishing between the common features of ACA. Furthermore, to generate semantically relevant visual representations for each label, we propose a semantic attention (SA) module to comprehensively utilize the information from both images and labels. This module improves the model’s understanding of the correlation between them and captures the semantic information within images, thereby better understanding the differences between various categories of ACA. Extensive experiments and analysis were conducted on the ACA dataset to verify the superiority of our method. The main contributions of this paper are summarized as follows:

We introduce a coordinated attention mechanism to optimize the feature representation of ancient architectural images, more accurately capturing the features of ACA.
We design a semantic attention module to achieve alignment between image features and semantics, generating relevant visual representations for each label, which helps to understand the image semantics.
Extensive experimental results on the ACA dataset and other datasets demonstrate that the proposed coordinate-to-semantic attention network (CSA-Net) classification model significantly improves multi-label ACA image classification performance compared to existing models.

Related work

Ancient architecture image classification

In ancient architecture image classification, each image may involve multiple aspects or features, such as architectural style, materials, and age. Existing methods are mainly studied regarding the feature representation and semantic analysis for ancient architecture images.

Feature representation methods for ancient architecture

To better represent ancient architecture, researchers have explored various feature representation methods. Traditional methods^21,22,23 use handcrafted feature representations such as Local Binary Pattern (LBP) and color histograms. These methods describe image features based on local features like edges, corner points, and textures. For example, Yanni et al.²⁴ proposed an algorithm that combines local features with shape contour matching, leveraging unique local features and shape information to recognize buildings under different viewpoints, lighting conditions, and occlusions. However, ACA images face challenges not only from traditional noise factors like illumination, vision, and background but also from the complexity of structural features and uneven feature distribution. Therefore, traditional feature extraction methods often struggle to capture sufficiently rich and accurate information.

Recently, some research efforts have turned to deep learning-based methods, such as using pre-trained convolutional neural network (CNN)^25,26 models for feature extraction. Wu et al.¹⁸ proposed a classification method based on local feature learning for ancient Buddhist buildings. This model enhances adaptability by incorporating architectural features from different cultural contexts, thereby improving its performance across various geographical and cultural contexts. A method¹⁶ using image-to-text technology employs deep learning to extract features from images and generate text descriptions of neoclassical architectural styles. Consequently, deep learning-based local feature learning methods can automatically learn abstract features from images. Wang et al.²⁷ proposed a method for recognizing ancient architecture based on an improved asymmetric initial space V3 model using a convolutional neural network. This approach addresses the issues of small dataset size and insufficient model training for ancient architecture. Another study¹⁷ used CNN technology to automatically recognize roof ridge beasts in ancient architecture images, significantly improving the efficiency and accuracy of identifying these features and aiding in the preservation and study of ancient buildings. In²⁸, the authors utilized CNN to classify building typologies in different regions of Indonesia. Wang et al.²⁵ used CNN to extract feature representations from architectural images and introduced a channel-spatial attention mechanism to capture the correlation between specific channels and spatial positions in the image. This joint optimization of CNN and the channel-spatial attention module resulted in the accurate classification of architectural styles. Zhou et al.²⁶ proposed a method for dating bronze ding vessels by effectively combining deep learning techniques with archeological knowledge to significantly improve the accuracy and automation of dating these artifacts. These deep learning methods facilitate end-to-end learning, automatically extracting and learning significant features from ancient architectural images.

Semantic analysis methods for ancient architecture and related cultural heritage

Ancient architecture often has beautiful contours and colorful forms, which not only record the architectural layout, hierarchical structure, building forms, and colors, but also contain rich cultural connotations. By analyzing the semantics of ancient architecture, we can deeply understand the culture, values, and information related to the social background they represent. In²⁹, the author proposes a semantic modeling system for the Southeast region, which transforms traditional geometric units (such as points, lines, and triangles) into architectural elements unique to the region. This approach ensures consistency in architectural style and avoids the shortcomings of the syntactic system. To improve the traditional method of modeling geometric units, a flexible semantic approach has been proposed in³⁰, which is based on the characteristics of ancient Chinese style architecture. This method involves a two-level semantic decomposition of roofs: structural and decorative functional categories. A study³¹ proposed a multimodal neural network framework combining a convolutional neural network for visual feature learning and semantic modeling of text data for the classification and annotation of cultural assets. Another study³² integrated additional text descriptions to capture intra- and inter-modal interactions in intangible cultural heritage images and textual descriptions, providing rich complementary semantic information.

Channel & spatial attention

Channel and spatial attention combine the advantages of channel attention and spatial attention, which can adaptively select important objects and regions in the image. This method effectively utilizes information from various channels and spatial locations to comprehensively extract features, thereby enhancing the model’s feature representation ability. SENet³³ was the first to propose an effective mechanism to automatically learn the importance of different channels, explicitly pointing out the interdependencies among channels, which can enhance the representation of important features. CBAM³⁴ first applies spatial attention to weight the feature map along the spatial dimension and then uses channel attention to weight each channel, simultaneously focusing on important information in both spatial and channel dimensions. BAM³⁵ directly combines channel and spatial attention to obtain an attention matrix that integrates spatial and channel information.

However, the SE block uses global pooling to aggregate global spatial information before modeling cross-channel relationships, which ignores the importance of positional information. Meanwhile, CBAM and BAM use convolution operations to capture local relationships but fail to model long-range dependencies effectively. To resolve these problems, a novel coordinate attention³⁶, integrates positional information into channel attention, allowing the network to focus on larger significant regions with reduced computational cost. The mechanism combines spatial information in horizontal and vertical directions and interacts effectively with channel attention to enhance the modeling ability of joint spatial and channel dependence, constituting a unified, multidimensional attention mechanism. This directional feature modeling attention is well-suited for dealing with key structures in ancient architecture that exhibit symmetry and directionality. Furthermore, Coordinate attention is capable of precisely identifying the locations of target objects, offering a broader receptive field compared to CBAM and BAM, and significantly enhancing the expressiveness of the features. CBAM and BAM combine channel and spatial attention, while coordinated attention builds on the foundation to achieve unified modeling, representing a more advanced integration.

Method

Overview

In this section, we introduce a new framework called CSA-Net, which separately depicts spatial and semantic relations of ACA images, as shown in Fig. 2. The coordinate attention mechanism, detailed in Section 3.2, processes information from different positions of each part of the ancient architecture and mines the long-range spatial dependencies of ACA image features. In Section 3.3, we propose a semantic attention module to align ACA label embeddings and feature maps to accurately capture the key information in the semantic context and efficiently integrate the information from different modalities, providing the optimal interdependent feature representations and semantic information for the final multi-label classification.

Coordinate attention mechanism

ACA image feature extraction

Deep convolutional neural networks are widely used as the backbone network for feature learning. We utilized a pre-trained ResNet-101 to capture spatial features of the ACA image. The higher layers of convolutional networks contain rich semantic information, making the extracted features more discriminative. Therefore, Given an ACA image I, the feature map extracted from the last convolutional layer of the CNN model is represented as X ∈ R^C×H×W, where C is the number of channels and H and W are the height and width of the feature map, respectively. Specifically,

$$X={f}_{{\rm{cnn}}}(I;\theta )$$

(1)

where f_cnn( ⋅ ) is feature extraction module, X is the generated feature map and θ is the weights of the network.

Coordinate attention

ACA usually leads to visual similarity due to common features. Thus, identifying the position and spatial relationship of different architectural features is helpful to distinguish ACA with similar local features but different overall structures. Coordinate attention³⁶ is a very effective method for capturing positional information and spatial relationships, primarily involving two steps: coordinate information embedding and coordinate attention generation (see Fig. 3). The coordinate attention mechanism algorithm process is described as Algorithm 1.

Algorithm 1

Coordinate Attention Mechanism

Input:

Image set I = {p₁, p₂, …, p_n}

ResNet-101 model

Coordinate Attention module

Output:

Feature vector set of I: F = {f₁, f₂, …, f_n}

1: Load I, ResNet-101 model, and Coordinate Attention module

2: Initialize $F={{\emptyset}}$

3: for i = 1 to n do

4: Preprocess image p_i

5: Extract feature map $X\in {{\mathbb{R}}}^{C\times H\times W}$ from ResNet-101

6: Apply Coordinate Attention module:

7: Perform horizontal pooling: ${z}_{c}^{w}(w)=\frac{1}{H}\mathop{\sum }\nolimits_{j = 1}^{H}{x}_{c}(j,w)$

8: Perform vertical pooling: ${z}_{c}^{h}(h)=\frac{1}{W}\mathop{\sum }\nolimits_{i = 1}^{W}{x}_{c}(h,i)$

9: Fuse features: $y=\delta \left(BN\left(Con{v}_{1}^{1\times 1}\left(\left[{z}^{h},{z}^{w}\right]\right)\right)\right)$

10: Split y into horizontal attention map y^w and vertical attention map y^h

11: Compute attention maps: ${p}^{h}=\delta \left(Con{v}_{h}^{1\times 1}\left({y}^{h}\right)\right)$; ${p}^{w}=\delta \left(Con{v}_{w}^{1\times 1}\left({y}^{w}\right)\right)$

12: Enhance X by applying attention maps: f_i = X ⋅ p^h ⋅ p^w

13: Add f_i to F = $F\cup \left\{{f}_{i}\right\}$.

14: end for

15: return F

Coordinate information embedding: The coordinate attention block takes feature $X=\left[{x}_{1},{x}_{2},\ldots,{x}_{C}\right]\in {R}^{C\times H\times W}$ as input and decomposes it into two parallel 1D feature encodings to efficiently integrate the spatial coordinate information of the ACA image. Specifically, two pooling kernels (H,1) and (1, W) are used to perform horizontal and vertical pooling operations on the input feature maps X, respectively, resulting in two feature maps with values only in the height and width dimensions. The output of the c-th channel at height h is:

$${z}_{c}^{h}\left(h\right)=\frac{1}{W}\sum _{0\le i < W}{x}_{c}\left(h,i\right)$$

(2)

The output of the c-th channel with width w:

$${z}_{c}^{w}\left(w\right)=\frac{1}{H}\sum _{0\le j < H}{x}_{c}\left(j,w\right)$$

(3)

where ${z}_{c}^{h}\in {R}^{C\times 1\times W}$ and ${z}_{c}^{w}\in {R}^{C\times H\times 1}$ represent the output related to the c-th channel with height h and the output associated with the c-th channel with width w, respectively. The two transformations above aggregate features along two spatial directions, generating a pair of direction-sensitive feature maps. This approach allows capturing long-range dependencies along one spatial direction while preserving precise positional information along the other. Unlike the channel attention method that transforms feature tensors into a single feature vector, coordinate attention addresses the difficulty of maintaining position information due to the compression of global spatial information into channels in previous approaches.

Coordinate attention generation: the extracted horizontal z^w and vertical z^h position information is concatenated, passed through a convolutional layer and a batch normalization layer, and then subjected to non-linear processing through an activation function.

$$y=\delta \left(BN\left(Con{v}_{1}^{1\times 1}\left(\left[{z}^{h},{z}^{w}\right]\right)\right)\right)$$

(4)

where $\left[\cdot,\cdot \right]$ denotes the aggregation operation along the spatial dimension, BN is a batch normalization operation, δ is a nonlinear activation function, and $y\in {R}^{C/r\times \left(H+W\right)}$ is an intermediate feature map that encodes spatial information in the horizontal and vertical directions, aiming at feature fusion of positional information and enhancing the expressive power of features. Next, the fused feature map y is split to obtain feature maps in the horizontal y^w ∈ R^C/r×W and vertical directions y^h ∈ R^C/r×H, yielding

$${y}^{h},{y}^{w}=Split\left(y\right)$$

(5)

Then y^h and y^w are passed through a 1 × 1 convolutional layer and an activation function to compute the horizontal and vertical attentional weights, respectively, and to make their output dimensions consistent with X.

$${p}^{h}=\delta \left(Con{v}_{h}^{1\times 1}\left({y}^{h}\right)\right)$$

(6)

$${p}^{w}=\delta \left(Con{v}_{w}^{1\times 1}\left({y}^{w}\right)\right)$$

(7)

where p^h ∈ R^C×1×W and p^w ∈ R^C×H×1 denote the corresponding attention weights. These weights reflect the importance of each position and are used for subsequent feature weighting. The resulting two attention maps can effectively reflect whether the objects of interest exist in the corresponding rows and columns, enabling us to locate the position of the target objects accurately. This positional information helps the model to identify and focus on the spatial distribution and arrangement of different architectural features in the ACA image, thereby capturing the overall structure of the ancient architecture more accurately.

After that, the original input feature map is multiplied by the attention weights in the horizontal and vertical directions to obtain the final enhanced feature representation f ∈ R^C×H×W, formulated as:

$$f=x{p}^{h}{p}^{w}$$

(8)

In this way, it makes the features more spatially concentrated and retains the spatial information of the original features, which better captures the spatial dependencies in different directions and improves ACA feature representation ability.

Semantic attention module

The diversity of historical and cultural backgrounds may lead to a similar appearance of ACA. ACA images usually contain rich and complex semantic information, and this complexity makes the classification task not only rely on visual appearance but also need to consider various factors such as ACA history, region, and cultural background. To capture the information in the semantic context, we first define a graph $G=\left(V,E\right)$ where the set of nodes V contains two parts: ACA image feature nodes and ACA label embedding nodes. Specifically the image features are represented as node feature matrices $\scriptstyle{F}={\left\{{f}_{j}\right\}}_{j = 1}^{n}$, and ${f}_{j}\in {R}^{{d}_{I}}$ is feature vector at the j-th spatial region of the input image I, where n is the number of spatial regions of the image and d_I is the dimension of each feature vector. The label embeddings are represented as the node feature matrix $L={\left\{{l}_{i}\right\}}_{i = 1}^{m}$, ${l}_{i}\in {R}^{{d}_{L}}$, where m is the number of labels, d_L is the dimension of the label embeddings. The semantic attention module process is described as Algorithm 2.

Algorithm 2

GAT for Image-Label Feature Alignment

Input:

Image feature matrix $F={\left\{{f}_{j}\right\}}_{j = 1}^{n}$, ${f}_{j}\in {R}^{{d}_{I}}$

Label embedding matrix $L={\left\{{l}_{i}\right\}}_{i = 1}^{m}$, ${l}_{i}\in {R}^{{d}_{L}}$

GAT parameters: weight matrices W_l, W_f, attention weight vector a

LeakyReLU slope α = 0.2

Output:

Weighted aggregate label representation $Q={\left\{{Q}_{i}\right\}}_{i = 1}^{m}$

1: Initialize $Q={{\emptyset}}$

2: for each label embedding node l_i ∈ L do

3: for each image feature node f_j ∈ F do

4: Compute the attention weight ${e}_{ij}=LeakyReLU\left({a}^{T}\left[{W}_{l}{l}_{i}| | {W}_{f}{f}_{j}\right]\right)$

5: end for

6: Normalize attention weights across all f_j: ${a}_{ij}=\frac{exp\left({e}_{ij}\right)}{\mathop{\sum }\nolimits_{k = 1}^{n}exp\left({e}_{ik}\right)}$

7: Compute the weighted aggregate representation Q_i for label node l_i

8: Add Q_i to Q = $Q\cup \left\{{Q}_{i}\right\}$

9: end for

10: return Q

Graph Attention Network (GAT)³⁷ is then used to learn the weights between the nodes and the weights from the label embedding nodes l_i ∈ L to the image feature nodes f_j ∈ F are denoted as e_ij. These weights can indicate the degree of correlation between the image features and the label embeddings. GAT calculates the weights by the self-attention mechanism, which can be expressed as:

$${e}_{ij}=LeakyReLU\left({a}^{T}\left[{W}_{l}{l}_{i}| | {W}_{f}{f}_{j}\right]\right)$$

(9)

where a is the attention weight vector, W is the weight matrix, ∣∣ represents the concatenation operation, and l_i and f_j are representations of the label embedding vector i and the image feature node j, respectively. Furthermore, the negative input slope of LeakyReLU is set to be 0.2. Then the softmax function normalization was used to obtain the normalized attention weights:

$${a}_{ij}=\frac{exp\left({e}_{ij}\right)}{\mathop{\sum }\nolimits_{k = 1}^{n}exp\left({e}_{ik}\right)}$$

(10)

where k is the number of image feature nodes, a_ij is an attention coefficient representing the adaptive learning of the importance of different image feature nodes to label nodes. Finally, an alignment matrix is constructed using attention weights to align the image features and label embeddings, reflecting their correlation and better capturing their semantic relationships. As a result, the feature vector f_j of the input image I weighted by attention is multiplied by the attention coefficient a_ij to aggregate the label features, obtaining the weighted aggregate label representation Q_i for each label node l_i, formulated as:

$${Q}_{i}=\mathop{\sum }\limits_{j=1}^{n}{a}_{ij}{f}_{j}$$

(11)

In this way, the semantic relationship between image features and labels is used to update the label representation while retaining the label embedding. The updated label representation can capture more detailed and rich information, which helps the model to consider more context information in decision-making rather than relying solely on surface visual features, thereby understanding the complex semantic information of ACA images. In addition, using semantic relationships to update label embedding can better distinguish ACA with similar appearance but different semantics.

Finally, the aggregated label representation Q are mapped through the fully connected layer fc₁ and the activation function. Then the 2048-dimensional features are mapped to the final output category space through fc₂, and we obtain the ACA labels prediction scores as:

$$\hat{y}=f{c}_{2}\left(LeakyReLU\left(f{c}_{1}\left(tanh\left(Q\right)\right)\right)\right)$$

(12)

where fc₂ denotes fully connected layer operations. The CSA-Net model is trained using the cross-entropy loss function as follows:

$$L=\mathop{\sum }\limits_{i=1}^{m}{y}^{i}log\left(\sigma \left({\hat{y}}^{i}\right)\right)+\left(1-{y}^{i}\right)log\left(1-\sigma \left({\hat{y}}^{i}\right)\right)$$

(13)

where ${\hat{y}}^{i}$ is the output of fully connected layer, ${y}^{i}=\left\{0,1\right\}$ represents whether label i appears in the image or not, and $\sigma \left(\cdot \right)$ is the sigmoid function.

Experiments

In this section, we verify the effectiveness of the proposed CSA-Net framework through experiments on the ACA dataset. First, we introduce the basic information about the ACA dataset. Next, we describe the evaluation metrics and implementation details. Then, comparative experiments and ablation studies are performed on the proposed CSA-Net framework. Finally, we visualize the multi-label classification results of ACA images.

Datasets

We mainly conduct experiments on the ACA dataset to validate the model’s performance in the multi-label classification task of ancient Chinese architecture images. To evaluate the model’s generalization, the MS-COCO and Gothic architecture datasets are further introduced to demonstrate the model’s performance in different tasks and complex scenes. Specifically:

We used six class ACA images (http://vision.ia.ac.cn/data/) from the National Laboratory of Pattern Recognition at the Institute of Automation, Chinese Academy of Sciences. Meanwhile, we collected additional images online to extend the ACA dataset for our experiments. The dataset contains a total of 2500 images that are categorized according to six types of ACA images. Each type of ACA image selects 80 % of its total number as the number of training samples. Additionally, each type of ACA image includes some labels, such as architectural category (palace, temple), dynasty (Tang_Dynasty, Song_Dynasty, etc.), school (Huizhou_School, Shanxi_School, etc.), geographic region (Central_China, etc.), and religion (Buddhism, Taoism). Detailed information on the ACA dataset is shown in Table 1.

Table 1 The ACA dataset details

Full size table

MS-COCO dataset is a widely used benchmark for multi-label image classification, containing 82,783 training images and 40,504 validation images, covering 80 different object categories. Each image is annotated with multiple labels, effectively supporting model training and evaluation in complex backgrounds and multi-object scenes.

Gothic architecture dataset consists of 330 training images and 220 test images. We define multi-label categories based on architectural features and task requirements. Additional annotations were made to the existing dataset, covering 10 categories. Each image is annotated with multiple dimensions of labels, including Style (e.g., Gothic), Region (e.g., France, the UK), and Cathedral Name (e.g., Notre-Dame de Paris, Reims Cathedral, etc.). This dataset can be used for architectural style classification, multi-label classification, architectural analysis, and other studies.

Experimental settings

Evaluation metrics

To evaluate the results of multi-label ACA image classification, we use Precision, Recall, and F1 value as evaluation metrics to evaluate our proposed model. Precision (P) reflects the model’s accuracy in predicting different categories, evaluating its ability to minimize interference from background or other categories in multi-label tasks. Recall (R) measures the model’s coverage of the target categories, with higher recall indicating better capture of relevant features, especially in complex backgrounds. The F1 score, the harmonic mean of precision and recall, balances these two metrics and helps mitigate bias. In this study, the F1 score is used to evaluate the model’s overall performance in multi-label classification, especially when class distributions are imbalanced. The specific formulas are defined as follows:

$$P=\frac{1}{M}\mathop{\sum }\limits_{i=1}^{M}\frac{{N}_{i}^{c}}{{N}_{i}^{p}}\qquad R=\frac{1}{M}\mathop{\sum }\limits_{i=1}^{M}\frac{{N}_{i}^{c}}{{N}_{i}^{g}}\qquad F1=\frac{2\times P\times R}{P+R}$$

(14)

where M is the total number of label categories, ${N}_{i}^{c}$ denotes the number of images correctly predicted as the ith category, ${N}_{i}^{p}$ represents the total number of predicted images for the i-th category, and ${N}_{i}^{g}$ is the total number of images truthfully labeled as the i-th category. In addition, mean Average Precision (mAP) is used to measure the accuracy of the multi-label ACA image classification task. mAP takes into account the order of prediction, calculates the average precision of all categories, and takes its average as the final metric.

Implementation details

In the training and testing phases of all experiments, the input images were resized to 448 × 448. The model uses ResNet-101 as a feature extractor to extract the visual features of each image, and the input features are mapped to the same dimensions as the label representation. Since GloVe can generate word vectors through global word frequency statistics, the algorithm is used to generate word embeddings for the ACA labels, converting 22 ancient architecture labels into 110-dimensional word embedding vectors. The output dimension of the fully connected layer (FC) is the same as the number of labels. Additionally, The proposed CSA-Net framework is implemented based on PyTorch, and the server environment is configured as follows: Intel$\left(R\right)$ Xeon $\left(R\right)$ CPU E5-2678 v3 @ 2.50GHz CPU, 62 GB memory, and NVIDIA GeForce RTX 2080 Ti 11 GB GPU to train our model. The regularization parameter λ is set to 10⁻³, and SGD is used as the optimizer with a momentum of 0.9, weight decay of 10⁻⁴, initial learning rate of 0.01, and a batch size of 8.

Experimental results

To validate the effectiveness of the model, we compared our model with the following multi-label image classification methods on the ACA dataset: softmax³⁸, LSEP³⁹, ranking⁴⁰, WARP⁴¹, AG2E⁴², VGG19+GCN-A²⁰ (where model A is trained with multi-label loss through a binary correlation matrix), VGG19+GCN-B²⁰ (where model B is obtained from multi-label loss training of the final correlation matrix), GATN⁴³, MSRN⁴⁴ as well as DA-GAT⁴⁵. The quantitative comparison results are shown in Table 2. When the resolution of the input image is 448 × 448, our CSA-Net model achieves 88.75% on mAP, which outperforms the other methods in all metrics. In Table 2, the first four rows of the method are based on the image features extracted by Inception-v4 for multi-label classification, and AG2E uses the VGG16 model for feature extraction on ACA images. Compared with the above methods and the model method based on VGG19, our CSA-Net model significantly outperforms them in all metrics. Additionally, when compared with multi-label image classification methods such as GATN, MSRN, and DA-GAT, which are based on image features extracted by ResNet-101, our CSA-Net model improves the mAP by 17.99%, 20.19%, and 10.92%, respectively. Moreover, by comparing with the latest methods MMDSR and NOAH (based on ResNet-101), our model outperforms these methods in evaluation metrics, further verifying its advantages and effectiveness. Notably, due to the different feature distributions and semantic contents between general datasets and the ACA dataset, models that perform well on the MS-COCO dataset may not achieve good results on the ACA dataset. The ACA dataset might contain specific architectural structures, artistic styles, or cultural elements that are either lacking or rare in general datasets. Therefore, our proposed model needs to learn the specific features and semantics present in the ACA dataset. As shown in Table 2, our model not only effectively extracts features specific to the ACA dataset but also focuses more on these specific semantic contents. Consequently, the aforementioned experimental results confirm the effectiveness of the CSA-Net model.

Table 2 Performance comparison between CSA-Net and current methods on the ACA dataset

Full size table

In addition, the CSA-Net model has been validated on other datasets. As shown in Table 3, the CSA-Net model achieves an mAP of 91.18% on the Gothic Architecture dataset, outperforming existing models. CSA-Net accurately captures the unique features of architecture, effectively improving classification performance. Furthermore, CSA-Net is also applicable to more complex general vision tasks. On the MS-COCO dataset, CSA-Net effectively handles complex backgrounds and multi-object scenes through the combined use of coordinate attention and semantic attention modules, which enhances the ability to capture target regions and semantic relationships. As shown in Table 4, the experimental results demonstrate that the CSA-Net model achieves 85.7% on mAP. In summary, CSA-Net not only shows significant advantages in ancient architecture classification tasks with specific semantic features but also performs well in general visual tasks involving diverse categories and complex scenes. It demonstrates the generalization capability of the CSA-Net model, enabling it to adapt to diverse task scenes and category distributions, further validating its effectiveness and applicability.

Table 3 Performance comparison between CSA-Net and current methods on the Gothic architecture dataset

Full size table

Table 4 Performance comparison between CSA-Net and current methods on the MS-COCO dataset

Full size table

Ablation studies

On the ACA dataset, despite the overall classification performance improvement, we also conduct ablation studies on different components of our proposed model. The following experiments aim to investigate how each module in our proposed model affects classification performance. The experimental results are shown in Table 5. From this table, it can be seen that the baseline ResNet-101 model without CA and SA achieves 83.7% on mAP. Further analysis reveals that when we add only CA or SA individually, improvement of 86.91% and 84.41% is achieved, respectively. With the complete architecture, the performance of the CSA-Net model further improves to 88.75% on mAP. Thus, the experimental results demonstrate the effectiveness of our proposed CA and SA, as well as the fact that CA and SA play complementary roles, which work together to improve the model’s ability to understand and express the semantic information of images.

Table 5 Ablation study of different components of the CSA-Net model in the ACA dataset

Full size table

Furthermore, we also investigate the sensitivity of CSA-Net to hyperparameters, i.e., the regularization parameter λ. We study {10⁻¹, 10⁻², …, 10⁻⁵}, fixing the other hyperparameters and adjusting only the value of λ. The results for different values of λ are shown in Fig. 4, when λ = 10⁻³ corresponds to the highest mAP, which reaches 88.75%. It can be seen that this is an optimal setting for the regularization parameter, leading to the best performance on the evaluation metrics. Thus, appropriate regularization can help improve the generalization ability and performance of the model.

Analysis of results

In this section, we use the prediction results of the CSA-Net model for a qualitative analysis of the classification results. The XGrad-CAM⁴⁶ method is employed for visualization analysis, and the t-SNE⁴⁷ method is used to analyze the label correlations. The prediction results are analyzed using different modules.

Qualitative results of the multi-label ACA classification

Through the multi-label classification results, we demonstrate the significant advantages of the CSA-Net model. The experimental results of label prediction shown in Fig. 5 show that the CSA-Net model can predict more labels, making the learned labels more consistent with the image content. In contrast, the VGG19+GCN model often only recognizes the most prominent objects in the image and generates relatively incomplete labels. As in Fig. 5a, Sichuan_school, Palace, and Jin_Dynasty are effectively predicted by the CSA-Net model. Moreover, the CSA-Net model exhibits a high accuracy in predicting CSA-Net image labels based on the co-occurrence relationships between objects. As shown in Fig. 5c and f, Shanxi_School and East_China are incorrectly assigned by the VGG19+GCN model to images that do not contain corresponding visual features, whereas the CSA-Net model accurately analyzes the inter-relationships between labels and correctly predicts object categories that frequently appear simultaneously in the same scene. On this basis, although the NOAH method can capture global features, it is insufficient in modeling the label relationships in ancient architecture images, which leads to its failure to effectively address the issue of label confusion, as shown in Fig. 5d and e. By contrast, the CSA-Net model can effectively utilize the contextual information in the image to achieve a more accurate and comprehensive classification effect.

**Fig. 5: Prediction results for samples.**

Visual analysis of the visual features

To further understand how coordinate attention focuses on specific regions in the image to recognize different architectural features, we use XGrad-CAM⁴⁶ to generate attentional heatmaps of ACA images for visual analysis. The experimental results are shown in Fig. 6. It can be found that the CSA-Net model can not only clearly show the main areas of concern of the model and locate the key features of ACA but also evenly cover elements related to multiple labels within the image. For example, in Zixiao_Palace, Fayu_Temple, and Longquan_Temple, the CSA-Net model accurately pinpoints specific features crucial for classification decisions through positional information, effectively filtering out irrelevant background information. This avoids the problem of focusing equally on the sky, the ground, the background trees, and the building itself, which would result in the image focusing on too broad a range of learned incorrect or irrelevant features. It can be seen that the CSA-Net model demonstrates a strong ability to distinguish between target and irrelevant regions, reducing the impact of background interference on classification performance, thereby enhancing its robustness. Furthermore, as shown in Fig. 7, the model consistently focuses on the architectural features, even in the face of changes in viewpoint (such as front, side, etc.), reflecting its robustness under complex backgrounds and viewpoint variations. Additionally, as shown in Fig. 6 in Shangqing_Palace and Zhantan_Temple, the CSA-Net model effectively utilizes the long-range spatial dependencies within the images. Unlike the baseline model, which focuses only on small parts of the building (such as columns or windows) and overlooks other potentially distinguishing features (like the overall architectural style), the CSA-Net model avoids misclassification when dealing with visually similar but different categories of buildings. Therefore, the experimental results verify that the CSA-Net model correctly understands and effectively utilizes the key visual information of the image, which improves the accuracy of classification.

Semantic correlation visualization

To qualitatively analyze whether the learned features have certain semantic relations, the t-SNE⁴⁷ method is used to visualize and analyze the sample distributions of ResNet-101 semantic features and CSA-Net model semantic features. As shown in Fig. 8a, it can be seen that the ResNet-101 semantic features indicate that there is no distinct structure in the reduced-dimension space, and there is a large overlap and confusion of samples. In contrast, Fig. 8b demonstrates that the semantic features trained by the CSA-Net model tend for similar samples to cluster together, indicating that the semantic attention module effectively captures the semantic similarity between samples. Additionally, samples from different categories are more dispersed in the reduced-dimension space, indicating that the CSA-Net model has learned more discriminative feature representations in the feature space, allowing for better differentiation between different categories or semantic contents. Therefore, the visualization results of the semantic correlation further confirm the effectiveness and superiority of our CSA-Net model.

Visual analysis of predicted labels using different modules

From the results, we found that the ResNet-101 model primarily makes predictions based on the visual content of the images and is prone to predicting incorrect labels due to confusion. This indicates that the model has difficulty distinguishing certain visually similar architectural elements, such as decorative carvings and architectural structures. After configuring the CA in ResNet-101, the method can further recognize easily confusable features. For example, compared with the ResNet-101 method, ResNet-101+CA can recognize Zixiao_Palace and Zhantan_Temple in Fig. 9. However, ResNet-101 can be easily confused with Fayu_Temple. When the SA module is configured in ResNet-101, introducing multimodal data (such as label set) might help provide more contextual information. Yet, when the model faces high feature similarity between different categories, the model may classify more conservatively to avoid outputting confusing labels, which may cause some correct labels to fail to output. For example, only high-confidence results such as Temple, Buddhism, and East_China are output in Fig. 9 Zhantan_Temple. On this basis, when the CA and SA structures are introduced at the same time, ResNet 101+CA+SA, i.e., the CSA-Net model, is established. It can be found that the complete CSA-Net model can effectively resolve issues of confusion in ACA images, and the predicted label scores are higher than the predicted scores of other modules. The results demonstrate that each module in the CSA-Net model can effectively improve the performance of the model.

**Fig. 9: Sample images from the ACA validation set and the predicted labels using the different modules.**

Conclusion and future work

In this paper, we propose a novel coordinate-to-semantic attention network (CSA-Net) for the multi-label ACA image classification method, which utilizes coordinate attention to explore precise positional information, encoding channel relationships, and long-range dependencies. It also integrates label information and image features, aligning them while using graph attention to adjust their relationships, guiding the model to better understand the semantic relationships between ACA images and labels. Through comparative experiments and ablation studies, we validate the effectiveness of the CA and SA modules, which play important roles in capturing key features of images and the correlations between labels. Experimental results demonstrate that the CSA-Net model can better solve confusion problems caused by the similarity of ancient architecture, which not only improves the accuracy of classification but also enhances the understanding of multi-label semantic relationships.

Although CSA-Net has demonstrated good performance in the multi-label classification task for ancient architecture, its computational resource requirements and real-time performance have not been thoroughly explored. Additionally, the model’s robustness in complex occlusion scenes still needs to be improved. Future work will focus on enhancing the model’s real-time performance and computational efficiency, as well as exploring the improvement of the attention mechanism to address extreme occlusion situations, further enhancing the model’s robustness in complex scenes.

Data availability

The datasets used during the current study are available from the corresponding author on reasonable request.

References

Wang, X., Wang, Y., Ma, L., Yuan, P. & Zhang, Y. Information processing technology in the digital protection of architectural cultural heritage. In 2020 International Conference on Culture-oriented Science & Technology (ICCST), 496–499 (IEEE, 2020).
Yu, K. et al. Automatic labeling framework for paint loss disease of ancient murals based on hyperspectral image classification and segmentation. Herit. Sci. 12, 192 (2024).
Article Google Scholar
Tan, G., Zhu, J. & Chen, Z. Deep learning based identification and interpretability research of traditional village heritage value elements: a case study in hubei province. Herit. Sci. 12, 200 (2024).
Article Google Scholar
Sasithradevi, A., Nathan, S., Chanthini, B., Subbulakshmi, T. & Prakash, P. Monunet: a high performance deep learning network for kolkata heritage image classification. Herit. Sci. 12, 242 (2024).
Article Google Scholar
Fu, D. et al. Graph relationship-driven label coded mapping and compensation for multi-label textile fiber recognition. Eng. Appl. Artif. Intell. 133, 108484 (2024).
Article Google Scholar
Gao, Q., Long, T. & Zhou, Z. Mineral identification based on natural feature-oriented image processing and multi-label image classification. Expert Syst. Appl. 238, 122111 (2024).
Article Google Scholar
Guo, H., Zheng, K., Fan, X., Yu, H. & Wang, S. Visual attention consistency under image transforms for multi-label image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 729–739 (2019).
Zhu, K. & Wu, J. Residual attention: A simple but effective method for multi-label recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 184–193 (2021).
Wang, J. et al. CNN-RNN: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, 2285–2294 (2016).
Deng, X., Feng, S., Lyu, G., Wang, T. & Lang, C. Beyond word embeddings: Heterogeneous prior knowledge driven multi-label image classification. IEEE Trans. Multimed. 25, 4013–4025 (2022).
Article Google Scholar
Chen, Z.-M., Wei, X.-S., Wang, P. & Guo, Y. Learning graph convolutional networks for multi-label recognition and applications. IEEE Trans. Pattern Anal. Mach. Intell. 45, 6969–6983 (2021).
Article Google Scholar
Zhu, F., Li, H., Ouyang, W., Yu, N. & Wang, X. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, 5513–5522 (2017).
Chen, Z.-M., Wei, X.-S., Wang, P. & Guo, Y. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, 5177–5186 (2019).
Kuang, W. & Li, Z. Multi-label image classification with multi-layered multi-perspective dynamic semantic representation. Mach. Learn. 113, 3443–3461 (2024).
Article Google Scholar
Li, C., Zhou, A. & Yao, A. Noah: Learning pairwise object category attentions for image classification. arXiv preprint arXiv:2402.02377 (2024).
Qin, W., Chen, L., Zhang, B., Chen, W. & Luo, H. Neodescriber: An image-to-text model for automatic style description of neoclassical architecture. Expert Syst. Appl. 231, 120706 (2023).
Article Google Scholar
Hou, M., Hao, W., Dong, Y. & Ji, Y. A detection method for the ridge beast based on improved yolov3 algorithm. Herit. Sci. 11, 167 (2023).
Article CAS Google Scholar
Wu, Y. et al. Classification of ancient Buddhist architecture in multi-cultural context based on local feature learning. Mob. Inf. Syst. 2022, 8952381 (2022).
Google Scholar
Sun, M., Zhang, F., Duarte, F. & Ratti, C. Understanding architecture age and style through deep learning. Cities 128, 103787 (2022).
Article Google Scholar
Zhang, S., Chen, S., Zhang, J., Cai, Z. & Hu, L. Image annotation of ancient Chinese architecture based on visual attention mechanism and GCN. Multimed. Tools Appl. 81, 39963–39980 (2022).
Article Google Scholar
Yang, S., Shengyang, L., Shao, Y. & Zheng, H. Building recognition method based on improved hog feature. Comput. Eng. Appl 54, 196–200 (2018).
CAS Google Scholar
Freeman, W. T. et al. The design and use of steerable filters. IEEE Trans. Pattern Anal. Mach. Intell. 13, 891–906 (1991).
Article Google Scholar
Yaoyao, N., Lihua, H., Jifu, Z. & Sulan, Z. Feature matching based on grid and multi-density for ancient architectural images. J. Comput.-Aided Des. Comput. Graph. 32, 437–444 (2020).
Google Scholar
Yanni, W., Huiqin, W. & Jianping, W. A building recognition algorithm based on local feature and shape contour matching. J. Xi’ Univ. Archit. Technol. 49, 752–756 (2017).
Google Scholar
Wang, B., Zhang, S., Zhang, J. & Cai, Z. Architectural style classification based on cnn and channel–spatial attention. Signal, Image Video Process. 17, 99–107 (2023).
Article Google Scholar
Zhou, R. et al. Multi-granularity archaeological dating of chinese bronze dings based on a knowledge-guided relation graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3103–3113 (2023).
Wang, X. et al. A recognition method of ancient architectures based on the improved inception v3 model. Symmetry 14, 2679 (2022).
Article Google Scholar
Firmansyah, H. R., Sarli, P. W., Twinanda, A. P., Santoso, D. & Imran, I. Building typology classification using convolutional neural networks utilizing multiple ground-level image process for city-scale rapid seismic vulnerability assessment. Eng. Appl. Artif. Intell. 131, 107824 (2024).
Article Google Scholar
Liu, Y., Xu, C., Pan, Z. & Pan, Y. Semantic modeling for ancient architecture of digital heritage. Comput. Graph. 30, 800–814 (2006).
Article Google Scholar
Li, L. et al. Semantic 3d modeling based on citygml for ancient chinese-style architectural roofs of digital heritage. ISPRS Int. J. Geo-Inf. 6, 132 (2017).
Article Google Scholar
Belhi, A., Bouras, A. & Foufou, S. Leveraging known data for missing label prediction in cultural heritage context. Appl. Sci. 8, 1768 (2018).
Article Google Scholar
Fan, T., Wang, H. & Deng, S. Intangible cultural heritage image classification with multimodal attention and hierarchical fusion. Expert Syst. Appl. 231, 120555 (2023).
Article Google Scholar
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7132–7141 (2018).
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), 3–19 (2018).
Park, J., Woo, S., Lee, J.-Y. & Kweon, I. S. Bam: Bottleneck attention module. In: British Machine Vision Conference (2018).
Hou, Q., Zhou, D. & Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13713–13722 (2021).
Velickovic, P. et al. Graph attention networks. stat 1050, 10–48550 (2017).
Gong, Y., Jia, Y., Leung, T., Toshev, A. & Ioffe, S. Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894 (2013).
Li, Y., Song, Y. & Luo, J. Improving pairwise ranking for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3617–3625 (2017).
Dutta, A., Verma, Y. & Jawahar, C. Automatic image annotation: the quirks and what works. Multimed. Tools Appl. 77, 31991–32011 (2018).
Article Google Scholar
Weston, J., Bengio, S. & Usunier, N. Wsabie: Scaling up to large vocabulary image annotation. In Twenty-Second International Joint Conference on Artificial Intelligence (Citeseer, 2011).
Wang, L., Ding, Z. & Fu, Y. Adaptive graph guided embedding for multi-label annotation. In IJCAI, 2798–2804 (2018).
Yuan, J. et al. Graph attention transformer network for multi-label image classification. ACM Trans. Multimed. Comput., Commun. Appl. 19, 1–16 (2023).
Article Google Scholar
Qu, X., Che, H., Huang, J., Xu, L. & Zheng, X. Multi-layered semantic representation network for multi-label image classification. Int. J. Mach. Learn. Cybern. 14, 3427–3435 (2023).
Article Google Scholar
Zhou, W., Xia, Z., Dou, P., Su, T. & Hu, H. Double attention based on graph attention network for image multi-label classification. ACM Trans. Multimed. Comput., Commun. Appl. 19, 1–23 (2023).
Google Scholar
Fu, R. et al. Axiom-based grad-cam: Towards accurate visualization and explanation of cnns. arXiv preprint arXiv:2008.02312 (2020).
Hinton, G. & Van Der Maaten, L. Visualizing data using T-SNE journal of machine learning research. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Sun, D., Ma, L., Ding, Z. & Luo, B. An attention-driven multi-label image classification with semantic embedding and graph convolutional networks. Cogn. Comput. 1–12 (2023).

Download references

Acknowledgements

The author would like to thank the editor and anonymous reviewers for their valuable time and suggestions to improve the quality of this paper. This study was partially supported by the Shanxi Province Natural Science Foundation (Grant No. 202403021221143), the National Natural Science Foundation of China (Grant nos. 62273248, 12473105, 12473106), and the Graduate Education Innovation of Taiyuan University of Science and Technology (BY2023014).

Author information

Authors and Affiliations

School of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan, China
Sulan Zhang, Fei Wang, Huiyuan Zhou, Lihua Hu, Haifeng Yang, Jifu Zhang & Jianghui Cai
School of Computer Science and Technology, North University of China, Taiyuan, China
Jianghui Cai

Authors

Sulan Zhang
View author publications
Search author on:PubMed Google Scholar
Fei Wang
View author publications
Search author on:PubMed Google Scholar
Huiyuan Zhou
View author publications
Search author on:PubMed Google Scholar
Lihua Hu
View author publications
Search author on:PubMed Google Scholar
Haifeng Yang
View author publications
Search author on:PubMed Google Scholar
Jifu Zhang
View author publications
Search author on:PubMed Google Scholar
Jianghui Cai
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors contributed to the current work. Conceptualization: Z.S., W.F. Methodology: Z.S., W.F., Z.H., H.L. Formal analysis: Z.S., Z.J., C.J. Example analysis: H.L., Y.H., Z.J., C.J. Writing – original draft: Z.S., W.F. Writing – review: Z.S., Z.H., H.L. Supervision: Y.H., Z.J., C.J. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Sulan Zhang or Jianghui Cai.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, S., Wang, F., Zhou, H. et al. A coordinate-to-semantic attention network for multi-label ancient Chinese architecture image classification. npj Herit. Sci. 13, 14 (2025). https://doi.org/10.1038/s40494-025-01547-8

Download citation

Received: 15 October 2024
Accepted: 13 January 2025
Published: 15 February 2025
Version of record: 15 February 2025
DOI: https://doi.org/10.1038/s40494-025-01547-8

This article is cited by

ViT-HVE: a vision transformer-based framework for recognition and weighted evaluation of cultural heritage values
- Wenlun Xu
- Bo Huang
- Pengyue Yan
npj Heritage Science (2025)
Application of Huangling mianhua image recognition and classification system based on deep learning
- Dazhuang Li
- Lei Gao
- Chaolong Wu
Discover Computing (2025)

Abstract

Similar content being viewed by others

Reconstructing the local structures of Chinese ancient architecture using unsupervised depth estimation

DSC-Net: learning discriminative spatial contextual features for semantic segmentation of large-scale ancient architecture point clouds

A deep learning approach to fight illicit trafficking of antiquities using artefact instance classification

Introduction

Related work

Ancient architecture image classification

Feature representation methods for ancient architecture

Semantic analysis methods for ancient architecture and related cultural heritage

Channel & spatial attention

Method

Overview

Coordinate attention mechanism

ACA image feature extraction

Coordinate attention

Algorithm 1

Semantic attention module

Algorithm 2

Experiments

Datasets

Experimental settings

Evaluation metrics

Implementation details

Experimental results

Ablation studies

Analysis of results

Qualitative results of the multi-label ACA classification

Visual analysis of the visual features

Semantic correlation visualization

Visual analysis of predicted labels using different modules

Conclusion and future work

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

ViT-HVE: a vision transformer-based framework for recognition and weighted evaluation of cultural heritage values

Application of Huangling mianhua image recognition and classification system based on deep learning

Search

Quick links