Introduction

Ancient Chinese architecture (ACA), as an important part of traditional culture, carries a wealth of information about history, culture, religion, and art. These buildings, including palaces and temples, etc., have unique architectural styles and characteristics, making their preservation particularly important. In recent years, the development of digital preservation technology1 and deep learning2,3,4 has driven rapid advancements in the field of image processing of ancient architecture. These technologies can more accurately record and recognize ancient architecture to fully inherit and display their historical and cultural value. In particular, by accurate semantic categories of ancient architecture, Chinese traditional architecture can be better understood and recorded, which is conducive to the protection and inheritance of cultural heritage. Nonetheless, due to the complex structural features and diverse categories of ACA images, it is difficult for a single label to fully describe the overall content of the image. Therefore, multi-label ACA classification emerges as a more effective approach compared to single-label classification, providing vital technical assistance for the conservation of ancient architecture.

In general, ancient buildings exhibit similarities, with structures from different styles or periods sharing comparable visual features. This makes it difficult for the model to distinguish them accurately. Thus, one of the key challenges in the multi-label ACA image classification is to solve the confusion caused by the similarity of architectural features. Currently, the multi-label image classification task has made great progress in various applications such as textile fiber5 and mineral identification6. Two important features, namely spatial relation learning7,8 and label correlation9,10,11, are the main considerations in the study of multi-label image classification tasks. For example, ResNet-SRN12 captures the potential spatial relationships of labels using an attentional mechanism and provides spatial regularization for the final classification results. ML-GCN13 introduces GCN to the multi-label image classification task to explicitly establish the correlation between labels. MMDSR14 proposes a multi-layered, multi-perspective dynamic semantic representation to enhance the modeling of relationships between labels. NOAH15 introduces a pairwise category attention mechanism to improve classification performance. Inspired by the development of multi-label image classification technology, there has been research devoted to the construction of the ACA image classification model. The study mainly focuses on two aspects: feature representation and semantic analysis of ancient architectural images. On the one hand, feature representation emphasizes designing effective feature extraction methods that enable the model to extract distinguishable features from ACA images. For complex ancient architectural structures and textures, some studies use pre-trained CNN models to extract features16,17. For example, Wu et al.18 proposed an automatic classification algorithm for ancient Buddhist architecture based on local feature learning in a multicultural context. Compared to traditional feature extraction methods, the local feature learning method based on deep learning demonstrates better expression. On the other hand, in terms of semantic analysis, researchers are dedicated to understanding the semantic information in ancient architectural images through techniques such as deep learning19,20, which includes the function, style, and period of the buildings. Additionally, Graph-related techniques are increasingly being employed to explore the semantics of ACA images. Zhang et al.20 proposed a method that combines visual attention mechanisms with graph convolutional networks to enrich semantic information.

The aforementioned ACA image classification methods share common shortcomings: the ancient architecture feature learning scheme concentrates solely on CNNs16,18 to learn the local relationships in ancient architecture images, which cannot capture long-range dependencies and preserve positional information. Furthermore, these methods ignore the correlations between image regions and semantic labels, failing to effectively utilize accurate semantic context information. This leads to an inability to resolve the confusion caused by the similarities between ancient architecture, resulting in inaccurate classification. To be specific, the reasons for the confusion are the following: (i) When different ancient architecture shares common features such as similar structures, ornamentation, and styles, the ancient architecture may present a similar appearance in the image. As shown in Fig. 1a, the Forbidden City and the Summer Palace are such ancient palace architecture from the Ming and Qing Dynasties, sharing a similar court architectural style. They both employ the typical design of yellow tiles and red walls, forming striking visual features. When the model focuses only on these common local features, it may get confused due to their high similarity, leading to difficulty in distinguishing between them. (ii) The rich diversity of ancient architecture means that different categories may have significant differences in design and form. However, because of historical, regional, and stylistic diversity, ancient architecture may exhibit similar appearances in the image. As shown in Fig. 1b, although Fayu Temple and Zhantan Temple have significant differences in design and style, they may exhibit similar appearances in the image. This similarity arises because they are temple structures within the same religious tradition or cultural context, sharing common features such as roof design and architectural structure. In this case, the confusion mainly arises from some shared features within this diversity, lacking semantic context to help distinguish them, thereby increasing the complexity of the classification task.

Fig. 1: Samples of feature similarity.
figure 1

a Local common feature similarity leads to confusion. b Some features in the diversity present a similar appearance, leading to confusion.

To address the above issues, this paper proposes a novel multi-label ACA image classification model, the coordinate-to-semantic attention network (CSA-Net) model. This model identifies the spatial relationship between different architectural features and adaptively aligns to adjust the feature representations of images and labels, thereby capturing their correlation and semantic information. Specifically, to finely focus on specific regions in the image and capture long-range spatial dependencies, we introduce a coordinate attention (CA) mechanism to establish positional information. This approach allows the model to consider more comprehensive and abstract features, better distinguishing between the common features of ACA. Furthermore, to generate semantically relevant visual representations for each label, we propose a semantic attention (SA) module to comprehensively utilize the information from both images and labels. This module improves the model’s understanding of the correlation between them and captures the semantic information within images, thereby better understanding the differences between various categories of ACA. Extensive experiments and analysis were conducted on the ACA dataset to verify the superiority of our method. The main contributions of this paper are summarized as follows:

  • We introduce a coordinated attention mechanism to optimize the feature representation of ancient architectural images, more accurately capturing the features of ACA.

  • We design a semantic attention module to achieve alignment between image features and semantics, generating relevant visual representations for each label, which helps to understand the image semantics.

  • Extensive experimental results on the ACA dataset and other datasets demonstrate that the proposed coordinate-to-semantic attention network (CSA-Net) classification model significantly improves multi-label ACA image classification performance compared to existing models.

Related work

Ancient architecture image classification

In ancient architecture image classification, each image may involve multiple aspects or features, such as architectural style, materials, and age. Existing methods are mainly studied regarding the feature representation and semantic analysis for ancient architecture images.

Feature representation methods for ancient architecture

To better represent ancient architecture, researchers have explored various feature representation methods. Traditional methods21,22,23 use handcrafted feature representations such as Local Binary Pattern (LBP) and color histograms. These methods describe image features based on local features like edges, corner points, and textures. For example, Yanni et al.24 proposed an algorithm that combines local features with shape contour matching, leveraging unique local features and shape information to recognize buildings under different viewpoints, lighting conditions, and occlusions. However, ACA images face challenges not only from traditional noise factors like illumination, vision, and background but also from the complexity of structural features and uneven feature distribution. Therefore, traditional feature extraction methods often struggle to capture sufficiently rich and accurate information.

Recently, some research efforts have turned to deep learning-based methods, such as using pre-trained convolutional neural network (CNN)25,26 models for feature extraction. Wu et al.18 proposed a classification method based on local feature learning for ancient Buddhist buildings. This model enhances adaptability by incorporating architectural features from different cultural contexts, thereby improving its performance across various geographical and cultural contexts. A method16 using image-to-text technology employs deep learning to extract features from images and generate text descriptions of neoclassical architectural styles. Consequently, deep learning-based local feature learning methods can automatically learn abstract features from images. Wang et al.27 proposed a method for recognizing ancient architecture based on an improved asymmetric initial space V3 model using a convolutional neural network. This approach addresses the issues of small dataset size and insufficient model training for ancient architecture. Another study17 used CNN technology to automatically recognize roof ridge beasts in ancient architecture images, significantly improving the efficiency and accuracy of identifying these features and aiding in the preservation and study of ancient buildings. In28, the authors utilized CNN to classify building typologies in different regions of Indonesia. Wang et al.25 used CNN to extract feature representations from architectural images and introduced a channel-spatial attention mechanism to capture the correlation between specific channels and spatial positions in the image. This joint optimization of CNN and the channel-spatial attention module resulted in the accurate classification of architectural styles. Zhou et al.26 proposed a method for dating bronze ding vessels by effectively combining deep learning techniques with archeological knowledge to significantly improve the accuracy and automation of dating these artifacts. These deep learning methods facilitate end-to-end learning, automatically extracting and learning significant features from ancient architectural images.

Semantic analysis methods for ancient architecture and related cultural heritage

Ancient architecture often has beautiful contours and colorful forms, which not only record the architectural layout, hierarchical structure, building forms, and colors, but also contain rich cultural connotations. By analyzing the semantics of ancient architecture, we can deeply understand the culture, values, and information related to the social background they represent. In29, the author proposes a semantic modeling system for the Southeast region, which transforms traditional geometric units (such as points, lines, and triangles) into architectural elements unique to the region. This approach ensures consistency in architectural style and avoids the shortcomings of the syntactic system. To improve the traditional method of modeling geometric units, a flexible semantic approach has been proposed in30, which is based on the characteristics of ancient Chinese style architecture. This method involves a two-level semantic decomposition of roofs: structural and decorative functional categories. A study31 proposed a multimodal neural network framework combining a convolutional neural network for visual feature learning and semantic modeling of text data for the classification and annotation of cultural assets. Another study32 integrated additional text descriptions to capture intra- and inter-modal interactions in intangible cultural heritage images and textual descriptions, providing rich complementary semantic information.

Channel & spatial attention

Channel and spatial attention combine the advantages of channel attention and spatial attention, which can adaptively select important objects and regions in the image. This method effectively utilizes information from various channels and spatial locations to comprehensively extract features, thereby enhancing the model’s feature representation ability. SENet33 was the first to propose an effective mechanism to automatically learn the importance of different channels, explicitly pointing out the interdependencies among channels, which can enhance the representation of important features. CBAM34 first applies spatial attention to weight the feature map along the spatial dimension and then uses channel attention to weight each channel, simultaneously focusing on important information in both spatial and channel dimensions. BAM35 directly combines channel and spatial attention to obtain an attention matrix that integrates spatial and channel information.

However, the SE block uses global pooling to aggregate global spatial information before modeling cross-channel relationships, which ignores the importance of positional information. Meanwhile, CBAM and BAM use convolution operations to capture local relationships but fail to model long-range dependencies effectively. To resolve these problems, a novel coordinate attention36, integrates positional information into channel attention, allowing the network to focus on larger significant regions with reduced computational cost. The mechanism combines spatial information in horizontal and vertical directions and interacts effectively with channel attention to enhance the modeling ability of joint spatial and channel dependence, constituting a unified, multidimensional attention mechanism. This directional feature modeling attention is well-suited for dealing with key structures in ancient architecture that exhibit symmetry and directionality. Furthermore, Coordinate attention is capable of precisely identifying the locations of target objects, offering a broader receptive field compared to CBAM and BAM, and significantly enhancing the expressiveness of the features. CBAM and BAM combine channel and spatial attention, while coordinated attention builds on the foundation to achieve unified modeling, representing a more advanced integration.

Method

Overview

In this section, we introduce a new framework called CSA-Net, which separately depicts spatial and semantic relations of ACA images, as shown in Fig. 2. The coordinate attention mechanism, detailed in Section 3.2, processes information from different positions of each part of the ancient architecture and mines the long-range spatial dependencies of ACA image features. In Section 3.3, we propose a semantic attention module to align ACA label embeddings and feature maps to accurately capture the key information in the semantic context and efficiently integrate the information from different modalities, providing the optimal interdependent feature representations and semantic information for the final multi-label classification.

Fig. 2
figure 2

Illustration of the proposed CSA-Net framework.

Coordinate attention mechanism

ACA image feature extraction

Deep convolutional neural networks are widely used as the backbone network for feature learning. We utilized a pre-trained ResNet-101 to capture spatial features of the ACA image. The higher layers of convolutional networks contain rich semantic information, making the extracted features more discriminative. Therefore, Given an ACA image I, the feature map extracted from the last convolutional layer of the CNN model is represented as X RC×H×W, where C is the number of channels and H and W are the height and width of the feature map, respectively. Specifically,

$$X={f}_{{\rm{cnn}}}(I;\theta )$$
(1)

where fcnn( ) is feature extraction module, X is the generated feature map and θ is the weights of the network.

Coordinate attention

ACA usually leads to visual similarity due to common features. Thus, identifying the position and spatial relationship of different architectural features is helpful to distinguish ACA with similar local features but different overall structures. Coordinate attention36 is a very effective method for capturing positional information and spatial relationships, primarily involving two steps: coordinate information embedding and coordinate attention generation (see Fig. 3). The coordinate attention mechanism algorithm process is described as Algorithm 1.

Fig. 3
figure 3

Schematic illustration of Coordinate Attention mechanism.

Algorithm 1

Coordinate Attention Mechanism

Input:

 Image set I = {p1, p2, …, pn}

 ResNet-101 model

 Coordinate Attention module

Output:

 Feature vector set of I: F = {f1, f2, …, fn}

1: Load I, ResNet-101 model, and Coordinate Attention module

2: Initialize \(F={{\emptyset}}\)

3: for i = 1 to n do

4: Preprocess image pi

5: Extract feature map \(X\in {{\mathbb{R}}}^{C\times H\times W}\) from ResNet-101

6: Apply Coordinate Attention module:

7: Perform horizontal pooling: \({z}_{c}^{w}(w)=\frac{1}{H}\mathop{\sum }\nolimits_{j = 1}^{H}{x}_{c}(j,w)\)

8: Perform vertical pooling: \({z}_{c}^{h}(h)=\frac{1}{W}\mathop{\sum }\nolimits_{i = 1}^{W}{x}_{c}(h,i)\)

9: Fuse features: \(y=\delta \left(BN\left(Con{v}_{1}^{1\times 1}\left(\left[{z}^{h},{z}^{w}\right]\right)\right)\right)\)

10: Split y into horizontal attention map yw and vertical attention map yh

11: Compute attention maps: \({p}^{h}=\delta \left(Con{v}_{h}^{1\times 1}\left({y}^{h}\right)\right)\); \({p}^{w}=\delta \left(Con{v}_{w}^{1\times 1}\left({y}^{w}\right)\right)\)

12: Enhance X by applying attention maps: fi = X ph pw

13: Add fi to F = \(F\cup \left\{{f}_{i}\right\}\).

14: end for

15: return F

Coordinate information embedding: The coordinate attention block takes feature \(X=\left[{x}_{1},{x}_{2},\ldots,{x}_{C}\right]\in {R}^{C\times H\times W}\) as input and decomposes it into two parallel 1D feature encodings to efficiently integrate the spatial coordinate information of the ACA image. Specifically, two pooling kernels (H,1) and (1, W) are used to perform horizontal and vertical pooling operations on the input feature maps X, respectively, resulting in two feature maps with values only in the height and width dimensions. The output of the c-th channel at height h is:

$${z}_{c}^{h}\left(h\right)=\frac{1}{W}\sum _{0\le i < W}{x}_{c}\left(h,i\right)$$
(2)

The output of the c-th channel with width w:

$${z}_{c}^{w}\left(w\right)=\frac{1}{H}\sum _{0\le j < H}{x}_{c}\left(j,w\right)$$
(3)

where \({z}_{c}^{h}\in {R}^{C\times 1\times W}\) and \({z}_{c}^{w}\in {R}^{C\times H\times 1}\) represent the output related to the c-th channel with height h and the output associated with the c-th channel with width w, respectively. The two transformations above aggregate features along two spatial directions, generating a pair of direction-sensitive feature maps. This approach allows capturing long-range dependencies along one spatial direction while preserving precise positional information along the other. Unlike the channel attention method that transforms feature tensors into a single feature vector, coordinate attention addresses the difficulty of maintaining position information due to the compression of global spatial information into channels in previous approaches.

Coordinate attention generation: the extracted horizontal zw and vertical zh position information is concatenated, passed through a convolutional layer and a batch normalization layer, and then subjected to non-linear processing through an activation function.

$$y=\delta \left(BN\left(Con{v}_{1}^{1\times 1}\left(\left[{z}^{h},{z}^{w}\right]\right)\right)\right)$$
(4)

where \(\left[\cdot,\cdot \right]\) denotes the aggregation operation along the spatial dimension, BN is a batch normalization operation, δ is a nonlinear activation function, and \(y\in {R}^{C/r\times \left(H+W\right)}\) is an intermediate feature map that encodes spatial information in the horizontal and vertical directions, aiming at feature fusion of positional information and enhancing the expressive power of features. Next, the fused feature map y is split to obtain feature maps in the horizontal yw RC/r×W and vertical directions yh RC/r×H, yielding

$${y}^{h},{y}^{w}=Split\left(y\right)$$
(5)

Then yh and yw are passed through a 1 × 1 convolutional layer and an activation function to compute the horizontal and vertical attentional weights, respectively, and to make their output dimensions consistent with X.

$${p}^{h}=\delta \left(Con{v}_{h}^{1\times 1}\left({y}^{h}\right)\right)$$
(6)
$${p}^{w}=\delta \left(Con{v}_{w}^{1\times 1}\left({y}^{w}\right)\right)$$
(7)

where ph RC×1×W and pw RC×H×1 denote the corresponding attention weights. These weights reflect the importance of each position and are used for subsequent feature weighting. The resulting two attention maps can effectively reflect whether the objects of interest exist in the corresponding rows and columns, enabling us to locate the position of the target objects accurately. This positional information helps the model to identify and focus on the spatial distribution and arrangement of different architectural features in the ACA image, thereby capturing the overall structure of the ancient architecture more accurately.

After that, the original input feature map is multiplied by the attention weights in the horizontal and vertical directions to obtain the final enhanced feature representation f RC×H×W, formulated as:

$$f=x{p}^{h}{p}^{w}$$
(8)

In this way, it makes the features more spatially concentrated and retains the spatial information of the original features, which better captures the spatial dependencies in different directions and improves ACA feature representation ability.

Semantic attention module

The diversity of historical and cultural backgrounds may lead to a similar appearance of ACA. ACA images usually contain rich and complex semantic information, and this complexity makes the classification task not only rely on visual appearance but also need to consider various factors such as ACA history, region, and cultural background. To capture the information in the semantic context, we first define a graph \(G=\left(V,E\right)\) where the set of nodes V contains two parts: ACA image feature nodes and ACA label embedding nodes. Specifically the image features are represented as node feature matrices \(\scriptstyle{F}={\left\{{f}_{j}\right\}}_{j = 1}^{n}\), and \({f}_{j}\in {R}^{{d}_{I}}\) is feature vector at the j-th spatial region of the input image I, where n is the number of spatial regions of the image and dI is the dimension of each feature vector. The label embeddings are represented as the node feature matrix \(L={\left\{{l}_{i}\right\}}_{i = 1}^{m}\), \({l}_{i}\in {R}^{{d}_{L}}\), where m is the number of labels, dL is the dimension of the label embeddings. The semantic attention module process is described as Algorithm 2.

Algorithm 2

GAT for Image-Label Feature Alignment

Input:

 Image feature matrix \(F={\left\{{f}_{j}\right\}}_{j = 1}^{n}\), \({f}_{j}\in {R}^{{d}_{I}}\)

 Label embedding matrix \(L={\left\{{l}_{i}\right\}}_{i = 1}^{m}\), \({l}_{i}\in {R}^{{d}_{L}}\)

 GAT parameters: weight matrices Wl, Wf, attention weight vector a

 LeakyReLU slope α = 0.2

Output:

 Weighted aggregate label representation \(Q={\left\{{Q}_{i}\right\}}_{i = 1}^{m}\)

1: Initialize \(Q={{\emptyset}}\)

2: for each label embedding node li L do

3: for each image feature node fj F do

4: Compute the attention weight \({e}_{ij}=LeakyReLU\left({a}^{T}\left[{W}_{l}{l}_{i}| | {W}_{f}{f}_{j}\right]\right)\)

5: end for

6: Normalize attention weights across all fj: \({a}_{ij}=\frac{exp\left({e}_{ij}\right)}{\mathop{\sum }\nolimits_{k = 1}^{n}exp\left({e}_{ik}\right)}\)

7: Compute the weighted aggregate representation Qi for label node li

8: Add Qi to Q = \(Q\cup \left\{{Q}_{i}\right\}\)

9: end for

10: return Q

Graph Attention Network (GAT)37 is then used to learn the weights between the nodes and the weights from the label embedding nodes li L to the image feature nodes fj F are denoted as eij. These weights can indicate the degree of correlation between the image features and the label embeddings. GAT calculates the weights by the self-attention mechanism, which can be expressed as:

$${e}_{ij}=LeakyReLU\left({a}^{T}\left[{W}_{l}{l}_{i}| | {W}_{f}{f}_{j}\right]\right)$$
(9)

where a is the attention weight vector, W is the weight matrix, represents the concatenation operation, and li and fj are representations of the label embedding vector i and the image feature node j, respectively. Furthermore, the negative input slope of LeakyReLU is set to be 0.2. Then the softmax function normalization was used to obtain the normalized attention weights:

$${a}_{ij}=\frac{exp\left({e}_{ij}\right)}{\mathop{\sum }\nolimits_{k = 1}^{n}exp\left({e}_{ik}\right)}$$
(10)

where k is the number of image feature nodes, aij is an attention coefficient representing the adaptive learning of the importance of different image feature nodes to label nodes. Finally, an alignment matrix is constructed using attention weights to align the image features and label embeddings, reflecting their correlation and better capturing their semantic relationships. As a result, the feature vector fj of the input image I weighted by attention is multiplied by the attention coefficient aij to aggregate the label features, obtaining the weighted aggregate label representation Qi for each label node li, formulated as:

$${Q}_{i}=\mathop{\sum }\limits_{j=1}^{n}{a}_{ij}{f}_{j}$$
(11)

In this way, the semantic relationship between image features and labels is used to update the label representation while retaining the label embedding. The updated label representation can capture more detailed and rich information, which helps the model to consider more context information in decision-making rather than relying solely on surface visual features, thereby understanding the complex semantic information of ACA images. In addition, using semantic relationships to update label embedding can better distinguish ACA with similar appearance but different semantics.

Finally, the aggregated label representation Q are mapped through the fully connected layer fc1 and the activation function. Then the 2048-dimensional features are mapped to the final output category space through fc2, and we obtain the ACA labels prediction scores as:

$$\hat{y}=f{c}_{2}\left(LeakyReLU\left(f{c}_{1}\left(tanh\left(Q\right)\right)\right)\right)$$
(12)

where fc2 denotes fully connected layer operations. The CSA-Net model is trained using the cross-entropy loss function as follows:

$$L=\mathop{\sum }\limits_{i=1}^{m}{y}^{i}log\left(\sigma \left({\hat{y}}^{i}\right)\right)+\left(1-{y}^{i}\right)log\left(1-\sigma \left({\hat{y}}^{i}\right)\right)$$
(13)

where \({\hat{y}}^{i}\) is the output of fully connected layer, \({y}^{i}=\left\{0,1\right\}\) represents whether label i appears in the image or not, and \(\sigma \left(\cdot \right)\) is the sigmoid function.

Experiments

In this section, we verify the effectiveness of the proposed CSA-Net framework through experiments on the ACA dataset. First, we introduce the basic information about the ACA dataset. Next, we describe the evaluation metrics and implementation details. Then, comparative experiments and ablation studies are performed on the proposed CSA-Net framework. Finally, we visualize the multi-label classification results of ACA images.

Datasets

We mainly conduct experiments on the ACA dataset to validate the model’s performance in the multi-label classification task of ancient Chinese architecture images. To evaluate the model’s generalization, the MS-COCO and Gothic architecture datasets are further introduced to demonstrate the model’s performance in different tasks and complex scenes. Specifically:

We used six class ACA images (http://vision.ia.ac.cn/data/) from the National Laboratory of Pattern Recognition at the Institute of Automation, Chinese Academy of Sciences. Meanwhile, we collected additional images online to extend the ACA dataset for our experiments. The dataset contains a total of 2500 images that are categorized according to six types of ACA images. Each type of ACA image selects 80 % of its total number as the number of training samples. Additionally, each type of ACA image includes some labels, such as architectural category (palace, temple), dynasty (Tang_Dynasty, Song_Dynasty, etc.), school (Huizhou_School, Shanxi_School, etc.), geographic region (Central_China, etc.), and religion (Buddhism, Taoism). Detailed information on the ACA dataset is shown in Table 1.

Table 1 The ACA dataset details

MS-COCO dataset is a widely used benchmark for multi-label image classification, containing 82,783 training images and 40,504 validation images, covering 80 different object categories. Each image is annotated with multiple labels, effectively supporting model training and evaluation in complex backgrounds and multi-object scenes.

Gothic architecture dataset consists of 330 training images and 220 test images. We define multi-label categories based on architectural features and task requirements. Additional annotations were made to the existing dataset, covering 10 categories. Each image is annotated with multiple dimensions of labels, including Style (e.g., Gothic), Region (e.g., France, the UK), and Cathedral Name (e.g., Notre-Dame de Paris, Reims Cathedral, etc.). This dataset can be used for architectural style classification, multi-label classification, architectural analysis, and other studies.

Experimental settings

Evaluation metrics

To evaluate the results of multi-label ACA image classification, we use Precision, Recall, and F1 value as evaluation metrics to evaluate our proposed model. Precision (P) reflects the model’s accuracy in predicting different categories, evaluating its ability to minimize interference from background or other categories in multi-label tasks. Recall (R) measures the model’s coverage of the target categories, with higher recall indicating better capture of relevant features, especially in complex backgrounds. The F1 score, the harmonic mean of precision and recall, balances these two metrics and helps mitigate bias. In this study, the F1 score is used to evaluate the model’s overall performance in multi-label classification, especially when class distributions are imbalanced. The specific formulas are defined as follows:

$$P=\frac{1}{M}\mathop{\sum }\limits_{i=1}^{M}\frac{{N}_{i}^{c}}{{N}_{i}^{p}}\qquad R=\frac{1}{M}\mathop{\sum }\limits_{i=1}^{M}\frac{{N}_{i}^{c}}{{N}_{i}^{g}}\qquad F1=\frac{2\times P\times R}{P+R}$$
(14)

where M is the total number of label categories, \({N}_{i}^{c}\) denotes the number of images correctly predicted as the ith category, \({N}_{i}^{p}\) represents the total number of predicted images for the i-th category, and \({N}_{i}^{g}\) is the total number of images truthfully labeled as the i-th category. In addition, mean Average Precision (mAP) is used to measure the accuracy of the multi-label ACA image classification task. mAP takes into account the order of prediction, calculates the average precision of all categories, and takes its average as the final metric.

Implementation details

In the training and testing phases of all experiments, the input images were resized to 448 × 448. The model uses ResNet-101 as a feature extractor to extract the visual features of each image, and the input features are mapped to the same dimensions as the label representation. Since GloVe can generate word vectors through global word frequency statistics, the algorithm is used to generate word embeddings for the ACA labels, converting 22 ancient architecture labels into 110-dimensional word embedding vectors. The output dimension of the fully connected layer (FC) is the same as the number of labels. Additionally, The proposed CSA-Net framework is implemented based on PyTorch, and the server environment is configured as follows: Intel\(\left(R\right)\) Xeon \(\left(R\right)\) CPU E5-2678 v3 @ 2.50GHz CPU, 62 GB memory, and NVIDIA GeForce RTX 2080 Ti 11 GB GPU to train our model. The regularization parameter λ is set to 10−3, and SGD is used as the optimizer with a momentum of 0.9, weight decay of 10−4, initial learning rate of 0.01, and a batch size of 8.

Experimental results

To validate the effectiveness of the model, we compared our model with the following multi-label image classification methods on the ACA dataset: softmax38, LSEP39, ranking40, WARP41, AG2E42, VGG19+GCN-A20 (where model A is trained with multi-label loss through a binary correlation matrix), VGG19+GCN-B20 (where model B is obtained from multi-label loss training of the final correlation matrix), GATN43, MSRN44 as well as DA-GAT45. The quantitative comparison results are shown in Table 2. When the resolution of the input image is 448 × 448, our CSA-Net model achieves 88.75% on mAP, which outperforms the other methods in all metrics. In Table 2, the first four rows of the method are based on the image features extracted by Inception-v4 for multi-label classification, and AG2E uses the VGG16 model for feature extraction on ACA images. Compared with the above methods and the model method based on VGG19, our CSA-Net model significantly outperforms them in all metrics. Additionally, when compared with multi-label image classification methods such as GATN, MSRN, and DA-GAT, which are based on image features extracted by ResNet-101, our CSA-Net model improves the mAP by 17.99%, 20.19%, and 10.92%, respectively. Moreover, by comparing with the latest methods MMDSR and NOAH (based on ResNet-101), our model outperforms these methods in evaluation metrics, further verifying its advantages and effectiveness. Notably, due to the different feature distributions and semantic contents between general datasets and the ACA dataset, models that perform well on the MS-COCO dataset may not achieve good results on the ACA dataset. The ACA dataset might contain specific architectural structures, artistic styles, or cultural elements that are either lacking or rare in general datasets. Therefore, our proposed model needs to learn the specific features and semantics present in the ACA dataset. As shown in Table 2, our model not only effectively extracts features specific to the ACA dataset but also focuses more on these specific semantic contents. Consequently, the aforementioned experimental results confirm the effectiveness of the CSA-Net model.

Table 2 Performance comparison between CSA-Net and current methods on the ACA dataset

In addition, the CSA-Net model has been validated on other datasets. As shown in Table 3, the CSA-Net model achieves an mAP of 91.18% on the Gothic Architecture dataset, outperforming existing models. CSA-Net accurately captures the unique features of architecture, effectively improving classification performance. Furthermore, CSA-Net is also applicable to more complex general vision tasks. On the MS-COCO dataset, CSA-Net effectively handles complex backgrounds and multi-object scenes through the combined use of coordinate attention and semantic attention modules, which enhances the ability to capture target regions and semantic relationships. As shown in Table 4, the experimental results demonstrate that the CSA-Net model achieves 85.7% on mAP. In summary, CSA-Net not only shows significant advantages in ancient architecture classification tasks with specific semantic features but also performs well in general visual tasks involving diverse categories and complex scenes. It demonstrates the generalization capability of the CSA-Net model, enabling it to adapt to diverse task scenes and category distributions, further validating its effectiveness and applicability.

Table 3 Performance comparison between CSA-Net and current methods on the Gothic architecture dataset
Table 4 Performance comparison between CSA-Net and current methods on the MS-COCO dataset

Ablation studies

On the ACA dataset, despite the overall classification performance improvement, we also conduct ablation studies on different components of our proposed model. The following experiments aim to investigate how each module in our proposed model affects classification performance. The experimental results are shown in Table 5. From this table, it can be seen that the baseline ResNet-101 model without CA and SA achieves 83.7% on mAP. Further analysis reveals that when we add only CA or SA individually, improvement of 86.91% and 84.41% is achieved, respectively. With the complete architecture, the performance of the CSA-Net model further improves to 88.75% on mAP. Thus, the experimental results demonstrate the effectiveness of our proposed CA and SA, as well as the fact that CA and SA play complementary roles, which work together to improve the model’s ability to understand and express the semantic information of images.

Table 5 Ablation study of different components of the CSA-Net model in the ACA dataset

Furthermore, we also investigate the sensitivity of CSA-Net to hyperparameters, i.e., the regularization parameter λ. We study {10−1, 10−2, …, 10−5}, fixing the other hyperparameters and adjusting only the value of λ. The results for different values of λ are shown in Fig. 4, when λ = 10−3 corresponds to the highest mAP, which reaches 88.75%. It can be seen that this is an optimal setting for the regularization parameter, leading to the best performance on the evaluation metrics. Thus, appropriate regularization can help improve the generalization ability and performance of the model.

Fig. 4
figure 4

Comparison among different values of λ.

Analysis of results

In this section, we use the prediction results of the CSA-Net model for a qualitative analysis of the classification results. The XGrad-CAM46 method is employed for visualization analysis, and the t-SNE47 method is used to analyze the label correlations. The prediction results are analyzed using different modules.

Qualitative results of the multi-label ACA classification

Through the multi-label classification results, we demonstrate the significant advantages of the CSA-Net model. The experimental results of label prediction shown in Fig. 5 show that the CSA-Net model can predict more labels, making the learned labels more consistent with the image content. In contrast, the VGG19+GCN model often only recognizes the most prominent objects in the image and generates relatively incomplete labels. As in Fig. 5a, Sichuan_school, Palace, and Jin_Dynasty are effectively predicted by the CSA-Net model. Moreover, the CSA-Net model exhibits a high accuracy in predicting CSA-Net image labels based on the co-occurrence relationships between objects. As shown in Fig. 5c and f, Shanxi_School and East_China are incorrectly assigned by the VGG19+GCN model to images that do not contain corresponding visual features, whereas the CSA-Net model accurately analyzes the inter-relationships between labels and correctly predicts object categories that frequently appear simultaneously in the same scene. On this basis, although the NOAH method can capture global features, it is insufficient in modeling the label relationships in ancient architecture images, which leads to its failure to effectively address the issue of label confusion, as shown in Fig. 5d and e. By contrast, the CSA-Net model can effectively utilize the contextual information in the image to achieve a more accurate and comprehensive classification effect.

Fig. 5: Prediction results for samples.
figure 5

af Represent each class of samples on the 6-class ACA dataset. Prediction results of the other models and our CSA-Net model on the ACA validation set, respectively; labels in red font indicate incorrect results predicted by the model.

Visual analysis of the visual features

To further understand how coordinate attention focuses on specific regions in the image to recognize different architectural features, we use XGrad-CAM46 to generate attentional heatmaps of ACA images for visual analysis. The experimental results are shown in Fig. 6. It can be found that the CSA-Net model can not only clearly show the main areas of concern of the model and locate the key features of ACA but also evenly cover elements related to multiple labels within the image. For example, in Zixiao_Palace, Fayu_Temple, and Longquan_Temple, the CSA-Net model accurately pinpoints specific features crucial for classification decisions through positional information, effectively filtering out irrelevant background information. This avoids the problem of focusing equally on the sky, the ground, the background trees, and the building itself, which would result in the image focusing on too broad a range of learned incorrect or irrelevant features. It can be seen that the CSA-Net model demonstrates a strong ability to distinguish between target and irrelevant regions, reducing the impact of background interference on classification performance, thereby enhancing its robustness. Furthermore, as shown in Fig. 7, the model consistently focuses on the architectural features, even in the face of changes in viewpoint (such as front, side, etc.), reflecting its robustness under complex backgrounds and viewpoint variations. Additionally, as shown in Fig. 6 in Shangqing_Palace and Zhantan_Temple, the CSA-Net model effectively utilizes the long-range spatial dependencies within the images. Unlike the baseline model, which focuses only on small parts of the building (such as columns or windows) and overlooks other potentially distinguishing features (like the overall architectural style), the CSA-Net model avoids misclassification when dealing with visually similar but different categories of buildings. Therefore, the experimental results verify that the CSA-Net model correctly understands and effectively utilizes the key visual information of the image, which improves the accuracy of classification.

Fig. 6
figure 6

Visualization results of ResNet-101 and CSA-Net models on the ACA validation set.

Fig. 7
figure 7

Visualization results of the CSA-Net model on the ACA dataset from different viewpoints.

Semantic correlation visualization

To qualitatively analyze whether the learned features have certain semantic relations, the t-SNE47 method is used to visualize and analyze the sample distributions of ResNet-101 semantic features and CSA-Net model semantic features. As shown in Fig. 8a, it can be seen that the ResNet-101 semantic features indicate that there is no distinct structure in the reduced-dimension space, and there is a large overlap and confusion of samples. In contrast, Fig. 8b demonstrates that the semantic features trained by the CSA-Net model tend for similar samples to cluster together, indicating that the semantic attention module effectively captures the semantic similarity between samples. Additionally, samples from different categories are more dispersed in the reduced-dimension space, indicating that the CSA-Net model has learned more discriminative feature representations in the feature space, allowing for better differentiation between different categories or semantic contents. Therefore, the visualization results of the semantic correlation further confirm the effectiveness and superiority of our CSA-Net model.

Fig. 8: Results of semantic correlation visualization for different models.
figure 8

a Semantic correlation visualization on the ResNet-101 model. b Semantic correlation visualization on the proposed CSA-Net model.

Visual analysis of predicted labels using different modules

From the results, we found that the ResNet-101 model primarily makes predictions based on the visual content of the images and is prone to predicting incorrect labels due to confusion. This indicates that the model has difficulty distinguishing certain visually similar architectural elements, such as decorative carvings and architectural structures. After configuring the CA in ResNet-101, the method can further recognize easily confusable features. For example, compared with the ResNet-101 method, ResNet-101+CA can recognize Zixiao_Palace and Zhantan_Temple in Fig. 9. However, ResNet-101 can be easily confused with Fayu_Temple. When the SA module is configured in ResNet-101, introducing multimodal data (such as label set) might help provide more contextual information. Yet, when the model faces high feature similarity between different categories, the model may classify more conservatively to avoid outputting confusing labels, which may cause some correct labels to fail to output. For example, only high-confidence results such as Temple, Buddhism, and East_China are output in Fig. 9 Zhantan_Temple. On this basis, when the CA and SA structures are introduced at the same time, ResNet 101+CA+SA, i.e., the CSA-Net model, is established. It can be found that the complete CSA-Net model can effectively resolve issues of confusion in ACA images, and the predicted label scores are higher than the predicted scores of other modules. The results demonstrate that each module in the CSA-Net model can effectively improve the performance of the model.

Fig. 9: Sample images from the ACA validation set and the predicted labels using the different modules.
figure 9

Labels in red font indicate incorrect results predicted by the model.

Conclusion and future work

In this paper, we propose a novel coordinate-to-semantic attention network (CSA-Net) for the multi-label ACA image classification method, which utilizes coordinate attention to explore precise positional information, encoding channel relationships, and long-range dependencies. It also integrates label information and image features, aligning them while using graph attention to adjust their relationships, guiding the model to better understand the semantic relationships between ACA images and labels. Through comparative experiments and ablation studies, we validate the effectiveness of the CA and SA modules, which play important roles in capturing key features of images and the correlations between labels. Experimental results demonstrate that the CSA-Net model can better solve confusion problems caused by the similarity of ancient architecture, which not only improves the accuracy of classification but also enhances the understanding of multi-label semantic relationships.

Although CSA-Net has demonstrated good performance in the multi-label classification task for ancient architecture, its computational resource requirements and real-time performance have not been thoroughly explored. Additionally, the model’s robustness in complex occlusion scenes still needs to be improved. Future work will focus on enhancing the model’s real-time performance and computational efficiency, as well as exploring the improvement of the attention mechanism to address extreme occlusion situations, further enhancing the model’s robustness in complex scenes.