Feature enhancement and attention mechanism fusion method for Cantonese embroidery image detection based on YOLOv8

Rao, Yongsheng; An, Ting; Xuan, Yingshuang; Wang, Ranran; Zhou, Qixin; Guan, Hao; Li, Maoning

doi:10.1038/s40494-025-02079-x

Download PDF

Article
Open access
Published: 14 October 2025

Feature enhancement and attention mechanism fusion method for Cantonese embroidery image detection based on YOLOv8

Yongsheng Rao¹,
Ting An¹,
Yingshuang Xuan²,
Ranran Wang³,
Qixin Zhou¹,
Hao Guan¹ &
…
Maoning Li⁴

npj Heritage Science volume 13, Article number: 521 (2025) Cite this article

Abstract

Cantonese embroidery images are difficult to recognize accurately because of significant differences in their morphologies and complex backgrounds. In this paper, a dedicated Cantonese embroidery image dataset is constructed, and an improved YOLOv8 detection method is proposed for the accurate detection of objects in Cantonese embroidery images. By integrating a feature enhancement module and the large separable kernel attention (LSKA) mechanism, the method strengthens its capacity for identifying key texture regions, enhances feature extraction in complex backgrounds, and optimizes performance through lightweight design and the WIoU loss function. The experimental results show that the method achieves a detection accuracy of 98.5%, with mAP and F1 scores improved by 0.7% and 1.37%, respectively, compared with those of the original model, and the model size is reduced by 5.12%. This provides an effective solution for the automated detection of intangible cultural heritage embroidery and the recognition of other heritage artifacts.

Combined query embroidery image retrieval based on enhanced CNN and blend transformer

Article Open access 11 November 2024

Research on the innovative application of Shen Embroidery cultural heritage based on convolutional neural network

Article Open access 26 April 2024

Advancing e-waste classification with customizable YOLO based deep learning models

Article Open access 25 May 2025

Introduction

In October 2003, the United Nations Educational, Scientific and Cultural Organization (UNESCO) adopted the Convention for the Safeguarding of Intangible Cultural Heritage at its 32nd General Conference¹. As a typical representative of intangible cultural heritage, embroidery not only has profound cultural connotations but also embodies exquisite handicraft skills. Cantonese embroidery, as one of the four famous embroideries in China and a national non-legacy, can be traced back to the Tang Dynasty, reached its peak in the Ming and Qing Dynasties, and was especially famous during the period of “Thirteen Houses” of Guangzhou’s foreign trade in the Qing Dynasty. Cantonese embroidery is known for its richness of subject matter, fullness of composition, rich colors, and variety of stitches, highlighting the unique cultural characteristics of the Lingnan region. In particular, the subject matter is distinctive and especially good at showing the unique scenery of the Lingnan region, such as flowers, lychees, kapoks, magpies, partridges, phoenixes, cranes, peacocks and many other subjects².

In the field of image recognition, accurately recognizing objects in images has always been a challenge for scholars. Traditional image recognition techniques (e.g., template matching³) localize objects by sliding templates to compute the similarity pixel by pixel; however, they suffer from both low accuracy and efficiency in complex backgrounds. With the iterative development of deep learning techniques, their application in image detection has become increasingly feasible. It has been confirmed in the literature⁴ that target detection technology enables the automatic identification and localization of objects within images. The feasibility of this technology provides a novel approach for detecting motif elements in Cantonese embroidery images. This technique can not only accurately determine the category and location of motif elements in Cantonese embroidery images but also allow researchers to conduct efficient categorization and precise archiving of embroidery works based on the detection results, thus enhancing the efficiency and accuracy of Cantonese embroidery research and management. In addition, the combination of the target detection technique and generative adversarial networks (GANs) enables intelligent restoration of damaged regions in Cantonese embroidery images⁵. Moreover, recent developments in stylized description generation models, such as style embedding-based variational autoencoder (SE-VAE)⁶ and XGL-Transformer (XGL-T)⁷, combined with object detection techniques, hold great potential for generating stylized textual descriptions of Cantonese embroidery images, thereby enriching their multimodal representations. These approaches provide valuable references for future research that integrates detection and generation. Conducting automatic recognition research on Cantonese embroidery images via object detection holds significant theoretical value and promising application prospects.

With the wide application of deep learning technology in the field of computer vision, convolutional neural networks (CNNs)⁸ have become a core method for tasks such as image classification, target detection, and semantic segmentation. CNNs can automatically learn image features and efficiently deal with images of different sizes, textures, and colors. Current CNN-based target detection methods can be categorized into two-stage algorithms and single-stage algorithms. Two-stage algorithms, such as R⁃CNN⁹, Faster R⁃CNN¹⁰, and Mask R⁃CNN¹¹, first identify object candidate frames by region suggestion and then classify the candidate frames. In addition, single-stage algorithms such as YOLO¹², SSD¹³, and EfficientDet¹⁴, perform regression prediction directly on the image and output the location and category information of the object at once.

Among two-stage algorithms, early models like R-CNN suffered from inefficiencies due to redundant computation⁹. Later developments, such as SPPNet¹⁵ and Fast R-CNN¹⁶, presented pooling strategies to enhance performance. Faster R-CNN¹⁰ further improved efficiency by incorporating a region proposal network (RPN), enabling end-to-end training and near real-time detection. Nevertheless, this method still relies heavily on anchor box design, is sensitive to hyperparameters, and has high computational demands when using deep backbones.

Two-stage algorithms have been applied in the domain of embroidery and textile recognition. For example, the Faster R-CNN framework based on ResNet50 architectures¹⁷, has successfully identified multiple categories of patterns in traditional Chinese embroideries. Another study¹⁸ compared the performance of VGG-based Faster R-CNN and AlexNet-based CNN in fabric pattern recognition, and found that both achieved good accuracy and precision. However, these methods are mainly suitable for single-label images. In case, when multiple patterns coexist, interference can lead to reduced recognition accuracy, and the overall detection speed remains relatively slow. These limitations indicate that traditional two-stage models still face challenges in generalizing effectively to complex multi-label embroidery image recognition tasks.

Single-stage detection algorithms, characterized by their efficient design, have outstanding efficiency advantages in real-time application scenarios and have now surpassed many traditional multistage detection methods in terms of speed and accuracy. For example, YOLO (you only look once), a classical target detection algorithm proposed by the Redmon team¹² in 2016, emerged at the CVPR conference with its efficient single-stage detection architecture; SSD, proposed by Wei Liu et al.¹³ at ECCV 2016, achieves end-to-end detection by directly predicting, on feature maps at different scales, the target’s bounding boxes and categories to achieve end-to-end detection, which is fast and can detect in real time.

Single-stage detection algorithms have been widely used in textile pattern recognition in recent years due to their efficient architecture. A previous study¹⁹ applied SSD and YOLOv8 to the recognition of Nantong blue calico fabric patterns. It improved mAP by replacing the backbone network from VGG to MobileNetV2, confirming the positive effect of lightweight architectures on performance optimization. However, the method still performed poorly in recognizing geometric patterns.

To further improve recognition performance, recent research has begun to explore structural optimization directions, such as combining lightweight networks with multi-scale feature fusion modules. One study²⁰ introduced a spatial pyramid pooling (SPP) module into a MobileNetV1-based model to enhance the fusion of local and global features, applying it to the recognition of Nantong Shen embroidery patterns. Despite modest improvements in feature extraction, the overall performance gain is limited by fundamental limitations of the MobileNetV1 architecture. Subsequently, the backbone was upgraded to MobileNetV3, incorporating attention mechanisms and efficient activation functions, which significantly reduced the number of parameters while enhancing the model’s ability to allocate feature weights, achieving both lightweight design and improved recognition performance²¹.

In addition to architectural improvements, attention mechanisms have been widely adopted to enhance the model’s sensitivity to key regions in images. For example, an image captioning model integrating Faster R-CNN and Inception V3, known as MAA-FIC²², employed adaptive attention to effectively capture salient objects and their spatial relationships, enabling the generation of coherent and factual textual descriptions. Another study²³ incorporated a multi-head attention mechanism into ResNet50 to improve focus on texture details and color variations, combined with data augmentation strategies to enhance model robustness. Although these methods increase computational complexity, they can effectively improve feature extraction capabilities and help achieve more accurate image recognition under complex conditions.

Although CNN-based object detection models have achieved certain success in embroidery image recognition, they still face limitations when dealing with the complex backgrounds, multi-label coexistence, dense textures, and rich colors characteristic of Cantonese embroidery images. In particular, while the mainstream YOLOv8 model demonstrates robust detection performance, its capability to model intricate patterns and semantically complex regions remains limited. To address these challenges, this paper proposes an improved YOLOv8-based model that integrates feature enhancement and attention mechanisms, aiming to improve detection accuracy and model robustness for Cantonese embroidery images.

Methods

To address the scarcity of Cantonese embroidery datasets, low recognition accuracy under complex backgrounds, and significant interclass morphological variation, this study proposes an improved YOLOv8 model named DAL-YOLOv8. First, a dedicated Cantonese embroidery dataset is constructed through image collection, augmentation, and annotation. The YOLOv8 model is subsequently enhanced by introducing DWConv²⁴ to reduce the computational cost, a C2f-AFE module for better feature extraction, an LSKA²⁵ attention mechanism to expand the receptive field, and the WIoU²⁶ loss function to reduce the impact of low-quality samples. Finally, experimental results are used to validate the effectiveness of the proposed model. The overall workflow is illustrated in Fig. 1:

Construction of an object detection dataset for Cantonese embroidery images

The dataset constructed in this study contains 1860 Cantonese embroidery images from three sources. The first part is from the protection unit of Cantonese embroidery, an intangible cultural heritage in Guangdong Province, China. It consists of 132 Cantonese embroidery shawl images, which are expanded to 266 images after processing. The second part, provided by Cantonese Embroidery Craft Factory Co., Ltd., contains a total of 60 images. In the third part, 1534 Cantonese embroidery images with resolutions greater than 500 × 500 pixels, a clean background (uniform solid color substrate, no folds, stains, or shadows), and representative images were screened and obtained from the internet. Figure 2 shows some examples of Cantonese embroidery images randomly selected from the dataset.

To increase the training effectiveness and generalizability of the model, this study applies data augmentation techniques to expand the dataset. The specific methods include horizontal flipping, vertical flipping, the addition of Gaussian noise, and random contrast adjustment, thereby increasing sample diversity. After four data augmentation techniques are applied to each Cantonese embroidery image, the total number of images in the dataset is expanded to 9300. The dataset is subsequently partitioned into training, validation, and test sets at a ratio of 8:1:1. Specifically, the training set comprises 7440 images, whereas the validation and test sets each contain 930 images. These subsets are utilized for model training and performance evaluation. Figure 3 presents several examples illustrating the effects of these augmentation techniques on Cantonese embroidery images.

Cantonese embroidery has a rich and diverse range of themes, often taking the unique natural scenery and cultural elements in the Lingnan region as its subjects, resulting in a unique style of “both rustic and refined”²⁷. Its content covers natural scenery (such as flowers, landscapes, lychees, and kapoks), auspicious totems (such as dragons, phoenixes, cranes, and peacocks), and custom folk images (such as figures, fish, and birds). Among these, lychee, kapok, and three bird species (white crane, magpie, and partridge) are particularly representative of cultural heritage. This study selects ten typical categories from these culturally significant elements—flower, bird, lychee, kapok, lotus, mandarin duck, peacock, crane, dragon, and phoenix—for in-depth analysis. The specific classification is illustrated in Fig. 4. To annotate the above ten categories of Cantonese embroidery images, LabelImg image annotation software was employed to manually annotate the rectangular regions. Table 1 presents the distributions of the ten categories in Train, Valid, and Test in the Cantonese embroidery dataset.

**Fig. 4: Categories of Cantonese embroidery objects.**

Table 1 Statistical table of the distribution of the Cantonese embroidery dataset

Full size table

To investigate the impact of background complexity on model performance in Cantonese embroidery images, this study categorizes the images into two types: simple background and complex background. Images with simple backgrounds are characterized by minimal background elements, uniform colors, simple textures, clear object boundaries, and low visual interference. In contrast, complex background images feature numerous patterns, dense textures, rich colors, and overlapping or occluded objects. To enhance the objectivity of this classification, the spatial density index (SDI) is introduced as a quantitative metric. The SDI takes into account both the number of objects in an image and their spatial distribution, providing a measure of object density. The calculation method is as follows:

$$SDI=\frac{N}{W\times H}\times \frac{1}{\bar{d}}$$

(1)

Here, N represents the number of detected objects in the image, W × H denotes the resolution area of the image, and $\bar{d}$ refers to the average Euclidean distance between the centers of all detected objects. In this study, the following thresholds are defined: images with SDI < 1 × 10^–5 are classified as having a simple background, whereas those with SDI > 1 × 10^–4 are classified as having a complex background. This classification provides a quantitative basis for analyzing model performance under different background conditions.

DAL-YOLOv8: an object detection algorithm with feature enhancement and an attention mechanism for Cantonese embroidery images

The YOLO series of algorithms has evolved over many iterations to continuously optimize the detection performance and inference speed. Among these, YOLOv8 stands out as the most representative version because of its exceptional detection accuracy, high inference speed, and strong system compatibility. Its architecture comprises three core functional modules: the backbone, which is responsible for extracting primary features; the neck, which enables multiscale feature fusion via a feature pyramid; and the head, which generates detection outputs—including bounding boxes, confidence scores, and category labels—based on the refined feature maps. For different application scenarios, YOLOv8 provides five model specifications: n/s/m/l/x²⁸. This study improves upon YOLOv8s.

Cantonese embroidery poses significant challenges for image detection because of its intricate subjects, complex compositions, vivid colors, and diverse stitching techniques. To address these issues, this study proposes an improved detection model, DAL-YOLOv8, which is based on the YOLOv8s architecture. The model integrates feature enhancement and attention mechanisms to improve recognition performance under complex backgrounds. Based on the original YOLOv8, the network architecture is reconstructed by module replacement and fusion. Specifically, standard convolutional layers are replaced with depthwise separable convolutions (DWConv)²⁴ to reduce the model size and computational cost. A feature enhancement module (C2f-AFE) is designed to strengthen the model’s ability to extract representative features from cluttered backgrounds. Additionally, a large kernel separable attention (LSKA)²⁵ mechanism is introduced, which uses a 7 × 7 depthwise convolution to expand the receptive field and improve the model’s ability to capture global contextual information and focus on key regions in embroidery images. Furthermore, the original bounding box regression loss function is replaced with Wise-IoU (WIoU)²⁶ loss to suppress the influence of low-quality samples and enhance localization accuracy. The overall structural improvements of the model are illustrated in Fig. 5, where yellow dashed lines indicate modified convolution modules, red dashed lines represent C2f-AFE replacements, and blue dashed lines highlight the newly added LSKA components.

DWConv

In YOLOv8, convolutional operations account for most of the computational resources (GFLOPs). To reduce the number of computations and optimize the model size, this paper replaces the standard convolution with depthwise separable convolution (DWConv)²⁴. As shown in Fig. 6, DWConv is completed in two steps. First, the depthwise convolution operation is carried out. Each channel is processed separately by a single convolution kernel, keeping the number of channels unchanged and only extracting spatial features. Subsequently, the pointwise convolution operation is performed, which mixes the information of all channels and accordingly adjusts the number of channels. This method can significantly reduce the computational resources while retaining the key feature information.

C2f-AFE

To address the limitations of existing deep learning methods in that they neglect semantic cues during semantic segmentation within complex scenarios (e.g., cluttered backgrounds, translucent objects), Muhammad Ali et al.²⁹ proposed a feature amplification network (FANet). The adaptive feature enhancement (AFE) module serves as the core component of FANet. By integrating four submodules: 1) convolutional embedding (CE), 2) spatial context module (SCM), 3) feature refinement module (FRM), and 4) convolutional multilayer perceptron (ConvMLP), the AFE module achieves multilevel enhancement of the input features (the detailed architecture is shown in Fig. 7). Experimental validation by Muhammad Ali confirmed the efficacy of this integrated design. This approach effectively captures large-scale contextual information along with high- and low-frequency features, significantly improving object boundary discrimination, particularly in scenarios involving translucent objects and cluttered backgrounds.

Among them, the CE module conducts LayerNorm and depthwise separable convolution operations on the input features in sequence and then compresses the number of channels by half through a 1 × 1 convolution. In this way, it reduces the computational amount, promotes feature mixing, and improves the feature generalization ability. The SCM module uses a large kernel group convolution of size 7 × 7 to expand the receptive field and capture a wider range of spatial context information to address scale changes in complex scenarios. The FRM module captures both low-frequency and high-frequency region information through the downsampling and upsampling operations of depthwise separable convolution to achieve feature refinement. The ConvMLP module first concatenates the outputs of the SCM and FRM, then fuses them through a 1 × 1 convolution, and finally inputs the result into the ConvMLP for further feature fusion and enhancement. The output result of the AFE module is shown in formula (2):

$${X}_{AFE}=Con{v}_{1\times 1}(ConvMLP(Concat({X}_{SCM},{X}_{FRM}))),{X}_{AFE}\in {R}^{H\times W\times C}$$

(2)

In Cantonese embroidery image detection, the complex background often leads to detection difficulties, whereas the AFE module expands the sensory field to capture the global contextual information through the large kernel convolution of the SCM module and simultaneously fuses the high- and low-frequency features with the help of the FRM module to enhance the texture details and boundary information, which is pertinent for dealing with complex textures (e.g., embroidery stitches, pattern contours) and background interference in the Cantonese embroidery images. On this basis, this paper combines the AFE module YOLOv8’s C2f module to design the C2f-AFE module (Fig. 8), which can realize the efficient fusion of multilevel features and thus improve the localization and classification accuracy of target detection.

The calculation formula of the bottleneck AFE module is as follows:

$${X}^{{\prime} }=Con{v}_{3\times 3}(Con{v}_{3\times 3}(X))$$

(3)

$${X}_{Bottleneck-AFE}=X+AFE({X}^{{\prime} })$$

(4)

The feature map X first goes through two standard 3 × 3 convolutions to obtain the feature map $X{\prime}$. $AFE(X\text{'})$ represents the feature map obtained by processing the feature map $X{\prime}$ with the AFE module, and ${X}_{Bottleneck-AFE}$ represents the feature map output after being processed by the bottleneck AFE module.

The calculation formulas of the C2f - AFE module are (5) to (9):

$$Y=Con{v}_{1\times 1}(X),Y\in {R}^{H\times W\times 2C}$$

(5)

$${Y}_{1},{Y}_{2}=split(Y,\dim =1)$$

(6)

$${Y}_{2}^{i}=Bottleneck\_AF{E}_{i}({Y}_{2}^{(i-1)}),i=1,2,\cdots ,n$$

(7)

$${Y}_{cat}=concat([{Y}_{1},{Y}_{2},{Y}_{2}^{(1)},{Y}_{2}^{(2)},\cdots ,{Y}_{2}^{(n)}],\dim =1),{Y}_{cat}\in {R}^{H\times W\times (2+n)C}$$

(8)

$${Y}_{out}=Con{v}_{1\times 1}({Y}_{cat}),{Y}_{out}\in {R}^{H\times W\times C}$$

(9)

Among them, “split” divides y along the channel dimension into two parts, and ${Y}_{1}$ and ${Y}_{2}$ are the feature maps after division. The output ${Y}_{2}^{(i)}$ of each bottleneck AFE is used as the input for the next iteration. ${Y}_{cat}$ is the merged feature map, and ${Y}_{out}$ is the final output feature map.

LSKA

Cantonese embroidery images present several challenges, including substantial shape variations across categories despite similar textures, frequent missed detections of small objects, interference from complex backgrounds, and partial occlusions. Traditional convolutional neural networks, constrained by limited local receptive fields, struggle to capture broader contextual information necessary for accurate recognition under these conditions. To address this issue, this study conducts comparative experiments using common attention mechanism modules, including CBAM³⁰, SE³¹, CA³² and ECA³³. The results are shown in Table 2 below; adopting the large separable kernel attention (LSKA) method proposed by Kin et al.²⁵, with the convolution size set to 7 × 7, achieves the best effect.

Table 2 Comparative results of the modules for each attention mechanism

Full size table

The structure of the LSKA module is shown in Fig. 9. The LSKA decomposes the two-dimensional large convolution kernel into one-dimensional kernels along the horizontal and vertical directions, significantly reducing computational complexity and memory usage. Its expansive receptive field enhances the ability to distinguish Cantonese embroidery patterns that exhibit similar textures yet differ in shape. Moreover, it facilitates the integration of contextual information across multiple scales, strengthens the association between small objects and surrounding patterns, and effectively decreases the likelihood of missed detections.

For a given convolution kernel of size k×k and its feature map F $(F\in {R}^{H\times W\times C})$, where C is the number of input channels, H and W represent the height and width of the feature map, respectively, and d represents the dilation rate, the LSKA module is implemented according to the following steps.

First, the original k×k convolution kernel is decomposed into two cascaded one-dimensional convolution kernels, ${W}_{1}\in {R}^{(1\times (2d-1))}$ and ${W}_{2}\in {R}^{((2d-1)\times 1)}$. Local spatial information is extracted through depthwise convolution to obtain the output feature map ${Z}^{C}$ of the depthwise convolution.

$${Z}^{C}=\mathop{\sum }\limits_{H,W}{W}_{2}^{C}\,\ast \,\left(\mathop{\sum }\limits_{H,W}{W}_{1}^{C}\ast {F}^{C}\right)$$

(10)

Then, the extended depthwise convolution of $\lfloor \frac{k}{d}\rfloor \times \lfloor \frac{k}{d}\rfloor$ is decomposed into more fine-grained one-dimensional extended depthwise convolutions, ${V}_{1}\in {R}^{(1\times \lfloor \frac{k}{d}\rfloor )}$ and ${V}_{2}\in {R}^{(\lfloor \frac{k}{d}\rfloor \times 1)}$, and the output feature ${Z}^{{\prime} C}$ is obtained through depthwise convolution. This feature contains global spatial information.

$${{Z}^{{\prime} }}^{C}=\mathop{\sum }\limits_{H,W}{V}_{2}^{C}\,\ast \,\left(\mathop{\sum }\limits_{H,W}{V}_{1}^{C}\ast {Z}^{C}\right)$$

(11)

Third, an attention map ${A}^{C}$ is obtained through a 1 × 1 convolution. The Hadamard product of the attention map ${A}^{C}$ and the input feature map F is the output ${\overline{F}}^{C}$ of the LSKA module.

$${A}^{C}=Con{v}_{1\times 1}\ast {Z}^{{{\prime} }^{C}}$$

(12)

$${\overline{F}}^{C}={A}^{C}\otimes {F}^{C}$$

(13)

Among them, * represents the convolution operation, $\lfloor .\rfloor$ represents the floor operation, and $\otimes$ indicates channelwise broadcasted elementwise multiplication (Hadamard product).

Loss function

Low-quality samples (e.g., repeated labeling of the same object, omission of labeling, etc.) inevitably appear in the Cantonese embroidery dataset, and the original CIoU loss function of YOLOv8 is insufficient to address these kinds of samples. During the training process, the model, if overfitting the low-quality samples, prioritizes the learning of erroneous localization information and gives an unreasonable gradient weight to these types of samples, which in turn leads to the model ignoring valid features in the high-quality samples. This results in the model ignoring the effective features in high-quality samples, which ultimately causes the overall localization performance to degrade. In contrast, Wise-IoU²⁶ can identify the “abnormal degree” (i.e., outliers) of low-quality samples and reduce the gradient gain of low-quality samples through a dynamic nonmonotonic focusing mechanism, thus improving the model’s learning efficiency for both high-quality and normal-quality samples. WIoU v1 lacks an effective focusing mechanism and cannot distinguish between anchor frames of different qualities. v2 adopts a monotonic focusing mechanism, where the gradient gain varies monotonically with the loss value, resulting in the gradient gain decreasing as it decreases at the end of the training period, which slows down the convergence speed of the model. v3 introduces a dynamic nonmonotonic focusing mechanism, which evaluates the quality of anchor frames by the degree of abnormality of anchor frames (i.e., outlier degree, the ratio of the degree to that of the anchor frame) and constructs nonmonotonic focusing coefficients to assign a small gradient gain to high- and low-quality anchor frames so that the learning focus is on ordinary quality anchors. Learning focuses on average-quality anchor frames to reduce the harmful gradient of low-quality samples, which in turn improves the overall performance of the detector. Experiments by Zanjia Tong et al.²⁵ also verified its effectiveness: the AP 75 on the MS-COCO dataset increased from 53.03% to 54.50% when the WIoU was applied to YOLOv7.

For this reason, this paper adopts the WIoU v3 instead of the CIoU as the bounding box regression loss function to minimize the negative impact of low-quality samples on model performance. The nonmonotonic focusing mechanism of the WIoU v3 has two advantages. First, it reduces the weight of high outlier samples (e.g., flower petals obscured by stitches) to suppress the interference of texture noise on the loss; second, it enhances the learning of low outlier samples (e.g., small targets such as tiny flowers) and improves the localization accuracy through an adaptive gradient. This loss function effectively compensates for the limitations of CIoU’s lack of targeted processing for low-quality samples, insufficient gradient allocation in WIoU v1, and misjudgment of outlier samples in WIoU v2. Figure 10 compares the effects of the CIoU and WIoU v3 on the training and validation sets of the Cantonese embroidery image dataset.

**Fig. 10: CIoU and WIoU loss curves during YOLOv8s training.**

As seen from Fig. 10, in the YOLOv8s training of the Cantonese embroidery image dataset, compared with the CIoU, the WIoU shows better performance: in the bounding box loss (Fig. 10a), the WIoU decreases faster in the early stage of training, and the loss in the training set and validation set is lower than that of the CIoU in the later stage, which is fast convergence and small loss; in the classification loss (Fig. 10b), the WIoU has a steeper drop in the early stage of the loss and has a leading convergence rate, and the loss is lower than the CIoU in the later stage. loss is lower than that of the CIoU, and the classification optimization is better; for the confidence loss (Fig. 10c), the WIoU has a large drop in the early stage, converges quickly, and has a smaller loss in the later stage of the training and validation sets, and the confidence optimization is better. Overall, the WIoU is comprehensively better than the CIoU in terms of convergence speed and loss performance, verifying the effectiveness of the improvement.

Results

Experimental environment

All the experiments were performed on a cloud server with the following configurations: a 16-core Intel Xeon(R) Platinum 8474 C CPU operating at 2.4 GHz and an NVIDIA GeForce RTX 4090D GPU with 24 GB of memory. The operating system used was Ubuntu 22.04. The deep learning framework employed was PyTorch 2.3.0, with PyCharm as the development environment. Compute Unified Device Architecture (CUDA) version 12.1 was used, and Python 3.12 was used as the programming language. All the experiments in this paper are trained and tested in the same environment and use the same hyperparameters, data augmentation strategies, and training datasets. The specific settings of the training parameters during the model training process are shown in Table 3.

Table 3 Selected key parameters set during model training

Full size table

To evaluate the performance of the improved YOLOv8 model (referred to as the DAL-YOLOv8 model in this study), the experiment employs precision (P), recall (R), mean average precision (mAP), and F1 score as evaluation metrics, providing a comprehensive and objective assessment of the model’s effectiveness.

As shown in Fig. 11, in the initial stage of DAL-YOLOv8 model training, the precision (P) rapidly increases, and the precision and recall (R) stabilize at about 50 training times. At about 180 training iterations, the box loss decreases significantly, reflecting the model’s improved ability to locate objects in images accurately. Combined with the steady increase in precision and the continuous decrease in the three loss quantities (train/box_loss, train/cls_loss, train/dfl_loss, etc.), the model performs well during training.

Ablation experiments

To evaluate the effectiveness of the improved model, this study designed five sets of ablation experiments on the Cantonese embroidery dataset, using the same equipment and dataset for training and testing to ensure comparability, and the experimental results are shown in Table 4. We selected the feature enhancement module (C2f-AFE), the attention mechanism (LSKA), DWConv, and the loss function (WIoU) and used evaluation metrics for quantitative and qualitative analysis.

Table 4 Ablation experiment results

Full size table

The results in Table 4 indicate that after the introduction of the C2f-AFE module into the YOLOv8 framework, the model’s mAP, P, and R values increase by 0.5%, 0.5%, and 0.7%, respectively. This improvement is attributed to the adaptive feature enhancement and multilevel feature fusion mechanisms of C2f-AFE, which strengthen the model’s ability to perceive complex texture lines, pattern contours, and global texture distributions in Cantonese embroidery images. Upon further integrating the LSKA attention mechanism on this basis, the R value of the model increases by 1.6%, with a slight improvement in mAP. This is due to the LSKA mechanism’s ability to focus accurately on key regions of targets in Cantonese embroidery images, thereby optimizing the detection effect.

Compared with the YOLOv8s base model, the introduction of DWConv reduces the size of the improved model by 5.12%. Moreover, the application of the WIoU loss function increases the P value by 1.2%, resulting in an mAP of 98.5% and an F1 value of 96.59%, whereas the model size remains stable at 20.4 M. These results demonstrate that the combined model (i.e., DAL-YOLOv8), formed by integrating YOLOv8s with C2f-AFE, LSKA, DWConv, and WIoU, can achieve efficient and accurate target detection in Cantonese embroidery image detection tasks through intermodule synergy. Figure 12 presents a visual comparison of heatmaps for some detection results of DAL-YOLOv8 before and after the addition of the C2f-AFE and LSKA modules.

As shown in Fig. 12, after the C2f-AFE module and LSKA attention mechanism are integrated into the YOLOv8 framework, the model’s ability to perceive features in Cantonese embroidery images is significantly improved. A comparison of the heat distributions before and after the addition of these two modules reveals that the model captures targets and pattern ranges such as kapok, bird, and phoenixe in Cantonese embroidery images more accurately after integration. Its capabilities in feature extraction and target detection under complex scenarios have been obviously enhanced, which strongly verifies the effectiveness of applying these two modules to the task of Cantonese embroidery image detection.

Model comparison experiments

To compare the efficiency of the improved model proposed in this paper, Faster RCNN, SSD, YOLOv6, YOLOv7, YOLOv8, YOLOv9, and YOLOv10 are selected for comparative experiments. All the experiments are conducted on the same equipment and in the same environment, use the same dataset and data augmentation method, and maintain the same ratio of the training set to the test set. The experiment involves 200 iterations, and the best results are selected for testing. The comparative data of each algorithm in terms of P, R, mAP, F1 and model size are shown in Table 5.

Table 5 Model comparison experiment results

Full size table

As shown in Table 5, compared with the classical target detection models, the DAL-YOLOv8 model proposed in this paper has significant advantages. Compared with Faster R-CNN and SSD, the mAPs of Faster R-CNN and SSD are 8.86 and 8.38 percentage points greater, indicators such as the P and F1 values are comprehensively leading, and the model size is smaller. Compared with YOLOv8s, this method improves the P, R, mAP and F1 values by 0.4, 2.3, 0.7 and 1.37%, respectively, while the model size is reduced by 5.12%. Although the model size is slightly larger than that of YOLOv9s and YOLOv10s because of the introduction of complex network modules, such as the feature enhancement module and attention mechanism, and the computational complexity (FLOPs) is increased, this move significantly improves the detection performance: the mAP of DAL-YOLOv8 reaches 98.5%, which is higher than that of YOLOv9s and YOLOv10s by 0.3 and 0.2 percentage points, respectively. Overall, the magnitude of model performance improvement relative to the percentage increase in computational complexity significantly outweighs the increased computational cost.

To verify the feasibility of DAL-YOLOv8, two representative scene images with different complexities from the test set are selected in this paper to carry out comparison experiments, and the model recognition results before and after the improvement are shown in Fig. 13.

**Fig. 13: Detection results across different scene contexts.**

Figure 13 shows the comparison results of detection between DAL-YOLOv8 and YOLOv8s. In the left image (YOLOv8s) and the right image (DAL-YOLOv8), the areas circled in purple indicate missed detection targets, and the annotations in blue represent false detection targets. In the first image with a simple background and a multisemantic scene, both models achieve full target detection, but DAL-YOLOv8 is superior in terms of positioning and recognition accuracy. In the second image, which has a complex background and a multisemantic scene, YOLOv8s results in missed and false detections. However, by integrating the C2f-AFE feature enhancement module and the LSKA attention mechanism, DAL-YOLOv8 effectively improves the feature recognition ability. In complex backgrounds, it not only strengthens the perception of global features but also takes into account the extraction of small target details and large target global features. Finally, it achieves accurate recognition of all categories in the image, and the recognition accuracy of each category is better than that of YOLOv8s.

Figure 14 presents a comparison of the detection accuracy across ten categories of Cantonese embroidery images between the proposed DAL-YOLOv8 model and the baseline YOLOv8s model. DAL-YOLOv8 outperforms YOLOv8s in all categories, with the accuracy for the “lotus” category reaching 99.4%, which is 0.9 percentage points higher than that of YOLOv8s. For categories with complex textures and fine embroidery details, such as “peacock” and “phoenix,” the improved model achieves accuracy gains of 2.6% and 0.2%, respectively. In the case of “dragon,” a category with larger patterns, the recognition accuracy reaches 99.2%, whereas even for smaller and less prominent categories such as “lychee,” an improvement of 0.9% is observed. These enhancements can be attributed to the integration of the C2f-AFE feature enhancement module and the LSKA attention mechanism in DAL-YOLOv8, which significantly improve the model’s ability to extract and recognize features under complex conditions.

According to the dataset distribution shown in Table 1, categories such as “dragon,” “phoenix,” and “mandarin duck” have significantly fewer samples across the training, validation, and test sets than categories such as “flower” and “bird”. However, as illustrated in Fig. 14, the DAL-YOLOv8 model achieves recognition accuracies exceeding 98% for these underrepresented categories, even outperforming more frequently occurring categories such as “kapok” and “lychee”. This suggests that although the dataset exhibits class imbalance, it does not substantially affect the model’s recognition performance. In other words, the impact of dataset imbalance on the final detection results is relatively limited.

Discussion

This study constructs a Cantonese embroidery image dataset encompassing ten thematic categories, offering a valuable resource for the digital preservation of intangible cultural heritage, academic research, and technological innovation. Based on this dataset, the DAL-YOLOv8 framework is proposed, introducing four key enhancements: (1) the C2f-AFE module improves feature extraction under complex backgrounds; (2) the LSKA attention mechanism enables a precise focus on key regions of embroidery patterns; (3) DWConv replaces standard convolution to reduce model complexity; and (4) the WIoU loss function improves the localization accuracy of detection boxes. These improvements effectively address core challenges in Cantonese embroidery image recognition, including limited sample size, large morphological variations, and background interference. For the constructed dataset, DAL-YOLOv8 achieves a mean average precision (mAP) of 98.5%, with a 5.12% reduction in model size compared with the baseline YOLOv8, outperforming mainstream methods in both accuracy and efficiency. This work establishes a high-precision, lightweight, automated recognition paradigm for embroidery image detection, providing reliable technical support for the digital preservation of cultural heritage.

Future research will focus on three directions: (1) migrating the core modules of DAL-YOLOv8—feature enhancement, attention mechanisms, and lightweight design—into newer architectures such as YOLOv10, with emphasis on enhancing model stability under varying lighting conditions; (2) developing embedded applications based on this framework for deployment in museum guided tour systems and Cantonese embroidery digital archive platforms, enabling practical implementation; and (3) designing customized feature enhancement modules for other embroidery traditions, such as Miao and Shen embroidery, to validate the scalability and adaptability of the proposed framework. These efforts aim to provide effective technical tools for the digital documentation and sustainable transmission of traditional embroidery craftsmanship.

Data availability

A subset of the Cantonese embroidery image dataset, including sample images and corresponding labels files in YOLO format, is available at https://github.com/anting1124/CantoneseEmbroidery-ObjectDetection-Resources. The complete dataset is not publicly available due to storage constraints and project policies but can be made available by the corresponding author upon reasonable request.

References

United Nations Educational, Scientific and Cultural Organization (UNESCO). Convention for the Safeguarding of the Intangible Cultural Heritage https://unesdoc.unesco.org/ark:/48223/pf0000186627 (2003).
Huang, H. A Study on the Aesthetic Characteristics of Cantonese embroidery from the Perspective of Lingnan Culture. (Master’s thesis, Guangzhou University, Guangzhou, 2017)
Jurie, F. & Dhome, M. A simple and efficient template matching algorithm. Proc. Eighth IEEE Int. Conf. Computer Vis. ((ICCV)) 2, 544–549 (2001).
Article Google Scholar
Zou, Z., Chen, K., Shi, Z., Guo, Y. & Ye, J. Object detection in 20 years: A survey. Proc. IEEE 111, 257–276 (2023).
Article Google Scholar
Liu, Y. X., et al. Image inpainting for embroidery based on improved deep convolutional generative adversarial networks. Laser & Optoelectronics Progress, 60: 2010005–2010005-11, (2023).
Sharma, D., Dhiman, C. & Kumar, D. Control with style: Style embedding-based variational autoencoder for controlled stylized caption generation framework. IEEE Trans. Cogn. Dev. Syst. 16, 2032–2042 (2024).
Article Google Scholar
Sharma, D. XGL-T transformer model for intelligent image captioning. Multimed. Tools Appl. 83, 4219–4240 (2024).
Article Google Scholar
Gu, J. et al. Recent advances in convolutional neural networks. Pattern Recognit. 77, 354–377 (2018).
Article Google Scholar
Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 580-587 (2014).
Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).
He, K., Gkioxari, G., Dollar, P. & Girshick, R. Mask R-CNN. Proceedings of the IEEE international conference on computer vision, 2961–2969 (2017).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, 779–788. (IEEE, 2016).
Liu, W. et al. SSD: Single Shot MultiBox Detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision – ECCV 2016. Lecture Notes in Computer Science, 9905. (Springer, Cham, 2016).
Tan, M., Pang, R. & Le, Q. V. EfficientDet: Scalable and efficient object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10781–10790 (2020).
He, K., Zhang, X., Ren, S. & Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 1904–1916 (2015).
Article PubMed Google Scholar
Girshick, R. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1440–1448 (2015).
Zhao H. X. Classification and recognition algorithm of embroidery images based on deep learning. (Master’s thesis, Qinghai Normal University, Qinghai 2021).
Rizki, Y., Medikawati Taufiq, R., Mukhtar, H., Apri Wenando, F. & Al Amien, J. Comparison between faster R-CNN and CNN in recognizing weaving patterns. In 2020 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS), 81-86 (IEEE, 2020).
Sun, K.-K., Huang, J.-W., Yuan, Y.-Y. & Chen, M.-Y. Classification and recognition of the nantong blue calico pattern based on deep learning. J. Engineered Fibers Fabr. 19, 15589250241270618 (2024).
Article Google Scholar
Zhu, J. J. & Zhu, C. Research on the innovative application of shen embroidery cultural heritage based on convolutional neural network. Sci. Rep. 14, 9574 (2024).
Article CAS PubMed PubMed Central Google Scholar
Zhu C. Y., Xue, B., Zhu, J. J., & Huang W. J. Research on image recognition of nantong shen embroidery intangible cultural heritage based on improved mobile net V3. Preprint at https://doi.org/10.21203/rs.3.rs-5300929/v1 (2024).
Sharma, D., Dhiman, C. & Kumar, D. UnMA-CapSumT: Unified and multi-head attention-driven caption summarization transformer. https://doi.org/10.48550/arXiv.2412.11836 (2024).
Wang, J. H., Chen, H. Y. & Chen, Y. Research on pattern image recognition model introducing attention mechanism. Print. Digital Media Technol. Res. 2, 46–53 (2024).
Google Scholar
Howard, A. G. et al. MobileNets: Efficient convolutional neural networks for mobile vision applications. https://doi.org/10.48550/arXiv.1704.04861 (2017).
Lau, K. W., Po, L.-M. & Rehman, Y. A. U. Large separable kernel attention: Rethinking the large kernel attention design in CNN. Expert Syst. Appl. 236, 121352 (2024).
Article Google Scholar
Tong, Z., Chen, Y., Xu, Z. & Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. https://doi.org/10.48550/arXiv.2301.10051 (2023).
Cao, W. & Meng, X. D. Design of Cantonese embroidery pattern elements in silk cultural and creative products. Screen Print. 2, 1–4 (2023).
CAS Google Scholar
Sohan, M., Sai Ram, T. & Reddy, C. V. R. A Review on YOLOv8 and Its Advancements. International Conference on Data Intelligence and Cognitive Informatics. 529–545 (Springer, Singapore, 2024).
Ali, M., Javaid, M., Noman, M., Fiaz, M. & Khan, S. Fanet: Feature amplification network for semantic segmentation in cluttered background. 2024 IEEE International Conference on Image Processing (ICIP). 2592–2598 (IEEE, 2024).
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. CBAM: Convolutional block attention module. Proceedings of the European conference on computer vision (ECCV). 3-19 (2018).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition. 7132-7141 (2018).
Hou, Q., Zhou, D. & Feng, J. Coordinate attention for efficient mobile network design. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 13713-13722 (IEEE, 2021).
Wang, Q. et al. ECA-net: Efficient channel attention for deep convolutional neural networks. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11531–11539 (IEEE, 2020).

Download references

Acknowledgements

The authors gratefully acknowledge the support of the following programs, which facilitated this research: the Key Laboratory of Philosophy and Social Sciences in Guangdong Province (Maritime Silk Road of Guangzhou University, Grant No. GD22TWCXGC15); the 2023 Guangdong Province Higher Education Institutions Characteristic Innovation Project “Lingnan Culture and Art Digital Resource Sharing Platform” (Grant No. 2023WTSCX068); the Guangzhou University Research Projects “Digital Revitalization and Intelligent Dissemination Research of Lingnan Cultural Arts” (Grant No. PT252022040) and “Cantonese Embroidery Intangible Cultural Heritage Digital Resource Sharing Platform” (Liwan Research Institute, Guangzhou University); and the Guangzhou Academician and Expert Workstation (No. 2024-D003). These programs provided financial and logistical support for the research.

Author information

Authors and Affiliations

Institute of Computing Science and Technology, Guangzhou University, Guangzhou, PR China
Yongsheng Rao, Ting An, Qixin Zhou & Hao Guan
College of Art & Design, Guangzhou Polytechnic University, Guangzhou, PR China
Yingshuang Xuan
School of Fine Arts and Design, Hechi University, Hechi, PR China
Ranran Wang
School of Fine Arts and Design, Guangzhou University, Guangzhou, PR China
Maoning Li

Authors

Yongsheng Rao
View author publications
Search author on:PubMed Google Scholar
Ting An
View author publications
Search author on:PubMed Google Scholar
Yingshuang Xuan
View author publications
Search author on:PubMed Google Scholar
Ranran Wang
View author publications
Search author on:PubMed Google Scholar
Qixin Zhou
View author publications
Search author on:PubMed Google Scholar
Hao Guan
View author publications
Search author on:PubMed Google Scholar
Maoning Li
View author publications
Search author on:PubMed Google Scholar

Contributions

ML: Conceptualization, Methodology, Resources, Software, Data Curation; YR: Conceptualization, Methodology, Resources, Software; TA: Conceptualization, Methodology, Data Curation, Writing - Original, Visualization; YX: Conceptualization, Methodology, Writing—Review & Editing; RW: Conceptualization, Methodology, Investigation, Writing—Review & Editing; QZ: Conceptualization, Methodology, Software, Supervision; HG: Methodology, Validation, Writing—Review & Editing.All authors read and approved the final version of the manuscript.

Corresponding author

Correspondence to Maoning Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics approval and consent to participate

This is not applicable as the present study did not involve human participants.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Rao, Y., An, T., Xuan, Y. et al. Feature enhancement and attention mechanism fusion method for Cantonese embroidery image detection based on YOLOv8. npj Herit. Sci. 13, 521 (2025). https://doi.org/10.1038/s40494-025-02079-x

Download citation

Received: 13 May 2025
Accepted: 25 September 2025
Published: 14 October 2025
DOI: https://doi.org/10.1038/s40494-025-02079-x

Feature enhancement and attention mechanism fusion method for Cantonese embroidery image detection based on YOLOv8

Abstract

Similar content being viewed by others

Combined query embroidery image retrieval based on enhanced CNN and blend transformer

Research on the innovative application of Shen Embroidery cultural heritage based on convolutional neural network

Advancing e-waste classification with customizable YOLO based deep learning models

Introduction

Methods

Construction of an object detection dataset for Cantonese embroidery images