Introduction

In October 2003, the United Nations Educational, Scientific and Cultural Organization (UNESCO) adopted the Convention for the Safeguarding of Intangible Cultural Heritage at its 32nd General Conference1. As a typical representative of intangible cultural heritage, embroidery not only has profound cultural connotations but also embodies exquisite handicraft skills. Cantonese embroidery, as one of the four famous embroideries in China and a national non-legacy, can be traced back to the Tang Dynasty, reached its peak in the Ming and Qing Dynasties, and was especially famous during the period of “Thirteen Houses” of Guangzhou’s foreign trade in the Qing Dynasty. Cantonese embroidery is known for its richness of subject matter, fullness of composition, rich colors, and variety of stitches, highlighting the unique cultural characteristics of the Lingnan region. In particular, the subject matter is distinctive and especially good at showing the unique scenery of the Lingnan region, such as flowers, lychees, kapoks, magpies, partridges, phoenixes, cranes, peacocks and many other subjects2.

In the field of image recognition, accurately recognizing objects in images has always been a challenge for scholars. Traditional image recognition techniques (e.g., template matching3) localize objects by sliding templates to compute the similarity pixel by pixel; however, they suffer from both low accuracy and efficiency in complex backgrounds. With the iterative development of deep learning techniques, their application in image detection has become increasingly feasible. It has been confirmed in the literature4 that target detection technology enables the automatic identification and localization of objects within images. The feasibility of this technology provides a novel approach for detecting motif elements in Cantonese embroidery images. This technique can not only accurately determine the category and location of motif elements in Cantonese embroidery images but also allow researchers to conduct efficient categorization and precise archiving of embroidery works based on the detection results, thus enhancing the efficiency and accuracy of Cantonese embroidery research and management. In addition, the combination of the target detection technique and generative adversarial networks (GANs) enables intelligent restoration of damaged regions in Cantonese embroidery images5. Moreover, recent developments in stylized description generation models, such as style embedding-based variational autoencoder (SE-VAE)6 and XGL-Transformer (XGL-T)7, combined with object detection techniques, hold great potential for generating stylized textual descriptions of Cantonese embroidery images, thereby enriching their multimodal representations. These approaches provide valuable references for future research that integrates detection and generation. Conducting automatic recognition research on Cantonese embroidery images via object detection holds significant theoretical value and promising application prospects.

With the wide application of deep learning technology in the field of computer vision, convolutional neural networks (CNNs)8 have become a core method for tasks such as image classification, target detection, and semantic segmentation. CNNs can automatically learn image features and efficiently deal with images of different sizes, textures, and colors. Current CNN-based target detection methods can be categorized into two-stage algorithms and single-stage algorithms. Two-stage algorithms, such as R⁃CNN9, Faster R⁃CNN10, and Mask R⁃CNN11, first identify object candidate frames by region suggestion and then classify the candidate frames. In addition, single-stage algorithms such as YOLO12, SSD13, and EfficientDet14, perform regression prediction directly on the image and output the location and category information of the object at once.

Among two-stage algorithms, early models like R-CNN suffered from inefficiencies due to redundant computation9. Later developments, such as SPPNet15 and Fast R-CNN16, presented pooling strategies to enhance performance. Faster R-CNN10 further improved efficiency by incorporating a region proposal network (RPN), enabling end-to-end training and near real-time detection. Nevertheless, this method still relies heavily on anchor box design, is sensitive to hyperparameters, and has high computational demands when using deep backbones.

Two-stage algorithms have been applied in the domain of embroidery and textile recognition. For example, the Faster R-CNN framework based on ResNet50 architectures17, has successfully identified multiple categories of patterns in traditional Chinese embroideries. Another study18 compared the performance of VGG-based Faster R-CNN and AlexNet-based CNN in fabric pattern recognition, and found that both achieved good accuracy and precision. However, these methods are mainly suitable for single-label images. In case, when multiple patterns coexist, interference can lead to reduced recognition accuracy, and the overall detection speed remains relatively slow. These limitations indicate that traditional two-stage models still face challenges in generalizing effectively to complex multi-label embroidery image recognition tasks.

Single-stage detection algorithms, characterized by their efficient design, have outstanding efficiency advantages in real-time application scenarios and have now surpassed many traditional multistage detection methods in terms of speed and accuracy. For example, YOLO (you only look once), a classical target detection algorithm proposed by the Redmon team12 in 2016, emerged at the CVPR conference with its efficient single-stage detection architecture; SSD, proposed by Wei Liu et al.13 at ECCV 2016, achieves end-to-end detection by directly predicting, on feature maps at different scales, the target’s bounding boxes and categories to achieve end-to-end detection, which is fast and can detect in real time.

Single-stage detection algorithms have been widely used in textile pattern recognition in recent years due to their efficient architecture. A previous study19 applied SSD and YOLOv8 to the recognition of Nantong blue calico fabric patterns. It improved mAP by replacing the backbone network from VGG to MobileNetV2, confirming the positive effect of lightweight architectures on performance optimization. However, the method still performed poorly in recognizing geometric patterns.

To further improve recognition performance, recent research has begun to explore structural optimization directions, such as combining lightweight networks with multi-scale feature fusion modules. One study20 introduced a spatial pyramid pooling (SPP) module into a MobileNetV1-based model to enhance the fusion of local and global features, applying it to the recognition of Nantong Shen embroidery patterns. Despite modest improvements in feature extraction, the overall performance gain is limited by fundamental limitations of the MobileNetV1 architecture. Subsequently, the backbone was upgraded to MobileNetV3, incorporating attention mechanisms and efficient activation functions, which significantly reduced the number of parameters while enhancing the model’s ability to allocate feature weights, achieving both lightweight design and improved recognition performance21.

In addition to architectural improvements, attention mechanisms have been widely adopted to enhance the model’s sensitivity to key regions in images. For example, an image captioning model integrating Faster R-CNN and Inception V3, known as MAA-FIC22, employed adaptive attention to effectively capture salient objects and their spatial relationships, enabling the generation of coherent and factual textual descriptions. Another study23 incorporated a multi-head attention mechanism into ResNet50 to improve focus on texture details and color variations, combined with data augmentation strategies to enhance model robustness. Although these methods increase computational complexity, they can effectively improve feature extraction capabilities and help achieve more accurate image recognition under complex conditions.

Although CNN-based object detection models have achieved certain success in embroidery image recognition, they still face limitations when dealing with the complex backgrounds, multi-label coexistence, dense textures, and rich colors characteristic of Cantonese embroidery images. In particular, while the mainstream YOLOv8 model demonstrates robust detection performance, its capability to model intricate patterns and semantically complex regions remains limited. To address these challenges, this paper proposes an improved YOLOv8-based model that integrates feature enhancement and attention mechanisms, aiming to improve detection accuracy and model robustness for Cantonese embroidery images.

Methods

To address the scarcity of Cantonese embroidery datasets, low recognition accuracy under complex backgrounds, and significant interclass morphological variation, this study proposes an improved YOLOv8 model named DAL-YOLOv8. First, a dedicated Cantonese embroidery dataset is constructed through image collection, augmentation, and annotation. The YOLOv8 model is subsequently enhanced by introducing DWConv24 to reduce the computational cost, a C2f-AFE module for better feature extraction, an LSKA25 attention mechanism to expand the receptive field, and the WIoU26 loss function to reduce the impact of low-quality samples. Finally, experimental results are used to validate the effectiveness of the proposed model. The overall workflow is illustrated in Fig. 1:

Fig. 1
figure 1

Flowchart of the DAL-YOLOv8 framework. In the first stage, the Cantonese embroidery images are augmented and annotated. In the second stage, the model detects objects and locates their positions. In the third stage, the output results are analyzed.

Construction of an object detection dataset for Cantonese embroidery images

The dataset constructed in this study contains 1860 Cantonese embroidery images from three sources. The first part is from the protection unit of Cantonese embroidery, an intangible cultural heritage in Guangdong Province, China. It consists of 132 Cantonese embroidery shawl images, which are expanded to 266 images after processing. The second part, provided by Cantonese Embroidery Craft Factory Co., Ltd., contains a total of 60 images. In the third part, 1534 Cantonese embroidery images with resolutions greater than 500 × 500 pixels, a clean background (uniform solid color substrate, no folds, stains, or shadows), and representative images were screened and obtained from the internet. Figure 2 shows some examples of Cantonese embroidery images randomly selected from the dataset.

Fig. 2
figure 2

Sample images from the Cantonese embroidery dataset, including ten object categories.

To increase the training effectiveness and generalizability of the model, this study applies data augmentation techniques to expand the dataset. The specific methods include horizontal flipping, vertical flipping, the addition of Gaussian noise, and random contrast adjustment, thereby increasing sample diversity. After four data augmentation techniques are applied to each Cantonese embroidery image, the total number of images in the dataset is expanded to 9300. The dataset is subsequently partitioned into training, validation, and test sets at a ratio of 8:1:1. Specifically, the training set comprises 7440 images, whereas the validation and test sets each contain 930 images. These subsets are utilized for model training and performance evaluation. Figure 3 presents several examples illustrating the effects of these augmentation techniques on Cantonese embroidery images.

Fig. 3: Data augmentation diagram.
figure 3

a Original image; b Horizontal flip; c Vertical flip; d Addition of Gaussian noise; e Random contrast adjustment. Images (be) are generated through data augmentation.

Cantonese embroidery has a rich and diverse range of themes, often taking the unique natural scenery and cultural elements in the Lingnan region as its subjects, resulting in a unique style of “both rustic and refined”27. Its content covers natural scenery (such as flowers, landscapes, lychees, and kapoks), auspicious totems (such as dragons, phoenixes, cranes, and peacocks), and custom folk images (such as figures, fish, and birds). Among these, lychee, kapok, and three bird species (white crane, magpie, and partridge) are particularly representative of cultural heritage. This study selects ten typical categories from these culturally significant elements—flower, bird, lychee, kapok, lotus, mandarin duck, peacock, crane, dragon, and phoenix—for in-depth analysis. The specific classification is illustrated in Fig. 4. To annotate the above ten categories of Cantonese embroidery images, LabelImg image annotation software was employed to manually annotate the rectangular regions. Table 1 presents the distributions of the ten categories in Train, Valid, and Test in the Cantonese embroidery dataset.

Fig. 4: Categories of Cantonese embroidery objects.
figure 4

a flower; b bird; c lychee; d kapok; e lotus; f mandarin duck; g peacock; h crane; i dragon; j phoenix.

Table 1 Statistical table of the distribution of the Cantonese embroidery dataset

To investigate the impact of background complexity on model performance in Cantonese embroidery images, this study categorizes the images into two types: simple background and complex background. Images with simple backgrounds are characterized by minimal background elements, uniform colors, simple textures, clear object boundaries, and low visual interference. In contrast, complex background images feature numerous patterns, dense textures, rich colors, and overlapping or occluded objects. To enhance the objectivity of this classification, the spatial density index (SDI) is introduced as a quantitative metric. The SDI takes into account both the number of objects in an image and their spatial distribution, providing a measure of object density. The calculation method is as follows:

$$SDI=\frac{N}{W\times H}\times \frac{1}{\bar{d}}$$
(1)

Here, N represents the number of detected objects in the image, W × H denotes the resolution area of the image, and \(\bar{d}\) refers to the average Euclidean distance between the centers of all detected objects. In this study, the following thresholds are defined: images with SDI < 1 × 10–5 are classified as having a simple background, whereas those with SDI > 1 × 10–4 are classified as having a complex background. This classification provides a quantitative basis for analyzing model performance under different background conditions.

DAL-YOLOv8: an object detection algorithm with feature enhancement and an attention mechanism for Cantonese embroidery images

The YOLO series of algorithms has evolved over many iterations to continuously optimize the detection performance and inference speed. Among these, YOLOv8 stands out as the most representative version because of its exceptional detection accuracy, high inference speed, and strong system compatibility. Its architecture comprises three core functional modules: the backbone, which is responsible for extracting primary features; the neck, which enables multiscale feature fusion via a feature pyramid; and the head, which generates detection outputs—including bounding boxes, confidence scores, and category labels—based on the refined feature maps. For different application scenarios, YOLOv8 provides five model specifications: n/s/m/l/x28. This study improves upon YOLOv8s.

Cantonese embroidery poses significant challenges for image detection because of its intricate subjects, complex compositions, vivid colors, and diverse stitching techniques. To address these issues, this study proposes an improved detection model, DAL-YOLOv8, which is based on the YOLOv8s architecture. The model integrates feature enhancement and attention mechanisms to improve recognition performance under complex backgrounds. Based on the original YOLOv8, the network architecture is reconstructed by module replacement and fusion. Specifically, standard convolutional layers are replaced with depthwise separable convolutions (DWConv)24 to reduce the model size and computational cost. A feature enhancement module (C2f-AFE) is designed to strengthen the model’s ability to extract representative features from cluttered backgrounds. Additionally, a large kernel separable attention (LSKA)25 mechanism is introduced, which uses a 7 × 7 depthwise convolution to expand the receptive field and improve the model’s ability to capture global contextual information and focus on key regions in embroidery images. Furthermore, the original bounding box regression loss function is replaced with Wise-IoU (WIoU)26 loss to suppress the influence of low-quality samples and enhance localization accuracy. The overall structural improvements of the model are illustrated in Fig. 5, where yellow dashed lines indicate modified convolution modules, red dashed lines represent C2f-AFE replacements, and blue dashed lines highlight the newly added LSKA components.

Fig. 5
figure 5

Structural diagram of the DAL-YOLOv8 model, highlighting the main modules for feature enhancement and attention mechanisms. DWConv reduces model size and computational cost; the C2f-AFE module enhances feature extraction from complex backgrounds; and the LSKA expands the receptive field and improves attention to key regions in embroidery images.

DWConv

In YOLOv8, convolutional operations account for most of the computational resources (GFLOPs). To reduce the number of computations and optimize the model size, this paper replaces the standard convolution with depthwise separable convolution (DWConv)24. As shown in Fig. 6, DWConv is completed in two steps. First, the depthwise convolution operation is carried out. Each channel is processed separately by a single convolution kernel, keeping the number of channels unchanged and only extracting spatial features. Subsequently, the pointwise convolution operation is performed, which mixes the information of all channels and accordingly adjusts the number of channels. This method can significantly reduce the computational resources while retaining the key feature information.

Fig. 6
figure 6

DWConv module structure, showing depthwise and pointwise convolution operations for efficient feature extraction.

C2f-AFE

To address the limitations of existing deep learning methods in that they neglect semantic cues during semantic segmentation within complex scenarios (e.g., cluttered backgrounds, translucent objects), Muhammad Ali et al.29 proposed a feature amplification network (FANet). The adaptive feature enhancement (AFE) module serves as the core component of FANet. By integrating four submodules: 1) convolutional embedding (CE), 2) spatial context module (SCM), 3) feature refinement module (FRM), and 4) convolutional multilayer perceptron (ConvMLP), the AFE module achieves multilevel enhancement of the input features (the detailed architecture is shown in Fig. 7). Experimental validation by Muhammad Ali confirmed the efficacy of this integrated design. This approach effectively captures large-scale contextual information along with high- and low-frequency features, significantly improving object boundary discrimination, particularly in scenarios involving translucent objects and cluttered backgrounds.

Fig. 7
figure 7

Structure diagram of the AFE module, including convolutional embedding, spatial context, feature refinement, and ConvMLP components.

Among them, the CE module conducts LayerNorm and depthwise separable convolution operations on the input features in sequence and then compresses the number of channels by half through a 1 × 1 convolution. In this way, it reduces the computational amount, promotes feature mixing, and improves the feature generalization ability. The SCM module uses a large kernel group convolution of size 7 × 7 to expand the receptive field and capture a wider range of spatial context information to address scale changes in complex scenarios. The FRM module captures both low-frequency and high-frequency region information through the downsampling and upsampling operations of depthwise separable convolution to achieve feature refinement. The ConvMLP module first concatenates the outputs of the SCM and FRM, then fuses them through a 1 × 1 convolution, and finally inputs the result into the ConvMLP for further feature fusion and enhancement. The output result of the AFE module is shown in formula (2):

$${X}_{AFE}=Con{v}_{1\times 1}(ConvMLP(Concat({X}_{SCM},{X}_{FRM}))),{X}_{AFE}\in {R}^{H\times W\times C}$$
(2)

In Cantonese embroidery image detection, the complex background often leads to detection difficulties, whereas the AFE module expands the sensory field to capture the global contextual information through the large kernel convolution of the SCM module and simultaneously fuses the high- and low-frequency features with the help of the FRM module to enhance the texture details and boundary information, which is pertinent for dealing with complex textures (e.g., embroidery stitches, pattern contours) and background interference in the Cantonese embroidery images. On this basis, this paper combines the AFE module YOLOv8’s C2f module to design the C2f-AFE module (Fig. 8), which can realize the efficient fusion of multilevel features and thus improve the localization and classification accuracy of target detection.

Fig. 8
figure 8

Structure of the C2f-AFE module, integrating the C2f structure with the AFE module.

The calculation formula of the bottleneck AFE module is as follows:

$${X}^{{\prime} }=Con{v}_{3\times 3}(Con{v}_{3\times 3}(X))$$
(3)
$${X}_{Bottleneck-AFE}=X+AFE({X}^{{\prime} })$$
(4)

The feature map X first goes through two standard 3 × 3 convolutions to obtain the feature map \(X{\prime}\). \(AFE(X\text{'})\) represents the feature map obtained by processing the feature map \(X{\prime}\) with the AFE module, and \({X}_{Bottleneck-AFE}\) represents the feature map output after being processed by the bottleneck AFE module.

The calculation formulas of the C2f - AFE module are (5) to (9):

$$Y=Con{v}_{1\times 1}(X),Y\in {R}^{H\times W\times 2C}$$
(5)
$${Y}_{1},{Y}_{2}=split(Y,\dim =1)$$
(6)
$${Y}_{2}^{i}=Bottleneck\_AF{E}_{i}({Y}_{2}^{(i-1)}),i=1,2,\cdots ,n$$
(7)
$${Y}_{cat}=concat([{Y}_{1},{Y}_{2},{Y}_{2}^{(1)},{Y}_{2}^{(2)},\cdots ,{Y}_{2}^{(n)}],\dim =1),{Y}_{cat}\in {R}^{H\times W\times (2+n)C}$$
(8)
$${Y}_{out}=Con{v}_{1\times 1}({Y}_{cat}),{Y}_{out}\in {R}^{H\times W\times C}$$
(9)

Among them, “split” divides y along the channel dimension into two parts, and \({Y}_{1}\) and \({Y}_{2}\) are the feature maps after division. The output \({Y}_{2}^{(i)}\) of each bottleneck AFE is used as the input for the next iteration. \({Y}_{cat}\) is the merged feature map, and \({Y}_{out}\) is the final output feature map.

LSKA

Cantonese embroidery images present several challenges, including substantial shape variations across categories despite similar textures, frequent missed detections of small objects, interference from complex backgrounds, and partial occlusions. Traditional convolutional neural networks, constrained by limited local receptive fields, struggle to capture broader contextual information necessary for accurate recognition under these conditions. To address this issue, this study conducts comparative experiments using common attention mechanism modules, including CBAM30, SE31, CA32 and ECA33. The results are shown in Table 2 below; adopting the large separable kernel attention (LSKA) method proposed by Kin et al.25, with the convolution size set to 7 × 7, achieves the best effect.

Table 2 Comparative results of the modules for each attention mechanism

The structure of the LSKA module is shown in Fig. 9. The LSKA decomposes the two-dimensional large convolution kernel into one-dimensional kernels along the horizontal and vertical directions, significantly reducing computational complexity and memory usage. Its expansive receptive field enhances the ability to distinguish Cantonese embroidery patterns that exhibit similar textures yet differ in shape. Moreover, it facilitates the integration of contextual information across multiple scales, strengthens the association between small objects and surrounding patterns, and effectively decreases the likelihood of missed detections.

Fig. 9
figure 9

LSKA module structure, illustrating large kernel attention mechanism.

For a given convolution kernel of size k×k and its feature map F \((F\in {R}^{H\times W\times C})\), where C is the number of input channels, H and W represent the height and width of the feature map, respectively, and d represents the dilation rate, the LSKA module is implemented according to the following steps.

First, the original k×k convolution kernel is decomposed into two cascaded one-dimensional convolution kernels, \({W}_{1}\in {R}^{(1\times (2d-1))}\) and \({W}_{2}\in {R}^{((2d-1)\times 1)}\). Local spatial information is extracted through depthwise convolution to obtain the output feature map \({Z}^{C}\) of the depthwise convolution.

$${Z}^{C}=\mathop{\sum }\limits_{H,W}{W}_{2}^{C}\,\ast \,\left(\mathop{\sum }\limits_{H,W}{W}_{1}^{C}\ast {F}^{C}\right)$$
(10)

Then, the extended depthwise convolution of \(\lfloor \frac{k}{d}\rfloor \times \lfloor \frac{k}{d}\rfloor\) is decomposed into more fine-grained one-dimensional extended depthwise convolutions, \({V}_{1}\in {R}^{(1\times \lfloor \frac{k}{d}\rfloor )}\) and \({V}_{2}\in {R}^{(\lfloor \frac{k}{d}\rfloor \times 1)}\), and the output feature \({Z}^{{\prime} C}\) is obtained through depthwise convolution. This feature contains global spatial information.

$${{Z}^{{\prime} }}^{C}=\mathop{\sum }\limits_{H,W}{V}_{2}^{C}\,\ast \,\left(\mathop{\sum }\limits_{H,W}{V}_{1}^{C}\ast {Z}^{C}\right)$$
(11)

Third, an attention map \({A}^{C}\) is obtained through a 1 × 1 convolution. The Hadamard product of the attention map \({A}^{C}\) and the input feature map F is the output \({\overline{F}}^{C}\) of the LSKA module.

$${A}^{C}=Con{v}_{1\times 1}\ast {Z}^{{{\prime} }^{C}}$$
(12)
$${\overline{F}}^{C}={A}^{C}\otimes {F}^{C}$$
(13)

Among them, * represents the convolution operation, \(\lfloor .\rfloor\) represents the floor operation, and \(\otimes\) indicates channelwise broadcasted elementwise multiplication (Hadamard product).

Loss function

Low-quality samples (e.g., repeated labeling of the same object, omission of labeling, etc.) inevitably appear in the Cantonese embroidery dataset, and the original CIoU loss function of YOLOv8 is insufficient to address these kinds of samples. During the training process, the model, if overfitting the low-quality samples, prioritizes the learning of erroneous localization information and gives an unreasonable gradient weight to these types of samples, which in turn leads to the model ignoring valid features in the high-quality samples. This results in the model ignoring the effective features in high-quality samples, which ultimately causes the overall localization performance to degrade. In contrast, Wise-IoU26 can identify the “abnormal degree” (i.e., outliers) of low-quality samples and reduce the gradient gain of low-quality samples through a dynamic nonmonotonic focusing mechanism, thus improving the model’s learning efficiency for both high-quality and normal-quality samples. WIoU v1 lacks an effective focusing mechanism and cannot distinguish between anchor frames of different qualities. v2 adopts a monotonic focusing mechanism, where the gradient gain varies monotonically with the loss value, resulting in the gradient gain decreasing as it decreases at the end of the training period, which slows down the convergence speed of the model. v3 introduces a dynamic nonmonotonic focusing mechanism, which evaluates the quality of anchor frames by the degree of abnormality of anchor frames (i.e., outlier degree, the ratio of the degree to that of the anchor frame) and constructs nonmonotonic focusing coefficients to assign a small gradient gain to high- and low-quality anchor frames so that the learning focus is on ordinary quality anchors. Learning focuses on average-quality anchor frames to reduce the harmful gradient of low-quality samples, which in turn improves the overall performance of the detector. Experiments by Zanjia Tong et al.25 also verified its effectiveness: the AP 75 on the MS-COCO dataset increased from 53.03% to 54.50% when the WIoU was applied to YOLOv7.

For this reason, this paper adopts the WIoU v3 instead of the CIoU as the bounding box regression loss function to minimize the negative impact of low-quality samples on model performance. The nonmonotonic focusing mechanism of the WIoU v3 has two advantages. First, it reduces the weight of high outlier samples (e.g., flower petals obscured by stitches) to suppress the interference of texture noise on the loss; second, it enhances the learning of low outlier samples (e.g., small targets such as tiny flowers) and improves the localization accuracy through an adaptive gradient. This loss function effectively compensates for the limitations of CIoU’s lack of targeted processing for low-quality samples, insufficient gradient allocation in WIoU v1, and misjudgment of outlier samples in WIoU v2. Figure 10 compares the effects of the CIoU and WIoU v3 on the training and validation sets of the Cantonese embroidery image dataset.

Fig. 10: CIoU and WIoU loss curves during YOLOv8s training.
figure 10

a Boundary box loss; b Classification loss; c DFL loss. Each plot shows four curves corresponding to training and validation datasets.

As seen from Fig. 10, in the YOLOv8s training of the Cantonese embroidery image dataset, compared with the CIoU, the WIoU shows better performance: in the bounding box loss (Fig. 10a), the WIoU decreases faster in the early stage of training, and the loss in the training set and validation set is lower than that of the CIoU in the later stage, which is fast convergence and small loss; in the classification loss (Fig. 10b), the WIoU has a steeper drop in the early stage of the loss and has a leading convergence rate, and the loss is lower than the CIoU in the later stage. loss is lower than that of the CIoU, and the classification optimization is better; for the confidence loss (Fig. 10c), the WIoU has a large drop in the early stage, converges quickly, and has a smaller loss in the later stage of the training and validation sets, and the confidence optimization is better. Overall, the WIoU is comprehensively better than the CIoU in terms of convergence speed and loss performance, verifying the effectiveness of the improvement.

Results

Experimental environment

All the experiments were performed on a cloud server with the following configurations: a 16-core Intel Xeon(R) Platinum 8474 C CPU operating at 2.4 GHz and an NVIDIA GeForce RTX 4090D GPU with 24 GB of memory. The operating system used was Ubuntu 22.04. The deep learning framework employed was PyTorch 2.3.0, with PyCharm as the development environment. Compute Unified Device Architecture (CUDA) version 12.1 was used, and Python 3.12 was used as the programming language. All the experiments in this paper are trained and tested in the same environment and use the same hyperparameters, data augmentation strategies, and training datasets. The specific settings of the training parameters during the model training process are shown in Table 3.

Table 3 Selected key parameters set during model training

To evaluate the performance of the improved YOLOv8 model (referred to as the DAL-YOLOv8 model in this study), the experiment employs precision (P), recall (R), mean average precision (mAP), and F1 score as evaluation metrics, providing a comprehensive and objective assessment of the model’s effectiveness.

As shown in Fig. 11, in the initial stage of DAL-YOLOv8 model training, the precision (P) rapidly increases, and the precision and recall (R) stabilize at about 50 training times. At about 180 training iterations, the box loss decreases significantly, reflecting the model’s improved ability to locate objects in images accurately. Combined with the steady increase in precision and the continuous decrease in the three loss quantities (train/box_loss, train/cls_loss, train/dfl_loss, etc.), the model performs well during training.

Fig. 11
figure 11

DAL-YOLOv8 training results, showing training and validation metrics.

Ablation experiments

To evaluate the effectiveness of the improved model, this study designed five sets of ablation experiments on the Cantonese embroidery dataset, using the same equipment and dataset for training and testing to ensure comparability, and the experimental results are shown in Table 4. We selected the feature enhancement module (C2f-AFE), the attention mechanism (LSKA), DWConv, and the loss function (WIoU) and used evaluation metrics for quantitative and qualitative analysis.

Table 4 Ablation experiment results

The results in Table 4 indicate that after the introduction of the C2f-AFE module into the YOLOv8 framework, the model’s mAP, P, and R values increase by 0.5%, 0.5%, and 0.7%, respectively. This improvement is attributed to the adaptive feature enhancement and multilevel feature fusion mechanisms of C2f-AFE, which strengthen the model’s ability to perceive complex texture lines, pattern contours, and global texture distributions in Cantonese embroidery images. Upon further integrating the LSKA attention mechanism on this basis, the R value of the model increases by 1.6%, with a slight improvement in mAP. This is due to the LSKA mechanism’s ability to focus accurately on key regions of targets in Cantonese embroidery images, thereby optimizing the detection effect.

Compared with the YOLOv8s base model, the introduction of DWConv reduces the size of the improved model by 5.12%. Moreover, the application of the WIoU loss function increases the P value by 1.2%, resulting in an mAP of 98.5% and an F1 value of 96.59%, whereas the model size remains stable at 20.4 M. These results demonstrate that the combined model (i.e., DAL-YOLOv8), formed by integrating YOLOv8s with C2f-AFE, LSKA, DWConv, and WIoU, can achieve efficient and accurate target detection in Cantonese embroidery image detection tasks through intermodule synergy. Figure 12 presents a visual comparison of heatmaps for some detection results of DAL-YOLOv8 before and after the addition of the C2f-AFE and LSKA modules.

Fig. 12: Heatmap visualizations.
figure 12

a Original image; b Before adding C2f-AFE and LSKA modules; c After adding C2f-AFE and LSKA modules.

As shown in Fig. 12, after the C2f-AFE module and LSKA attention mechanism are integrated into the YOLOv8 framework, the model’s ability to perceive features in Cantonese embroidery images is significantly improved. A comparison of the heat distributions before and after the addition of these two modules reveals that the model captures targets and pattern ranges such as kapok, bird, and phoenixe in Cantonese embroidery images more accurately after integration. Its capabilities in feature extraction and target detection under complex scenarios have been obviously enhanced, which strongly verifies the effectiveness of applying these two modules to the task of Cantonese embroidery image detection.

Model comparison experiments

To compare the efficiency of the improved model proposed in this paper, Faster RCNN, SSD, YOLOv6, YOLOv7, YOLOv8, YOLOv9, and YOLOv10 are selected for comparative experiments. All the experiments are conducted on the same equipment and in the same environment, use the same dataset and data augmentation method, and maintain the same ratio of the training set to the test set. The experiment involves 200 iterations, and the best results are selected for testing. The comparative data of each algorithm in terms of P, R, mAP, F1 and model size are shown in Table 5.

Table 5 Model comparison experiment results

As shown in Table 5, compared with the classical target detection models, the DAL-YOLOv8 model proposed in this paper has significant advantages. Compared with Faster R-CNN and SSD, the mAPs of Faster R-CNN and SSD are 8.86 and 8.38 percentage points greater, indicators such as the P and F1 values are comprehensively leading, and the model size is smaller. Compared with YOLOv8s, this method improves the P, R, mAP and F1 values by 0.4, 2.3, 0.7 and 1.37%, respectively, while the model size is reduced by 5.12%. Although the model size is slightly larger than that of YOLOv9s and YOLOv10s because of the introduction of complex network modules, such as the feature enhancement module and attention mechanism, and the computational complexity (FLOPs) is increased, this move significantly improves the detection performance: the mAP of DAL-YOLOv8 reaches 98.5%, which is higher than that of YOLOv9s and YOLOv10s by 0.3 and 0.2 percentage points, respectively. Overall, the magnitude of model performance improvement relative to the percentage increase in computational complexity significantly outweighs the increased computational cost.

To verify the feasibility of DAL-YOLOv8, two representative scene images with different complexities from the test set are selected in this paper to carry out comparison experiments, and the model recognition results before and after the improvement are shown in Fig. 13.

Fig. 13: Detection results across different scene contexts.
figure 13

a YOLOv8s; b DAL-YOLOv8s in simple and complex backgrounds.

Figure 13 shows the comparison results of detection between DAL-YOLOv8 and YOLOv8s. In the left image (YOLOv8s) and the right image (DAL-YOLOv8), the areas circled in purple indicate missed detection targets, and the annotations in blue represent false detection targets. In the first image with a simple background and a multisemantic scene, both models achieve full target detection, but DAL-YOLOv8 is superior in terms of positioning and recognition accuracy. In the second image, which has a complex background and a multisemantic scene, YOLOv8s results in missed and false detections. However, by integrating the C2f-AFE feature enhancement module and the LSKA attention mechanism, DAL-YOLOv8 effectively improves the feature recognition ability. In complex backgrounds, it not only strengthens the perception of global features but also takes into account the extraction of small target details and large target global features. Finally, it achieves accurate recognition of all categories in the image, and the recognition accuracy of each category is better than that of YOLOv8s.

Figure 14 presents a comparison of the detection accuracy across ten categories of Cantonese embroidery images between the proposed DAL-YOLOv8 model and the baseline YOLOv8s model. DAL-YOLOv8 outperforms YOLOv8s in all categories, with the accuracy for the “lotus” category reaching 99.4%, which is 0.9 percentage points higher than that of YOLOv8s. For categories with complex textures and fine embroidery details, such as “peacock” and “phoenix,” the improved model achieves accuracy gains of 2.6% and 0.2%, respectively. In the case of “dragon,” a category with larger patterns, the recognition accuracy reaches 99.2%, whereas even for smaller and less prominent categories such as “lychee,” an improvement of 0.9% is observed. These enhancements can be attributed to the integration of the C2f-AFE feature enhancement module and the LSKA attention mechanism in DAL-YOLOv8, which significantly improve the model’s ability to extract and recognize features under complex conditions.

Fig. 14
figure 14

Detection results for ten categories before and after model improvement. Precision is expressed as percentage (%).

According to the dataset distribution shown in Table 1, categories such as “dragon,” “phoenix,” and “mandarin duck” have significantly fewer samples across the training, validation, and test sets than categories such as “flower” and “bird”. However, as illustrated in Fig. 14, the DAL-YOLOv8 model achieves recognition accuracies exceeding 98% for these underrepresented categories, even outperforming more frequently occurring categories such as “kapok” and “lychee”. This suggests that although the dataset exhibits class imbalance, it does not substantially affect the model’s recognition performance. In other words, the impact of dataset imbalance on the final detection results is relatively limited.

Discussion

This study constructs a Cantonese embroidery image dataset encompassing ten thematic categories, offering a valuable resource for the digital preservation of intangible cultural heritage, academic research, and technological innovation. Based on this dataset, the DAL-YOLOv8 framework is proposed, introducing four key enhancements: (1) the C2f-AFE module improves feature extraction under complex backgrounds; (2) the LSKA attention mechanism enables a precise focus on key regions of embroidery patterns; (3) DWConv replaces standard convolution to reduce model complexity; and (4) the WIoU loss function improves the localization accuracy of detection boxes. These improvements effectively address core challenges in Cantonese embroidery image recognition, including limited sample size, large morphological variations, and background interference. For the constructed dataset, DAL-YOLOv8 achieves a mean average precision (mAP) of 98.5%, with a 5.12% reduction in model size compared with the baseline YOLOv8, outperforming mainstream methods in both accuracy and efficiency. This work establishes a high-precision, lightweight, automated recognition paradigm for embroidery image detection, providing reliable technical support for the digital preservation of cultural heritage.

Future research will focus on three directions: (1) migrating the core modules of DAL-YOLOv8—feature enhancement, attention mechanisms, and lightweight design—into newer architectures such as YOLOv10, with emphasis on enhancing model stability under varying lighting conditions; (2) developing embedded applications based on this framework for deployment in museum guided tour systems and Cantonese embroidery digital archive platforms, enabling practical implementation; and (3) designing customized feature enhancement modules for other embroidery traditions, such as Miao and Shen embroidery, to validate the scalability and adaptability of the proposed framework. These efforts aim to provide effective technical tools for the digital documentation and sustainable transmission of traditional embroidery craftsmanship.