Introduction

Batik, as a time-honored manual resist dyeing technique, is widely distributed across various cultural regions of the world, including China, Indonesia, Malaysia, Singapore, India, Japan, and other countries. Chinese batik has a history of over two thousand years and maintains living traditions among ethnic minority groups in southeastern Guizhou1,2,3. In 2006, batik was included in the first batch of national-level intangible cultural heritage lists. As the Miao people do not have their own written language, batik images have become a “living fossil” for studying ethnic history, religious beliefs, aesthetic concepts, and social life. However, with accelerated industrialization and transformations in modern lifestyles, batik faces a severe crisis of inheritance.

To protect this precious cultural heritage, scholars have conducted systematic digital research on batik from multiple perspectives. In cultural connotation mining, researchers have analyzed the ethnic cultural information embedded in batik patterns through field investigations and interviews, revealing how these patterns reflect the history, culture, and aesthetic concepts of the Miao ethnic group2,4,5. Knowledge representation methods, including natural language processing and multimodal knowledge graphs, have been developed to systematically organize scattered batik knowledge and intuitively display visual pattern characteristics6,7. For pattern innovation design, various computational methods have been proposed, including fractal geometry-based automatic generation8, shape grammar combined with artificial neural networks9, interactive genetic algorithms with neural network cognitive models10, and semiotics11. These approaches have also extended to pattern re-use optimization using particle swarm algorithms and personalized product recommendation systems12,13. In image feature extraction, researchers have developed techniques such as morphological operations with edge detection for contour extraction14, Zernike Moments combined with curvelet transform for global and local feature extraction15, and distance transformation algorithms combined with EFEM for crack feature extraction and simulation16. These technological innovations have significantly expanded the expression forms of traditional patterns and opened new avenues for the living inheritance of intangible cultural heritage.

Current research on batik patterns is mostly limited to the classification and recognition of individual patterns. For example, Elvitaria et al.17 proposed a batik pattern classification system based on the integration of deep learning and gray level co-occurrence matrix (GLCM) feature extraction. Dewa et al.18 used a convolutional neural network to classify Indonesian batik patterns into five categories. To address the class imbalance problem in pattern classification, Beatrice et al.19 constructed a pattern classification model by introducing oversampling and enhancement operations and using DenseNet169. In our previous research5, we also used an improved ResNet34 model to construct a Chinese batik pattern classification model. These studies all segment patterns from the original image for independent feature extraction and classification. Although this method has demonstrated effectiveness in terms of classification accuracy, it has the following limitations: First, this method overlooks the correlation between patterns and the overall composition, ignoring the spatial relationships between patterns; second, the boundary blurring and pattern overlap during image segmentation lead to substantial loss of detailed information. To address this issue, Wu et al.7 attempted to use RT-DETR for pattern detection. However, this model is only suitable for ideal situations where the image is complete and the patterns are tiled, and struggles to handle complex situations such as pattern occlusion, deformation, and blurring commonly seen in actual batik images. Therefore, existing research lacks systematic recognition and localization of multiple coexisting patterns in batik under complex conditions, which affects the comprehensive interpretation of batik cultural semantics and the effectiveness of digital preservation.

Object detection technology, capable of simultaneously performing pattern localization and recognition, has demonstrated substantial value in cultural heritage protection. On one hand, it enables automated annotation of extensive batik image collections, significantly enhancing digitization efficiency; on the other hand, by recording spatial distributions of patterns, it supports subsequent research in cultural semantic analysis, stylistic evolution, and association mining.

Among object detection methodologies, the You Only Look Once (YOLO) series has emerged as the dominant paradigm, due to their optimal speed-accuracy trade-off. In heritage applications, researchers have successfully employed YOLO variants for cultural element detection. Gao et al.20 proposed a YOLOv8-based recognition model for Jiangnan private gardens, optimizing model representation and feature extraction to facilitate garden heritage preservation. They constructed a comprehensive dataset capturing various viewing angles and lighting conditions in complex garden environments, achieving 8.7% accuracy improvement over the baseline through data augmentation. In Portugal, where Azulejos tiles cover heritage buildings with intricate patterns, natural weathering and material degradation pose preservation challenges. Karimi et al.21 constructed a tile defect dataset, employing YOLO for damage detection and MobileNet for severity classification, achieving over 72% overall accuracy. Darma et al.22 introduced a YOLOv5-based method for detecting Balinese carving patterns on sacred architecture, utilizing dataset expansion and network scaling strategies to enable robust detection in complex environments.

YOLOv11, with its enhanced feature pyramid networks and refined detection heads, provides a solid foundation for pattern recognition tasks. However, its convolutional architecture inherently operates on local receptive fields, limiting its ability to model the long-distance spatial dependencies that define cultural semantics in batik compositions. This architectural constraint presents unique challenges when applying baseline YOLOv11 to batik pattern detection:

First, batik images contain exceptionally dense pattern distributions with numerous patterns co-occurring in a single image, where culturally significant symbols often appear at extremely small scales. This combination of high density and small object size exceeds the detection capabilities of baseline YOLOv11. The model’s local receptive fields fail to capture the long-distance spatial relationships that define cultural meanings in pattern combinations, resulting in insufficient detection accuracy for these complex scenarios.

Second, authentic batik samples exhibit substantial quality variations stemming from both historical degradation and collection limitations. Traditional dyeing processes create irregular boundaries through wax-bleeding effects, while age-related fading and surface contamination significantly reduce visual contrast. Additionally, field-collected images suffer from inconsistent lighting conditions and complex textile backgrounds. These factors demand detection models with enhanced robustness and adaptability beyond what baseline YOLOv11 provides. These challenges necessitate architectural innovations that specifically address batik’s unique characteristics while maintaining computational efficiency.

To address these challenges, we present a comprehensive solution that combines dataset construction, architectural innovation, and practical application. Our main contributions are summarized as follows:

(1) We have constructed a large-scale batik pattern dataset containing 861 images. This dataset covers 7 major categories of batik patterns, with 9933 bounding boxes. It is currently the largest and most comprehensive Chinese batik detection dataset. It addresses the critical data scarcity bottleneck in batik pattern analysis.

(2) We propose an improved YOLOv11 model with two targeted enhancements: Vision Outlooker (VOLO) integration for capturing long-distance feature dependencies with reduced computational complexity, and Fused-MobileNet Bottleneck Convolution (Fused-MBConv) optimization in C3K2 modules that synergistically combines depth-wise convolutions, inverted residuals, and SE Layers. These modifications reduce network complexity while enhancing feature quality, as validated by ablation studies showing substantial improvements in detection accuracy. The model demonstrates robust performance across diverse batik patterns, effectively addressing the robustness and generalizability challenges inherent in heritage preservation applications.

(3) We have developed an integrated prototype system that bridges pattern detection with cultural knowledge interpretation. By linking our detection model with a batik knowledge graph5, the system provides end-to-end functionality from visual recognition to semantic analysis, enabling users to explore pattern meanings, historical contexts, and cultural significance.

The proposed model achieves high accuracy while maintaining minimal hardware dependencies, making it ideally suited for deployment across diverse cultural institutions and educational scenarios.

The remainder of this paper is organized as follows: Section 2 details the improved YOLOv11 model including attention mechanism and convolution optimizations; Section 3 presents experimental results covering dataset construction, ablation studies, comparative experiments, and prototype system implementation; Section 4 concludes the paper.

Methods

Benchmark network

Since batik pattern detection results often serve as inputs for subsequent analysis models, benchmark network selection must balance detection accuracy with computational efficiency. We adopt YOLOv11 as our baseline model, released by Ultralytics in 202423. This model has demonstrated superior performance in various computer vision tasks, achieving optimal trade-offs between accuracy and inference speed. Compared to its predecessor YOLOv8, YOLOv11 has mainly improved in three aspects: convolutional unit structure, multi-head attention mechanism, and lightweight detection head. The basic structure is shown in Fig. 1.

Fig. 1: Basic structure of YOLOv11.
Fig. 1: Basic structure of YOLOv11.
Full size image

The Conv unit in the figure represents the convolution operation. C3K2, SPPF and other units are module names or operation modes, and some core unit structures are shown in the lower half of the figure.

First, YOLOv11 introduces the C3K2 architecture in the convolutional unit structure. This architecture is built on the C3K module in the Cross Stage Partial (CSP) structure, and its advantages mainly focus on two aspects: On the one hand, the module divides the features into two parts, one part extracts deeper information from the input features through multiple C3K or Bottleneck structures, and the other part directly passes the features through standard convolution structures, finally concatenating the two parts, balancing parameter reduction and feature extraction effectiveness; On the other hand, C3K2 introduces the idea of multi-scale convolution kernels, allowing designers to freely adjust the size of the convolution kernels to expand the receptive field, enabling the model to handle detection targets of various sizes and improve feature extraction quality in complex scenes or backgrounds. The overall structure of C3K2 is shown in Fig. 2, and its detailed structure is shown in Fig. 2.

Fig. 2: C3K2 module structure diagram.
Fig. 2: C3K2 module structure diagram.
Full size image

The Conv unit in the figure represents the convolution operation. CBS, Bottleneck and other units are module names or operation methods. Swish/Relu represents the activation function.

Second, YOLOv11 proposes the C2PSA architecture by combinng CSP with Pyramid Squeeze Attention (PSA) mechanisms. The CSP structure employs the C2 module, which has a structure similar to C3K. Specifically, the input features are divided into two parts after passing through a convolutional layer. One part is directly passed, and the other part is processed by the PSA module. Then, the two parts of the features are concatenated and another convolution is performed to restore the original number of channels. PSA has replaces the original Bottleneck module here. PSA uses multiple convolution kernels (such as 3 × 3, 5 × 5, 7 × 7) to extract multi-scale features in parallel, and weights the concatenated feature channels through Squeeze and Excitation (SE) modules. This design enhances the network’s attention focus on targets of different scales. The structure of C2PSA is shown in Fig. 3.

Fig. 3: C2PSA module structure diagram.
Fig. 3: C2PSA module structure diagram.
Full size image

The Conv unit in the figure represents the convolution operation. Gap, SE and other units are module names or operation methods. Softmax/Relu represents the activation function.

Third, YOLOv11 incorporates a depth-wise separable convolution layer in a branch of the decoupled head. Depth-wise separable convolution consists of depth convolution and point convolution. The former processes each input feature channel separately, that is, each channel only focuses on using a single convolution kernel for convolution and does not interact with other channels. This process significantly reduces the number of parameters while improving computational efficiency. Afterwards, to compensate for the lack of cross channel information exchange in deep convolution, a 1 × 1 pointwise convolution layer is used to integrate cross channel feature information and form a complete depth-wise separable convolution structure. The optimized decoupled head greatly reduces computational complexity and parameter count while maintaining network performance. The schematic diagram of a depth separable structure is shown in Fig. 4.

Fig. 4: Schematic diagram of depth separable structure.
Fig. 4: Schematic diagram of depth separable structure.
Full size image

The depth separable structure consists of two parts, DepthConv and PointConv, as shown in the figure. The structure and collaboration of the two units are shown in the lower part of the figure.

The loss function of YOLOv11 considers both classification loss and localization loss. The former uses Binary Cross Entropy Loss (BCEL) for classification optimization, while the latter combines Distribution Focal Loss (DFL) and CIoU. The complete loss function can be expressed as:

$${f}_{loss}={\lambda }_{1}\,{f}_{cls}+{\lambda }_{2}\,{f}_{DFL}+{\lambda }_{3}\,{f}_{CIoU}$$
(1)

where, BCEL and CIoU are standard optimization objectives in the YOLO series. DFL generalizes discrete classification results into continuous values through integration:

$${f}_{DFL}({S}_{i},{S}_{i+1})=-(({y}_{i+1}-y)\log ({S}_{i})+(y-{y}_{i})\log ({S}_{i+1}))$$
(2)

where \({y}_{i}\) and \({y}_{i+1}\) are the values that are close to each other on the left and right sides of the continuous label \(y\), satisfying\({y}_{i} < y < {y}_{i+1}\); \({S}_{i}\) and \({S}_{i+1}\) represent the predicted probabilities for the left and right endpoints, respectively.

This paper uses YOLOv11n, which has the least number of parameters, as the benchmark model for constructing a batik pattern detection model throught two enhancements. First, addressing limitations of convolutional neural networks (CNNs) in capturing long-distance feature correlations, we introduce the global attention mechanism from VOLO. This mechanism performs attention computation within each image block (window) while combining local block and cross-block interactions to aggregate information from surrounding blocks, maintaining global perception while reducing computational complexity and achieving fine-grained and efficient feature encoding. Secondly, we employ Fused-MBConv to optimize the C3k and Bottleneck modules in C3K2. By integrating convolution strategies, inversed residual structures, and SE Layer coordination, this optimization reduces network branches and layers when using depth-wise separable convolutions, enabling the entire network to have stronger feature engineering quality while preserving efficient inference speed. These improvements enable YOLOv11 to achieve lightweight, efficient, and accurate batik pattern detection in complex backgrounds.

Improvement of attention mechanism

VOLO was proposed by Yuan et al.24, which designed a visual foreground operator attention mechanism aimed at efficiently encoding fine features and realizing efficient interaction between image blocks through self- attention mechanism, thereby modeling global dependencies and achieving more optimal visual feature engineering. Considering the bottleneck problem of object detection algorithms in batik printing patterns, namely the large scale variation, high density, and wide range of data sources, this paper refers to the work of reference24 and introduces the core idea of VOLO to the C2PSA module of YOLO v11, replacing the PSA attention mechanism and establishing high-quality relationships between long-distance features of the model, aiming to improve the feature engineering quality of the model and overcome the related problems of complex batik printing pattern detection. Specifically:

First, for input features with width and height of W and H respectively \(X\in {{\mathbb{R}}}^{H\times {\rm{W}}\times C}\), two linear transformation layers with weights of \({W}_{A}\in {{\mathbb{R}}}^{C\times {K}^{4}}\) and \({W}_{V}\in {{\mathbb{R}}}^{C\times C}\) are used to map each C-dimensional window feature into a similarity matrix \(A\in {{\mathbb{R}}}^{H\times {\rm{W}}\times {K}^{4}}\) and value representation \(V\in {{\mathbb{R}}}^{H\times {\rm{W}}\times C}\).

Second, defining \({V}_{\varDelta i,j}\in {{\mathbb{R}}}^{C\times {K}^{2}}\) as all values within the local window at position (i, j), it can be expressed as:

$${V}_{\varDelta i,j}=\{{V}_{i+p-|k/2|,j+q-|k/2|}\},\,0\le p\le K,\,0\le q\le K$$
(3)

By rearranging the dimension of the similarity matrix at position (i, j) \({A}_{i,j}\in {{\mathbb{R}}}^{H\times {\rm{W}}\times {K}^{4}}\) into \({\mathop{A}\limits^{\frown {}}}_{i,j}\in {{\mathbb{R}}}^{{K}^{2}\times {K}^{2}}\), applying the attention mechanism in the form of form Eq. (4), and adding the different weighted values of different local windows at the same position, the output is obtained.

$${Y}_{\varDelta i,j}=MatMul(Soft\,\max ({\hat{A}}_{i,j}),{V}_{\varDelta i,j})$$
(4)

Finally, the results of implementing the Outlook attention mechanism will be fused with the original input features to effectively capture the long-distance feature relationships of the model. This process can be represented by the following equations:

$$\mathop{X}\limits^{\frown {}}=Outlook\,Attention(LN(X))$$
(5)
$$Z=MLP(LN(\mathop{X}\limits^{\frown {}}))+\mathop{X}\limits^{\frown {}}$$
(6)

where Outlook Attention is the attention representation in Eq. (4); LN is layer normalization. The visualization of this process are shown in Fig. 5.

Fig. 5: Schematic diagram of Outlook structure.
Fig. 5: Schematic diagram of Outlook structure.
Full size image

The overall structure and operation process of Outlook are shown in detail in the figure, which corresponds to the specific description in the text.

Improvement of convolution method

Similar to the core idea embodied in YOLOv11 in the decoupled head, we apply depth-wise separable convolution to the C3K2 module. At the same time, we refer to the Fused-MBConv structure proposed in EfficientNetV225 and make the following considerations: first, we use deep convolution (channel wise convolution) and point wise convolution (1×1 convolution) to significantly reduce computational and parameter complexity. For example, for a standard 3 × 3 convolution kernel, assuming its input channel is \({C}_{In}\) and output is \({C}_{Out}\), the number of standard convolution parameters is 3 × 3 × \({C}_{In}\)×\({C}_{Out}\), while the number of depth-wise separable convolution parameters is only 3×3×\({C}_{In}\) + 1 × 1 × \({C}_{In}\)× \({C}_{Out}\). Second, the introduction of inversed residual structure and SE Layer further enhances feature engineering. We perform a 1×1 convolution to increase the dimensionality (by expanding the number of channels), then perform a deep convolution, and enhance the quality of feature extraction through an SE layer, then use a 1×1 convolution to reduce the dimensionality of the features. This structure preserves more feature information while reducing computational complexity, known as the MBConv structure. Finally, if the deep convolution and pointwise convolution in MBConv are combined into a single standard convolution to reduce branching operations, it can further accelerate the efficiency of feature extraction. In summary, the features of standard convolution, MBConv, and Fused-MBConv are represented in Table 1. The structural analysis of MBConv and Fused-MBConv is shown in Fig. 6.

Fig. 6: Schematic diagram of MB Conv/Fused MB Conv structure.
Fig. 6: Schematic diagram of MB Conv/Fused MB Conv structure.
Full size image

The specific structure of MB Conv is shown on the left side of the figure, and the specific structure of Fused MB Conv is shown on the right side. Conv, DepthConv, etc. are module names or operation methods.

Table 1 Characteristic parameters of three convolutional structures

Fused-MBConv offers the following advantages: On the one hand, conventional depth-wise separable convolution separates spatial feature extraction (depth-wise convolution) from channel feature fusion (pointwise convolution), and the decomposed operation may not fully fuse feature information, especially when dealing with high complexity tasks (such as the complex background of batik patterns in this paper), which may lead to insufficient modeling ability for complex spatial channel relationships and may result in decreased accuracy. In addition, this structure usually needs to be compensated for by increasing the network width or depth, thereby partially offsetting its efficiency advantage. Fused-MBConv combines the deep convolution and pointwise convolution in MBConv into a single standard convolution, reducing network branches and layers, which to some extent combines the advantages of both. On the other hand, Fused-MBConv improves the quality of feature extraction through the SE attention mechanism. SE is a lightweight and efficient channel attention mechanism that can explicitly model the dependencies between channels, adaptively learn the weights of different channels, enhance the feature responses of important channels, and suppress the contributions of irrelevant or noisy channels. It works by compressing the spatial dimension into channel descriptors through global average pooling, capturing global contextual information, avoiding the limitations of local receptive fields, and enhancing the flexibility of the model’s expressive ability by dynamically adjusting the importance of channel features. Therefore, considering the complexity of the batik modes, which requires the detection model to have better robustness and adaptability due to historical degradation and collection limitations of real batik samples, this paper uses Fused MB Conv to replace the C3k and Bottleneck modules in C3K2. This method has to some extent improved the feature engineering quality of the model, effectively solving the inherent challenges of robustness and universality in the application of batik heritage protection.

So far, the improved network structure is shown in Fig. 7. In the figure, the ““ mark represents the location where VOLO is added. The ““ mark represents the location where Fused-MBConv was added. It can be seen that VOLO has replaced the attention mechanism of some PSA Blocks in the original C2PSA module. Fused-MBConv is integrated into the C3K2 module. The two complement each other, enabling the entire network to have stronger feature engineering quality while maintaining efficient inference speed.

Fig. 7: Structure diagram of batik detection model.
Fig. 7: Structure diagram of batik detection model.
Full size image

The Conv unit in the figure represents the convolution operation. C3K2, SPPF and other units are module names or operation modes, and some core unit structures are shown in the lower half of the figure. Triangle and star markings represent positions where the structure has been improved.

Results

Dataset and its preprocessing

The intangible cultural heritage of China covers traditional skills, folk festivals, music and dance, oral literature, etc., and is the “living gene” accumulated over five thousand years of Chinese civilization. Protecting intangible cultural heritage helps prevent cultural homogenization and maintain the diversity and vitality of Chinese culture. Among them, the batik patterns of the Miao ethnic group in Guizhou, with their mysterious style, beautiful patterns, clear intentions, profound cultural connotations, and rich themes, have been passed down from generation to generation and have become one of the most representative intangible cultural heritages of China. Focusing on the detection of pattern elements and analysis of meanings in complex batik patterns, we have conducted on-site visits, collected public information, and consulted authoritative books (from libraries in multiple parts of Guizhou Province, China). We have compiled 861 complete complex batik patterns for constructing batik pattern detection models. Overall, the constructed dataset includes 7 pattern categories: butterfly, fish, dragon, bird, drum, flower, and fruit. Partial patterns are shown in Fig. 8a. Due to the difficulty of collecting the dataset, the wide range of channels for obtaining it, and the uneven quality of the patterns involved, there are detection challenges including complex backgrounds, large variations in lighting, diverse pattern forms, and concentrated pattern density in the images. Each complete batik image contains multiple patterns of various types. The data contains a large amount of pattern overlap and combination, which increases the complexity of the detection process. The dataset contains a total of 9933 manually annotated bounding boxes, as shown in Fig. 8b. From Fig. 8c, it can be seen that the scale of the anchors in the batik patterns varies greatly, with most of them concentrated in the 50 × 50 to 250 × 250 pixel range. This scale variation requires the model to have strong multi-scale feature extraction capabilities. From the perspective of small classes, the number of flower class anchors is 2486, while the number of drum class anchors is only 422, which emphasizes the high requirement for model robustness to better handle the problem of imbalanced samples between classes.

Fig. 8: Schematic diagram of dataset and labels.
Fig. 8: Schematic diagram of dataset and labels.
Full size image

The upper part of the figure shows typical batik patterns in the dataset. The lower left part shows the distribution of pattern types in the dataset. The lower right part shows the distribution of targets sizes in the dataset. a Schematic diagram of batik printing pattern. b Number distribution of anchors. c Size distribution of anchors.

The partitioning and processing method of the dataset in this paper is as follows: First, 734 complex batik images are randomly partitioned as the training set and 127 images as the testing set. Second, the resolution of the original patterns is mostly concentrated within 640×640 pixels, with a few patterns exceeding this range (up to 1435 × 2292 pixels). To balance the efficiency and effectiveness of the training process, the patterns are normalized to 640×640 pixels to preserve their features. Third, due to the difficulty in collecting complex batik patterns, the sample size is limited. To ensure the feasibility of establishing deep learning models and reduce the risk of overfitting, the dataset is augmented in the experiment. Specifically, it includes Blur, Advanced Blur, ToGray, CLAHE, Flip, Mosaic, Translation, and Erasing. The specific meanings and parameter settings are shown in Table 2, where p represents the probability of using each data augmentation operation.

Table 2 Parameter setting

During the model training process, the specific experimental hardware, software, and parameter settings are shown in Table 3. The deep learning model frameworks used in the experiment are PyTorch1.9.1 and Torchvision 0.10.1. To ensure the fairness of various models in ablation experiments and comparative experiments, the following two experimental settings are made: first, the training epoch of all models is set to 300. 300 epochs ensure that the model achieves its optimal performance, and this parameter has been validated through repeated experiments. Second, the parameter initialization process of all models does not use pre-rained weights and uses the same optimizer (SGD) to control experimental variables. In addition, considering the universality of the model in reality and reducing its dependence on hardware, we optimize and validate it using YOLOv11n, the smallest model in the YOLOv11 series.

Table 3 Training parameter setting

Ablation experiment

The advantages of the batik pattern detection model designed in this paper mainly lie in two aspects: At the data processing level, considering the limited amount of complex batik pattern data, we have designed a data augmentation method to prevent overfitting.In terms of model structure, we integrate the core ideas of VOLO, MBConv, and Fused-MBConv into the baseline model to enhance its ability to extract global features and model complex spatial channel relationships.To fully analyze the effectiveness of the above two aspects for batik pattern detection tasks, we designed an ablation experimental process, as shown in Table 4. The difference between models IV, V, and VI is that model IV completely uses MBConv to replace the benchmark network’s C3K2 module; model V completely replaces the C3K2 module of the benchmark network with Fused-MBConv; model VI replaces the C3K2 module of the benchmark network with a mixture of MBConv and Fused-MBConv.

Table 4 Model structure in ablation experiment

To evaluate model performance from multiple dimensions, widely used indicators in the field of object detection are used here: precision rate (P), recall rate (R), average precision (AP), mean average precision (mAP), frames per second (FPS). We quantitatively explore the advantages and disadvantages of the model based on the number of parameters and model size. The performance of the six models mentioned above is shown in Table 5 and Fig. 9.

Fig. 9: The confusion matrix effect of each model.
Fig. 9: The confusion matrix effect of each model.
Full size image

a Model I confusion matrix. b Model II confusion matrix. c Model III confusion matrix. d Model IV confusion matrix. e Model V confusion matrix. f Model VI confusion matrix.

Table 5 Results of ablation experiment.

Overall, the introduction of data augmentation, VOLO module, and Fused-MBConv module has gradually improved the performance of the benchmark model detection. Especially in complex batik patterns where data is precious and scarce, increasing the training data reasonably can to some extent avoid overfitting of the model.In addition, improvements in attention mechanisms and convolutional structures have enhanced the feature engineering quality of the baseline network. Under the implementation of the above measures, mAP increased from 0.716 in benchmark model I to 0.748 in model V, with an improvement rate of 4.47%.At the same time, although the parameter count and size of the model have slightly increased, the FPS is comparable to the benchmark model, with a model size of only 6.962MB. This means that the designed network structure effectively improves detection accuracy while ensuring detection efficiency.

From the perspective of individual modules, the benchmark model II, which has undergone data augmentation, obtained the optimal precision rate among all models, but there is a problem of low recall rate, which is prone to missed detection; After adding the VOLO module, the model has been optimized in both feature extraction ability and structure, significantly improving mAP while only slightly increasing the number of parameters. The MBConv and Fused MBConv modules balance detection accuracy and detection speed, respectively. For example, Model IV, which completely uses MBConv to improve the C3K2 module in the benchmark network, has the optimal number of parameters and model size, but the detection results are slightly lower than Model III. Model V, which completely replaces the C3K2 module with the Fused-MBConv module, achieved the optimal recall rate and mAP, but slightly increased the number of model parameters. Model VI, which combines two modules, actually exhibits a loss of accuracy.

From the perspective of individual categories, models III, IV, and V have all shown impressive performance in multiple categories.For example, Model III has strong detection capabilities for Dragon and Flower categories, Model IV has better detection performance for Butterfly and Drum categories, but ultimately Model V has the best comprehensive detection accuracy.This indicates that the improvement mechanism for the benchmark model is feasible and effective. An example of partial detection results is shown in Fig. 10.

Fig. 10: Example of detection effect.
Fig. 10: Example of detection effect.
Full size image

a Example Fig. 1. b Example Fig. 2. c Example Fig. 3. d Example Fig. 4. e Example Fig. 5. f Example Fig. 6.

From Fig.10, it can be seen that the robustness of Model V can meet the requirements of practical intangible cultural heritage application scenarios, whether in printed patterns or clear vector images.However, in detection tasks with dense targets and complex backgrounds, the model inevitably suffers from missed or false detections. For example, in Fig.10a, flower and butterfly patterns alternate and incomplete flower patterns appear, which greatly increases the difficulty of detection and leads to missed detections. Due to the complex background composition (as shown in Fig. 10d), multiple pattern categories (fruits, birds, butterflies) are present. Moreover, the three types of patterns of fruits, fish, and flowers have certain morphological similarities, leading to false positives.

Comparative experiment

In order to make the model deployment universal, we have chosen representative models of lightweight networks: YOLO series and SSD series models as comparison targets.Both are classic One-stage object detection systems that have the advantage of balancing accuracy and speed.Among them, the YOLO series has selected YOLOv8n26,27, YOLOv9t28, YOLOv10n29 and PP-YOLOE-Lite30, which are relatively mature and innovative applications, as well as the improved networks Ghost YOLO (based on YOLOv8n, optimized with Ghost convolution)31 and MobileVit YOLO (based on YOLOv7s, optimized with MobileViT)32 that have performed well in the engineering field.SSD uses the most widely used classic structure MobileNetv2-SSD33,34.To ensure the fairness of the comparison process, the experiment follows the following settings: first, all models in the comparison experiment were conducted in the hardware environment shown in Table 4, and the training process parameters were set consistently. Second, the evaluation criteria are the same as those used in the ablation experiment, and no pre-training weights are used for training. The experimental results are shown in Table 6 and Fig. 11.

Fig. 11: Normalization effect of comparative experimental indicators.
Fig. 11: Normalization effect of comparative experimental indicators.
Full size image

The figure shows the specific performance of 8 models in 6 indicators. All indicators have been normalized.

Table 6 Comparison of experimental results

Summarizing the data in Table 6 and Fig. 11, the performance of each model is as follows:

First, although MobileNetv2-SSD is one of the most classic lightweight models, it can be seen that it no longer has the advantage of real-time performance. Excluding the improved MobileVit YOLO, the YOLO series models have comprehensively surpassed MobileNetv2-SSD in FPS, Parameters, and Model size metrics. In the object detection task, MobileNetv2-SSD achieved the best performance in the P index, reaching 0.773. However, the R index of the model is only 0.198, indicating that the model has ignored a large number of targets to be detected in its work. The model can only detect a small number of targets with significant features, resulting in obvious missed detections. Therefore, MobilNetv2-SSD is not suitable for detecting patterns in batik patterns with complex and dense backgrounds.

Second, the improved MobileVit YOLO and Ghost YOLO based on YOLOv7s and YOLOv8n also did not achieve good detection results. The former used more parameters in an attempt to improve the feature engineering quality of the model, but the values of P, R, and mAP were the lowest among the YOLO series models used. The latter used Ghost convolution to make the model lightweight, resulting in Ghost YOLO having the best lightweight performance, but the detection accuracy is not satisfactory. The situation of missed detections for both is still not optimistic.

Third, YOLOv8n, YOLOv9t, YOLOv10n, and PP-YOLOE-Lite achieved relatively good performance. All four models have reduced the probability of missed detections. YOLOv8n’s R and mAP metrics are second only to our proposed model, and its FPS performance is impressive. YOLOv9t has only 0.227million more parameters than Ghost YOLO, but its FPS metric is not outstanding. The performance of YOLOv10n in model accuracy is actually inferior to YOLOv9t, with an mAP index of only 0.680. Although PP-YOLOE-Lite performs the best in the P index, it has a large number of parameters and a relatively large model, resulting in average overall performance. This indicates that the generational differences of the YOLO series in the general dataset may not necessarily apply to the detection scenario in this paper.

The model proposed in this paper is the best in terms of detection accuracy indicators. Especially compared to other models, R shows a significant improvement, which enables the model to accurately detect target patterns in complex batik backgrounds. Compared to the five comparison models in Table 6, mAP increased by 49.6%, 3.89%, 10.00%, 13.16%, and 28.74% respectively. From the three lightweight indicators of FPS, Parameters, and Model size, the model still demonstrates relatively superior capabilities, indicating that it has the best overall performance. In summary, our proposed model combines high-precision and high-efficiency detection capabilities in batik pattern detection tasks, with low hardware dependence, and can be deployed at a lower cost in the vast majority of cultural dissemination applications and scenarios.

Batik detection and analysis system

To apply deep learning technology to practical applications in the cultural field, we have developed a prototype system for batik image detection and analysis. This system integrates an improved YOLOv11 pattern detection model and the batik pattern knowledge graph we previously constructed, achieving comprehensive functionality from image recognition to cultural analysis. The core interface of the system is shown in Fig. 12, primarily consisting of two functional modules: image detection and knowledge graph.

Fig. 12: Prototype system for batik image detection and analysis.
Fig. 12: Prototype system for batik image detection and analysis.
Full size image

a Batik image detection interface. b The correlation aspect of the knowledge graph of batik patterns.

The image detection module (Fig. 12a) consists of three components: “Batik Image Loading”, “Detection Results”, and “Statistics”. Users upload batik images through the input interface, whereupon the improved YOLOv11 model performs automatic pattern detection. The visualization area displays detected patterns with category annotations, while the statistics panel provides quantitative analysis of pattern distribution and frequency.

The knowledge graph module (Fig. 12b) enables semantic exploration of detected patterns. Upon selecting a detected pattern, users access three integrated views: (1) Pattern-related nodes displaying semantic relationships from the knowledge graph, (2) Similar pattern examples retrieved from our dataset, and (3) Cultural information including prototype sources, symbolic meanings, and associated worship consciousness. This multi-dimensional presentation facilitates comprehensive understanding of pattern semantics.

Our system can provide practical tools for various heritage preservation needs. In digital archiving, it enables systematic documentation of patterns within batik items, automatically detecting and recording multiple pattern types with their positions and frequencies—replacing labor-intensive manual annotation. For cultural education and exhibition, users can photograph batik images and instantly access cultural connotations of detected patterns, such as symbolic meanings, worship consciousness, and cultural taboos. For restoration work, the system can help identify damaged patterns and retrieve intact references.

Through developing this prototype, we establish a technical framework that connects visual recognition with cultural interpretation. The integration of pattern detection and semantic knowledge graphs preserves both the visual forms and cultural meanings of batik patterns. By transforming pattern identification into comprehensive cultural analysis, the system offers researchers, cultural workers, and educators a practical tool for documenting and transmitting batik heritage knowledge, contributing to the sustainable preservation of this intangible cultural heritage.

Discussion

This paper presents an improved YOLOv11-based model for batik pattern detection that effectively balances detection accuracy with computational efficiency. The proposed method addresses the inherent challenges of detecting culturally significant patterns in complex batik images, where dense pattern distributions, scale variations, and degraded image quality pose substantial obstacles to conventional detection algorithms.

Our main contributions include: (1) construction of a comprehensive batik dataset with detailed pattern annotations, providing an essential resource for future research; (2) development of an improved YOLOv11n model integrating VOLO attention mechanisms and Fused-MBConv modules for improved feature representation; and (3) implementation of a prototype system bridging visual detection with cultural knowledge interpretation. Experimental results demonstrate the effectiveness of our approach, achieving significant improvements in detection accuracy while maintaining real-time performance, particularly for small-scale and densely distributed patterns.

While our work has made meaningful progress, there are still two main limitations that require further research. First, the current model has been trained and evaluated exclusively on Chinese batik patterns. Future work should assess its generalization capability across different batik traditions, including Indonesian, Malaysian, and Indian styles, to establish a cross-cultural detection framework that respects each tradition’s unique characteristics while leveraging shared visual features. Second, although we have developed a prototype system, systematic usability validation remains to be conducted. We plan to gather feedback from heritage professionals, museum curators, and cultural researchers to refine the system’s interface and functionality for practical deployment in heritage institutions.

This work contributes to the growing body of research on computer-assisted cultural preservation, demonstrating that computer vision techniques can effectively support the documentation and interpretation of intangible cultural heritage while respecting its complexity and cultural significance.