S2SWCLIP: semantic-optimized prompts with spatial-wavelet synergy for zero-shot anomaly detection

Zhang, Huan; Wu, Chunlei; Lu, Jing; Jing, Mengyuan

doi:10.1038/s41598-026-43044-3

Download PDF

Article
Open access
Published: 11 March 2026

S2SWCLIP: semantic-optimized prompts with spatial-wavelet synergy for zero-shot anomaly detection

Huan Zhang^1,2,
Chunlei Wu^1,2,
Jing Lu^1,2 &
…
Mengyuan Jing^1,2

Scientific Reports , Article number: (2026) Cite this article

1050 Accesses
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Zero-shot anomaly detection is crucial for privacy-sensitive scenarios with limited target data. However, prominent methods based on visual-language models suffer from semantic overlap due to simplistic generic prompts, while the reductive design of visual representations fails to capture crucial local details and global structures, leading to alignment deviation between text and visual embeddings. In this paper, we propose S2SWCLIP, which integrates semantic-optimized prompts with wavelet-spatial synergy to advance the design principles by refining prompt learning, enriching visual representations, and optimizing cross-modal alignment. Initially, object-agnostic prompts, contrastive normal-anomaly prompts, and anomaly-referenced prompts are combined to delineate sharper semantic boundaries via strongly contrasting vocabulary, while comprehensive semantic information is optimized through embedding integration enabled by a cross-informative adaptive fusion mechanism. Subsequently, the spatial-to-wavelet transformation module facilitates the conversion of spatial features into frequency domain representations, in synergy with hierarchically fused visual features to retain fine-grained and meaningful image details. Furthermore, the entropy-gain similarity adaptively quantifies information richness to emphasize features with low entropy disparity, optimizing image-text alignment. Large-scale experiments on 14 real-world anomaly detection datasets reveal that S2SWCLIP outperforms numerous methods. The code is available at https://github.com/Huanzh111/S2SW.

Weakly supervised video anomaly detection based on hyperbolic space

Article Open access 01 November 2024

A swin transformer-based hybrid reconstruction discriminative network for image anomaly detection

Article Open access 30 September 2025

A principled representation of elongated structures using heatmaps

Article Open access 14 September 2023

Data availibility

The datasets generated and/or analysed during the current study are available in the GitHub repository, [https://github.com/Huanzh111/S2SW].

References

Liu, W., Wang, C. & Zhang, Y. Industrial surface defect detection by multi-scale inpainting-gan. Vis. Comput. 41(8), 5643–5660 (2025).
Google Scholar
Zhu, W. et al. Surface defect detection and classification of steel using an efficient swin transformer. Adv. Eng. Inform. 57, 102061 (2023).
Google Scholar
Wei, C., Liang, J., Liu, H., Hou, Z. & Huan, Z. Multi-stage unsupervised fabric defect detection based on dcgan. Vis. Comput. 39(12), 6655–6671 (2023).
Google Scholar
Qin, Z., Yi, H., Lao, Q. & Li, K. Medical image understanding with pretrained vision language models: A comprehensive study. arXiv preprint arXiv:2209.15517 (2022).
Liu, J. et al. Deep industrial image anomaly detection: A survey. Mach. Intell. Res. 21(1), 104–135 (2024).
Google Scholar
Cao, Y., Xu, X., Zhang, J., Cheng, Y., Huang, X., Pang, G. & Shen, W. A survey on visual anomaly detection: Challenge, approach, and prospect. arXiv preprint arXiv:2401.16402 (2024).
Bae, J., Lee, J.-H. & Kim, S. Pni: Industrial anomaly detection using position and neighborhood information. In: Proceedings of the IEEE/CVF International Conference on Computer Vision 6373–6383 (2023).
Gu, Z., Liu, L., Chen, X., Yi, R., Zhang, J., Wang, Y., Wang, C., Shu, A., Jiang, G. & Ma, L. Remembering normality: Memory-guided knowledge distillation for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision 16401–16409 (2023).
Wang, G., Han, S., Ding, E. & Huang, D. Student-teacher feature pyramid matching for anomaly detection. arXiv preprint arXiv:2103.04257 (2021).
McIntosh, D. & Albu, A. B. Inter-realization channels: Unsupervised anomaly detection beyond one-class classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision 6285–6295 (2023).
Xiao, Q., Li, G. & Chen, Q. Complex image classification by feature inference. Neurocomputing 544, 126231 (2023).
Google Scholar
Yu, X., Wang, H., Wang, J. & Wang, X. A common feature-driven prediction model for multivariate time series data. Inf. Sci. 677, 120967 (2024).
Google Scholar
Miao, J., Tao, H., Xie, H., Sun, J. & Cao, J. Reconstruction-based anomaly detection for multivariate time series using contrastive generative adversarial networks. Inf. Process. Manag. 61(1), 103569 (2024).
Google Scholar
Liu, S. et al. Time series anomaly detection with adversarial reconstruction networks. IEEE Trans. Knowl. Data Eng. 35(4), 4293–4306 (2022).
Google Scholar
Chen, Y. et al. Lgfdr: Local and global feature denoising reconstruction for unsupervised anomaly detection. Vis. Comput. 40(12), 8881–8894 (2024).
Google Scholar
Yang, M., Wu, P. & Feng, H. Memseg: A semi-supervised method for image surface defect detection using differences and commonalities. Eng. Appl. Artif. Intell. 119, 105835 (2023).
Google Scholar
Bergmann, P., Fauser, M., Sattlegger, D. & Steger, C. Mvtec ad-a comprehensive real-world dataset for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 9592–9600 (2019).
Zhou, Q., Pang, G., Tian, Y., He, S. & Chen, J. Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection. arXiv preprint arXiv:2310.18961 (2023).
Baugh, M., Batten, J., Müller, J. P. & Kainz, B. Zero-shot anomaly detection with pre-trained segmentation models. arXiv preprint arXiv:2306.09269 (2023).
Chen, X., Zhang, J., Tian, G., He, H., Zhang, W., Wang, Y., Wang, C. & Liu, Y. Clip-ad: A language-guided staged dual-path model for zero-shot anomaly detection. In: International Joint Conference on Artificial Intelligence 17–33 (Springer, 2024).
Jeong, J., Zou, Y., Kim, T., Zhang, D., Ravichandran, A. & Dabeer, O. Winclip: Zero-/few-shot anomaly classification and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 19606–19616 (2023).
Zhang, X. et al. Fscmf: A dual-branch frequency-spatial joint perception cross-modality network for visible and infrared image fusion. Neurocomputing 641, 130376 (2025).
Google Scholar
Zhang, X., Dong, K., Cheng, D., Hua, Z. & Li, J. Stwanet: Spatio-temporal wavelet attention aggregation network for remote sensing change detection. IEEE J. Select. Top. Appl. Earth Observ. Remote Sens. (2025).
Zhang, X., Fan, G., Chen, G.-Y., Hua, Z., Li, J., Gan, M. & Chen, C. Wavelet-guided dual-frequency encoding for remote sensing change detection. arXiv preprint arXiv:2508.05271 (2025).
Bao, H. et al. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural. Inf. Process. Syst. 35, 32897–32912 (2022).
Google Scholar
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P. & Clark, J. Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning 8748–8763 (PMLR, 2021).
Schuhmann, C. et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022).
Google Scholar
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H. & Miller, J. et al. Openclip (2021).
Zhang, J., Huang, J., Jin, S. & Lu, S. Vision-language models for vision tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. (2024).
Aota, T., Tong, L. T. T. & Okatani, T. Zero-shot versus many-shot: Unsupervised texture anomaly detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 5564–5572 (2023).
Liznerski, P., Ruff, L., Vandermeulen, R. A., Franks, B. J., Müller, K.-R. & Kloft, M. Exposing outlier exposure: What can be learned from few, one, and zero outlier images. arXiv preprint arXiv:2205.11474 (2022).
Esmaeilpour, S., Liu, B., Robertson, E. & Shu, L. Zero-shot out-of-distribution detection based on the pre-trained model clip. In: Proceedings of the AAAI Conference on Artificial Intelligence vol. 36, pp. 6568–6576 (2022).
Schwartz, E. et al. Maeday: Mae for few-and zero-shot anomaly-detection. Comput. Vis. Image Underst. 241, 103958 (2024).
Google Scholar
Zhou, C., Loy, C. C. & Dai, B. Extract free dense labels from clip. In: European Conference on Computer Vision 696–712 (Springer, 2022).
Chen, X., Han, Y. & Zhang, J. A zero-/fewshot anomaly classification and segmentation method for cvpr 2023 vand workshop challenge tracks 1&2: 1st place on zero-shot ad and 4th place on few-shot ad. arXiv preprint arXiv:2305.17382 2(4) (2023).
Huang, C., Jiang, A., Feng, J., Zhang, Y., Wang, X. & Wang, Y. Adapting visual-language models for generalizable anomaly detection in medical images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 11375–11385 (2024).
Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith, S., Roelofs, R., Lopes, R. G., Hajishirzi, H., Farhadi, A. & Namkoong, H. Robust fine-tuning of zero-shot models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 7959–7971 (2022).
Khattak, M. U., Rasheed, H., Maaz, M., Khan, S. & Khan, F. S. Maple: Multi-modal prompt learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 19113–19122 (2023).
Kim, K., Oh, Y. & Ye, J. C. Zegot: Zero-shot segmentation through optimal transport of text prompts. arXiv preprint arXiv:2301.12171 (2023).
Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J. & Lu, J. Denseclip: Language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 18082–18091 (2022).
Ham, J., Jung, Y. & Baek, J.-G. Glocalclip: Object-agnostic global-local prompt learning for zero-shot anomaly detection. arXiv preprint arXiv:2411.06071 (2024).
Zhou, K., Yang, J., Loy, C. C. & Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vision 130(9), 2337–2348 (2022).
Google Scholar
Cao, Y., Zhang, J., Frittoli, L., Cheng, Y., Shen, W. & Boracchi, G. Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection. In: European Conference on Computer Vision 55–72 (Springer, 2024).
Zhang, K. et al. Adversarial spatio-temporal learning for video deblurring. IEEE Trans. Image Process. 28(1), 291–301 (2018).
Google Scholar
Chen, N., Li, B., Wang, Y., Ying, X., Wang, L., Zhang, C., Guo, Y., Li, M. & An, W. Motion and appearance decoupling representation for event cameras. IEEE Trans. Image Process. (2025).
Ross, T.-Y. & Dollár, G. Focal loss for dense object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2980–2988 (2017).
Li, X., Sun, X., Meng, Y., Liang, J., Wu, F. & Li, J. Dice loss for data-imbalanced nlp tasks. arXiv preprint arXiv:1911.02855 (2019).
Zou, Y., Jeong, J., Pemula, L., Zhang, D. & Dabeer, O. Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In: European Conference on Computer Vision 392–408 (Springer, 2022).
Jezek, S., Jonak, M., Burget, R., Dvorak, P. & Skotak, M. Deep learning-based defect detection of metal parts: Evaluating current methods in complex conditions. In: 2021 13th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT) 66–71 (IEEE, 2021).
Mishra, P., Verk, R., Fornasier, D., Piciarelli, C. & Foresti, G. L. Vt-adl: A vision transformer network for image anomaly detection and localization. In: 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE) 01–06 (IEEE, 2021).
Tabernik, D., Šela, S., Skvarč, J. & Skočaj, D. Segmentation-based deep-learning approach for surface-defect detection. J. Intell. Manuf. 31(3), 759–776 (2020).
Google Scholar
Wieler, M. & Hahn, T. Weakly supervised learning for industrial optical inspection. In: DAGM Symposium In vol. 6, p. 11 (2007).
Cauley, K., Hu, Y. & Fielden, S. Head CT: Toward making full use of the information the x-rays give. Am. J. Neuroradiol. 42(8), 1362–1369 (2021).
Google Scholar
TS, C. & Jagadale, B. N. Comparative analysis of u-net and deeplab for automatic polyp segmentation in colonoscopic frames using cvc-clinicdb dataset (2023).
Tajbakhsh, N., Gurudu, S. R. & Liang, J. Automated polyp detection in colonoscopy videos using shape and context information. IEEE Trans. Med. Imaging 35(2), 630–644 (2015).
Google Scholar
Jha, D., Smedsrud, P. H., Riegler, M. A., Halvorsen, P., De Lange, T., Johansen, D. & Johansen, H. D. Kvasir-seg: A segmented polyp dataset. In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26 451–462 (Springer, 2020).
Hicks, S. A., Jha, D., Thambawita, V., Halvorsen, P., Hammer, H. L. & Riegler, M. A. The endotect 2020 challenge: Evaluation and comparison of classification, segmentation and inference time for endoscopy. In: Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part VIII, pp. 263–274 (Springer, 2021).
Gong, H., Chen, G., Wang, R., Xie, X., Mao, M., Yu, Y., Chen, F. & Li, G. Multi-task learning for thyroid nodule segmentation with thyroid region prior. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI) 257–261 (IEEE, 2021).
Bergmann, P., Fauser, M., Sattlegger, D. & Steger, C. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 4183–4192 (2020).
Li, H. & Hu, J. Feature consistency learning for anomaly detection. IEEE Trans. Instrum. Meas. 74, 1–9 (2024).
Google Scholar
Li, H., Chen, Z., Xu, Y. & Hu, J. Hyperbolic anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 17511–17520 (2024).

Download references

Funding

This work is supported by the grants from the Natural Science Foundation of Shandong Province (ZR2024MF145), the National Natural Science Foundation of China (62072469), and the Qingdao Natural Science Foundation (23-2-1-162-zyyd-jch).

Author information

Authors and Affiliations

Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), No. 66, Changjiang West Road, Qingdao, 266580, Shandong, China
Huan Zhang, Chunlei Wu, Jing Lu & Mengyuan Jing
Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software, China University of Petroleum (East China), No. 66, Changjiang West Road, Qingdao, 266580, Shandong, China
Huan Zhang, Chunlei Wu, Jing Lu & Mengyuan Jing

Authors

Huan Zhang
View author publications
Search author on:PubMed Google Scholar
Chunlei Wu
View author publications
Search author on:PubMed Google Scholar
Jing Lu
View author publications
Search author on:PubMed Google Scholar
Mengyuan Jing
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, H.Z. and M.Y.J.; methodology, H.Z. and C.L.W.; investigation, H.Z. and C.L.W.; writing-original draft, H.Z. and C.L.W.; writing-review & editing, H.Z. and J.L.; funding acquisition, C.L.W.; resources, C.L.W.; supervision,M.Y.J. and J.L.

Corresponding author

Correspondence to Chunlei Wu.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, H., Wu, C., Lu, J. et al. S2SWCLIP: semantic-optimized prompts with spatial-wavelet synergy for zero-shot anomaly detection. Sci Rep (2026). https://doi.org/10.1038/s41598-026-43044-3

Download citation

Received: 17 November 2025
Accepted: 28 February 2026
Published: 11 March 2026
DOI: https://doi.org/10.1038/s41598-026-43044-3