Abstract
Zero-shot anomaly detection is crucial for privacy-sensitive scenarios with limited target data. However, prominent methods based on visual-language models suffer from semantic overlap due to simplistic generic prompts, while the reductive design of visual representations fails to capture crucial local details and global structures, leading to alignment deviation between text and visual embeddings. In this paper, we propose S2SWCLIP, which integrates semantic-optimized prompts with wavelet-spatial synergy to advance the design principles by refining prompt learning, enriching visual representations, and optimizing cross-modal alignment. Initially, object-agnostic prompts, contrastive normal-anomaly prompts, and anomaly-referenced prompts are combined to delineate sharper semantic boundaries via strongly contrasting vocabulary, while comprehensive semantic information is optimized through embedding integration enabled by a cross-informative adaptive fusion mechanism. Subsequently, the spatial-to-wavelet transformation module facilitates the conversion of spatial features into frequency domain representations, in synergy with hierarchically fused visual features to retain fine-grained and meaningful image details. Furthermore, the entropy-gain similarity adaptively quantifies information richness to emphasize features with low entropy disparity, optimizing image-text alignment. Large-scale experiments on 14 real-world anomaly detection datasets reveal that S2SWCLIP outperforms numerous methods. The code is available at https://github.com/Huanzh111/S2SW.
Similar content being viewed by others
Data availibility
The datasets generated and/or analysed during the current study are available in the GitHub repository, [https://github.com/Huanzh111/S2SW].
References
Liu, W., Wang, C. & Zhang, Y. Industrial surface defect detection by multi-scale inpainting-gan. Vis. Comput. 41(8), 5643–5660 (2025).
Zhu, W. et al. Surface defect detection and classification of steel using an efficient swin transformer. Adv. Eng. Inform. 57, 102061 (2023).
Wei, C., Liang, J., Liu, H., Hou, Z. & Huan, Z. Multi-stage unsupervised fabric defect detection based on dcgan. Vis. Comput. 39(12), 6655–6671 (2023).
Qin, Z., Yi, H., Lao, Q. & Li, K. Medical image understanding with pretrained vision language models: A comprehensive study. arXiv preprint arXiv:2209.15517 (2022).
Liu, J. et al. Deep industrial image anomaly detection: A survey. Mach. Intell. Res. 21(1), 104–135 (2024).
Cao, Y., Xu, X., Zhang, J., Cheng, Y., Huang, X., Pang, G. & Shen, W. A survey on visual anomaly detection: Challenge, approach, and prospect. arXiv preprint arXiv:2401.16402 (2024).
Bae, J., Lee, J.-H. & Kim, S. Pni: Industrial anomaly detection using position and neighborhood information. In: Proceedings of the IEEE/CVF International Conference on Computer Vision 6373–6383 (2023).
Gu, Z., Liu, L., Chen, X., Yi, R., Zhang, J., Wang, Y., Wang, C., Shu, A., Jiang, G. & Ma, L. Remembering normality: Memory-guided knowledge distillation for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision 16401–16409 (2023).
Wang, G., Han, S., Ding, E. & Huang, D. Student-teacher feature pyramid matching for anomaly detection. arXiv preprint arXiv:2103.04257 (2021).
McIntosh, D. & Albu, A. B. Inter-realization channels: Unsupervised anomaly detection beyond one-class classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision 6285–6295 (2023).
Xiao, Q., Li, G. & Chen, Q. Complex image classification by feature inference. Neurocomputing 544, 126231 (2023).
Yu, X., Wang, H., Wang, J. & Wang, X. A common feature-driven prediction model for multivariate time series data. Inf. Sci. 677, 120967 (2024).
Miao, J., Tao, H., Xie, H., Sun, J. & Cao, J. Reconstruction-based anomaly detection for multivariate time series using contrastive generative adversarial networks. Inf. Process. Manag. 61(1), 103569 (2024).
Liu, S. et al. Time series anomaly detection with adversarial reconstruction networks. IEEE Trans. Knowl. Data Eng. 35(4), 4293–4306 (2022).
Chen, Y. et al. Lgfdr: Local and global feature denoising reconstruction for unsupervised anomaly detection. Vis. Comput. 40(12), 8881–8894 (2024).
Yang, M., Wu, P. & Feng, H. Memseg: A semi-supervised method for image surface defect detection using differences and commonalities. Eng. Appl. Artif. Intell. 119, 105835 (2023).
Bergmann, P., Fauser, M., Sattlegger, D. & Steger, C. Mvtec ad-a comprehensive real-world dataset for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 9592–9600 (2019).
Zhou, Q., Pang, G., Tian, Y., He, S. & Chen, J. Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection. arXiv preprint arXiv:2310.18961 (2023).
Baugh, M., Batten, J., Müller, J. P. & Kainz, B. Zero-shot anomaly detection with pre-trained segmentation models. arXiv preprint arXiv:2306.09269 (2023).
Chen, X., Zhang, J., Tian, G., He, H., Zhang, W., Wang, Y., Wang, C. & Liu, Y. Clip-ad: A language-guided staged dual-path model for zero-shot anomaly detection. In: International Joint Conference on Artificial Intelligence 17–33 (Springer, 2024).
Jeong, J., Zou, Y., Kim, T., Zhang, D., Ravichandran, A. & Dabeer, O. Winclip: Zero-/few-shot anomaly classification and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 19606–19616 (2023).
Zhang, X. et al. Fscmf: A dual-branch frequency-spatial joint perception cross-modality network for visible and infrared image fusion. Neurocomputing 641, 130376 (2025).
Zhang, X., Dong, K., Cheng, D., Hua, Z. & Li, J. Stwanet: Spatio-temporal wavelet attention aggregation network for remote sensing change detection. IEEE J. Select. Top. Appl. Earth Observ. Remote Sens. (2025).
Zhang, X., Fan, G., Chen, G.-Y., Hua, Z., Li, J., Gan, M. & Chen, C. Wavelet-guided dual-frequency encoding for remote sensing change detection. arXiv preprint arXiv:2508.05271 (2025).
Bao, H. et al. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural. Inf. Process. Syst. 35, 32897–32912 (2022).
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P. & Clark, J. Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning 8748–8763 (PMLR, 2021).
Schuhmann, C. et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022).
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H. & Miller, J. et al. Openclip (2021).
Zhang, J., Huang, J., Jin, S. & Lu, S. Vision-language models for vision tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. (2024).
Aota, T., Tong, L. T. T. & Okatani, T. Zero-shot versus many-shot: Unsupervised texture anomaly detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 5564–5572 (2023).
Liznerski, P., Ruff, L., Vandermeulen, R. A., Franks, B. J., Müller, K.-R. & Kloft, M. Exposing outlier exposure: What can be learned from few, one, and zero outlier images. arXiv preprint arXiv:2205.11474 (2022).
Esmaeilpour, S., Liu, B., Robertson, E. & Shu, L. Zero-shot out-of-distribution detection based on the pre-trained model clip. In: Proceedings of the AAAI Conference on Artificial Intelligence vol. 36, pp. 6568–6576 (2022).
Schwartz, E. et al. Maeday: Mae for few-and zero-shot anomaly-detection. Comput. Vis. Image Underst. 241, 103958 (2024).
Zhou, C., Loy, C. C. & Dai, B. Extract free dense labels from clip. In: European Conference on Computer Vision 696–712 (Springer, 2022).
Chen, X., Han, Y. & Zhang, J. A zero-/fewshot anomaly classification and segmentation method for cvpr 2023 vand workshop challenge tracks 1&2: 1st place on zero-shot ad and 4th place on few-shot ad. arXiv preprint arXiv:2305.17382 2(4) (2023).
Huang, C., Jiang, A., Feng, J., Zhang, Y., Wang, X. & Wang, Y. Adapting visual-language models for generalizable anomaly detection in medical images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 11375–11385 (2024).
Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith, S., Roelofs, R., Lopes, R. G., Hajishirzi, H., Farhadi, A. & Namkoong, H. Robust fine-tuning of zero-shot models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 7959–7971 (2022).
Khattak, M. U., Rasheed, H., Maaz, M., Khan, S. & Khan, F. S. Maple: Multi-modal prompt learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 19113–19122 (2023).
Kim, K., Oh, Y. & Ye, J. C. Zegot: Zero-shot segmentation through optimal transport of text prompts. arXiv preprint arXiv:2301.12171 (2023).
Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J. & Lu, J. Denseclip: Language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 18082–18091 (2022).
Ham, J., Jung, Y. & Baek, J.-G. Glocalclip: Object-agnostic global-local prompt learning for zero-shot anomaly detection. arXiv preprint arXiv:2411.06071 (2024).
Zhou, K., Yang, J., Loy, C. C. & Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vision 130(9), 2337–2348 (2022).
Cao, Y., Zhang, J., Frittoli, L., Cheng, Y., Shen, W. & Boracchi, G. Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection. In: European Conference on Computer Vision 55–72 (Springer, 2024).
Zhang, K. et al. Adversarial spatio-temporal learning for video deblurring. IEEE Trans. Image Process. 28(1), 291–301 (2018).
Chen, N., Li, B., Wang, Y., Ying, X., Wang, L., Zhang, C., Guo, Y., Li, M. & An, W. Motion and appearance decoupling representation for event cameras. IEEE Trans. Image Process. (2025).
Ross, T.-Y. & Dollár, G. Focal loss for dense object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2980–2988 (2017).
Li, X., Sun, X., Meng, Y., Liang, J., Wu, F. & Li, J. Dice loss for data-imbalanced nlp tasks. arXiv preprint arXiv:1911.02855 (2019).
Zou, Y., Jeong, J., Pemula, L., Zhang, D. & Dabeer, O. Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In: European Conference on Computer Vision 392–408 (Springer, 2022).
Jezek, S., Jonak, M., Burget, R., Dvorak, P. & Skotak, M. Deep learning-based defect detection of metal parts: Evaluating current methods in complex conditions. In: 2021 13th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT) 66–71 (IEEE, 2021).
Mishra, P., Verk, R., Fornasier, D., Piciarelli, C. & Foresti, G. L. Vt-adl: A vision transformer network for image anomaly detection and localization. In: 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE) 01–06 (IEEE, 2021).
Tabernik, D., Šela, S., Skvarč, J. & Skočaj, D. Segmentation-based deep-learning approach for surface-defect detection. J. Intell. Manuf. 31(3), 759–776 (2020).
Wieler, M. & Hahn, T. Weakly supervised learning for industrial optical inspection. In: DAGM Symposium In vol. 6, p. 11 (2007).
Cauley, K., Hu, Y. & Fielden, S. Head CT: Toward making full use of the information the x-rays give. Am. J. Neuroradiol. 42(8), 1362–1369 (2021).
TS, C. & Jagadale, B. N. Comparative analysis of u-net and deeplab for automatic polyp segmentation in colonoscopic frames using cvc-clinicdb dataset (2023).
Tajbakhsh, N., Gurudu, S. R. & Liang, J. Automated polyp detection in colonoscopy videos using shape and context information. IEEE Trans. Med. Imaging 35(2), 630–644 (2015).
Jha, D., Smedsrud, P. H., Riegler, M. A., Halvorsen, P., De Lange, T., Johansen, D. & Johansen, H. D. Kvasir-seg: A segmented polyp dataset. In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26 451–462 (Springer, 2020).
Hicks, S. A., Jha, D., Thambawita, V., Halvorsen, P., Hammer, H. L. & Riegler, M. A. The endotect 2020 challenge: Evaluation and comparison of classification, segmentation and inference time for endoscopy. In: Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part VIII, pp. 263–274 (Springer, 2021).
Gong, H., Chen, G., Wang, R., Xie, X., Mao, M., Yu, Y., Chen, F. & Li, G. Multi-task learning for thyroid nodule segmentation with thyroid region prior. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI) 257–261 (IEEE, 2021).
Bergmann, P., Fauser, M., Sattlegger, D. & Steger, C. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 4183–4192 (2020).
Li, H. & Hu, J. Feature consistency learning for anomaly detection. IEEE Trans. Instrum. Meas. 74, 1–9 (2024).
Li, H., Chen, Z., Xu, Y. & Hu, J. Hyperbolic anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 17511–17520 (2024).
Funding
This work is supported by the grants from the Natural Science Foundation of Shandong Province (ZR2024MF145), the National Natural Science Foundation of China (62072469), and the Qingdao Natural Science Foundation (23-2-1-162-zyyd-jch).
Author information
Authors and Affiliations
Contributions
Conceptualization, H.Z. and M.Y.J.; methodology, H.Z. and C.L.W.; investigation, H.Z. and C.L.W.; writing-original draft, H.Z. and C.L.W.; writing-review & editing, H.Z. and J.L.; funding acquisition, C.L.W.; resources, C.L.W.; supervision,M.Y.J. and J.L.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, H., Wu, C., Lu, J. et al. S2SWCLIP: semantic-optimized prompts with spatial-wavelet synergy for zero-shot anomaly detection. Sci Rep (2026). https://doi.org/10.1038/s41598-026-43044-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-43044-3


