Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Nature Communications
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. nature communications
  3. articles
  4. article
Revealing the intrinsic ethical vulnerability of aligned large language models
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 21 March 2026

Revealing the intrinsic ethical vulnerability of aligned large language models

  • Jiawei Lian  ORCID: orcid.org/0000-0002-3899-07971,2,
  • Jianhong Pan1,
  • Lefan Wang  ORCID: orcid.org/0009-0007-8645-58722,
  • Yi Wang  ORCID: orcid.org/0000-0001-8659-47241 na1,
  • Xiaofei Wang  ORCID: orcid.org/0009-0004-6683-39692,
  • Yingjie Lu  ORCID: orcid.org/0009-0000-9096-12082,
  • Shaohui Mei  ORCID: orcid.org/0000-0002-8018-596X2 na1 &
  • …
  • Lap-Pui Chau  ORCID: orcid.org/0000-0003-4932-05931 na1 

Nature Communications , Article number:  (2026) Cite this article

  • 3110 Accesses

  • 1 Altmetric

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Computer science
  • Information technology

Abstract

Large language models (LLMs) represent foundational advances toward artificial general intelligence, yet their alignment with human values via instruction tuning and preference learning achieves only superficial ethical compliance. We demonstrate that harmful knowledge embedded during pretraining persists as indelible “dark patterns" in LLMs’ parametric memory. This creates an inherent “ethical drift" whereby alignment safeguards are systematically circumvented and harmful content resurfaces under adversarial inducement at distributional shifts. Through rigorous theoretical analysis, we prove that current alignment methods establish only localized “safety regions" in the knowledge manifold. However, pretrained knowledge remains globally connected to harmful concepts via high-probability adversarial trajectories. We empirically validate these theoretical insights through a straightforward yet theoretically grounded methodology-semantic coherence inducement under distributional shifts. The effectiveness of this approach, achieving a 100% attack success rate across 22 out of 26 state-of-the-art aligned LLMs (including DeepSeek-R1, Llama-3, and Qwen3, among others), is not incidental but a direct consequence of our theoretical framework, demonstrating that the vulnerability is architectural rather than implementation-specific and revealing a fundamental structural weakness in current aligned LLMs.

Similar content being viewed by others

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Article Open access 17 September 2025

Strong and weak alignment of large language models with human values

Article Open access 21 August 2024

Training large language models on narrow tasks can lead to broad misalignment

Article Open access 14 January 2026

Data availability

The datasets used in this study are publicly available and can be accessed through the following link: https://github.com/centerforaisafety/HarmBench

Code availability

The code used in this study is available on Code Ocean under the https://doi.org/10.24433/CO.4583093.v2.

References

  1. Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025).

  2. Grattafiori, A. et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024).

  3. Achiam, J. et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).

  4. Zheng, Y. et al. Large language models for scientific discovery in molecular property prediction. Nature Machine Intelligence 1–11 (2025).

  5. Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).

    Google Scholar 

  6. Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).

    Google Scholar 

  7. Goertzel, B. Artificial general intelligence: concept, state of the art, and future prospects. J. Artif. Gen. Intell. 5, 1 (2014).

    Google Scholar 

  8. Fei, N. et al. Towards artificial general intelligence via a multimodal foundation model. Nat. Commun. 13, 3094 (2022).

    Google Scholar 

  9. Wang, J., Zhang, B., Du, Q., Zhang, J. & Chu, D. A survey on data selection for llm instruction tuning. arXiv preprint arXiv:2402.05123 (2024).

  10. Kirk, H., Bean, A., Vidgen, B., Röttger, P. & Hale, S. The past, present and better future of feedback learning in large language models for subjective human preferences and values. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2409–2430 (2023).

  11. Ullah, E., Parwani, A., Baig, M. M. & Singh, R. Challenges and barriers of using large language models (llm) such as chatgpt for diagnostic medicine with a focus on digital pathology–a recent scoping review. Diagnostic Pathol. 19, 43 (2024).

    Google Scholar 

  12. Li, Y., Katsumata, K., Javanmardi, E. & Tsukada, M. Large language models for human-like autonomous driving: A survey. In 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), 439–446 (IEEE, 2024).

  13. Lin, M.-Y., Lee, O.-W. & Lu, C.-Y. Embodied ai with large language models: A survey and new hri framework. In 2024 International Conference on Advanced Robotics and Mechatronics (ICARM), 978–983 (IEEE, 2024).

  14. Das, B. C., Amini, M. H. & Wu, Y. Security and privacy challenges of large language models: A survey. ACM Comput. Surv. 57, 1–39 (2025).

    Google Scholar 

  15. Ma, X. et al. Safety at scale: A comprehensive survey of large model safety. arXiv preprint arXiv:2502.05206 (2025).

  16. Perez, F. & Ribeiro, I. Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop (2022).

  17. Andriushchenko, M. & Flammarion, N. Does refusal training in llms generalize to the past tense? arXiv preprint arXiv:2407.11969 (2024).

  18. Wei, A., Haghtalab, N. & Steinhardt, J. Jailbroken: How does llm safety training fail? Adv. Neural Info. Proc. Syst. 36, 80079–80110 (2023).

  19. Zou, A. et al. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023).

  20. Zhu, S. et al. Autodan: Interpretable gradient-based adversarial attacks on large language models. In First Conference on Language Modeling (2024).

  21. Mehrotra, A. et al. Tree of attacks: Jailbreaking black-box llms automatically. Adv. Neural Inf. Process. Syst. 37, 61065–61105 (2024).

    Google Scholar 

  22. Addepalli, S., Varun, Y., Suggala, A., Shanmugam, K. & Jain, P. Does safety training of llms generalize to semantically related natural prompts? In The Thirteenth International Conference on Learning Representations (2025).

  23. Andriushchenko, M., Croce, F. & Flammarion, N. Jailbreaking leading safety-aligned llms with simple adaptive attacks. In The Thirteenth International Conference on Learning Representations (2025).

  24. Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. neural Inf. Process. Syst. 35, 27730–27744 (2022).

    Google Scholar 

  25. Bai, Y. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022).

  26. Shu, Y. & Yu, Z. Distribution shifts are bottlenecks: Extensive evaluation for grounding language models to knowledge bases. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, 71–88 (2024).

  27. Kulinski, S. & Inouye, D. I. Towards explaining distribution shifts. In International Conference on Machine Learning, 17931–17952 (PMLR, 2023).

  28. Liu, X., Xu, N., Chen, M. & Xiao, C. Autodan: Generating stealthy jailbreak prompts on aligned large language models. In The Twelfth International Conference on Learning Representations (2024).

  29. Sadasivan, V. S. et al. Fast adversarial attacks on language models in one gpu minute. In Proceedings of the 41st International Conference on Machine Learning, 42976–42998 (2024).

  30. Mazeika, M. et al. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. In Proceedings of the 41st International Conference on Machine Learning, 35181–35224 (2024).

  31. Mazeika, M. et al. The trojan detection challenge. In NeurIPS 2022 Competition Track, 279–291 (PMLR, 2023).

  32. Biden, J. R. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence (2023).

  33. Yang, A. et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025).

  34. Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).

  35. Zheng, L. et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 36, 46595–46623 (2023).

    Google Scholar 

  36. Yang, A. et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023).

  37. Bai, J. et al. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).

  38. Geng, X. et al. Koala: A dialogue model for academic research. Blog post https://bair.berkeley.edu/blog/2023/04/03/koala/ (2023).

  39. Mitra, A. et al. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045 (2023).

  40. Jiang, A. et al. Mistral 7b. arxiv 2023. arXiv preprint arXiv:2310.06825 (2023).

  41. Tunstall, L. et al. Zephyr: Direct distillation of lm alignment. In First Conference on Language Modeling (2024).

  42. Kim, S. et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), 23–35 (2024).

  43. Wang, G. et al. Openchat: Advancing open-source language models with mixed-quality data. In The Twelfth International Conference on Learning Representations (2024).

  44. Zhu, B., Frick, E., Wu, T., Zhu, H. & Jiao, J. Starling-7b: Improving llm helpfulness & harmlessness with rlaif (2023).

  45. Wen, Y. et al. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. Adv. Neural Inf. Process. Syst. 36, 51008–51025 (2023).

    Google Scholar 

  46. Guo, C., Sablayrolles, A., Jégou, H. & Kiela, D. Gradient-based adversarial attacks against text transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 5747–5757 (2021).

  47. Wallace, E., Feng, S., Kandpal, N., Gardner, M. & Singh, S. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019).

  48. Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E. & Singh, S. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4222–4235 (2020).

  49. Perez, E. et al. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 3419–3448 (2022).

  50. Chao, P. et al. Jailbreaking black box large language models in twenty queries. In IEEE Conference on Secure and Trustworthy Machine Learning, 23–42 (2025).

  51. Zeng, Y. et al. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 14322–14350 (2024).

  52. Shen, X., Chen, Z., Backes, M., Shen, Y. & Zhang, Y. “ do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 1671–1685 (2024).

  53. Act, E. A. I. The eu artificial intelligence act (2024).

  54. Coglianese, C. People and processes: Ai governance under executive order 14,110. Admin. Reg. L. N. 49, 9 (2023).

    Google Scholar 

  55. Bai, Y. et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022).

  56. Rafailov, R. et al. Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 36, 53728–53741 (2023).

    Google Scholar 

  57. Song, X., Duan, S. & Liu, G. Alis: Aligned llm instruction security strategy for unsafe input prompt. In Proceedings of the 31st International Conference on Computational Linguistics, 9124–9146 (2025).

  58. Chakraborty, S. et al. Transfer q-star: Principled decoding for llm alignment. Adv. Neural Inf. Process. Syst. 37, 101725–101761 (2024).

    Google Scholar 

Download references

Acknowledgements

The research work described in this paper was conducted in the JC STEM Lab of Machine Learning and Computer Vision funded by The Hong Kong Jockey Club Charities Trust (L.C.). This research received partially support from the Global STEM Professorship Scheme from the Hong Kong Special Administrative Region (L.C.), and partially from the National Natural Science Foundation of China (No. 62171381, S.M.).

Author information

Author notes
  1. These authors jointly supervised this work: Yi Wang, Shaohui Mei, Lap-Pui Chau.

Authors and Affiliations

  1. Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China

    Jiawei Lian, Jianhong Pan, Yi Wang & Lap-Pui Chau

  2. School of Electronics and Information, Northwestern Polytechnical University, Xi’an, China

    Jiawei Lian, Lefan Wang, Xiaofei Wang, Yingjie Lu & Shaohui Mei

Authors
  1. Jiawei Lian
    View author publications

    Search author on:PubMed Google Scholar

  2. Jianhong Pan
    View author publications

    Search author on:PubMed Google Scholar

  3. Lefan Wang
    View author publications

    Search author on:PubMed Google Scholar

  4. Yi Wang
    View author publications

    Search author on:PubMed Google Scholar

  5. Xiaofei Wang
    View author publications

    Search author on:PubMed Google Scholar

  6. Yingjie Lu
    View author publications

    Search author on:PubMed Google Scholar

  7. Shaohui Mei
    View author publications

    Search author on:PubMed Google Scholar

  8. Lap-Pui Chau
    View author publications

    Search author on:PubMed Google Scholar

Contributions

J.L. is a Dual PhD candidate in a joint training program between The Hong Kong Polytechnic University and Northwestern Polytechnical University. J.L. and J.P. conceived and designed experiments. J.L. performed experiments, contributed analysis tools, and drafted the paper. J.L. and L.W. analysed data. X.W. and Y.L. performed experiments. Y.W., S.M., and L.C. co-supervised this work; they reviewed and edited the manuscript.

Corresponding authors

Correspondence to Yi Wang, Shaohui Mei or Lap-Pui Chau.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Davide Maiorca, Raja Chatila and the other anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Transparent Peer Review file (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lian, J., Pan, J., Wang, L. et al. Revealing the intrinsic ethical vulnerability of aligned large language models. Nat Commun (2026). https://doi.org/10.1038/s41467-026-70917-y

Download citation

  • Received: 07 May 2025

  • Accepted: 09 March 2026

  • Published: 21 March 2026

  • DOI: https://doi.org/10.1038/s41467-026-70917-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • Reviews & Analysis
  • News & Comment
  • Videos
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims & Scope
  • Editors
  • Journal Information
  • Open Access Fees and Funding
  • Calls for Papers
  • Editorial Values Statement
  • Journal Metrics
  • Editors' Highlights
  • Contact
  • Editorial policies
  • Top Articles

Publish with us

  • For authors
  • For Reviewers
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Nature Communications (Nat Commun)

ISSN 2041-1723 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics