Revealing the intrinsic ethical vulnerability of aligned large language models

Lian, Jiawei; Pan, Jianhong; Wang, Lefan; Wang, Yi; Wang, Xiaofei; Lu, Yingjie; Mei, Shaohui; Chau, Lap-Pui

doi:10.1038/s41467-026-70917-y

Download PDF

Article
Open access
Published: 21 March 2026

Revealing the intrinsic ethical vulnerability of aligned large language models

Nature Communications , Article number: (2026) Cite this article

3110 Accesses
1 Altmetric
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Large language models (LLMs) represent foundational advances toward artificial general intelligence, yet their alignment with human values via instruction tuning and preference learning achieves only superficial ethical compliance. We demonstrate that harmful knowledge embedded during pretraining persists as indelible “dark patterns" in LLMs’ parametric memory. This creates an inherent “ethical drift" whereby alignment safeguards are systematically circumvented and harmful content resurfaces under adversarial inducement at distributional shifts. Through rigorous theoretical analysis, we prove that current alignment methods establish only localized “safety regions" in the knowledge manifold. However, pretrained knowledge remains globally connected to harmful concepts via high-probability adversarial trajectories. We empirically validate these theoretical insights through a straightforward yet theoretically grounded methodology-semantic coherence inducement under distributional shifts. The effectiveness of this approach, achieving a 100% attack success rate across 22 out of 26 state-of-the-art aligned LLMs (including DeepSeek-R1, Llama-3, and Qwen3, among others), is not incidental but a direct consequence of our theoretical framework, demonstrating that the vulnerability is architectural rather than implementation-specific and revealing a fundamental structural weakness in current aligned LLMs.

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Article Open access 17 September 2025

Strong and weak alignment of large language models with human values

Article Open access 21 August 2024

Training large language models on narrow tasks can lead to broad misalignment

Article Open access 14 January 2026

Data availability

The datasets used in this study are publicly available and can be accessed through the following link: https://github.com/centerforaisafety/HarmBench

Code availability

The code used in this study is available on Code Ocean under the https://doi.org/10.24433/CO.4583093.v2.

References

Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025).
Grattafiori, A. et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024).
Achiam, J. et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Zheng, Y. et al. Large language models for scientific discovery in molecular property prediction. Nature Machine Intelligence 1–11 (2025).
Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).
Google Scholar
Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).
Google Scholar
Goertzel, B. Artificial general intelligence: concept, state of the art, and future prospects. J. Artif. Gen. Intell. 5, 1 (2014).
Google Scholar
Fei, N. et al. Towards artificial general intelligence via a multimodal foundation model. Nat. Commun. 13, 3094 (2022).
Google Scholar
Wang, J., Zhang, B., Du, Q., Zhang, J. & Chu, D. A survey on data selection for llm instruction tuning. arXiv preprint arXiv:2402.05123 (2024).
Kirk, H., Bean, A., Vidgen, B., Röttger, P. & Hale, S. The past, present and better future of feedback learning in large language models for subjective human preferences and values. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2409–2430 (2023).
Ullah, E., Parwani, A., Baig, M. M. & Singh, R. Challenges and barriers of using large language models (llm) such as chatgpt for diagnostic medicine with a focus on digital pathology–a recent scoping review. Diagnostic Pathol. 19, 43 (2024).
Google Scholar
Li, Y., Katsumata, K., Javanmardi, E. & Tsukada, M. Large language models for human-like autonomous driving: A survey. In 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), 439–446 (IEEE, 2024).
Lin, M.-Y., Lee, O.-W. & Lu, C.-Y. Embodied ai with large language models: A survey and new hri framework. In 2024 International Conference on Advanced Robotics and Mechatronics (ICARM), 978–983 (IEEE, 2024).
Das, B. C., Amini, M. H. & Wu, Y. Security and privacy challenges of large language models: A survey. ACM Comput. Surv. 57, 1–39 (2025).
Google Scholar
Ma, X. et al. Safety at scale: A comprehensive survey of large model safety. arXiv preprint arXiv:2502.05206 (2025).
Perez, F. & Ribeiro, I. Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop (2022).
Andriushchenko, M. & Flammarion, N. Does refusal training in llms generalize to the past tense? arXiv preprint arXiv:2407.11969 (2024).
Wei, A., Haghtalab, N. & Steinhardt, J. Jailbroken: How does llm safety training fail? Adv. Neural Info. Proc. Syst. 36, 80079–80110 (2023).
Zou, A. et al. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023).
Zhu, S. et al. Autodan: Interpretable gradient-based adversarial attacks on large language models. In First Conference on Language Modeling (2024).
Mehrotra, A. et al. Tree of attacks: Jailbreaking black-box llms automatically. Adv. Neural Inf. Process. Syst. 37, 61065–61105 (2024).
Google Scholar
Addepalli, S., Varun, Y., Suggala, A., Shanmugam, K. & Jain, P. Does safety training of llms generalize to semantically related natural prompts? In The Thirteenth International Conference on Learning Representations (2025).
Andriushchenko, M., Croce, F. & Flammarion, N. Jailbreaking leading safety-aligned llms with simple adaptive attacks. In The Thirteenth International Conference on Learning Representations (2025).
Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. neural Inf. Process. Syst. 35, 27730–27744 (2022).
Google Scholar
Bai, Y. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022).
Shu, Y. & Yu, Z. Distribution shifts are bottlenecks: Extensive evaluation for grounding language models to knowledge bases. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, 71–88 (2024).
Kulinski, S. & Inouye, D. I. Towards explaining distribution shifts. In International Conference on Machine Learning, 17931–17952 (PMLR, 2023).
Liu, X., Xu, N., Chen, M. & Xiao, C. Autodan: Generating stealthy jailbreak prompts on aligned large language models. In The Twelfth International Conference on Learning Representations (2024).
Sadasivan, V. S. et al. Fast adversarial attacks on language models in one gpu minute. In Proceedings of the 41st International Conference on Machine Learning, 42976–42998 (2024).
Mazeika, M. et al. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. In Proceedings of the 41st International Conference on Machine Learning, 35181–35224 (2024).
Mazeika, M. et al. The trojan detection challenge. In NeurIPS 2022 Competition Track, 279–291 (PMLR, 2023).
Biden, J. R. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence (2023).
Yang, A. et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025).
Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
Zheng, L. et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 36, 46595–46623 (2023).
Google Scholar
Yang, A. et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023).
Bai, J. et al. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
Geng, X. et al. Koala: A dialogue model for academic research. Blog post https://bair.berkeley.edu/blog/2023/04/03/koala/ (2023).
Mitra, A. et al. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045 (2023).
Jiang, A. et al. Mistral 7b. arxiv 2023. arXiv preprint arXiv:2310.06825 (2023).
Tunstall, L. et al. Zephyr: Direct distillation of lm alignment. In First Conference on Language Modeling (2024).
Kim, S. et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), 23–35 (2024).
Wang, G. et al. Openchat: Advancing open-source language models with mixed-quality data. In The Twelfth International Conference on Learning Representations (2024).
Zhu, B., Frick, E., Wu, T., Zhu, H. & Jiao, J. Starling-7b: Improving llm helpfulness & harmlessness with rlaif (2023).
Wen, Y. et al. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. Adv. Neural Inf. Process. Syst. 36, 51008–51025 (2023).
Google Scholar
Guo, C., Sablayrolles, A., Jégou, H. & Kiela, D. Gradient-based adversarial attacks against text transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 5747–5757 (2021).
Wallace, E., Feng, S., Kandpal, N., Gardner, M. & Singh, S. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019).
Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E. & Singh, S. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4222–4235 (2020).
Perez, E. et al. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 3419–3448 (2022).
Chao, P. et al. Jailbreaking black box large language models in twenty queries. In IEEE Conference on Secure and Trustworthy Machine Learning, 23–42 (2025).
Zeng, Y. et al. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 14322–14350 (2024).
Shen, X., Chen, Z., Backes, M., Shen, Y. & Zhang, Y. “ do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 1671–1685 (2024).
Act, E. A. I. The eu artificial intelligence act (2024).
Coglianese, C. People and processes: Ai governance under executive order 14,110. Admin. Reg. L. N. 49, 9 (2023).
Google Scholar
Bai, Y. et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022).
Rafailov, R. et al. Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 36, 53728–53741 (2023).
Google Scholar
Song, X., Duan, S. & Liu, G. Alis: Aligned llm instruction security strategy for unsafe input prompt. In Proceedings of the 31st International Conference on Computational Linguistics, 9124–9146 (2025).
Chakraborty, S. et al. Transfer q-star: Principled decoding for llm alignment. Adv. Neural Inf. Process. Syst. 37, 101725–101761 (2024).
Google Scholar

Download references

Acknowledgements

The research work described in this paper was conducted in the JC STEM Lab of Machine Learning and Computer Vision funded by The Hong Kong Jockey Club Charities Trust (L.C.). This research received partially support from the Global STEM Professorship Scheme from the Hong Kong Special Administrative Region (L.C.), and partially from the National Natural Science Foundation of China (No. 62171381, S.M.).

Author information

These authors jointly supervised this work: Yi Wang, Shaohui Mei, Lap-Pui Chau.

Authors and Affiliations

Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China
Jiawei Lian, Jianhong Pan, Yi Wang & Lap-Pui Chau
School of Electronics and Information, Northwestern Polytechnical University, Xi’an, China
Jiawei Lian, Lefan Wang, Xiaofei Wang, Yingjie Lu & Shaohui Mei

Authors

Jiawei Lian
View author publications
Search author on:PubMed Google Scholar
Jianhong Pan
View author publications
Search author on:PubMed Google Scholar
Lefan Wang
View author publications
Search author on:PubMed Google Scholar
Yi Wang
View author publications
Search author on:PubMed Google Scholar
Xiaofei Wang
View author publications
Search author on:PubMed Google Scholar
Yingjie Lu
View author publications
Search author on:PubMed Google Scholar
Shaohui Mei
View author publications
Search author on:PubMed Google Scholar
Lap-Pui Chau
View author publications
Search author on:PubMed Google Scholar

Contributions

J.L. is a Dual PhD candidate in a joint training program between The Hong Kong Polytechnic University and Northwestern Polytechnical University. J.L. and J.P. conceived and designed experiments. J.L. performed experiments, contributed analysis tools, and drafted the paper. J.L. and L.W. analysed data. X.W. and Y.L. performed experiments. Y.W., S.M., and L.C. co-supervised this work; they reviewed and edited the manuscript.

Corresponding authors

Correspondence to Yi Wang, Shaohui Mei or Lap-Pui Chau.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Davide Maiorca, Raja Chatila and the other anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Transparent Peer Review file (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Lian, J., Pan, J., Wang, L. et al. Revealing the intrinsic ethical vulnerability of aligned large language models. Nat Commun (2026). https://doi.org/10.1038/s41467-026-70917-y

Download citation

Received: 07 May 2025
Accepted: 09 March 2026
Published: 21 March 2026
DOI: https://doi.org/10.1038/s41467-026-70917-y