Table 1 Summary of LLMs and AI scientists (agents) safety concerns and solutions

Type of Safety Risk	LLMs	AI Scientists (agents)
Content Safety Risks	Risks Identified: Issues such as offensiveness, unfairness, illegal activities, and ethical concerns^67,68. Evaluation Methods: SafetyBench with multiple-choice questions covering seven categories of safety risks⁶⁷. Alignment Methods: Reinforcement learning from human feedback (RLHF)^69,70. Safe RLHF, decoupling helpfulness and harmlessness⁷¹. Self-evaluation and training-free alignment via RAIN⁷². Fine-tuning Safety: Adversarial examples and benign data can inadvertently compromise model safety during fine-tuning^73,74. Reassuringly, extra safety examples can improve this concern, an excess may hinder it⁷⁵.	Tool Interaction Risks: Identifying risks of agents with an emulator⁴³.
Jailbreak Vulnerabilities	Alignment-Breaking Attacks: Evaluated under jailbreaking conditions^33,61,62,63. Defenses: Prompt techniques (self-examination)^76,77,78, parameter pruning⁷⁹, fine-tuning⁸⁰.	Evaluation of Risk Awareness: Techniques like AgentMonitor⁵⁴ and R-Judge⁵⁵.

Here, LLMs refer to base language models that primarily process and generate text, while AI scientists are autonomous systems that combine LLMs with the ability to use external tools (e.g., laboratory equipment, scientific software) and take actions in the physical world. For example, while an LLM might generate text describing a chemical reaction, an AI scientist could execute that reaction using robotic equipment.

Quick links

Search