Table 1 Summary of LLMs and AI scientists (agents) safety concerns and solutions

From: Risks of AI scientists: prioritizing safeguarding over autonomy

Type of Safety Risk

LLMs

AI Scientists (agents)

Content Safety Risks

Risks Identified: Issues such as offensiveness, unfairness, illegal activities, and ethical concerns67,68. Evaluation Methods: SafetyBench with multiple-choice questions covering seven categories of safety risks67. Alignment Methods: Reinforcement learning from human feedback (RLHF)69,70. Safe RLHF, decoupling helpfulness and harmlessness71. Self-evaluation and training-free alignment via RAIN72. Fine-tuning Safety: Adversarial examples and benign data can inadvertently compromise model safety during fine-tuning73,74. Reassuringly, extra safety examples can improve this concern, an excess may hinder it75.

Tool Interaction Risks: Identifying risks of agents with an emulator43.

Jailbreak Vulnerabilities

Alignment-Breaking Attacks: Evaluated under jailbreaking conditions33,61,62,63. Defenses: Prompt techniques (self-examination)76,77,78, parameter pruning79, fine-tuning80.

Evaluation of Risk Awareness: Techniques like AgentMonitor54 and R-Judge55.

  1. Here, LLMs refer to base language models that primarily process and generate text, while AI scientists are autonomous systems that combine LLMs with the ability to use external tools (e.g., laboratory equipment, scientific software) and take actions in the physical world. For example, while an LLM might generate text describing a chemical reaction, an AI scientist could execute that reaction using robotic equipment.