Table 1 Summary of LLMs and AI scientists (agents) safety concerns and solutions
From: Risks of AI scientists: prioritizing safeguarding over autonomy
Type of Safety Risk | LLMs | AI Scientists (agents) |
---|---|---|
Content Safety Risks | Risks Identified: Issues such as offensiveness, unfairness, illegal activities, and ethical concerns67,68. Evaluation Methods: SafetyBench with multiple-choice questions covering seven categories of safety risks67. Alignment Methods: Reinforcement learning from human feedback (RLHF)69,70. Safe RLHF, decoupling helpfulness and harmlessness71. Self-evaluation and training-free alignment via RAIN72. Fine-tuning Safety: Adversarial examples and benign data can inadvertently compromise model safety during fine-tuning73,74. Reassuringly, extra safety examples can improve this concern, an excess may hinder it75. | Tool Interaction Risks: Identifying risks of agents with an emulator43. |
Jailbreak Vulnerabilities | Alignment-Breaking Attacks: Evaluated under jailbreaking conditions33,61,62,63. Defenses: Prompt techniques (self-examination)76,77,78, parameter pruning79, fine-tuning80. | Evaluation of Risk Awareness: Techniques like AgentMonitor54 and R-Judge55. |