Nature Medicine

Extended Data Table 2 Stable vs vulnerable sub-datasets of The Pile

From: Medical large language models are vulnerable to data-poisoning attacks

Vulnerable subsets are not rigorously moderated, allowing malicious users to infect with poisoned content by hosting web pages (Common Crawl), uploading code (GitHub), or posting comments (HackerNews), as well as other approaches that an LLM training set may incidentally capture.

Back to article page

Search

Advanced search

Quick links