Extended Data Table 2 Stable vs vulnerable sub-datasets of The Pile

From: Medical large language models are vulnerable to data-poisoning attacks

  1. Vulnerable subsets are not rigorously moderated, allowing malicious users to infect with poisoned content by hosting web pages (Common Crawl), uploading code (GitHub), or posting comments (HackerNews), as well as other approaches that an LLM training set may incidentally capture.