Extended Data Fig. 1: Visualization of data source suitability and chemical space diversity.

The top section analyzes molecular text length distributions, tokenizer-processed lengths and representative scaffolds. The bottom section visualizes molecular property distributions and their interrelationships.