Table 1 Statistics of SMILES and SELFIES token counts. Each entry captures average, median, and percentile distributions. SELFIES strings show higher mean lengths but remain feasible for Roberta-based models, as only a small fraction surpasses the 512 limit.
From: Domain adaptation of a SMILES chemical transformer to SELFIES with limited computational resources
SMILES token count | SELFIES token count | |
---|---|---|
Mean | 35 | 136 |
Std | 22 | 79 |
Min | 1 | 3 |
25% | 23 | 90 |
50% | 32 | 123 |
75% | 42 | 159 |
Max | 1035 | 3333 |