Table 1 Statistics of SMILES and SELFIES token counts. Each entry captures average, median, and percentile distributions. SELFIES strings show higher mean lengths but remain feasible for Roberta-based models, as only a small fraction surpasses the 512 limit.

From: Domain adaptation of a SMILES chemical transformer to SELFIES with limited computational resources

 

SMILES token count

SELFIES token count

Mean

35

136

Std

22

79

Min

1

3

25%

23

90

50%

32

123

75%

42

159

Max

1035

3333