Fig. 2: Distributions of tokens for TSSA codes, SMILES, and SELFIES.
From: t-SMILES: a fragment-based molecular representation framework for de novo ligand design

a, b, c, d, e and f are the token distributions of TSSA_J, TSSA_B, SMILES, TSSA_M, TSSA_S and SELFIES, respectively. The symbols “&^“, which are used to indicate the molecular topology structure in t-SMILES, exhibit the second-highest frequency in the TSSA codes. However, they are not required to be in pairs, unlike the “(“ and “)” symbols in SMILES, which must be paired. The number of paired parentheses (highlighted in red) in t-SMILES codes exhibited a notable decline as they are limited to sub-fragments rather than the entirety of the SMILES string. The suffix letters “J, B, M, S” in various t-SMILES code names represent the fragmentation algorithm: JTVAE, BRICS, MMPA, and Scaffold.