Introduction

The chemical space is estimated to contain approximately 1060 drug-like compounds1,2. This scale presents a significant opportunity for identifying novel pharmaceuticals, materials, and catalysts. However, most of these compounds have not been experimentally characterized3,4. Computational methods are now central to property prediction and chemical space exploration and help mitigate the limitations of large-scale experimental screening5. Among these methods, are descriptor-based prediction6 and graph or geometry-based methods7. Descriptor-based prediction relies on predefined features (e.g., molecular weight, fingerprints) derived from tools like RDKit8which are then fed as inputs for machine learning prediction9. On the other hand, graph-based and geometry-based predictions use molecular graphs to model atoms and bonds, incorporating 2D connectivity and 3D spatial geometry to predict molecular and quantum mechanical properties10,11. Of these methods, geometric graph neural networks have shown state-of-the-art performance in several molecular property prediction benchmarks7,12.

However, there is a common limitation of descriptor-based and graph-based prediction paradigms: they often depend on molecular descriptors (e.g., from RDKit) or privileged 3D information13both of which may be unavailable and impractical to generate for large, novel chemical libraries14. This requirement impedes scalability when navigating the vast chemical space. Consequently, these methods may struggle to generalize efficiently in high-throughput or early-stage exploratory analyses, where computational overhead and minimal data prerequisites are critical15.

Recently, transformer architectures16 have assumed a central role, building on their early success in natural language processing (NLP) to address molecular modelling tasks17,18. Initially developed to handle long-range contextual relationships in text, transformers exploit multi-head attention to capture dependencies across entire sequences, a feature that proves advantageous when treating molecules as strings19. Moreover, transformers’ capacity for self-supervised training on large-scale unlabeled data precludes the need for molecular descriptors or expensive 3D information13,20. As a result, they can learn context-rich embeddings that generalize to novel molecules. This scalability is particularly valuable in high-throughput settings, where rapid exploration of vast chemical spaces is crucial. These properties make transformers a compelling alternative to traditional methods for molecular representation and property prediction.

SMILES (Simplified Molecular Input Line Entry System)21 is the most prevalent string-based format for molecules. It encodes chemical graphs into linear notations resembling sentences, allowing NLP models to operate as if they are processing language. Many large-scale molecular transformers, such as ChemBERTa-zinc-base-v1[22] , rely on SMILES for both pretraining and downstream predictions. A more performant variant, ChemBERTa-77 M-MLM20was trained on 77 million SMILES strings and achieved strong performance across a broad range of property prediction tasks, outperforming graph-based models such as D-MPNN23 on 6 out of 8 MoleculeNet24 benchmarks. This improvement can be attributed to use of a large-scale pretraining dataset (77 M SMILES strings) and transformers’ ability to automate feature extraction directly from molecular strings20.

However, SMILES exhibits notable limitations, including the absence of strict valency checks and multiple valid representations for a single compound25,26. These flaws can degrade performance in tasks requiring stable, unambiguous encodings27. SELFIES (Self-Referencing Embedded Strings)28 was introduced to mitigate these limitations. Every SELFIES string is guaranteed to represent a valid molecular structure, avoiding many of the shortcomings associated with SMILES. This could make SELFIES more robust for property prediction, as the encoding is always chemically valid28,29. SELFormer30a transformer model trained on around 2 million SELFIES strings, demonstrated state-of-the-art results on ESOL and SIDER benchmarks24. On ESOL, SELFormer improved RMSE by more than 15% over GEM7; a geometry-based graph neural network. On SIDER, it increased ROC-AUC by 10% over MolCLR31. SELFormer also outperformed ChemBERTa-77 M in key molecular property prediction tasks like BBBP, BACE, and SIDER30. These findings suggest that the SELFIES-based transformer can perform at least on par with, and in some cases surpass, the performance of both SMILES-based and graph-based models. Notably, SELFormer, which was pretrained on 2 million molecules30outperformed ChemBERTa-77 M, despite the latter having been pretrained on 77 million molecules20.

While SELFIES-based transformers such as SELFormer have shown strong predictive performance, their development requires training from scratch on large SELFIES corpora using customized tokenizers30. This process is computationally expensive and often inaccessible to groups without substantial computational resources[41]. Even well-resourced laboratories must allocate significant time and hardware to build such models. For smaller research groups, the costs may be prohibitive41,42. In natural language processing, domain-adaptive pretraining (DAPT) has been proposed as a lower-cost alternative, where a pretrained model is further adapted to a new domain using unlabeled data32. This paradigm has shown that the effectiveness of continued pretraining correlates with the degree of divergence between the target domain and the original pretraining corpus, a relationship often assessed through vocabulary overlap32. Even when starting from strong pretrained baselines such as RoBERTa33continued pretraining on domain-specific corpora—such as biomedical, scientific, or review texts—has been shown to yield measurable improvements when the target domain diverges from the original training distribution32.

Even though SELFIES is not a new semantic domain, it represents a distinct molecular notation that differs structurally from SMILES28. SELFIES encodes molecules using bracketed syntax and introduces unique tokens such as [Branch1] and [Ring1]. Despite these differences, SMILES and SELFIES share a substantial portion of their chemical vocabulary, including atomic symbols (e.g., C, O, N), bond indicators (e.g., =, #), and punctuation29. In DAPT, vocabulary overlap is used to estimate similarity between source and target domains. By this measure, the overlap between SMILES and SELFIES suggests that the degree of divergence remains within a range that allows effective adaptation using the original tokenizer.

Building on DAPT paradigm, this study examines whether a transformer pretrained on SMILES can be adapted to SELFIES without changing the tokenizer or reinitializing the model. Although SELFIES uses bracketed tokens and differs in syntax, it shares most of its vocabulary with SMILES, including atoms, bonds, and punctuation. The approach treats SELFIES not as a new domain but as an alternative notation within the same chemical language.

To evaluate this hypothesis, ChemBERTa-zinc-base-v122 was selected as the adaptation base. The model was originally trained on SMILES strings and uses a byte-pair tokenizer34. Approximately 700,000 molecules were sampled from PubChem35 and converted to SELFIES format. Adaptation was performed using masked language modeling36with no changes made to the tokenizer vocabulary. The training procedure was conducted on a single NVIDIA A100 GPU and completed within 12 h, using Google Colab Pro37. The overall methodology is summarized in Fig. 1.

Fig. 1
figure 1

Overview of the methodology for repurposing a SMILES-pretrained transformer to SELFIES. The workflow includes data collection from PubChem, SELFIES conversion, tokenization checks, domain adaptation via masked language modeling, embedding evaluation, and downstream fine-tuning on ESOL, FreeSolv, and Lipophilicity.

The adapted model was evaluated at the embedding level to determine whether it produced chemically coherent representations of SELFIES inputs. t-SNE38 projections and cosine similarity39 were used to analyze clustering and pairwise distances among embeddings for molecules with common functional groups. Also, frozen embeddings were used to predict twelve properties from the QM9 dataset40. These evaluations were designed to assess whether the adapted model retained chemically meaningful structure without requiring end-to-end retraining.

Downstream predictive performance was further tested through full model finetuning on three benchmarks: ESOL, FreeSolv, and Lipophilicity24. All datasets were scaffold-split24 during evaluation. The adapted model matched or exceeded the performance of SMILES-based transformer baselines and graph neural networks across all tasks.

These findings confirm that a SMILES-pretrained model can be adapted to SELFIES through representation-level domain adaptation. The approach avoids the need for specialized tokenization or pretraining from scratch, while producing embeddings that are both structurally coherent and predictive. Although the primary objective is not to establish new state-of-the-art results, the study provides a generalizable framework for extending pretrained molecular transformers to robust notations such as SELFIES.

In addition to achieving high-fidelity performance across multiple benchmarks, the model does not rely on molecular descriptors or three-dimensional structural information. This enables application to novel compounds and use cases where such data are unavailable. Also, the cost-efficient design makes it suitable for research groups with constrained computational resources41,42.

Methods

Tokenization feasibility

Before adaptation, it was necessary to determine whether the tokenizer used in ChemBERTa-zinc-base-v1, originally trained on SMILES, could process SELFIES strings. Two criteria were evaluated: the presence of unrecognized tokens [UNK] and the proportion of SELFIES sequences that exceeded the 512-token input limit33. The presence of unknown tokens would indicate that certain SELFIES symbols are not represented in the tokenizer’s vocabulary. This would break input consistency and lead to loss in essential information43. Sequences that exceed the maximum length are truncated at the input layer, which also results in the loss of information44. Both conditions are expected to reduce the reliability of downstream predictions43,44.

To evaluate tokenizer compatibility, approximately 700,000 SMILES strings were sampled from PubChem35 and converted to SELFIES using the Python “selfies” library45. Molecules that failed the conversion step were excluded. The resulting SELFIES strings were passed directly into the ChemBERTa-zinc-base-v1 tokenizer. The tokenizer vocabulary and merges remained unchanged. Each SELFIES string was treated as a complete input sequence, and segmentation followed the merge rules originally learned from SMILES. Sequences were padded or truncated to a fixed length of 512 tokens33. All inputs were formatted using standard transformer conventions. No additional encoding layers or syntax-specific adjustments were applied.

The resulting distribution of token counts for SMILES versus SELFIES, and example molecules along with the token counts for the two notations are summarized in Tables 1 and 2 respectively. Additionally, the frequency distribution of token lengths for both SMILES and SELFIES is shown in Fig. 2. Notably, none of the SELFIES strings contained [UNK] tokens, and only about 0.5% exceeded 512 tokens. Despite SELFIES often generating longer tokens—an average of 136 tokens compared to SMILES’ 36—the tokenizer consistently decomposed SELFIES expressions into tokens recognizable from the original merges. For instance, “acetone” spans 5 tokens in SMILES but 19 in SELFIES; however, such expansions did not produce either truncation or unknown tokens. A higher incidence of truncation or unknown tokens would have introduced systematic information loss and undermined the validity of subsequent adaptation43,44.

Table 1 Statistics of SMILES and SELFIES token counts. Each entry captures average, median, and percentile distributions. SELFIES strings show higher mean lengths but remain feasible for Roberta-based models, as only a small fraction surpasses the 512 limit.
Fig. 2
figure 2

Comparison of SMILES versus SELFIES token distributions. SMILES notation frequently exhibits fewer tokens. SELFIES expansions guarantee valid structures at the cost of increased token lengths. Despite longer SELFIES strings, fewer than 1% exceeded 512 tokens and no [UNK] tokens were introduced.

Although SELFIES strings were processed successfully using the SMILES-trained tokenizer, detailed inspection revealed that SELFIES-specific tokens such as [= Branch1] and [Ring1] were not tokenized as complete semantic units29,46. Instead, the tokenizer segmented these tokens at the character or sub-character level, producing tokens like ‘[‘, ‘=’, ‘Br’, ‘a’, ‘nc’, ‘h’, ‘1’, ‘]’ for [= Branch1] and ‘[‘, ‘R’, ‘i’, ‘n’, ‘g’, ‘1’, ‘]’ for [Ring1]. As expected, this behaviour reflects the fact that the original tokenizer, trained only on SMILES data, lacks specific merges for SELFIES bracketed terms and therefore falls back on character-level decomposition47. Therefore, SELFIES inputs tend to produce longer tokenized sequences compared to SMILES for the same molecule.

However, a closer analysis of molecules that exceeded the 512-token limit showed that these cases corresponded to compounds with extreme molecular complexity. The molecules exceeding the 512 threshold exhibited molecular formulas such as C₂₂₈H₃₈₂O₁₉₁ (6179 g/mol), C₁₂₄H₁₈₅N₉O₂₀₇S₃₆ (6268 g/mol), C₉₈H₁₄₇N₇O₁₆₁S₂₈ (4897 g/mol), and C₂₀₅H₃₆₆N₃O₁₁₇P₅ (4900 g/mol)—all notably exceeding typical molecule scales40. Moreover, when the original SMILES strings for these molecules were tokenized, they too exceeded 512 tokens, indicating that the sequence length issue arises from the molecules’ intrinsic sizes and branching complexity. For such ultra-large molecules, existing transformers may not be able to represent all structural information due to unavoidable truncation44. Addressing these cases would likely require specialized tokenization schemes or architectural modifications44. However, for the small, medium, and moderately long molecules that dominate benchmark datasets like ESOL, FreeSolv, and Lipophilicity and QM924,40; current models and tokenization strategies remain sufficient.

Tokenization analysis confirmed that the vocabulary and merges from ChemBERTa-zinc-base-v1 provided full coverage for SELFIES symbols. No additional merges or tokenizer modifications were necessary to process the SELFIES corpus.

Table 2 Example molecules highlighting differences between SMILES and SELFIES representations, and the resulting token count using ChemBERTa-zinc-base-v1 tokenizer.

Domain-adapting to SELFIES

Following the verification that the ChemBERTa-zinc-base-v1 tokenizer can process SELFIES representations without generating unknown tokens and with minimal sequence truncation, the model was subsequently domain-adapted using these SELFIES inputs. The same dataset comprising approximately 700,000 molecules—previously used in the initial feasibility assessment—was employed for domain adaptation. The domain adaptation process followed a standard masked language modeling (MLM)48 approach, masking 15% of tokens at random and training the model to reconstruct them. All domain adaptation procedures, including the masked token sampling and model updates, were performed using Hugging Face Trainer class49 to streamline data collator setup, optimization schedules, and logging.

The domain adaptation was conducted on an NVIDIA A100 GPU within Google Colab Pro50using a per-device batch size of 32. Seventeen epochs of MLM domain adaptation were completed under an Adam optimizer51 (learning rate = 5 × 10⁻⁵, weight decay = 0.01), and mixed-precision FP1652 was enabled to optimize GPU memory usage. The final training loss was 0.0268.

t-SNE visualization

A chemically diverse panel was selected, comprising alkanes, alkenes, alkynes, aromatics, ketones, thiols, and alcohols. Each molecule’s SELFIES string was fed through the domain-adapted model, generating a 768-dimensional embedding via average-pooling of the final-layer hidden states. A t-SNE projection38 (perplexity = 5, random state = 0) was performed, then reduced these embeddings to two dimensions for visualization. This analysis aimed to evaluate whether the repurposed model encodes functional group information in a chemically meaningful way. Distinct molecular classes—such as ketones and alkanes—should occupy separate regions in the resulting two-dimensional projection if the embeddings reflect structural differences. For reference, SMILES strings were processed using the original ChemBERTa-zinc-base-v1, and their embeddings were projected using the same t-SNE configuration to allow direct comparison.

Cosine similarity

Cosine similarity39 was evaluated using a selected subset of molecules, with all similarities computed relative to methane. Molecules with similar structures, such as ethane and propane, were expected to yield higher similarity scores, while dissimilar compounds like benzene or phenol were expected to yield lower scores. This comparison was used to assess whether the SELFIES-based embeddings reflect chemical similarity in latent space. The same procedure was applied to SMILES-based embeddings produced by the original ChemBERTa-zinc-base-v1 to enable direct comparison between the two representations.

QM9 regression using frozen embeddings

Molecular descriptors are highly effective inputs for machine learning models in property prediction13. Transformers offer a new alternative by taking raw SELFIES strings and automatically generating embeddings that act as learned descriptors, without the need for external toolkits or handcrafted features13,20,30. If these embeddings carry enough chemical information to predict properties accurately, they provide a direct and efficient replacement for traditional descriptor pipelines. To test this, we evaluated the domain-adapted model’s performance in property prediction using frozen embeddings53without any additional fine-tuning.

Embedding-level coherence, while informative, must translate into reliable property predictions to validate any practical advantage of our proposed methodology. The QM9 dataset40—which encompasses a variety of quantum, electronic, thermodynamic, and energetic properties—served as a regression benchmark in this study. Specifically, the following twelve properties were evaluated: Dipole Moment (μ), Isotropic Polarizability (α), HOMO Energy \(\in _{{HOMO}}\), LUMO Energy \(\in _{{LOMO}}\), Gap (HOMO–LUMO) \(\in _{{gap}}\), Electronic Spatial Extent \(\:<{\varvec{R}}^{2}>\), Zero-Point Vibrational Energy \(\:\left(\varvec{z}\varvec{p}\varvec{v}\varvec{e}\right)\), Internal Energy at 0 K \(\:{(\varvec{U}}_{0})\), Internal Energy at 298 K \(\:\left(\varvec{U}\right)\), Enthalpy at 298 K \(\:\left(\varvec{H}\right)\), Free Energy at 298 K \(\:\left(\varvec{G}\right)\), and Heat Capacity at 298 K \(\:{(\varvec{C}}_{\varvec{v}})\).

To ensure a consistent molecular subset for regression analysis, all QM9 molecules were first converted from SMILES to SELFIES using the standard implementation of the SELFIES Python library45. Molecules that failed the conversion step were removed. The remaining SELFIES strings were input to the domain-adapted model. Hidden states from the final transformer layer were extracted and averaged across all tokens to produce a 768-dimensional embedding for each molecule. These embeddings were used as fixed representations for the regression task.

This generated consistent 768-dimensional inputs (identical to what ChemBERTa-zinc-base-v1 yields with SMILES). A uniform feed-forward network54—with dropout55 and batch normalization56—was then trained on each property in a five-fold cross-validation57. The same network architecture and training protocols for two other models were implemented: ChemBERTa-zinc-base-v1 and ChemBERTa-77 M-MLM. The objective is to ensure that any observed performance differences are attributed to the underlying embeddings—whether from the domain-adapted SELFIES model, ChemBERTa-zinc-base-v1, or ChemBERTa-77 M-MLM—rather than variations in downstream network architecture.

The consistency of the feed-forward network (including dropout, batch normalization, and identical training procedures) isolates the effect of the embeddings from other confounding factors. Root-mean-squared error (RMSE) and coefficient of determination (R²) were the primary metrics to gauge prediction accuracy. Figure 3 shows the architecture of the feed-forward network. A single set of hyperparameters58 was used for all models. The configuration was selected through preliminary testing to ensure stable convergence across ChemBERTa-zinc-base-v1, ChemBERTa-77 M-MLM, and the domain-adapted model.

Fig. 3
figure 3

Schematic of the feed-forward neural network used for QM9 regression. Embedding vectors (derived from ChemBERTa’s final-layer average pooling) are fed into sequential fully connected layers (512→256→128→64), each followed by batch normalization and dropout.

Performance on the QM9 regression tasks was used to compare the domain-adapted model against ChemBERTa-zinc-base-v1 and ChemBERTa-77 M-MLM. This evaluation was designed to assess whether adaptation to SELFIES yields improvements over the original SMILES model and whether the adapted model remains competitive with a significantly larger SMILES-pretrained transformer.

Fine-Tuning the repurposed model for lipophilicity, ESOL, and freesolv datasets

While frozen embedding evaluations assessed the structural quality of the learned representations, they do not reflect the model’s full potential under direct supervision. In transformer-based pipelines, task-specific fine-tuning is a standard stage following pretraining or adaptation36,59. This step evaluates whether the domain-adapted model maintains predictive accuracy when trained directly on supervised tasks, and whether the gains observed in frozen embedding evaluations extend to practical property prediction benchmarks. To evaluate practical predictive capacity, the domain-adapted model was fine-tuned60,61 end-to-end on three widely used molecular property prediction benchmarks: ESOL, FreeSolv, and Lipophilicity24.

These datasets are widely used in cheminformatics and drug discovery and represent chemically diverse and application-relevant endpoints62,63,64. The ESOL dataset consists of 1128 molecules with aqueous solubility values reported in log (mol/L). FreeSolv comprises 643 molecules with hydration free energies in water, expressed in kcal/mol63. The Lipophilicity dataset includes 4200 molecules with experimental logP values reflecting the octanol/water distribution coefficient, a proxy for lipophilicity that informs ADME-related properties64. These datasets provide a practical assessment of the model’s generalizability across real-world molecular property prediction tasks.

The datasets were partitioned using scaffold-based splits (80/10/10) to ensure generalization across distinct molecular scaffolds24. To enable direct comparison with existing benchmarks, the finetuning setup followed the SELFormer protocol30including architecture, optimization strategy, and evaluation procedures. The domain-adapted model was fine-tuned for 25 epochs with a batch size of 16 and a learning rate of 5 × 10⁻⁵, using the Adam optimizer and the Hugging Face Trainer API49. The CLS token65 output from the domain-adapted model was passed through dropout, a tanh-activated dense layer, and a regression head. RMSE, MSE, and MAE were recorded on validation splits, and final test predictions were inverse transformed to the original property scale. All hyperparameters and architectural components were held constant across datasets to ensure comparability.

Results & discussion

t-SNE clusters

After domain adaptation, a chemically diverse subset of molecules was selected to examine the structure of the learned embedding space using t-distributed stochastic neighbor embedding (t-SNE). The set included linear alkanes (methane, ethane, propane), aromatic compounds, and molecules containing functional groups such as ketones, thiols, and alcohols. As shown in Fig. 4, embeddings obtained from the domain-adapted model produced clearer separations between functional groups compared to those from the SMILES-pretrained baseline. Alkanes formed a more compact cluster, with separation from other chemical classes such as thiols and aromatics. Molecules containing aromatic rings occupied more localized regions, consistent with the presence of explicit bracketed tokens in SELFIES for ring specification28. Although the tokenizer was not trained on SELFIES syntax, continued masked language modeling48 was sufficient for the model to process bracketed tokens consistently. The adapted model produced more coherent latent representations across functional groups, even when operating with a tokenizer derived from SMILES. The shared use of atomic symbols, bond characters, and punctuation across both notations enabled this adaptation without modifying the tokenizer or introducing new merges.

Fig. 4
figure 4

t-SNE clustering of selected functional groups in (top) ChemBERTa-zinc-base-v1-based embeddings versus (bottom) SELFIES-adapted embeddings. Points represent final-layer mean-pooled vectors for diverse small molecules (e.g., alkanes, aromatics, and thiols). More distinct grouping is visible in the SELFIES embeddings, suggesting bracket expansions help clarify substructural features.

Cosine similarities

Cosine similarity39 was used to quantify the distance between molecular embeddings and assess whether chemical structure was preserved in latent space. Methane was used as the reference molecule, and similarity scores were computed for a set of small compounds covering multiple functional classes, including alkanes, alkenes, alkynes, alcohols, ketones, thiols, and aromatics. The objective was to determine whether the model distinguishes between these classes in a manner consistent with established chemical intuition.

As shown in Fig. 5, the SMILES-pretrained model produced inconsistent similarity rankings. Alkynes such as acetylene and propyne received higher similarity scores to methane than linear alkanes, despite their increased structural complexity. Alkenes also ranked closer to methane than some saturated hydrocarbons. These outcomes suggest that the ChemBERTa-zinc-base-v1 embeddings do not reliably encode structural proximity.

Fig. 5
figure 5

Cosine similarity of various molecules against methane’s embedding, comparing the ChemBERTa-zinc-base-v1 model (top) to the SELFIES-domain-adapted model (bottom). Alkanes (ethane, propane, butane) show higher similarity than ring-containing or heteroatom-rich structures (benzene, phenol). The SELFIES-based model often yields sharper functional distinctions.

The domain-adapted model exhibited a consistent and chemically interpretable pattern in cosine similarity values relative to methane. Within each functional group, shorter molecules were assigned higher similarity scores, while larger or more substituted analogues received progressively lower values. This trend was observed across alkanes, alkenes, alkynes, as well as molecules containing functional groups such as ketones and thiols. The gradation in similarity aligns with structural proximity to the reference molecule and suggests that the adapted model encodes molecular relationships in a manner consistent with chemical intuition.

These results are consistent with the clustering behaviour observed in the t-SNE analysis and support the conclusion that adaptation to SELFIES improved the structural resolution of molecular embeddings. The next section examines whether these differences extend to supervised regression tasks.

QM9 regression using frozen embeddings

The final stage of the embedding-level evaluation used frozen transformer embeddings to predict twelve molecular properties from the QM9 dataset40. Representations were extracted from the final transformer layer and were used as input to a feed-forward neural network trained using five-fold cross-validation.

Three models were compared: ChemBERTa-zinc-base-v1 (v1), ChemBERTa-77 M-MLM (v2), and the domain-adapted model trained on SELFIES (SELFIES-DA). Table 3 summarizes the coefficient of determination (R2), and root mean squared error (RMSE) for each property. SELFIES-DA outperformed the smaller SMILES baseline (v1) across all twelve targets. For polarizability, R² increased from 0.681 to 0.969, and RMSE was reduced by more than half. Similar improvements were observed for \(\in _{{HOMO}}\), \(\in _{{LUMO}}\), and thermodynamic properties such as \(\:\varvec{U}\), \(\:\varvec{H}\), and \(\:{\varvec{C}}_{\varvec{v}}\).

Compared to ChemBERTa-77 M-MLM, SELFIES-DA achieved similar or slightly better performance on most properties, despite being adapted on a dataset roughly 100 times smaller (700,000 vs. 77 million molecules). These results are consistent with trends observed in the t-SNE and cosine similarity analyses and show that SELFIES-based adaptation can yield high-quality embeddings under constrained data and compute conditions. The original ChemBERTa study noted that performance improves with pretraining scale22. It remains possible that significant improvements could be achieved with a larger SELFIES corpus, though the current results demonstrate that stable and accurate representations can already be obtained under modest adaptation conditions. Since the transformer remained frozen during regression, the gains reflect changes in embedding quality introduced by the adaptation process.

The next section evaluates whether similar performance trends hold when the model is fine-tuned end-to-end on supervised property prediction tasks, in contrast to the frozen-embedding approach used for QM9.

Table 3 Comparative performance of three ChemBERTa-based models on the twelve QM9 properties. Each row corresponds to a specific QM9 property, while the columns list R² and RMSE for (i) ChemBERTa-zinc-base-v1 (“v1”), (ii) ChemBERTa-77 M-MLM (“v2”), and (iii) the SELFIES-domain-adapted model (“SELFIES DA”). The SELFIES-domain-adapted model outperforms the smaller SMILES model in all targets and in most cases exceeds the larger model’s performance.

Fine-Tuning the repurposed model for lipophilicity, ESOL, and freesolv datasets

To further assess the predictive performance, the domain-adapted model was fine-tuned36,59 on three standard benchmarks: ESOL (aqueous solubility), FreeSolv (hydration free energy), and Lipophilicity (logD). These datasets are widely used for evaluating molecular property prediction24. All tasks used scaffold-based splits, and test performance was measured using root mean squared error (RMSE). Table 4 presents the results alongside established baselines, including SELFormer30ChemBERTa-77 M-MLM20and a range of graph-based models7,18,23,66,67,68. Performance results for the graph-based models were sourced from the GEM study7while results for ChemBERTa-77 M-MLM20 and SELFormer30 were obtained from their respective original publications.

While the SELFIES-fine-tuned model (SELFIES FT) does not outperform the graph-based models, the results in Table 4 show that it achieves competitive performance across all three benchmarks. On ESOL, the model reached an RMSE of 0.944, outperforming several models such as ChemBERTa-77 M-MLM20PretrainGNN66 and GROVERbase18. On FreeSolv, it achieved an RMSE of 2.511, outperforming SELFormer and several graph-based baselines, though still behind geometry-enhanced models such as GEM7. On Lipophilicity, the model reached 0.746, closely aligned with SELFormer (0.735), slightly above AttentiveFP68 (0.721) and GEM7 (0.660).

Table 4 RMSE on ESOL, FreeSolv, and lipophilicity. SELFIES FT refers to the domain-adapted model fine-tuned end-to-end on each task. Despite limited pretraining scale and constrained computational resources, it performs competitively across all benchmarks.

Representative predictions are shown in Fig. 6. For ESOL, polar molecules such as 2-Methyloxirane and 2-pyrrolidone were associated with high predicted solubility, while large polyaromatic systems like 3,4-Benzopyrene and 3,4-Benzchrysene yielded low logS values. For FreeSolv, the model assigned relatively high hydration free energy to hydrophobic structures such as Octafluorocyclobutane, and low values to polar, hydrogen-bonding compounds such as caffeine and Dexketoprofen. These examples are consistent with expected behaviour and provide evidence that the model learns interpretable relationships between structure and target properties.

Fig. 6
figure 6

Representative predictions from the ESOL and FreeSolv test sets using the SELFIES FT model. Predicted values reflect known solubility and hydration trends, with polar structures ranked higher in ESOL and more negative in FreeSolv. The observations align with chemical intuition and demonstrate that the model produces interpretable outputs across structurally diverse molecules.

Although the model was trained on only 700,000 molecules and without any architectural modifications or tokenizer extensions, it achieved results comparable to models trained on substantially larger corpora or those augmented with 3D geometry7,20. The SELFIES-based adaptation provides a reliable alternative for settings with limited computational resources41,42. It produces accurate predictions without relying on molecular descriptors or structurally privileged inputs, two considerations of immense importance when dealing with novel molecules13. This characteristic provides an advantage over models that depend on engineered features or 3D structural data, especially in early-stage or high-throughput applications15.

The results presented in this study highlight the effectiveness of adapting a SMILES-pretrained transformer to SELFIES without modifying the tokenizer or model architecture. The SELFIES notation, with its context-free grammar, enforces structural validity and eliminates notational ambiguity28,29. These properties appear sufficient to support the learning of chemically relevant patterns across diverse prediction tasks. The use of a preexisting tokenizer, originally trained on SMILES, did not limit model performance when applied to SELFIES. This may suggest that notational standardization can offset the benefits typically associated with large-scale pretraining.

Conclusion

This study presents a cost-efficient strategy for repurposing a SMILES-pretrained transformer to SELFIES without modifying the tokenizer or model architecture. A default SMILES tokenizer was used to process approximately 700,000 SELFIES-formatted molecules, with no unknown tokens or excessive truncation encountered. The domain adaptation was completed in under 12 h on a single NVIDIA A100 GPU. Embedding-level evaluations—including t-SNE, cosine similarity, and QM9 regression—confirmed that SELFIES-based representations captured chemically meaningful structure with greater consistency than the SMILES baseline. On ESOL, FreeSolv, and Lipophilicity benchmarks, the adapted model performed competitively with both string-based and graph-based models.

These results demonstrate that SELFIES can serve as a robust and scalable alternative to SMILES for transformer-based molecular modeling. The approach requires no molecular descriptors, 3D geometry, or custom tokenizer, which makes it accessible to researchers operating under constrained resources. While the model does not outperform the graph-based methods, it achieves comparable and reliable predictions with minimal infrastructure. This work establishes that standardized string-based notations, when paired with domain-adaptive training, can yield practical gains without full-scale retraining. Future work could examine whether scaling the SELFIES corpus improves downstream performance or whether adapting a pretrained language model, such as RoBERTa33 provides additional benefits. Given the structured grammar and limited vocabulary of SELFIES relative to natural language, such models may capture its syntax efficiently and converge with fewer training epochs.