SmileyLlama: modifying large language models for directed chemical space exploration

Cavanagh, Joseph M.; Sun, Kunyang; Gritsevskiy, Andrew; Bagni, Dorian; Wang, Yingze; Bannister, Thomas D.; Head-Gordon, Teresa

doi:10.1038/s43588-026-00986-y

Download PDF

Article
Open access
Published: 11 May 2026

SmileyLlama: modifying large language models for directed chemical space exploration

Nature Computational Science (2026) Cite this article

5744 Accesses
1 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Here we show that large language models (LLMs) can be transformed via supervised fine-tuning of engineered prompts into SmileyLlama for exploring the chemical space of drug molecules. We benchmark SmileyLlama against pretrained LLMs and chemical language models trained from scratch for generating valid and novel drug-like molecules, and use direct preference optimization to both improve SmileyLlama’s adherence to a prompt and as part of the iMiner reinforcement learning framework to predict molecules with optimized three-dimensional conformations and high binding affinity to drug targets. By training an LLM to speak directly as a chemical language model, while retaining most of its natural language capabilities, we show that SmileyLlama can reliably generate molecules with user-specified properties rather than acting only as a chatbot with knowledge of chemistry or as a virtual assistant. While SmileyLlama is geared toward drug discovery, the supervised fine-tuning/direct preference optimization/LLM framework can be extended to other chemical, biological and materials applications.

Chemical language modeling with structured state space sequence models

Article Open access 22 July 2024

Large-scale chemical language representations capture molecular structure and properties

Article 21 December 2022

A merged molecular representation learning for molecular properties prediction with a web-based service

Article Open access 26 May 2021

Main

Chemical language models (CLMs)¹ trained on string representations such as Simplified Molecular-Input Line-Entry System (SMILES)² and SELF-referencIng Embedded Strings (SELFIES)³ have emerged as a useful tool for de novo generation of molecules, best exemplified by molecules relevant to pharmaceutical applications and drug discovery^4,5,6. Nearly all CLMs for molecular generation have been trained from scratch on large quantities of data such as ChEMBL⁷ and ZINC⁸ and using different model architectures, including variational autoencoders⁹, recurrent neural networks¹⁰, generative pretrained transformers (GPTs)¹¹ and structured state space sequence (S4) models¹². In addition, CLMs have achieved advances in chemical generation tasks through further downstream optimization of molecules with additional training or different model frameworks^13,14.

Language models are statistical models of probability distributions of units of language and can be adapted to generate meaningful text by sampling from these distributions. The most recent advancement of language models have resulted from the training of scaled-up transformers¹⁵ on massive amounts of data, resulting in the creation of large language models (LLMs) such as the proprietary GPT-4¹⁶ and the open-weight Llama¹⁷. Recently scientific groups have accessed frontier LLMs for the purpose of assisting research in the form of virtual lab members, to translate between natural and chemical languages¹⁸, or even performing research autonomously^19,20,21. Beyond chemical dictionary lookups or lab aides, LLMs have been used to perform mutation and crossover for an evolutionary algorithm to explore chemical space²² or to modify SMILES strings to change the properties of the molecules that they represent^23,24. Others have taken inspiration from LLMs to design CLMs, such as the ability to respond to prompts through a transformer-based architecture²⁵. However, to our knowledge, no CLM derived from a pretrained general-purpose LLM has reached the performance exhibited by modern CLMs that are trained from scratch with chemical data.

Here, we demonstrate that an open-weight LLM, Meta-Llama-3.1-8B-Instruct (‘Llama’ here on out)¹⁷, can be converted into a model for generative tasks in drug discovery. The fact that Llama is open-weight offers several benefits including allowing training and sharing adapters, to perform inference without needing to store potentially valuable data on a remote server, to have control over the hyperparameters and algorithms used for fine-tuning, and to perform interpretability analyses on the model weights. Using supervised fine-tuning (SFT) and direct preference optimization (DPO)²⁶ of the pretrained Llama with SMILES strings derived from ChEMBL, SmileyLlama generates drug-like molecules with desirable properties specified in a user-defined prompt with relevance to medicinal chemistry, which we show can match or exceed the performance of modern CLMs. We further demonstrate that SmileyLlama greatly improves the reinforcement learning component of iMiner algorithm¹⁴ to more efficiently explore chemical space to create molecules optimized for three-dimensional (3D) binding to target proteins, illustrated with the SARS-Cov-2 (SARS2) main protease (MPro)²⁷. While our dataset and subsequent analyses are created with drug discovery as a downstream application, this general procedure can be extended to other chemical applications such as chemical synthesis planning²⁸ or transition metal complex discovery²⁹.

Results

SFT and DPO of Llama

To steer the outputs of the pretrained Llama model¹⁷ for drug molecule generation, we first use SFT, in which the weights of Llama are further optimized on SMILES strings of approximately 2 million molecules from the ChEMBL Dataset (v33)⁷ to create SmileyLlama. For each molecule in our dataset, we picked a number of molecular properties of pharmaceutical interest to calculate using RDKit³⁰ and that are relevant for medicinal chemistry. In addition, drug molecules must also have suitable characteristics related to relevant biological phenomena such as obeying the rule-of-five³¹, or topological polar surface area (TPSA) ranges that are associated with oral bioavailability or the ability to cross the placenta or the blood–brain barrier³². If a drug need not meet these criteria, then a user interfacing with SmileyLlama should also be able to adjust the range criterion or eliminate it. Further specifics of these properties and the ranges we choose to specify during training of SmileyLlama can be found in the ‘Details of properties for fine-tuning’ section in the Methods.

After calculating and picking these properties for each SMILES string, we construct a prompt containing values of these properties, with the ‘correct’ completion being the SMILES string that these properties were calculated from. To illustrate, we used a prompt with a system instruction of ‘You love and excel at generating SMILES strings of drug-like molecules’ and a user instruction of the form ‘Output a SMILES string for a drug like molecule with the following properties:’ if properties are specified, or ‘Output a SMILES string for a drug like molecule:’ if no properties are specified. We chose to create prompts that assign SmileyLlama the role of an artificial intelligence that excels at producing SMILES strings, given the effectiveness of role prompting³³; we also chose this prompt format owing to its balance between motivation and brevity. Each property has a 50% chance of being calculated and specified in the prompt so that the trained model learns to operate equally well during inference, whether or not any properties are specified. We structure the prompts used for SFT so that during inference users avoid having to downselect the vast majority of generated molecules for having the correct characteristics—instead, users can simply prompt SmileyLlama to provide molecules with the characteristics they desire. See ‘Prompt formats and examples’ and ‘Additional training details’ sections in the Methods, as well as Supplementary Algorithm 1 and Supplementary Fig. 1 for further elaboration of SFT training.

We also use DPO²⁶, which also updates the weights of Llama to reinforce our model’s ability to robustly generate molecules for more specific task-oriented goals such as property specification. Algorithmically, we prompt our SFT model to generate molecules with a given property, sample several SMILES strings and use RDKit³⁰ to assess whether they have properties in line with what the prompt requested. We then pair molecules that correctly follow the prompt with those that do not, labeling them as winners and losers, respectively, and use a single epoch of DPO to improve the model’s performance. See Supplementary Algorithm 2 for pseudocode of this scoring and pairing procedure.

Benchmarking SmileyLlama against other LLMs and CLMs

To test the generative ability of SmileyLlama compared with other existing CLMs, we used the GuacaMol suite³⁴ to benchmark the validity, uniqueness and novelty of the molecules as shown in Table 1. In addition, Kullback–Leibler (KL) divergence and Fréchet ChemNet distance (FCD)³⁵ based on the GuacaMol definition (FCD_Guac) are used to analyze the distributional shifts from the ChEMBL training data for drug-like molecules³⁴. More detail is found in the ‘GuacaMol benchmark definitions’ section in the Methods.

Table 1 GuacaMol benchmarks comparing SmileyLlama with LLMs and with common CLM architectures trained on ChEMBL

Full size table

We first analyze the ability of Llama to produce molecules, relying only on its pretrained knowledge (zero-shot), or by providing it with one or more examples from the ChEMBL database in the formulated prompt (Table 1 and Supplementary Table 1). We find that, without SFT or examples provided in the prompt, the LLM is unable to produce a high percentage of valid SMILES strings compared with other state-of-the-art CLMs and generally performs poorly, even with variations in hyperparameters such as temperature (T). Interestingly, validity is lower when 20 examples are provided in the prompt (twenty-shot) than it is when no examples are in the prompt (zero-shot). We speculate that Llama zero-shot has had some exposure to the SMILES syntax to be able to generate valid strings, but it has no intrinsic ability to generalize, repeating the memorized SMILES and resulting in low uniqueness. When several examples are given in the prompt, this biases Llama away from the known SMILES strings it can produce, but the there are few enough examples provided that its grasp on the allowed mutable structure of SMILES strings is poor and thus less valid. However, because these prompts are so diverse, Llama’s 20-shot uniqueness is very high.

In Table 1, it can be seen that SFT substantially improves SmileyLlama’s ability to generate drug-like molecules. In addition, we experiment with the format of the SmileyLlama prompt, performing SFT on Llama with a less anthropomorphic user prompt and a blank template as an ablation study, showing that changing this prompt format does not substantially affect the GuacaMol benchmarks (Supplementary Table 1). To show the generality of the LLM-SFT approach, we also fine-tune Llama-3.2-3B, Llama-3.2-1B and Qwen-2.5-7B³⁶ using the same SFT workflow (including identical hyperparameters) that we developed for SmileyLlama. Supplementary Table 1 finds that the GuacaMol benchmark results did not change substantially between SmileyLlama and SmileyQwen2.5-7B. We also find, through inspection of SmileyLlama-1B and SmileyLlama-3B, that validity increases with parameter count, while novelty, uniqueness, and the match between the training distribution and the distribution of generated molecules remain largely unchanged.

Figure 1 shows that SmileyLlama generates very good agreement with ChEMBL quantities across a diverse property set. The Uniform Manifold Approximation and Projection (UMAP) visualization in Fig. 1a, a popular visualization tool used in drug discovery, finds that SmileyLlama generates molecules in every well-represented region of the chemical space of ChEMBL. We also consider the distribution of molecular properties of interest to medicinal chemistry in Fig. 1b, where the KL-divergence values indicate that all properties are in strong agreement between SmileyLlama-generated molecules and ChEMBL molecules, and are comparable to those of other models, as reflected by low KL divergence in GuacaMol (Table 1 and Supplementary Figs. 3 and 4). Furthermore, small percentages of undesirable molecular scaffolds are present in the ChEMBL training data itself¹⁴, but Supplementary Table 2 shows that SmileyLlama and most robust CLMs do not oversample these unviable chemical structures. Finally, while training was conducted at T = 1.0, exploration of temperature used at inference on the GuacaMol benchmark (Supplementary Fig. 2) suggests that this temperature is adequate for all tests described in the ‘Results’.

**Fig. 1: Distribution comparisons for different properties of the generated molecules from SmileyLlama (blue) with molecules from the training dataset from ChEMBL (gold).**

Property specification using SmileyLlama under SFT

In Table 2, we show the average percentage of valid, distinct SMILES strings generated for a complete panel of of molecular property tasks with SFT. This benchmark is distinct from other conditional molecule generation benchmarks¹⁸ in that we are testing SmileyLlama’s ability to robustly generate molecules with properties in value ranges rather than a specific value. This is of interest to medicinal chemists where numerical ranges of Lipinski violations or hydrogen bond donors and acceptors (and others) are used during chemical exploration. In addition, LLMs tend to struggle with numbers that have many degrees of precision and must be split into several tokens³⁷. Hence, we did not represent this category in the prompt during training.

Table 2 Percentage of valid, distinct generated molecules over a panel of tasks using SmileyLlama

Full size table

Overall, SmileyLlama model does very well on tasks on which it was trained through the engineered prompt, especially when contrasted with the model resulting from the ‘prompt ablation’ experiment in Table 2. We note that one has a choice to use SmileyLlama using lower temperatures at inference that can improve the SFT predictions further. Although all individual properties were present in the training data, some were underrepresented, such as the Lipinski rule-of-five, the presence of macrocycles, and certain categories of warhead-related SMARTS and Enamine substructures, resulting in more moderate performance for these categories. As expected, SmileyLlama does poorly on tasks involving exact numerical specifications. More encouragingly, SmileyLlama performs well on compound tasks such as generating molecules similar to existing leads, that is, ‘scaffold hopping’ R-group modification, and/or structure-based design to grow molecules from ligand fragments. Figure 2a is an example of SmileyLlama model’s ability to generate molecules from all 320 substructures in the Enamine database³⁸ that follow the Lipinski rule-of-five³⁹, which encompasses most of the molecular properties with ranges listed in Table 2.

**Fig. 2: Conditional generation with SmileyLlama for fragment growth and before and after DPO compared with ChEMBL.**

We compare SmileyLlama with a model resulting from an ablation study on the efficacy of prompting. We study this by removing all indications of molecular properties from all of the prompts in the dataset used to train SmileyLlama; each molecule from ChEMBL is treated as a completion to the same prompt, namely the prompt used for SmileyLlama when no molecular properties are specified. When we run SFT with exactly the same hyperparameters as SmileyLlama, we find that the ablated model performs quite poorly in comparison to SmileyLlama on this benchmark, achieving 90+% performance on only three tasks. This becomes especially pronounced when the properties are rarely found in the data, such as the presence of a macrocycle or a warhead-related SMARTS pattern. The stark contrast in performance highlights the necessity of our prompt engineering scheme: we cannot rely solely on the knowledge of the foundation model when fine-tuning for chemical tasks.

Property specification using SmileyLlama under DPO

While SmileyLlama typically performs well on tasks it was trained on using engineered prompts, and can still perform adequately when queried with prompts different from those it was trained on, it can be further optimized for specific tasks using DPO. DPO’s most popular application has been in improving the responses of LLM-derived chatbots, but it has also found use in improving the outputs of CLMs⁴⁰ and avoiding the need to separately train a reward model²⁶. Here, the relevance of DPO provides a way to further optimize the model by pairing desirable responses with undesirable responses. The model’s weights are then updated to be more likely to produce the ‘winner’ of the pairing and less likely to produce the ‘loser’ of the pairing. We generated our dataset by simply pairing up unsuccessful attempts at generating structures with successful attempts randomly for each task in Table 2.

SmileyLlama optimized with DPO substantially improved adherence to the prompt across nearly all tasks as seen in Table 2 and Fig. 2b. Note that, while DPO does cause the model to more robustly obey the rules in the prompt, it also shifts and narrows the property distribution compared to the training set and appears to be largely insensitive to temperature. SmileyLlama without DPO, on the other hand, occasionally does not obey the prompt but more faithfully reproduces the distribution of properties found in a filtered ChEMBL that satisfy Lipinski’s rules. In the context of drug discovery, SFT is primarily beneficial for early exploration of chemical space, whereas DPO is a type of constraint optimization that limits generated molecules to desired subclasses specified by the user.

Binding affinity to protein active sites with SmileyLlama/iMiner

The tests performed in previous sections do not take advantage of the 3D structural information of a putative drug nor its shape and molecular compatibility with a target protein active site. Hence, we use SmileyLlama augmented with DPO to generate unique and valid ligands that undergo further optimization for binding to a specific protein target when embedded in the iMiner framework¹⁴. iMiner combined with SmileyLlama is designed to generate novel inhibitor molecules for target proteins by combining deep reinforcement learning^41,42 with real-time 3D molecular docking using AutoDock Vina⁴³, thereby simultaneously creating chemical novelty while constraining molecules for shape and molecular compatibility with target active sites. Further details of the iMiner reinforcement learning model have been published elsewhere¹⁴ and are briefly summarized in the ‘iMiner reinforcement learning with SmileyLlama’ section in the Methods. To validate the effectiveness of SmileyLlama in the iMiner context, we generate inhibitor molecules for MPro, an enzyme whose function is essential to the SARS2 lifecycle⁴⁴. MPro has readily available experimental 3D structures^44,45, which provide the information needed for structure-based ligand design.

For the unconditional de novo generation case, SmileyLlama learns the user prompt ‘Output a SMILES string for a drug like molecule with the following properties: High SARS2PRO’, which pertains to minimizing the AutoDock Vina score while maximizing the drug likeliness score (S_DL) of the original iMiner reward function¹⁴. Figure 3a compares the docking scores of the original iMiner algorithm against SmileyLlama as a function of epoch number and with number of generated molecules per iteration. We notice first an improved data efficiency compared to iMiner, in which SmileyLlama requires only ~25% of the epochs to reach a similar level of improved docking score. Furthermore, iMiner’s diversity crashes with more epochs, which explains the sharpening peaks in later iterations (Fig. 3) and quantified further in Supplementary Fig. 2 against the GuacaMol benchmarks. This simply reflects convergence in the docking score, that is, there are fewer novel molecules as docking score reaches the highest values. By contrast, SmileyLlama maintains more diversity while greatly improving the docking score to iMiner (Fig. 3a) with minor degradation in validity compared with iMiner (Supplementary Fig. 5).

**Fig. 3: Comparison of the shift in docking score distributions for iMiner compared with SmileyLlama over optimization epochs as illustrated for SARS2 MPro.**

Figure 3b shows the property distributions of the final optimized set of novel molecules from SmileyLlama from the above prompt. While the property distributions are satisfactory for the number of hydrogen bond donors and acceptors, the molecular weight (MW) and logP results are not conforming to drug-like values. This indicates some inadequacy of the iMiner reward function, such that the CLM would require a reweighting and/or new terms in the loss/reward function, other hyperparameter tuning and/or expensive retraining. However, a unique advantage of SmileyLlama is that the distribution of generated molecules’ properties can be shifted using nothing more than prompt engineering, with no retraining required. Figure 3b shows that combining prompts such as ‘Output a SMILES string for a drug like molecule with the following properties: High SARS2PRO, < = 5 H-bond donors, < = 10 H-bond acceptors, < = 500 molecular weight, < = 5 logP (High SARS2Pro+Ro5)’ improves properties such as MW and logP and drug-likeness scores substantially with some expected loss in high docking scores because smaller molecules make fewer intermolecular interactions.

Figure 4 and Supplementary Fig. 6 provides a set of novel molecules from SmileyLlama docked in the MPro active site with the two engineered prompts ‘High SARS2Pro’ and ‘High SARS2Pro+Ro5’. Two of the higher-scoring molecules resemble the variations of the perampanel drug with the trefoil structure, which are tested inhibitors optimized by the Jorgensen group⁴⁶. However, unlike the molecules from their study that consistently retained the central pyridinone ring⁴⁶, SmileyLlama molecules have replaced the trefoil hub with the pyrimidine functional group (Fig. 4c). Higher docking scores are found for quite different drug scaffolds (Fig. 4a,b), but in all cases there is no notable homology match found in the Therapeutic Target Database⁴⁷. This would indicate that the generative capabilities of SmileyLlama are robust and outside of the pretrained Llama model. Finally, the proposed drugs are synthetically accessible⁴⁸, as indicated by an average synthetic accessibility (SA) score of approximately 3. Precise details can be found in the ‘example_molecules.csv’ file in the Supplementary Data.

**Fig. 4: SmileyLlama de novo generated molecules in the active site of SARS2 MPro.**

SmileyLlama outside of chemical language modeling

While SFT and DPO alters Llama in the creation of SmileyLlama, we find that SmileyLlama can still converse in English if it is prompted to do so, and some sample conversations are included in the ‘SmileyLlama outside of chemical language modeling’ section in the Methods. As a more quantitative measure of its residual capabilities, we evaluate SmileyLlama’s performance using the Language Model Evaluation Harness on the MMLU, GPQA, Math-Hard, and MMLU-Pro benchmarks^{49,50,51,52,53}. Supplementary Table 3 and Supplementary Fig. 7 show that SmileyLlama generally performs worse on moral scenarios and, interestingly, also performs worse on chemistry-related subjects than Llama. This is in part probably due to the tendency for SmileyLlama to complete prompts relating to chemistry with a SMILES string. In addition, accuracy errors in the MMLU tests have also been noted recently⁵⁴, and thus SmileyLlama’s degraded performance in chemistry may be partly an artifact of poorly designed evaluation benchmarks. Overall, this result is somewhat encouraging, because it implies the possibility that LLM-derived CLMs can inherit and take advantage of the natural language processing ability of their foundation model. SmileyLlama already does this—we can steer the properties of the molecules it generates and the chemical space it explores using natural language prompts while still retaining some ability to process nonchemical natural language. However, more work is required to develop SmileyLlama as an additional capability of an LLM, which may be achievable with larger foundation models

Discussion

Our study clarifies a few crucial points for CLMs derived from LLMs going forward. First, it is not necessary to pretrain a specialized model on chemistry-specific text to generate molecules from a text description; a much less resource-intensive SFT training run on prompt-following using a dataset of a few million molecules with a commodity LLM is sufficient to achieve this. Second, DPO provides another resource-efficient way of optimizing the model to produce molecules that score well on a targeted objective without needing in-context examples, instead relying on the generative nature of the model itself for good and bad examples. A corollary to this is the finding that SmileyLlama can combine its knowledge gained during single-objective optimization to perform well at a task specifying multiple objectives, elicited by combining the prompts (as opposed to requiring training on both prompts), which is a welcome outcome. Even so, there are still limitations to and trade-offs within the SmileyLlama framework and within our investigation for drug discovery. Additional factors for good drug candidates must also inhibit ‘off-target effects’ and/or be robust to mutation of the protein or virus among other downstream requirements. While SmileyLlama was not explicitly optimized for generating molecules with these qualities in this work, the DPO framework laid out here should be extensible to optimizing molecules for these characteristics. Even so, while DPO improves adherence to the prompt, it does so at the cost of narrowing the distribution of properties or diversity, which may not be desirable in all application areas or early stages of discovery. Furthermore, SmileyLlama still struggles in data-poor regimes, for example in the task of generating macrocycles.

The prompting and optimization framework for modifying LLMs to explore chemical space broadly or to narrow the search to specific regions shown here could also be leveraged for molecular design outside of drug discovery, such as the use of SMILES for elaborating on transition metal complexes⁵⁵. One could also imagine that casting a chemical problem as a linguistic construct could enable other applications, such as our recent work on chemical synthesis²⁸. As with many of the fields touched by LLMs this decade, the newly opened frontier of possibility in chemistry is as vast as it is exciting.

Methods

Details of properties for fine-tuning

Overview of selected properties for fine-tuning

When fine-tuning Llama to generate drug-like molecules, we carefully assess various design choices and proceed with the following properties, emphasizing those that medicinal chemists would consider when proposing de novo drug molecules. We categorized and summarized all 12 properties into 4 subgroups as follows.

Physiochemical properties. Absorption, distribution, metabolism and excretion (ADME) are the crucial criteria to quantify the localization and concentration of drug molecules within the body after administration. As a result, we build on the list of properties proposed in the classical Lipinski’s rule-of-five³⁹ with some modern additions such as TPSA to generate drug-like molecules that could demonstrate decent ADME.
- Number of hydrogen bond donors (#HBD)
- Number of hydrogen bond acceptors (#HBA)
- MW
- log of partition coefficient (logP)
- TPSA
- Fraction of sp³-hybridized carbon atoms (Fsp³)
Structure flexibility features. Binding sites within a targeted biomolecule (most often a protein) display by nature complex 3D geometry, with key potential sites of drug–target interactions (amino acid side chains, as an example) somewhat fixed in space. The protein, however, has a dynamic structure, and even the binding pocket undergoes changes in shape. Drug-like molecules need to be sufficiently rigid to enable efficient interactions with their target protein, including, in most cases, a high degree of selectivity over corresponding interactions with related proteins. Perhaps less intuitive is that drug-like molecules must be flexible enough to maintain those interactions as the protein adapts its conformation. There is a ‘Goldilocks principle’ at play, where too rigid or too flexible are each undesired extremes. Here, we chose the following two properties to account for the flexibility aspect.
- Number of rotatable bonds (#rot)
- Whether the molecule contains a macrocycle (defined as an eight-membered ring or larger)
Pattern-based features. In practical drug discovery, there are always some key patterns and/or scaffolds that medicinal chemists would like to hold onto or get rid of. For instance, in the lead optimization phase, retaining the key moiety and desired chemical formula are rather essential. Meanwhile, avoiding chemically unstable groups, PAINS molecules⁵⁶ and molecules that would cause structure alerts could increase the chance of success in development. Therefore, we have the following three properties for fine-tuning.
- Avoidance of undesirable chemical patterns
- Retention of specified substructure (between 50 Da and 250 Da in MW)
- Chemical formula
Covalent warhead feature. Drugs can be broadly categorized into noncovalent and covalent drugs, depending on whether the drug reacts with its target. That is, an electrophilic group of a covalent inhibitor might form a bond with a nucleophilic amino acid side chain of its target protein. The reactive functional group of a covalent inhibitor is called a warhead. While most drugs are noncovalent, either can be desired. To give the model the ability to generate covalent binders, we also curated common covalent warhead-related SMARTS patterns from the Enamine fragment library³⁸ to indicate whether our generated molecules have the capacity to covalently bind to the target or not.
- Whether the molecule contains common covalent warhead-related SMARTS patterns, and which of these patterns appear in the molecule

Prompting options used in fine-tuning

To incorporate the properties mentioned above into the training, we used several ways of prompting to satisfy the requirement from target uses.

For numerical properties, including all physiochemical properties and #rotatable bonds, we prompted Llama by providing a specific range that the training molecules falls into for that specific category. All the cutoff values used for ranges are either commonly used standards in drug discovery or derived from the training distribution. Besides the range guidance, we added the prompt that tells Llama exactly how many #HBDs and #HBAs are contained in the training data, enabling more nuanced generation. If a property falls into multiple valid ranges—for instance, four H-bond donors satisfies all of 4, 5 and 7—we randomly select one of these ranges to include in the prompt (if the property is chosen to be included in the prompt). It is important that the set of ranges for a property spans all possible molecules; otherwise, a prompt that omits information may bias the model toward generating molecules with property values outside the defined ranges. If we never include information in the prompt about molecules with more than seven H-bond donors, but sometimes include the number of H-bond donors when it is seven or fewer, then omitting this information may bias the model toward assuming that the number of H-bond donors is greater than seven. Doing this during training would bias results during inference. This is the same reason we sometimes explicitly specify undesirable properties in the prompt, such as the presence of bad SMARTS patterns. If the random number generator decided that a prompt should contain a substructure but the SMILES in question did not have any BRICS substructures, we added ‘no BRICS substructure’ to the list of properties in lieu of a substructure.

For other categorical properties, we used a combination of RDKit modules, SMILES strings and SMILES arbitrary target specification (SMARTS) strings to recognize if certain properties or chemical patterns are present in the training input. Unlike the objective of containing the scaffold exactly, chemical pattern avoidance and covalent warhead recognition required matching of more general substructures and/or certain functional groups. Here, we used SMARTS strings as our representation because of its ability of matching chemical patterns. More details about the specific SMARTS patterns used are shown later in this section.

Below is a detailed list of possible components of that could appear in a training prompt.

N H-bond donors, N = ≤3, ≤4, ≤5, ≤7, >7
N H-bond acceptors, N = ≤3, ≤4, ≤5, ≤10, ≤15, >15
N MW, N = ≤300, ≤400, ≤500, ≤600, >600
N logP, N = ≤3, ≤4, ≤5, ≤6, >6
N rotatable bonds, N = ≤7, ≤10, >10
N fraction sp³, N = < 0.4, > 0.4, > 0.5, >0.6
N TPSA, N = ≤90, ≤140, ≤200, >200
a macrocycle, no macrocycles
has bad SMARTS, lacks bad SMARTS
has covalent warheads, lacks covalent warheads
substructure of *a_smiles_string*
a chemical formula of *formula*

SMART patterns used to identify bad chemical groups

Li et al. pointed out a list of bad chemical patterns that exists in ChEMBL database, which will negatively affect compound generation¹⁴. In this work, we used the same list of SMARTS patterns as their work to avoid bad patterns, including cyclopentadiene, cyclopentadiene ylidenes, aromaticity-breaking tautomers, antiaromatic system, unstable halogen–heteroatom bonds, unstable fused rings, allenic system, thiazyl linkages and peroxide bonds. In Supplementary Table 2, we also present the frequency of sampling undesirable chemical groups in ChEMBL and across different generative models.

[C^2]1=[C^2]-[C^2]=[C^2] ~ [C;!d4] ~ [C;!^2;d2]1
[C^2]1 ~ [C^2] ~ [C^2] ~ [C^2] ~ [C;!^2;d2] ~ [N]1
[#6^2]1 ~ [#6^2] ~ [#6^3;!d4] ~ [#6^2]2 ~ [#6^2] ~ [#6^2] ~ [#6^2] ~ [#6^2](~ [*]) ~ [#6^2] ~ 2 ~ [#6^2] ~ 1
[#6]1(=[*])[#6]=[#6][#6]=[#6]1
[#6]1=[#6][R2-]=[R2-]1
[#6^2]1 ~ [#6^2] ~ [#6^2] ~ [#6^2] ~ [#6^1] ~ [#6^1] ~ 1
[#7,#8,#16]-[#9,#17,#35,#53]
[r3,r4]@[r5,r6]
[*]=[#6,#7,#8]=[*]
[#7,#16]=[#16]
[#8]-[#8]

In addition to the patterns mentioned above, we use the following SMARTS patterns to enforce our generated pyrroles to be one of the following correct forms.

[N^2]1 ~ [C,N;^2](=[*]) ~ [C,N;^2] ~ [C,N;^2] ~ [C^3]1
[N^2]1 ~ [C,N;^2] ~ [C,N;^2](=[*]) ~ [C,N;^2] ~ [C;^3]1
[N^2]1 ~ [C,N;^2] ~ [C,N;^2] ~ [C,N;^2](=[*]) ~ [C;^3]1
[C,N;^2](=[*])1 ~ [N;^2] ~ [C,N;^2] ~ [C,N;^2] ~ [C;^3]1
[C,N;^2]1 ~ [N;^2] ~ [C,N;^2](=[*]) ~ [C,N;^2] ~ [C;^3]1
[C,N;^2]1 ~ [N;^2] ~ [C,N;^2] ~ [C,N;^2](=[*]) ~ [C;^3]1

SMART patterns used to encode common covalent warhead-related functional groups

Common covalent warheads are extracted from the Enamine Covalent Screening and Covalent Fragment Library³⁸. The list of SMARTS strings is shown below.

sulfonyl fluorides: [#16](=[#8])(=[#8])-[#9]
chloroacetamides: [#8]=[#6](-[#6]-[#17])-[#7]
cyanoacrylamides: [#7]-[#6](=[#8])-[#6](-[#6]#[#7])=[#6]
epoxides: [#6]1-[#6]-[#8]-1
aziridines: [#6]1-[#6]-[#7]-1
disulfides: [#16]-[#16]
aldehydes: [#6](=[#8])-[#1]
vinyl sulfones: [#6]=[#6]-[#16](=[#8])(=[#8])-[#7]
boronic acids/esters: [#6]-[#5](-[#8])-[#8]
acrylamides: [#6]=[#6]-[#6](=[#8])-[#7]
cyanamides: [#6]-[#7](-[#6]#[#7])-[#6]
chloroFluoroAcetamides: [#7]-[#6](=[#8])-[#6](-[#9])-[#17]
butynamides: [#6]#[#6]-[#6](=[#8])-[#7]-[#6]
chloropropionamides: [#7]-[#6](=[#8])-[#6](-[#6])-[#17]
fluorosulfates: [#8]=[#16](=[#8])(-[#9])-[#8]
beta lactams: [#7]1-[#6]-[#6]-[#6]-1=[#8]

To assess SmileyLlama’s performance on generating molecules with specified properties, as in Table 2, we investigate SmileyLlama’s performance on the following 387 tasks, grouped into the families for which the averages are shown in the table.

exactly k H-bond donors, from k = 0 to k = 5
exactly k H-bond acceptors, from k = 0 to k = 10
≤k H-bond donors, for k = 3, 4, 5, 7
≤k H-bond acceptors, for k = 3, 4, 5, 10, 15
≤k MW, for k = 300, 400, 500, 600
≤k logP, for k = 3, 4, 5, 6
≤7, ≤10, >10 rotatable bonds
>0.4, >0.5, >0.6, <0.4 fraction sp³
≤90, ≤140, ≤200 TPSA
a macrocycle
no macrocycles
has bad SMARTS (not shown in table but included for completeness)
lacks bad SMARTS
lacks covalent warheads
has covalent warheads (one for each of the 16 covalent warheads in the section above)
a substructure of (one of each of 320 Enamine fragments³⁸)
< = 5 H-bond donors, < = 10 H-bond acceptors, < = 500 Molecular weight, < = 5 LogP
< = 3 H-bond donors, < = 3 H-bond acceptors, < = 300 Molecular weight, < = 3 LogP

Prompt formats and examples

We assess the ability of Llama to generate SMILES strings as a baseline. Below are examples of system and user prompts to illustrate the methods we used to prompt Llama and SmileyLlama. The Llama prompts are constructed using the Llama instruction-tuning format, while the SmileyLlama, robotic prompt, and blank prompts use the Alpaca format to reproduce the setup used in the most recent supervised fine-tuning of the foundation model.

For the case of Llama zero-shot, we use the following format, with no prefilled responses, when generating data for the GuacaMol benchmark. We chose to use a user prompt asking for ‘no other output’ because, in our informal experiments, Llama would often respond indirectly, including English text discussing SMILES strings without this explicit instruction to generate only SMILES strings.

System prompt:

‘You love and excel at generating SMILES strings of drug-like molecules’

User prompt: