Attention-based functional-group coarse-graining: a deep learning framework for molecular prediction and design

Han, Ming; Sun, Ge; Nealey, Paul F.; de Pablo, Juan J.

doi:10.1038/s41524-025-01836-7

Download PDF

Article
Open access
Published: 21 November 2025

Attention-based functional-group coarse-graining: a deep learning framework for molecular prediction and design

Ming Han^1,2^na1,
Ge Sun^1,3,4,5^na1,
Paul F. Nealey^1,6 &
…
Juan J. de Pablo^1,3,4,5,6

npj Computational Materials volume 11, Article number: 355 (2025) Cite this article

1998 Accesses
Metrics details

Subjects

Abstract

Machine learning (ML) offers considerable promise for the design of new molecules and materials. In real-world applications, the design problem is often domain-specific, and suffers from insufficient data, particularly labeled data, for ML training. In this study, we report a data-efficient, deep-learning framework for molecular discovery that integrates a coarse-grained functional-group representation with a self-attention mechanism to capture intricate chemical interactions. Our approach exploits group-contribution concepts to create a graph-based intermediate representation of molecules, serving as a low-dimensional embedding that substantially reduces the data demands typically required for training. Using a self-attention mechanism to learn the subtle but highly relevant chemical context of functional groups, the method proposed here consistently outperforms existing approaches for predictions of multiple thermophysical properties. In a case study focused on adhesive polymer monomers, we train on a limited dataset comprising only 6,000 unlabeled and 600 labeled monomers. The resulting chemistry prediction model achieves over 92% accuracy in forecasting properties directly from SMILES strings, exceeding the performance of current state-of-the-art techniques. Furthermore, the latent molecular embedding is invertible, enabling the design pipeline to automatically generate new monomers from the learned chemical subspace. We illustrate this functionality by targeting several properties, including high and low glass transition temperatures (Tg), and demonstrate that our model can identify new candidates with values that surpass those in the training set. The ease with which the proposed framework navigates both chemical diversity and data scarcity offers a promising route to accelerate and broaden the search for functional materials.

MolE: a foundation model for molecular graphs using disentangled attention

Article Open access 12 November 2024

Large-scale chemical language representations capture molecular structure and properties

Article 21 December 2022

Property-driven localization and characterization in deep molecular representations

Article Open access 11 August 2025

Introduction

Molecular design is at the core of modern science and engineering¹, with wide applications that range from the development of new drugs^2,3,4,5,6 to the discovery of new functional and sustainable materials^7,8,9,10,11. Although considerable progress has been made over decades of sustained effort, it continues to be a daunting endeavor. The construction of a molecule with specific target properties involves a combinatorial problem that consists of selecting the correct atoms and connecting them in an appropriate manner. The available chemical space for molecular design grows exponentially with the molecular size¹². However, relevant candidates only populate a very small portion of that space. To identify optimal choices, two crucial questions must be addressed: the relationship between different molecular structures and the dependence of molecular properties on them. This presents the inherent challenge of exploring chemical space, further exacerbating the curse of dimensionality that pervades molecular design.

Molecular embedding^13,14,15,16 can facilitate the navigation of chemical space. By evaluating a selection of molecular features, we can encode a molecule M by its corresponding feature vector h, denoted as:

$${\mathcal{R}}:M\to {\bf{h}}.$$

(1)

This mapping creates a mathematical realization of the chemical space, where the differences between molecules can be quantified via the distance ∥h_i − h_j∥, and molecular properties can be inferred from a function y = f (h). A good molecular embedding should satisfy the two following requirements. First, it must be chemically meaningful. Molecules with similar chemistry should be arranged close to each other, so that primarily relevant regions can be explored. Second, it should be informative. The feature vector should contain key information for the prediction of molecular properties, which in turn can provide guidance for the optimization of molecular structure.

Molecular fingerprints^17,18 are often used in traditional cheminformatics. They are prescribed descriptors that record the statistics of the different chemical groups in a molecule. This type of embedding organizes molecules in chemical space based on their local structures, which play a key role in determining molecular properties. Hence, if a new molecule with a new set of properties was sought, new candidates could be sampled from existing molecules simply by replacing chemical groups and then screening the proposed constructs using chemistry prediction models, such as Quantitative Structure-Activity Relationships (QSARs)^19,20. However, such an approach is limited to the exploration of the space in the near vicinity of individual known molecules, which is what local modifications allow. Furthermore, the quality of the resulting designs can be compromised by artifacts resulting from the interplay between chemical groups, especially their interconnectivity, which is disregarded by molecular fingerprints.

Recent advances in machine learning (ML) have enabled the extraction of molecular embedding from data^{21,22,23,24,25}. A widely adopted ML scheme relies on an autoencoder^26,27 architecture, where an encoder maps molecules to a continuous latent space, and a consecutive decoder tries to reconstruct them^28,29. The schematic in Fig. 1 provides a simple description of the key concepts.

After being trained, an autoencoder can produce latent vectors that represent the global structure of a molecule. This embedding enables for coherent nonlocal modifications of molecules. Due to the continuity of the latent space, interpolations can be made between different molecules to acquire combined properties, and directed optimization can be performed through gradient descent. However, since this embedding is primarily designed for molecular reconstruction, it does not necessarily correlate well with molecular properties. Molecular reconstruction focuses on the connectivity of atoms or chemical groups within a molecule, whereas molecular chemistry is also influenced by the interactions between different molecules. To enhance such correlations, the autoencoder should be jointly trained with an additional chemistry prediction model that maps latent vectors onto properties of interest. Doing so would require a large amount of labeled data, which is typically unavailable. In real-world applications, the design problem is often domain-specific; there is a preferred class of molecules, and the subset for which properties have been measured or are available is very limited.

In this work, we introduce a machine-learning pipeline for domain-specific molecular design, anchored by a functional-group-based coarse-graining strategy. The pipeline consists of a hierarchical coarse-grained graph autoencoder to generate relevant candidates, and a chemistry prediction model to efficiently screen the proposed structures and select optimal choices. A key innovation here is the integration of a self-attention mechanism, inspired by natural language processing, where tokens in a sequence can have long-range dependencies, into the realm of macromolecules, whose functional groups exhibit similarly intricate spatial and chemical interactions. We anticipate broad applicability of this framework wherever faithful molecular generation is essential, from pharmaceutical discovery, where scaffold and functional-group placements critically affect bioactivity, to materials science, where chain-like and branched architectures frequently govern mechanical and thermal properties. Crucially, by focusing on coarse graining based on functional groups, our hierarchical approach remains data-efficient, allowing robust design and analysis even under data-scarce conditions.

**Fig. 1: An autoencoder framework for molecular representation learning.**

Results

Coarse-grained graph autoencoder

Depending on the choice of molecular representation, there are two popular ways to create an autoencoder. By representing a molecule as a SMILES³⁰ string, we can treat molecular embedding as a natural language processing problem and build an autoencoder using a sequence model²⁸. However, because of the one-dimensional nature of a text string, special tricks are always needed to account for the three-dimensional topology of a molecule, especially for features like rings, branches, and stereoisomerism. This poses unnecessary obstacles to the application of string-based autoencoders. Alternatively, by representing a molecule as a graph of atoms, we can naturally preserve molecular topology and embed molecules using graph neural networks. Several autoencoder frameworks^29,31,32 have been developed in this context. While these atom-graph-based autoencoders have established important foundations, they face well-recognized challenges: low chemical validity rates due to unconstrained decoding, sensitivity to graph isomorphism, and scalability issues as molecular size increases³³.

The construction of molecules using structural motifs provides an effective means for the design of large molecules having a complex topology. Researchers have identified ~100 functional groups, which are local structures that underlie the key chemical properties of molecules. Notably, most synthesizable molecules can be deconstructed into these structural motifs. Thus, this small set of common functional groups (as shown in Fig. 2a) can serve as a standard vocabulary for molecular design. Compared with atoms, they enable a coarse-grained and chemically meaningful representation of a molecule, which simplifies the design process.

**Fig. 2: Coarse-grained autoencoder using hierarchical graph neural networks.**

Expanding upon recent advances in hierarchical encoder-decoder³⁴ architectures for molecular graphs, we constructed our functional-group-based autoencoder around a multi-level representation of molecular structures. As illustrated in Fig. 2a, a molecule M can be represented with two levels of description: at the fine level, it forms an atom graph ${{\mathcal{G}}}^{{\rm{a}}}(M)$ composed of atoms a_i and bonds b_ij; at the coarse level, it is also a motif graph ${{\mathcal{G}}}^{{\rm{f}}}(M)$ composed of functional groups F_u and their interconnectivity E_uv; in between is the hierarchical mapping from each functional group F_u towards the corresponding atomic subgraph ${{\mathcal{G}}}^{{\rm{a}}}({F}_{u})$, which is easily accessible via the cheminformatics software RDKit³⁵. Details are summarized below: (Throughout the paper, superscripts and subscripts are used to denote the hierarchical level and the node index within a graph, respectively.)

Coarse-graining graph representation
Motif level	${{\mathcal{G}}}^{{\rm{f}}}(M)=\left({{\mathcal{V}}}^{{\rm{f}}}(M),{{\mathcal{E}}}^{{\rm{f}}}(M)\right)$	${{\mathcal{V}}}^{{\rm{f}}}(M)=\left\{{F}_{u}\,\| \,{F}_{u}\in M\right\}$	${{\mathcal{E}}}^{{\rm{f}}}(M)=\left\{{E}_{uv}\,\| \,{E}_{uv}\in M\right\}$
Motif → Atom	${{\mathcal{G}}}^{{\rm{a}}}({F}_{u})=\left({{\mathcal{V}}}^{{\rm{a}}}({F}_{u}),{{\mathcal{E}}}^{{\rm{a}}}({F}_{u})\right)$	${{\mathcal{V}}}^{{\rm{a}}}({F}_{u})=\left\{{a}_{i}\,\| \,{a}_{i}\in {F}_{u}\right\}$	${{\mathcal{E}}}^{{\rm{a}}}({F}_{u})=\left\{{b}_{ij}\,\| \,{b}_{ij}\in {F}_{u}\right\}$
Atom level	${{\mathcal{G}}}^{{\rm{a}}}(M)=\left({{\mathcal{V}}}^{{\rm{a}}}(M),{{\mathcal{E}}}^{{\rm{a}}}(M)\right)$	${{\mathcal{V}}}^{{\rm{a}}}(M)=\left\{{a}_{i}\,\| \,{a}_{i}\in M\right\}$	${{\mathcal{E}}}^{{\rm{a}}}(M)=\left\{{b}_{ij}\,\| \,{b}_{ij}\in M\right\}$

Molecular embedding is introduced by treating the generation of molecules as a Bayesian inference:

$$P(M)=\int{\rm{d}}{{\bf{h}}}^{{\rm{m}}}\,P({{\bf{h}}}^{{\rm{m}}})\,P\left(M\,| \,{{\bf{h}}}^{{\rm{m}}}\right).$$

(2)

P(h^m) is a prior distribution of the embedding h^m. In the following text, we will explain in more detail how to build an encoder to estimate the posterior distribution P(h^m∣M) and a decoder to derive the conditional probability of reconstructing the same molecule P(M∣h^m).

The encoder analyzes a molecule from the bottom up. First, a message-passing network (MPN) is used to encode the atom graph:

$$\left\{{{\bf{h}}}_{i}^{{\rm{a}}}\right\},\,\left\{{{\bf{h}}}_{ij}^{{\rm{a}}}\right\}={{\rm{MPN}}}^{{\rm{a}}}\left({{\mathcal{G}}}^{{\rm{a}}},\,\left\{{{\bf{x}}}_{i}^{{\rm{a}}}\right\},\,\left\{{{\bf{x}}}_{ij}^{{\rm{a}}}\right\}\right).$$

(3)

Here the features of individual atoms and bonds, for instance, ${{\bf{x}}}_{i}^{{\rm{a}}}=$ (atom type, valence, formal charge) and ${{\bf{x}}}_{ij}^{{\rm{a}}}=$ (bond type, stereo-chemistry) are taken as inputs and shared with their neighbors through the message passing mechanism on the graph ${{\mathcal{G}}}^{{\rm{a}}}$. Then the embeddings of the atoms and bonds are derived to encode their local environment, denoted as ${{\bf{h}}}_{i}^{{\rm{a}}}$ and ${{\bf{h}}}_{ij}^{{\rm{a}}}$, which include their own properties and those of their neighbors, as well as the way in which they connect with each other. Second, we assemble the feature vectors of functional groups using a multi-layer perceptron (MLP):

$${{\bf{x}}}_{u}^{{\rm{f}}}={{\rm{MLP}}}^{{\rm{a}}\to {\rm{f}}}\left({\bf{x}}({F}_{u}),\sum _{i\,\in {{\mathcal{V}}}^{{\rm{a}}}({F}_{u})}{{\bf{h}}}_{i}^{{\rm{a}}}\right),$$

(4)

which consists of the type of embedding of the functional group x(F_u) and the graph embedding of its atomic components $\left\{{{\bf{h}}}_{i}^{{\rm{a}}}\,| \,i\in {{\mathcal{V}}}^{{\rm{a}}}({F}_{u})\right\}$. Third, we employ another MPN to encode the motif graph:

$$\left\{{{\bf{h}}}_{u}^{{\rm{f}}}\right\},\,\left\{{{\bf{h}}}_{uv}^{{\rm{f}}}\right\}={{\rm{MPN}}}^{{\rm{f}}}\left({{\mathcal{G}}}^{{\rm{f}}},\,\left\{{{\bf{x}}}_{u}^{{\rm{f}}}\right\},\,\left\{{{\bf{x}}}_{uv}^{{\rm{f}}}\right\}\right)$$

(5)

where ${{\bf{h}}}_{u}^{{\rm{f}}}$ and ${{\bf{h}}}_{uv}^{{\rm{f}}}$ are the embeddings of individual functional groups. Lastly, a molecular embedding h^m is sampled in a probabilistic manner:

$$P({{\bf{h}}}^{{\rm{m}}}| M)\approx {\mathcal{N}}\left({\boldsymbol{\mu }}({{\bf{h}}}_{0}^{{\rm{f}}}),\,{\boldsymbol{\sigma }}({{\bf{h}}}_{0}^{{\rm{f}}})\right)$$

(6)

where ${{\bf{h}}}_{0}^{{\rm{f}}}$ denotes the embedding of the root motif, a functional-group node assigned during graph construction. This assignment is deterministic for each molecule, ensuring reproducibility, but varies across molecules depending on their graph structure. Because the encoder is trained end-to-end, the latent distribution does not depend on the specific identity of the root motif, thereby avoiding systematic bias. The functions μ( ⋅ ) and σ( ⋅ ) map ${{\bf{h}}}_{0}^{{\rm{f}}}$ to the mean and log-variance of the Gaussian distribution, and the reparameterization trick enables a continuous, differentiable latent space.

Once molecular embedding is achieved, the decoder can reconstruct the same molecule motif-by-motif. We model this as an auto-regressive progress by factorizing the conditional probability in Eq. (2) into:

$$P(M| \,{{\bf{h}}}^{{\rm{m}}})=\prod _{u}P\left({M}_{\le u}\,\right.\left| \,{M}_{\le u-1},\,{{\bf{h}}}^{{\rm{m}}}\right)$$

(7)

where M_≤u−1 = ⋃_v≤u−1F_v and M_≤u = ⋃_v≤uF_v denote the molecule before and after adding functional group F_u, respectively. At each step, the decoder only needs to predict which functional group F_u to choose and which bond b_ij to form for its attachment onto the partial molecule M_≤u−1:

$$\begin{array}{rcl}P\left({M}_{\le u}\,\right.\left| \,* \right)&=&P\left({b}_{ij}\,\right.\left| \,{F}_{u},\,* \,\right)\,P\left({F}_{u}\,\right.\left| \,* \right)\\ {\rm{with}}\quad * \,\,\,&=&\left({M}_{\le u-1},\,{{\bf{h}}}^{{\rm{m}}}\right).\end{array}$$

(8)

To quantify the condition, we apply the same encoder to analyze M_≤u−1. Assuming that the useful information on the partial molecule for the prediction of the next motif is localized near its growing end, we use the embedding of the last motif ${{\bf{h}}}_{u-1}^{{\rm{f}}}$ to represent M_≤u−1. The probability of choosing functional group F_u as the next motif can then be estimated as

$$\begin{array}{rcl}&&P\left({F}_{u}\,\right.\left| \,{M}_{\le u-1},{{\bf{h}}}^{{\rm{m}}}\right)\,\approx \,P\left({F}_{u}\,\right.\left| \,{{\bf{h}}}_{u-1}^{{\rm{f}}},\,{{\bf{h}}}^{{\rm{m}}}\right)\\ &&={\rm{softmax}}\left(\,{\rm{MLP}}\,({{\bf{h}}}_{u-1}^{{\rm{f}}},\,{{\bf{h}}}^{{\rm{m}}})\,\right).\end{array}$$

(9)

Similarly, for the prediction of the attachment bond, the atom embeddings $\left\{{{\bf{h}}}_{i}^{{\rm{a}}}\,| \,{a}_{i}\in {F}_{u}\right\}$ and $\left\{{{\bf{h}}}_{j}^{{\rm{a}}}\,| \,{a}_{j}\in {F}_{u-1}\right\}$ are used to represent the condition in both F_u and M_≤u−1. Then we estimate the probability of choosing bond b_ij as

$$\begin{array}{rcl}&&P\left({b}_{ij}\,\left| \,{F}_{u},\,{M}_{\le u-1},\,{{\bf{h}}}^{{\rm{m}}}\right)\approx P\left({b}_{ij}\,\right| \,{{\bf{h}}}_{i}^{{\rm{a}}},\,{{\bf{h}}}_{j}^{{\rm{a}}},\,{{\bf{h}}}^{{\rm{m}}}\right)\\ &&={\rm{softmax}}\left(\,{\rm{MLP}}\,({{\bf{h}}}_{i}^{{\rm{a}}},\,{{\bf{h}}}_{j}^{{\rm{a}}},\,{{\bf{h}}}^{{\rm{m}}})\,\right).\end{array}$$

(10)

Since F_u is not yet attached to M_≤u−1, here the embedding ${{\bf{h}}}_{i}^{{\rm{a}}}$ is obtained by applying the encoder to F_u alone.

We train our model by minimizing its evidence lower bound (ELBO), a common loss function for a variational autoencoder. It contains two parts ${{\mathcal{L}}}_{{\rm{ELBO}}}={{\mathcal{L}}}_{1}+\lambda {{\mathcal{L}}}_{2}$, namely a reconstruction loss that quantifies the cross-entropy between the encoder and the decoder

$${{\mathcal{L}}}_{1}=H\left[P({{\bf{h}}}^{{\rm{m}}}| M),P(M| \,{{\bf{h}}}^{{\rm{m}}})\right]$$

(11)

and a regularizer that measures the Kullback-Leibler (KL) divergence between the prior and posterior distributions of molecular embedding to avoid overfitting

$${{\mathcal{L}}}_{2}={{\mathcal{D}}}_{{\rm{KL}}}\left[P({{\bf{h}}}^{{\rm{m}}}| M)\,| | \,P({{\bf{h}}}^{{\rm{m}}})\right]\,$$

(12)

where we postulate the prior distribution $P({{\bf{h}}}^{{\rm{m}}})={\mathcal{N}}({\bf{0}},\,{\bf{I}})$, and where the probability estimates by the encoder and the decoder, P(h^m∣M) and P(M∣ h^m), are computed using Eqs. (6)–(7).

To highlight the capability of handling domain-specific data, we tested the autoencoder on acrylate-based adhesive materials obtained from experiments. This dataset comprises 6,000 known monomers drawn from four different monomer classes. Because the autoencoder is an unsupervised method, no molecular properties or labels are required. Nonetheless, the t-SNE projection of the learned latent space ${\mathcal{S}}({{\bf{h}}}^{{\rm{m}}})$ presents a clear clustering of monomers according to their chemical types, as shown in Fig. 2c. Such automatic grouping of monomers underscores the practical utility of our model in guiding targeted molecule design for industrial adhesive applications.

Besides furnishing a chemically meaningful representation, our molecular embedding also remains fully invertible–a vital feature for generative tasks. Specifically, Using the embedding of a molecule produced by the encoder, the decoder can reconstruct the same molecule with an accuracy of ~95%, substantially surpassing string-based and atom-graph-based autoencoders^28,36 trained under comparable conditions (Table S1). Although previous studies^29,34 reported reasonable reconstruction rates on broader chemical libraries, those architectures were tailored to relatively diverse molecules in which rings are prevalent. Here, in contrast, we seek to capture the richer chain- and branch-dominated chemistry of polymeric adhesives, where ring motifs are relatively scarce. In such data-limited domain-specific settings, we observed that traditional decomposition schemes, which often mine ring and branch substructures based purely on the frequency of occurrence, induce an unbalanced motif vocabulary heavily biased toward rings. This bias not only hampers the model’s ability to reconstruct chain-like polymers accurately, but also leads to less chemically interpretable embeddings as shown in Fig. S1.

Our approach addresses this bottleneck by deliberately constructing a motif vocabulary from functional groups, building on the established principles of group contribution. This strategy provides two main advantages. First, it limits our dictionary to functional groups that comprehensively span typical polymeric architectures, mitigating the overfitting to ring motifs. Second, by focusing on functional groups rather than purely structural motifs, the model inherently encodes information relevant to reactivity and physical properties, resulting in embeddings that more closely align with chemical intuition. The net effect is markedly superior reconstruction fidelity, particularly for the class of polymeric monomers under study. Such a design aligns well with the aims of a data-scarce, domain-focused molecular design, wherein the capture of specialized chemistries can be more important than achieving broad coverage of all possible small-molecule scaffolds. Consequently, the strong reconstruction accuracy reflects the capacity of our model to handle polymer-like structures with minimal data, highlighting the promise of coarse graining based on functional groups for improved generative performance.

Attention-aided chemistry prediction model

Our autoencoder imposes a relationship between different molecular structures by mapping them into a continuous latent space, where they are organized in a chemically meaningful manner. This allows for efficient sampling of chemically relevant candidates. To achieve directed design, an efficient approach is still needed to evaluate the candidate properties. Although high-throughput experiments or simulations can consider hundreds of molecular species at a reasonable cost, they are limited in terms of scalability. To address this challenge, we propose a chemistry prediction model capable of directly deriving thermophysical properties from molecular structures by using the self-attention mechanism.

The model is built on the same coarse-grained graph representation of a molecule, shown in Fig. 3a. Instead of being a simple readout of the molecular embedding, it analyzes the embedding of individual atoms and functional groups. Compared with the global structure of a molecule, the variation of those local structures is much more constrained. Therefore, training a regression model on the latter is more data-efficient. This is particularly useful for domain-specific design, where labeled data is limited.

We first estimate the contributions of individual atoms and functional groups to molecular properties, by applying the following equation to the corresponding level of the graph hierarchy:

$${{\bf{c}}}_{i}={\rm{Concat}}\left(\hat{{\bf{c}}}({{\bf{h}}}_{i}),\,\sum _{j}{a}_{ij}\,\tilde{{\bf{c}}}({{\bf{h}}}_{i},{{\bf{h}}}_{j})\right),$$

(13)

where $\hat{{\bf{c}}}({{\bf{h}}}_{i})$ represents the contribution of node i alone, while $\tilde{{\bf{c}}}({{\bf{h}}}_{i},{{\bf{h}}}_{j})$ represents the contribution of the interaction between node i and node j that can be weighted by a learnable coefficient, a_ij. Nodes can refer to either atoms or functional groups depending on the level of the graph, with ${{\bf{h}}}_{i}={{\bf{h}}}_{i}^{{\rm{a}}}\,{\rm{or}}\,{{\bf{h}}}_{i}^{{\rm{f}}}$.

In bulk materials, the likelihood of two local structures interacting with each other is determined by their affinity. To capture this relationship, we introduce the weight coefficient, a_ij, defined by using the multi-head attention mechanism³⁷:

$${a}_{ij}={\rm{softmax}}\left(\frac{{{\bf{q}}}_{i}{{\bf{k}}}_{j}^{T}}{\sqrt{{d}_{k}}}\right)=\frac{\exp \left(\frac{{\bf{q}}\left({{\bf{h}}}_{i}\right)\cdot {\bf{k}}{\left({{\bf{h}}}_{j}\right)}^{T}}{\sqrt{{d}_{k}}}\right)}{{\sum }_{l}\exp \left(\frac{{\bf{q}}\left({{\bf{h}}}_{i}\right)\cdot {\bf{k}}{\left({{\bf{h}}}_{l}\right)}^{T}}{\sqrt{{d}_{k}}}\right)},$$

(14)

The attention weights a_ij are computed using the query, key, and value projections derived from the node representation h_i. In this context, the query q(h_i) represents the features that node i requests from its interaction counterparts, while the key k(h_j) represents the features that node j possesses and can match. The dimension of the keys is denoted by d_k. The affinity score between nodes i and j is obtained by performing a dot product of their characteristics, which is then scaled by $1/\sqrt{{d}_{k}}$ to stabilize the gradients during training. This scaled dot-product is transformed into attention by applying the softmax function to normalize the influence of each interaction. Finally, the resulting attention weight a_ij can be interpreted as the probability of node i interacting with node j, taking into account all other nodes in the graph.

We can then derive the molecular properties from the contributions of individual atoms and functional groups, denoted as c^a and c^f in the equation below:

$${\bf{y}}(M)={\rm{MLP}}\left(\sum _{i\,\in \,{{\mathcal{V}}}^{{\rm{a}}}}\sigma ({{\bf{h}}}_{i}^{{\rm{a}}})\,\phi ({{\bf{c}}}_{i}^{{\rm{a}}}),\,\sum _{u\,\in \,{{\mathcal{V}}}^{{\rm{f}}}}\sigma ({{\bf{h}}}_{u}^{{\rm{f}}})\,\phi ({{\bf{c}}}_{u}^{{\rm{f}}})\right).$$

(15)

Here, we use a graph pooling operation with a weighted sum to generate chemical embeddings for the same molecule at both the atom and motif levels, represented by the two expressions enclosed in parentheses. The weights, σ( ⋅ ) and ϕ( ⋅ ), which are the sigmoid and hyperbolic tangent functions respectively, act as gating mechanisms. These functions evaluate the significance of a local structure for the molecular properties of interest. Additionally, we incorporate an MLP structure to introduce more nonlinearity and enhance the model’s expressivity for regression. This improves the model’s ability to capture complex relationships and predict molecular properties more accurately.

Although our model is inspired by the group contribution theory, it goes beyond it. First, instead of disregarding the interconnectivity of local structures, it accounts for the influence of neighboring environments on individual atoms and functional groups, by taking their graph embeddings as input. This is particularly important for properties such as the partial charge of an atom and ionization of a functional group, which vary significantly with the local environment. Second, our model is more expressive. The contribution analysis of local structures by Eq. (13)–(14) does not rely on a simple quadratic expansion. Instead, q( ⋅ ), k( ⋅ ), $\hat{{\bf{c}}}(\cdot )$ and $\tilde{{\bf{c}}}(\cdot )$ are all modeled by neural networks, providing the capability to represent contributions up to any form of two-body interactions. Third, the self-attention mechanism in our design allows for non-reciprocity in the interaction, meaning that a_ij ≠ a_ji. This avoids the need to construct a non-reciprocal interaction energy U_ij ≠ U_ji, which is often used in models such as the UNIFAC method, but is difficult to justify from the perspective of physical interactions.

To evaluate the performance of our model, we first validate it on the standard QM9 dataset³⁸, which contains ~130,000 molecules labeled with quantum-chemical properties. Our framework demonstrates outstanding data efficiency: when trained on only 5% of the dataset (6,000 molecules), it achieves R² ≈ 0.97 in predicting HOMO and LUMO energies (Fig. S2). This performance, obtained with 6k samples, is comparable to or better than baseline models trained on over 100k samples^39,40. Beyond frontier orbital energies, we further benchmarked the model on additional QM9 targets–including isotropic polarizability, electronic spatial extent, and heat capacity at 298 K–and consistently obtained high predictive accuracy (R² = 0.98-0.99; Fig. S3). These results confirm both the robustness and generality of our functional-group-based representation, even under data-limited conditions.

To test our model in domain-specific applications, we evaluated a dataset of 600 monomer species relevant to adhesive polymeric materials. For each monomer, thermophysical properties were determined from all-atom molecular dynamics (MD) simulations, including cohesive energy (E_coh), heat of vaporization (ΔH_vap), isothermal compressibility (β), bulk density (ρ), radius of gyration (R_g), and glass transition temperature (T_g). We trained the model on 450 randomly selected monomers and tested on the remaining ones. As illustrated in Fig. 3b, the model achieves high predictive accuracy across nearly all properties, with R² values above 0.92. For T_g, the accuracy remains significant but lower than for other properties, reflecting the intrinsic difficulty of generating reliable T_g labels from simulations.

To further probe the design of our architecture and verify the contribution of key component, we conducted ablation studies using E_coh as a representative property (Figs. S4–S5). Removing the attention mechanism reduces the predictive accuracy, indicating that attention improves the model’s ability to emphasize chemically important motifs and higher-order interactions. Excluding atom-level contributions results in a further decrease in performance, demonstrating that fine-grained atomic detail is indispensable for capturing local polarity, bonding environments, and substituent effects. Together, these results confirm that both motif-level and atom-level information, combined with attention-based weighting, are critical to achieving state-of-the-art accuracy.

While our model performs strongly on single-property prediction, polymer design often requires optimizing multiple thermophysical properties simultaneously. This task is challenging because different properties, such as cohesive energy and glass transition temperature, are governed by distinct molecular factors and are rarely modeled together in existing approaches. To test this capability, we trained a multi-property model to predict both E_coh and T_g jointly. As shown in Fig. S6, the model maintains high accuracy (R² = 0.90 for E_coh and R² = 0.84 for T_g), with only a modest decrease compared to single-property models.> These results demonstrate that the hierarchical representation is sufficiently expressive to capture distinct yet correlated molecular determinants governing different physical targets. This capability is particularly important for real-world polymer design, where simultaneous control of multiple properties is essential. The ability to extend seamlessly from single- to multi-property prediction underscores the framework’s robustness and positions it as a practical tool for multi-objective molecular discovery.

Finally, we analyzed prediction robustness for T_g, the most challenging property in our dataset. As shown in Fig. S7, the RMSE of model predictions is comparable to the uncertainty of replicate MD simulations, indicating that the main limitation arises from the intrinsic noise in the training data rather than the model itself. This conclusion is further supported by variability analysis across five random train/test splits (Table S2), where we observe stable performance. Together, these results highlight the reliability of the framework and its suitability for practical molecular design tasks in polymeric materials.

Automated pipeline for molecular design

Our chemistry prediction model can process up to 10⁴ molecules in about an hour, producing property estimates that closely match those of the actual atomistic simulations. This speed and accuracy facilitate broad and cost-effective high-throughput screening across extensive molecular databases. Using the model as an initial filter to pinpoint promising candidates, we can then rely on MD simulations to further refine these selections (Fig. 4a). Here, as a case study, we demonstrate the generative power of the model by discovering new monomers with a target glass transition temperature (T_g), a cornerstone of polymer physics that governs whether a polymer behaves as a rigid solid or as a flexible material. Despite its importance for mechanical performance and processing, T_g remains notoriously difficult to predict due to the intricate interplay of molecular interactions, chain conformation, and packing or free volume considerations; designing polymers with a prescribed T_g is even more challenging, as there is no simple structure-property rule. Our pipeline addresses this gap by learning the underlying molecular chemistry, achieving high predictive accuracy, and enabling targeted molecular design.

**Fig. 4: Automated machine learning pipeline for domain-specific molecular design.**

Central to our strategy is a latent-space generative model, which explores new polymer architectures beyond the training database. To expand the molecular search space, we sample molecular embeddings from the prior distribution $P({{\bf{h}}}^{{\rm{m}}})={\mathcal{N}}({\bf{0}},{\bf{I}})$. Concretely, we draw random vectors from this standard normal distribution and feed them into the decoder, which converts these latent representations into valid molecular structures biased toward desired property ranges. To highlight the scope of this generative capability, in this fully automated pipeline (Fig. 4a), we restricted the chemical building blocks to the acrylate functionalities and sampled 50,000 unique candidates not present in the original database, well above simple enumeration of known compounds. Instead of a brute-force approach, the latent-space model systematically navigates chemical space in a manner that favors valid, nonduplicative structures and spans a broad range, rather than merely generating random permutations.

We next applied our chemistry-prediction model to all 50,000 generated acrylates, finding that the predicted T_g values spanned and extended beyond the range of the training set. This broad exploration highlights that the model does not rely on random ‘lucky’ hits; rather, it learns molecular motifs associated with T_g. After using our property predictor to screen for particularly high or low T_g values, we selected 100 representative candidates for validation via MD simulations. The “Screening" box in Fig. 4a illustrates the screening workflow, in which the newly generated structures pass through our chemistry prediction model, and then MD simulations validate a subset of particularly high- or low T_g candidates. As shown in Fig. 4b, the predicted T_g values for these candidates are in close agreement with the simulated results, demonstrating our model’s accuracy. For clearer visualization, we randomly selected half of both the training and test datasets to display, along with 20 random examples from the 100 newly generated molecules that were validated by simulation. In particular, some of the newly generated molecules exhibit T_g that exceed the low and high limits of the training set. This indicates that the generative model can explore regions of chemical space beyond those directly represented in the original data, rather than merely reproducing existing structures. Moreover, the Screening step in Fig. 4a reveals that these new candidates feature diverse molecular backbones and functional groups that would be difficult to conceive based on chemical intuition alone. This underscores how large-scale sampling and rapid property evaluation can facilitate the discovery of promising novel designs in polymer science.

So far, T_g has served as a critical proof-of-concept target: its sensitivity to both local chemical environments and long-range relaxation processes poses a stringent test for any materials-design model. Importantly, our approach is not confined to this single property and can be readily applied to others. As an other example (see Supplementary Information), we used the same pipeline to design materials with high cohesive energy density, as illustrated in Fig. S8, where newly generated candidates outperformed those found in the original database. Together, these two cases demonstrate the model’s ability to autonomously explore chemical space and discover promising structures beyond the training distribution. Building on this foundation, our framework naturally generalizes to multi-property optimization, enabling the concurrent pursuit of multiple requirements vital to industrial practice.>Since the predictor has already demonstrated reliable joint accuracy for ${{E}_{\rm{coh}}}$ and ${{T}_{g}}$, the same generative pipeline can be adapted to propose molecules that optimize both properties simultaneously. This extension transforms the current workflow into a scalable, multi-objective design engine. By systematically integrating generative exploration, high-throughput property prediction, and targeted simulation validation, this foundation provides a robust strategy for AI-guided molecular discovery and paves the way for accelerated innovation in materials design.

Discussion

Our approach to molecular design is based on the emerging paradigm of digitizing the chemical space and then directing molecular generation through property prediction models. However, it deviates from conventional strategies that rely on a single embedding vector to represent an entire molecule in an end-to-end framework. Instead, a hierarchical scheme is adopted in which local atomic details, coarse-grained functional groups, and global molecular embeddings each play distinct roles. This architecture not only mitigates the information loss often associated with autoencoders, but also ensures that relevant chemical features are extracted and leveraged when needed, akin to the multiscale design principles used in UNet-like image processing models⁴¹.

In our pipeline, the decoder incrementally assembles new molecules by predicting structural motifs and the specific bonds connecting them, guided by both local embeddings (atoms and functional groups) and a global molecular embedding that orchestrates the overall design. The subsequent screening employs a self-attention-based chemistry prediction model, which capitalizes on the same hierarchical representations to evaluate molecular properties. By shifting the main burden of chemical interpretation to local embeddings, the framework remains data efficient: The task of learning chemically meaningful representations at the atomic or group level is considerably more tractable than requiring a single global embedding to capture every subtlety of an entire molecule. This design choice proved to be critical to achieving high accuracy with limited training data.

A key innovation underlying this efficiency is our use of functional groups as a coarse-grained vocabulary for both generation and prediction. The group contribution theory approach identifies < 100 groups that recapitulate the most relevant chemistries, offering three significant advantages. First, it confines the combinatorial explosion of possible motifs, avoiding large data-biased dictionaries. Second, it delegates atom-level connectivity to established cheminformatics toolkits such as RDKit³⁵, alleviating the need for the autoencoder to learn this low-level connectivity from scratch. Finally, we embed these functional groups into a self-attention model that unites domain-specific chemical insights with the flexibility of modern neural networks. As a result, the chemistry prediction model achieves near-simulation-level accuracy while being trained on only a few hundred labeled molecules.

Our demonstration of targeted polymer design through this hierarchical framework showcases its potential to guide molecular discovery efficiently, even in sparse-data regimes. While the glass transition temperature served as a stringent proof-of-concept target, we also demonstrated successful generative design for other properties. Moreover, we have shown that our model can be naturally extended to optimize multiple properties concurrently. In this way, the synergy of coarse-grained functional groups and a self-attention-based architecture can be broadly harnessed for designing biomolecules, polymeric materials, and other hybrid chemical systems. We anticipate that the principles detailed in this work will serve as a useful foundation for the community to build ever more sophisticated machine learning pipelines that illuminate vast expanses of previously unexplored chemical space and drive rapid innovation in materials science.

Methods

Our method consists of three components: a coarse-grained graph autoencoder for latent molecular representation, an attention-aided model for property prediction, and an automated pipeline for functional-group-driven molecular design. Together, they enable interpretable and controllable generation of valid molecular structures from coarse motifs.

Model architecture

Molecular Representation and Functional Group Encoding: Molecules M are represented as atom-level graphs ${{\mathcal{G}}}^{{\rm{a}}}(M)$, where nodes correspond to atoms with features such as type, valence, and formal charge, and edges represent chemical bonds. Functional groups were identified and grouped hierarchically into coarse-grained graphs ${{\mathcal{G}}}^{{\rm{f}}}(M)$, where nodes correspond to functional groups and meta-bonds denote inter-group connectivity. Ring systems (e.g., benzene, pyridine, substituted aromatics) are treated as self-contained motifs within the functional-group vocabulary, ensuring that aromaticity and conjugation are preserved consistently rather than being fragmented across smaller units.

Atom embeddings $\{{{\bf{h}}}_{i}^{a}\}$ and bond embeddings $\{{{\bf{h}}}_{ij}^{a}\}$ are learned through a message-passing network (MPN) as in Eq. (3). Functional group embeddings are then constructed via aggregation of atom-level features followed by a multi-layer perceptron (MLP), as in Eq. (4). These are subsequently processed by a second MPN to encode the motif-level graph ${{\mathcal{G}}}^{{\rm{f}}}$, as in Eq. (5).

Coarse-Grained Graph Autoencoder: We formulate molecular generation as variational inference over a latent embedding h^m, with the generative model expressed as

$$P(M)=\int{\rm{d}}{{\bf{h}}}^{{\rm{m}}}\,P({{\bf{h}}}^{{\rm{m}}})\,P\left(M\,| \,{{\bf{h}}}^{{\rm{m}}}\right),$$

as in Eq. (2). The encoder produces a posterior distribution P(h^m∣M) from the final graph representation, approximated as a normal distribution over ${{\bf{h}}}_{0}^{{\rm{f}}}$ in Eq. (6). The decoder reconstructs molecules autoregressively by adding functional groups F_u to a partial graph M_≤u−1, following Eqs. (7)–(10).

The choice of the next functional group is predicted by:

$$P({F}_{u}| {M}_{\le u-1},{{\bf{h}}}^{{\rm{m}}})\approx {\rm{softmax}}\left({\rm{MLP}}({{\bf{h}}}_{u-1}^{{\rm{f}}},{{\bf{h}}}^{{\rm{m}}})\right),$$

as in Eq. (9), while the attachment bond b_ij is selected using atom-level features in Eq. (10). The model is trained by minimizing the ELBO:

$${{\mathcal{L}}}_{{\rm{ELBO}}}={{\mathcal{L}}}_{1}+\lambda {{\mathcal{L}}}_{2},$$

with ${{\mathcal{L}}}_{1}$ the reconstruction cross-entropy (Eq. 11) and ${{\mathcal{L}}}_{2}$ the KL divergence regularizer (Eq. 12).

Attention-Aided Property Prediction: To enable property-guided molecular design, we implemented an auxiliary prediction head over h^m. A multi-head attention mechanism aggregates context from functional group embeddings, followed by global pooling and regression. This module predicts molecular properties such as HOMO-LUMO gaps and is trained jointly with the generative pipeline in a multi-task setting.

Automated Pipeline for Molecular Design: We constructed an automated pipeline that integrates (1) functional group extraction, (2) coarse-grained graph construction, (3) latent encoding, (4) functional group-wise generation, and (5) optional property conditioning. This setup allows both latent sampling and gradient-based optimization in embedding space to steer generation toward property targets.

Data sources

We used the publicly available QM9 dataset, which consists of ~134,000 small organic molecules with DFT-optimized geometries and associated quantum chemical properties. For domain-specific tasks, we additionally adopted a dataset consistent with prior work⁴². Molecules were preprocessed using RDKit to ensure valency consistency and canonical SMILES formatting.

Molecular dynamics simulations

Molecular dynamics (MD) simulations were performed using GROMACS 2022 with the OPLS-AA force field. Simulation protocols and parameters largely followed previous studies^42,43. Each system was first energy-minimized to remove steric clashes and ensure stable starting configurations. The minimized structures were then equilibrated in the NPT ensemble at 250 K and 1 bar for 10 ns with a 1 fs timestep, employing a Langevin thermostat (friction constant = 1 ps⁻¹). Subsequently, the systems were gradually heated to 500 K at a rate of 0.01 K ps⁻¹, equilibrated at this elevated temperature, and then cooled to 100 K using the same rate. Glass-transition behavior was determined from the density-temperature profiles obtained during heating and cooling cycles. To reduce artifacts from quench history, snapshots collected along the cooling trajectory were subjected to an additional 1 ns NPT equilibration before entering production runs. Final production simulations were carried out for 25 ns in both the NPT and NVT ensembles at selected temperatures, and trajectories were analyzed to extract thermophysical properties. Three independent replicates were performed with randomized seeds.

Additional details are provided in the Supplementary Information, including descriptions of model architecture, hyperparameter settings, evaluation protocols, and implementation details. The Supplement also contains extended results and supporting analyses that further validate the robustness and generalizability of our approach across different molecular design tasks.

Data availability

The main data that support the findings of this study have been deposited in 10.5281/zenodo.13147126. Other data is provided within the manuscript or supplementary information files.

References

Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: Generative models for matter engineering. Science 361, 360–365 (2018).
Article CAS PubMed Google Scholar
Patani, G. A. & LaVoie, E. J. Bioisosterism: a rational approach in drug design. Chem. Rev. 96, 3147–3176 (1996).
Article CAS PubMed Google Scholar
Anderson, A. C. The process of structure-based drug design. Chem. Biol. 10, 787–797 (2003).
Article CAS PubMed Google Scholar
Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18, 463–477 (2019).
Article CAS PubMed PubMed Central Google Scholar
Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–702 (2020).
Article CAS PubMed PubMed Central Google Scholar
Nigam, A. et al. Tartarus: A benchmarking platform for realistic and practical inverse molecular design. Adv. Neural Inform. Process. Syst. 36, 3263–3306 (2023).
Google Scholar
Beaujuge, P. M. & Fréchet, J. M. Molecular design and ordering effects in π-functional materials for transistor and solar cell applications. J. Am. Chem. Soc. 133, 20009–20029 (2011).
Article CAS PubMed Google Scholar
Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for molecular and materials science. Nature 559, 547–555 (2018).
Article CAS PubMed Google Scholar
Merchant, A. et al. Scaling deep learning for materials discovery. Nature 624, 80–85 (2023).
Article CAS PubMed PubMed Central Google Scholar
Yao, Z. et al. Machine learning for a sustainable energy future. Nat. Rev. Mater. 8, 202–215 (2023).
Article PubMed Google Scholar
Gurnani, R. et al. AI-assisted discovery of high-temperature dielectrics for energy storage. Nat. Commun. 15, 6107 (2024).
Article CAS PubMed PubMed Central Google Scholar
Hansen, K. et al. Machine learning predictions of molecular properties: accurate many-body potentials and nonlocality in chemical space. J. Phys. Chem. Lett. 6, 2326–2331 (2015).
Article CAS PubMed PubMed Central Google Scholar
Cereto-Massagué, A. et al. Molecular fingerprint similarity search in virtual screening. Methods 71, 58–63 (2015).
Article PubMed Google Scholar
Duvenaud, D. K. et al. Convolutional networks on graphs for learning molecular fingerprints. In Proc. 29th International Conference on Neural Information Processing Systems 2224−2232 (NIPS, 2015).
Coley, C. W., Barzilay, R., Green, W. H., Jaakkola, T. S. & Jensen, K. F. Convolutional embedding of attributed molecular graphs for physical property prediction. J. Chem. Inform. Model. 57, 1757–1772 (2017).
Article CAS Google Scholar
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
Article CAS PubMed Google Scholar
Morgan, H. L. The generation of a unique machine description for chemical structures-a technique developed at Chemical Abstracts Service. J. Chem. Document. 5, 107–113 (1965).
Article CAS Google Scholar
Glen, R. C. et al. Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADMET. IDrugs 9, 199 (2006).
CAS Google Scholar
Nantasenamat, C., Isarankura-Na-Ayudhya, C., Naenna, T. & Prachayasittikul, V. A practical overview of quantitative structure-activity relationship. EXCLI J. 8, 74–88 (2009).
Google Scholar
Khan, A. U. et al. Descriptors and their selection methods in qsar analysis: paradigm for drug design. Drug Discov. Today 21, 1291–1302 (2016).
Article PubMed Google Scholar
Elton, D. C., Boukouvalas, Z., Fuge, M. D. & Chung, P. W. Deep learning for molecular design-a review of the state of the art. Mol. Syst. Design Eng. 4, 828–849 (2019).
Article CAS Google Scholar
David, L., Thakkar, A., Mercado, R. & Engkvist, O. Molecular representations in AI-driven drug discovery: a review and practical guide. J. Cheminform. 12, 56 (2020).
Article CAS PubMed PubMed Central Google Scholar
Walters, W. P. & Barzilay, R. Applications of deep learning in molecule generation and molecular property prediction. Accounts Chem. Res. 54, 263–270 (2020).
Article Google Scholar
Wigh, D. S., Goodman, J. M. & Lapkin, A. A. A review of molecular representation in the age of machine learning. Wiley Interdiscipl. Rev. Comput. Mol. Sci. 12, e1603 (2022).
Article Google Scholar
Li, Z., Jiang, M., Wang, S. & Zhang, S. Deep learning methods for molecular representation and property prediction. Drug Discov. Today 27, 103373 (2022).
Article PubMed Google Scholar
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. In International Conference on Learning Representations (2014).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature 521, 436–444 (2015).
Article CAS PubMed Google Scholar
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Sci. 4, 268–276 (2018).
Article Google Scholar
Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning, 2323–2332 (PMLR, 2018).
Weininger, D. Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inform. Comput. Sci. 28, 31–36 (1988).
Article CAS Google Scholar
Simonovsky, M. & Komodakis, N. Graphvae: Towards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks, 412–422 (Springer, 2018).
Serratosa, F. Graph regression based on autoencoders and graph autoencoders. In International Conference on Pattern Recognition, 345–360 (Springer, 2024).
Reiser, P. et al. Graph neural networks for materials science and chemistry. Commun. Mater. 3, 93 (2022).
Article CAS PubMed PubMed Central Google Scholar
Jin, W., Barzilay, R. & Jaakkola, T. Hierarchical generation of molecular graphs using structural motifs. In International Conference on Machine Learning, 4839–4848 (PMLR, 2020).
RDKit. RDKit : Open-Source Cheminformatics. http://www.rdkit.org (2025).
Liu, Q., Allamanis, M., Brockschmidt, M. & Gaunt, A. Constrained graph variational autoencoders for molecule design. In Proceedings of the 32nd Conference on Neural Information Processing Systems 7806−7815 (NeurIPS, 2018).
Vaswani, A. et al. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS, 2017).
Ramakrishnan, R., Dral, P. O., Rupp, M. & Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1, 1–7 (2014).
Article Google Scholar
Faber, F. A. et al. Machine learning prediction errors are better than DFT accuracy. J. Chem. Theory Comput. 13, 5255–5264 (2017).
Article CAS PubMed Google Scholar
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. In Int. Conference on Machine Learning, 1263–1272 (PMLR, 2017).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241 (Springer, 2015).
Schneider, L. et al. In silico active learning for small molecule properties. Mol. Syst. Design Eng. 7, 1611–1621 (2022).
Article CAS Google Scholar
Wang, Z. et al. Water-mediated ion transport in an anion exchange membrane. Nat. Commun. 16, 1099 (2025).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division.

Author information

These authors contributed equally: Ming Han, Ge Sun.

Authors and Affiliations

Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL, USA
Ming Han, Ge Sun, Paul F. Nealey & Juan J. de Pablo
James Franck Institute, University of Chicago, Chicago, IL, USA
Ming Han
Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
Ge Sun & Juan J. de Pablo
Department of Chemical and Biomolecular Engineering, Tandon School of Engineering, New York University, Brooklyn, NY, USA
Ge Sun & Juan J. de Pablo
Department of Physics, New York University, New York, NY, USA
Ge Sun & Juan J. de Pablo
Materials Science Division, Argonne National Laboratory, Lemont, IL, USA
Paul F. Nealey & Juan J. de Pablo

Authors

Ming Han
View author publications
Search author on:PubMed Google Scholar
Ge Sun
View author publications
Search author on:PubMed Google Scholar
Paul F. Nealey
View author publications
Search author on:PubMed Google Scholar
Juan J. de Pablo
View author publications
Search author on:PubMed Google Scholar

Contributions

J.J.d.P. conceived the project. M.H. and G.S. designed the model architecture. M.H. and G.S. performed model training and simulations. P.F.N. designed the ablation study experiments. All authors wrote and reviewed the manuscript.

Corresponding author

Correspondence to Juan J. de Pablo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Han, M., Sun, G., Nealey, P.F. et al. Attention-based functional-group coarse-graining: a deep learning framework for molecular prediction and design. npj Comput Mater 11, 355 (2025). https://doi.org/10.1038/s41524-025-01836-7

Download citation

Received: 29 May 2025
Accepted: 12 October 2025
Published: 21 November 2025
Version of record: 21 November 2025
DOI: https://doi.org/10.1038/s41524-025-01836-7

Coarse-graining graph representation
Motif level	\({{\mathcal{G}}}^{{\rm{f}}}(M)=\left({{\mathcal{V}}}^{{\rm{f}}}(M),{{\mathcal{E}}}^{{\rm{f}}}(M)\right)\)	\({{\mathcal{V}}}^{{\rm{f}}}(M)=\left\{{F}_{u}\,\| \,{F}_{u}\in M\right\}\)	\({{\mathcal{E}}}^{{\rm{f}}}(M)=\left\{{E}_{uv}\,\| \,{E}_{uv}\in M\right\}\)
Motif → Atom	\({{\mathcal{G}}}^{{\rm{a}}}({F}_{u})=\left({{\mathcal{V}}}^{{\rm{a}}}({F}_{u}),{{\mathcal{E}}}^{{\rm{a}}}({F}_{u})\right)\)	\({{\mathcal{V}}}^{{\rm{a}}}({F}_{u})=\left\{{a}_{i}\,\| \,{a}_{i}\in {F}_{u}\right\}\)	\({{\mathcal{E}}}^{{\rm{a}}}({F}_{u})=\left\{{b}_{ij}\,\| \,{b}_{ij}\in {F}_{u}\right\}\)
Atom level	\({{\mathcal{G}}}^{{\rm{a}}}(M)=\left({{\mathcal{V}}}^{{\rm{a}}}(M),{{\mathcal{E}}}^{{\rm{a}}}(M)\right)\)	\({{\mathcal{V}}}^{{\rm{a}}}(M)=\left\{{a}_{i}\,\| \,{a}_{i}\in M\right\}\)	\({{\mathcal{E}}}^{{\rm{a}}}(M)=\left\{{b}_{ij}\,\| \,{b}_{ij}\in M\right\}\)