Introduction

Coarse-grained molecular dynamics (CGMD)1,2 has emerged as a vital tool for material development, offering crucial insights into complex molecular systems including polymers3, proteins4, and membranes5. The primary advantage of CGMD is its ability to explore molecular phenomena over larger length scales and longer time frames, surpassing the capabilities of traditional all-atom molecular dynamics (AAMD)6,7,8,9,10,11 simulations, which typically offer higher resolution and, hence, are particularly adept at capturing detailed interfacial interactions12. In detail, CGMD achieves this speedup by effectively representing groups of atoms as beads13,14,15,16,17,18, thus extending the simulation capabilities from picoseconds to microseconds temporally and from nanometers to micrometers spatially. Consequently, coarse-graining provides unprecedented insights into complex molecular phenomena that remain inaccessible to conventional AAMD, thus enabling the study of complicated phenomena such as the self-assembly behaviors of polymers19.

Emergent CGMD modeling toolsets rely on two key components to learn the underlying inter-molecular relationships: bead-mapping schemes and the parametrization of bead-bead interactions. In this work, ‘molecular topology’ specifically refers to the set of bonded parameters (bond lengths, angles, and their associated force constants which elucidate a molecule’s topology) defined within a given coarse-grained mapping, rather than to the optimization of the bead-mapping scheme itself. These components are developed using two primary approaches: top-down13,14,15 and bottom-up16,17,18. Top-down approaches simplify systems to reproduce macroscopic properties with frameworks like CG-Martini13,20,21,22, where up to four heavy atoms are mapped onto one bead, and, the inter-bead interactions are parametrized using experimentally obtained thermodynamic data. In particular, Martini 321,22,23, the most recent version of the CG-Martini force field, typically offers reasonable coarse-approximated accuracy when widely applied across biological and material systems22,24 but struggles with materials exhibiting varying degrees of polymerization. Conversely, the bottom-up approach derives parameters directly from AAMD, ensuring microscopic accuracy but often requiring computationally expensive iterative refinement to match target observables. While top-down strategies aim to reproduce macroscopic properties and offer broader applicability, bottom-up methods emphasize fidelity to atomistic interactions. Recent advances in machine learning increasingly automate parameterization—particularly in polymer systems—making the choice between these approaches contingent on system complexity, desired accuracy, and modeling objectives.

Over the past decade, machine learning (ML)25,26 has transformed coarse-grained (CG) mapping and parameterization processes, markedly improving the accuracy and efficiency of CGMD simulations. In particular, ML-driven CGMD approaches leverage advanced algorithms to extract or optimize target parameters from large datasets, while also enabling active learning workflows that iteratively refine models. These methods are especially well-suited for bottom-up methodologies reliant on AAMD data. The relative computational affordability and accessibility of AAMD simulations, compared to experimental measurements, facilitate not only the generation of high-quality training datasets but also the on-demand data acquisition required for active learning, ensuring models remain adaptive and robust. Notable advancements in this pursuit include the Versatile Object-oriented Toolkit for Coarse-graining Applications27, which integrates techniques such as Iterative Boltzmann Inversion, force matching, and Inverse Monte Carlo. Similarly, the software MagiC28 implements a Metropolis Monte Carlo method, providing enhanced robustness against singular parameter values during optimization. Emerging ML-driven frameworks further generalize these capabilities: chemtrain29 enables learning deep potential models via automatic differentiation and statistical physics, while DMFF30 provides an open-source platform for differentiable force field development with support for both top-down and bottom-up approaches. Tools like TorchMD-Net31 and DeepMD32 enable end-to-end differentiable force field training with built-in uncertainty quantification, extending these principles to CG systems. Beyond these, other methods33 employing Relative Entropy Minimization have been integrated as well with popular simulation engines like GROMACS34 and LAMMPS35. For small molecules, approaches based on partition functions36 and parameter tuning via quantum chemical calculations37 instead of directly running AAMD have also shown sufficient promise. In this regard, optimization algorithms are central to resolving complex challenges in CG force field development because they systematically explore high-dimensional parameter spaces to minimize discrepancies between CG and reference data, ensuring accurate and transferable models while addressing the inherent nonlinearity and complexity of molecular interactions. Amongst the optimization methods, gradient-based techniques and Evolutionary Algorithms (EAs)—notably Genetic Algorithm (GA) and Particle Swarm Optimization (PSO)—have gained prominence. GA has been applied to optimize parameters in ReaxFF reactive force fields38 and CG water models39, while PSO has been used in tools like Swarm-CG40 and CGCompiler41 for CG model parameterization. However, while EAs are effective in exploring vast parameter spaces, they can be computationally expensive and often require numerous evaluations of the objective function. While gradient-based methods42,43 excel in problems with smooth, differentiable objective functions, they face limitations in CG force field parameterization due to (i) the non-differentiable nature of MD simulation outputs and (ii) their propensity to converge to local minima in complex energy landscapes. In this regard, Bayesian Optimization (BO)44,45,46,47 offers a powerful approach for problems where objective function evaluations are expensive and data acquisition is costly. By balancing exploration and exploitation through a probabilistic model, BO efficiently converges to optimal solutions with fewer evaluations, making it well-suited for optimizing CG force field parameters where computational cost is critical. Its ability to incorporate prior knowledge and handle noisy or sparse data further enhances its applicability to force field optimization tasks. Recent studies highlight BO’s advantages in diverse contexts. For example, BO optimized chlorine dosing schedules for water distribution systems48 with significantly fewer evaluations than traditional EAs. In materials design, BO identified superior solutions with smaller sample sizes and fewer iterations compared to GA and PSO49. Additionally, BO-based methods50 have demonstrated strong performance in problems with small dimensions and limited evaluation budgets. These findings underscore BO’s suitability for CGMD parameter optimization in polymer systems, particularly for computationally expensive simulations involving higher degrees of polymerization51,52. Furthermore, previous studies53,54 have highlighted the need to re-parametrize Martini for specific systems, including MOFs, proteins, and polymers, to address its limitations in accuracy for certain applications. In summary, while the conventional approach involves optimizing parameters at lower degrees of polymerization and validating at higher degrees, effectively addressing mesoscale phenomena necessitates models that are capable of systematically accounting for and adapting to variations in the degree of polymerization. BO’s advantages over gradient-based methods are particularly pronounced in high-dimensional CG parameter spaces. Although gradient-based optimization scales formally to higher dimensions, it requires precise derivative calculations—a significant challenge when objectives involve computationally expensive MD simulations with inherent noise. BO circumvents this by treating the optimization as a black-box problem, strategically balancing exploration and exploitation through probabilistic surrogate models. This enables global optimization without requiring gradient information, making it robust to noisy evaluations and better suited for identifying transferable parameter sets across polymerization degrees.

Martini3 has excelled as a general-purpose tool for the baseline coarse-graining of wide-ranging molecules. Detrimentally, its generality also precludes its ability to provide accurate property predictions for any particular sub-classes of molecules. The approach presented in this work aims to directly remedy this limitation by a low-cost protocol based on active learning for the efficient low-dimensional parametrization of the bonded parameters of a CG molecular structure. Particularly, we use BO to enhance the accuracy of the Martini3 force fields for three common polymers across varying degrees of polymerization. The corresponding AAMD calculation results are defined to be the ground truth in our iterative refinement scheme. We calibrate the BO model on density (\(\rho\)) and radius of gyration (\({R}_{g}\)) and demonstrate a unique generalizable parametrization scheme for CG force field optimization, independent of the degree of polymerization. Furthermore, because this BO framework optimizes against abstract target properties, it is inherently flexible and can be readily adapted to calibrate CG models against experimental macroscopic data, in addition to the AAMD-based refinement demonstrated here.

Results and discussion

Low-dimensional parametrization of the coarse-grained molecular structure

Molecular Dynamics (MD) models define the geometry of molecular topology through the bonded parameters of the force field such as bond lengths (\({b}_{0}\)), bond constants (\({k}_{b}\)), angle magnitudes (\(\varPhi\)), and angle constants (\({k}_{\varPhi }\)). These topological parameters are intrinsically linked to macroscopic properties of molecules, including density (\(\rho\)) and radius of gyration (\({R}_{g}\)). Variations in these topological parameters directly influence molecular geometry, which in turn alters packing efficiency, spatial distribution of atoms/beads, and overall molecular compactness. For instance, increasing bond lengths (\({b}_{0}\)) or widening bond angles (\(\varPhi\)) typically leads to a larger molecular volume, which impacts bulk properties such as \(\rho\), while bond constants (\({k}_{b}\)) and angle constants (\({k}_{\varPhi }\)) reflect the stiffness of the polymer backbone, which is crucial for determining conformational properties like \({R}_{g}\). Additionally, polymer chains containing aromatic rings introduce an additional bond length parameter (\(c\)) to account for the constant aromatic bonds needed to preserve the ring’s topology. Therefore, this set of bonded parameters (\(\theta\)) can be defined as:

$${\boldsymbol{\theta }}=\left\{\begin{array}{ll}\left[{b}_{0},{k}_{b},\varPhi ,{k}_{\varPhi }\right] & {\rm{for}}\; {\rm{non}}-{\rm{aromatic}}\; {\rm{molecules}}\\ \left[{b}_{0},{k}_{b},\varPhi ,{k}_{\varPhi },c\right] & {\rm{for}}\; {\rm{aromatic}}\; {\rm{molecules}}\end{array}\right.$$
(1)

Notably, this study excludes dihedral angles from the parameter set due to the complexity of their conformational space and the non-trivial relationship with the aforementioned macroscopic properties. However, the number of topological parameters (\(\theta\)) scale linearly with the degree of polymerization (\(n\)), hence attempting to optimize every parameter within the molecular topology space would be highly computationally inefficient, even in the case of CG representations. Consequently, reducing the dimensionality of the parameter space becomes essential. This reduction in the number of parameters to be optimized reduces the number of design variables, thus enhancing the efficiency and tractability of BO, which is typically less effective in higher-dimensional spaces. To this end, we propose an effective low-dimensional parametrization of a CG molecule’s topology, which focuses on capturing the critical degrees of freedom that influence vital macroscopic properties like \(\rho\) and \({R}_{g}\). In particular, we consider the bonded parameters (\(\theta\)) of the first, middle, and end bonds of a polymer chain to sufficiently describe the CG molecular topology. This consideration enables an efficient CG topology representation because the middle bonds capture the essence of the uniform internal structure, while the first and last bonds capture the deviations due to functional groups or termination effects. This selection is deemed adequate, given the regularity of repeating units in a polymer chain, and unique boundary effects at chain ends. The rationale behind this low-dimensional approach, distinguishing parameters for start, middle, and end segments, is to capture transferable features. The parameters optimized for these regions are intended to be applicable when constructing models for other degrees of polymerization of the same polymer, thus promoting transferability. A direct comparison to a model using a single, uniform parameter set for all segments was not performed but would be valuable future work to precisely quantify the impact of modeling these terminal boundary effects. Furthermore, this low-dimensional parametrization approach balances computational efficiency with the need to capture key topological features of the CG polymer.

Furthermore, this low-dimensional parametrization approach balances computational efficiency with the need to capture key topological features of the CG polymer. For instance, a CG styrene monomer contains 5 bonded parameters (see Supplementary Fig. 1) while a 20-polystyrene polymer chain (20-PS) consists of 139 bonded parameters (see Fig. 1). However, if we consider the start, middle, and end bonds of 20-PS (which contains aromatic rings in every monomer), we arrive at 3 sets of \(\theta\), which gives us 15 dimensions to optimize (see Fig. 1).

Fig. 1: Low-dimensional parametrization of a coarse-grained polymer chain.
figure 1

The schematic illustrates the selection of start, middle, and end segments of a 20-polystyrene (20-PS) chain for parameter optimization. This approach reduces the dimensionality of the optimization space from 139 total bonded parameters to 15, focusing on the most influential regions of the polymer.

CG topology optimization framework

Leveraging the strengths of CGMD simulations and BO in an integrated manner, addresses a critical challenge: improving the fidelity of Martini3 forcefields while maintaining computational efficiency. To this end, we create a synergistic workflow (see Fig. 2) that optimizes the pre-selected topological parameters (\(\theta\)) for a particular polymer chain, by calibrating on the CGMD-derived macroscopic properties of the bulk polymer. We initiated the workflow with 20 CGMD simulation runs, based on topological parameters (\(\theta\)) selected with a space-filling Latin Hypercube experimental design55 with maximum projection. This initial dataset was used to train the Gaussian Process (GP) surrogate model, ensuring adequate coverage of the parameter space. Following this, we integrated the Martini3 simulation method with BO, leveraging the Expected Hypervolume Improvement (EHVI) acquisition function56. The optimization proceeded iteratively over ~50 iterations, with two CGMD simulations conducted per iteration. Model convergence was defined as a plateau in loss reduction (i.e., objective function), indicating that the optimization had identified an optimal set of parameters (\({\theta }_{{optimal}}\)). This dual-run strategy balances computational efficiency with the need for sufficient data to refine the GP model, thus enabling effective exploration with the exploitation of the parameter space. Resultantly, the predicted target estimates are achieved by minimizing the objective function’s loss, which can be defined as:

$$\widehat{{\boldsymbol{\theta }}}=\mathop{{\rm{argmin}}}\limits_{\theta }{{||}{k}^{{CG}}\left({\boldsymbol{\theta }}\right)-{k}^{{AA}}{||}}_{2}$$
(2)

where the topological parameters (\(\theta\)) serve as input variables, and the properties derived from CGMD runs (\({k}_{{CG}}\)) are compared against reference high-fidelity data (AAMD-derived property estimates - \({k}_{{AA}}\)). By directly targeting macroscopic properties such as density and radius of gyration, this approach is designed to ensure fidelity in key emergent behaviors and avoid potential divergences in these properties that can occur in bottom-up methods focused primarily on matching local structural distributions (e.g., radial distribution functions). In summary, by combining this approach for iterative refinement, we facilitate multi-objective Bayesian optimization (MOBO), by minimizing the objective function while maintaining computational feasibility. Table 1 defines the optimization space (constraints) for these \(\theta\). This parametrization can be extended to complex polymer systems by introducing additional parameters while consistently using an adequately low number of topological parameters.

Fig. 2: The Bayesian optimization workflow for refining coarse-grained topologies.
figure 2

This flowchart illustrates the iterative process where properties from all-atom molecular dynamics (AAMD) serve as the ground truth. A Gaussian Process Regression (GPR) surrogate model and an acquisition function are used to intelligently select new coarse-grained (CG) topology parameters to evaluate, progressively minimizing the difference between CG and AA properties to find an optimal topology.

Table 1 Optimization space (constraints) for the low-dimensional parameters of the coarse-grained polymer systems

Pareto optimal property discrepancy frontier

The polymer systems trained in this work exhibit significant structural and chemical diversity across four degrees of polymerization, which necessitates varying training times even when trained on the same computational resource. To this end, we train the integrated CGMD-MOBO models individually on a single NVIDIA A100 GPU with 32 GB of memory, and the training times average between 3 h to 11 h (depending on the CGMD simulation times for a bulk polymer chain system). The models are trained until the objective function plateaus (see Fig. 3), indicating that the total loss has been minimized within the optimization search space (as shown in Table 1). Specifically, every polymer’s Martini3 topology is optimized until the absolute percentage errors converge to less than ~10%. Through Fig. 3, we also observe that the radius of gyration (\({R}_{g}\)) as a property facilitates quicker convergence, while the interdependencies affecting the density (ρ) demand additional iterations to balance competing influences. These interdependencies are attributed to the complexity of their optimization landscapes. Specifically, since \({R}_{g}\) is a molecular-scale property, it is directly influenced by localized geometry-based structural parameters such as bond lengths and angles, which are simpler to optimize and exhibit a smoother landscape. In contrast, ρ, a bulk property, is affected by both bonded and non-bonded interactions, which requires adjustments to long-range interactions (non-bonded parameters) and packing efficiency (bonded parameters). This leads to a more complex and slower-converging optimization process. Conversely, it can be claimed that the initial parameter set may have been closer to the optimal region \({R}_{g}\), further accelerating its convergence, hence a bias. However, to counter that thought, the model for every polymer chain was re-initialized and re-trained over 5 seeds, owing to the stochastic selection process of BO. For every seed, it was observed that \({R}_{g}\) yielded quicker convergence, hence supporting our prior notion that ρ depends on non-bonded interactions as well.

Fig. 3: Convergence of the Bayesian optimization for macroscopic properties.
figure 3

The plots show the absolute percentage error relative to all-atom computations as a function of the number of CGMD evaluations for density (\(\rho\)) and radius of gyration (\({R}_{g}\)). The results for multiple polymer systems demonstrate rapid convergence, typically reaching a plateau within 50 iterations (100 CGMD evaluations).

Convergence to Pareto optimal values typically occurred within ~50 iterations (100 CGMD simulation runs), as shown in Fig. 3. The evolution of the convex Pareto front (see Fig. 4) over these 50 iterations represents the set of non-dominated solutions, where improvements in one objective (density or radius of gyration) cannot be made without degrading the other. In particular, the front effectively captures the trade-offs between these competing objectives, allowing for informed decision-making in selecting optimal CG parameters. For instance, Fig. 4 shows the convex Pareto fronts we achieve with our workflow for 10-PE and 50-PE. The progression of the color bar denotes the evolution of the front cover 60 iterations, including 10 initial iterations (with 20 CGMD runs). This demonstrates that the integration of CGMD with MOBO effectively navigates the parameter space to balance fidelity and computational cost, by strategically exploiting known optimal regions while systematically exploring uncertain areas.

Fig. 4: Pareto fronts from multi-objective Bayesian optimization.
figure 4

The plots show the evolution of the Pareto front for 10-PE and 50-PE over 60 optimization iterations (including 10 iterations for initial design). Each point represents a CG parameter set, plotted by its resulting density (\(\rho\)) and radius of gyration (\({R}_{g}\)). The color of the points indicates the iteration number. The AA target (red circle), initial Martini3 value (green triangle), and an optimal prediction from the final front (cyan circle) are highlighted.

The improvements in the relative efficiency of our integrated framework are further quantified by conducting a standard deviation analysis to compare our optimized CG topologies to the raw Martini3 topologies, over 5 seeds per polymer. Figure 5 shows the percentage error metrics for density (\(\rho\)) and radius of gyration (\({R}_{g}\)) across the polymers of interest. This figure clearly illustrates the superior performance of our optimized topologies (green line) as compared to the raw Martini3 topology (red line). This MOBO solution also demonstrates the generalizability and superiority of our proposed framework, by efficiently reducing discrepancies between CGMD and AAMD from a bottom-up perspective.

Fig. 5: Performance comparison of optimized topologies against the default Martini3 model.
figure 5

Absolute percentage error for density (\(\rho\)) and radius of gyration (\({R}_{g}\)). Each plot compares the error of the default CG-Martini3 model (red line) with the BO-optimized model (green line) across all 12 polymer systems. The shaded area highlights the significant improvement in accuracy achieved by the optimization framework. Error bars represent the standard deviation over five independent runs.

For PE and PS, the absolute percentage error in ρ decreases by ~30% as n increases from 3 to 50 (Fig. 5, top). However, PMMA exhibits increased errors at higher degrees of polymerization due to Martini3’s limited representation of polar-nonpolar bead interactions in elongated chains23. This aligns with prior studies57 showing that force field accuracy for polar polymers degrades with chain length when non-bonded parameterizations are insufficient. Moreover, the optimized force field consistently reduces errors for \(\rho\) and \({R}_{g}\) across all polymer families. For instance, in 3-PE, coarse-graining three atomic groups into a single bead in the Martini3 model leads to deviations in molecular behavior and conformational flexibility. The proposed framework adjusts interaction parameters, improving the force field’s fidelity. On the other hand, in PMMA, the Martini3 model demonstrates higher accuracy at low polymerization degrees due to the effective separation of polar and non-polar blocks within beads. This nature of coarse-graining limits flexibility at higher molecular weights, increasing error. However, in PS, significant errors in density are linked to benzene ring stacking, which is inadequately represented by Martini3. Interestingly, the original Martini3 model aligns better with experimental densities than the optimized version for 50-PS, highlighting the need to balance future optimization efforts towards both experimental results (high-fidelity) and AAMD data (low-fidelity).

This paper introduces a MOBO (Multi-Objective Bayesian Optimization) framework that significantly enhances CG-Martini3 topologies for common polymers like PE, PMMA and PS. Our findings challenge conventional raw Martini3 topologies by consistently yielding CG topologies that accurately reproduce AA-calculated macroscopic properties (density and radius of gyration), with improvements observed across all studied materials and degrees of polymerization. Furthermore, this work presents a framework for the low dimensional parametrization of CG molecular topologies, which increases its generalizability over unknown complex polymer systems. A key strength of the proposed low-dimensional parametrization is its design for transferability across varying degrees of polymerization, moving towards a unified model for a given polymer type. The framework reports exceptionally low errors under ~10% for both density and radius of gyration for all the 12 polymers studied and introduces itself as a new benchmark for future model-building efforts. This unprecedented improvement helps bridge the gap between low- and high-fidelity MD models, enabling accurate predictions with CGMD at a fraction of the corresponding AAMD’s computational expense.

Methods

Molecular dynamics simulations

Molecular dynamics (MD)3,58,59 simulations scale in computational time and complexity with the increase in the number of particles simulated. In general, AAMD simulations are expensive and time-consuming when we tend to simulate thousands of atoms. Because of the larger number of atoms simulated, the greater number of physical interactions between them need to be captured to provide a realistic representation of the changes in the system under specific temperature and pressure conditions. The fundamental challenge in running AAMD simulations is the high computational budget necessary to simulate the system, which is often infeasible and, hence, a bottleneck. However, with the advent of CGMD as a methodology, MD simulations have grown to be more scalable for molecular systems with a higher number of atoms. In fact, for the same molecule, AAMD simulations involve substantially higher computational costs as they simulate a higher number of particles (here, atoms—higher resolution), as compared to CGMD, which simulates a smaller number of particles (here, beads—lower resolution). Coarse graining (CG) is fundamentally aimed at simplifying complex systems, and in the context of all-atom (AA) structures, coarse-graining involves grouping multiple atoms into a single bead while retaining as much information as possible from the original structure and composition. There is always an information loss or approximation when one moves from an AA system towards a CG system. However, the goal is to accept the information loss to a reasonable extent in the pursuit of speeding up the simulation time by multiple orders of magnitude while also saving on computational cost.

Supplementary Fig. 1 shows a schematic of the coarse-graining procedure for a poly-styrene monomer. Martini321,22,23 approximates aromatic rings (note the gray beads—three in number, which help to coarse-grain the aromatic ring) as well as molecular chains (the singular green bead, which helps with coarse-graining the carbon chain shown as C2H4).

Polymers of interest

The polymers we focus on in this work include Polyethylene (PE), Polystyrene (PS), and Polymethyl Methacrylate (PMMA) across multiple degrees of polymerization (n:n {3,10,20,50}). The wide applicability of these polymers60,61,62, combined with their reusability in building complex polymer systems, motivated our choice of these specific polymer systems. The AA molecular structure files (.gro/.pdb) for the aforementioned 12 polymer chains were prepared by J-OCTA software63, with the bonded and non-bonded interactions being parametrized by the GAFF (General Amber Force Field)64,65. Necessary electrostatic potential charges for each atom were calculated using Gaussian1666 revB.01 with RHF/6-31 G(d) level of theory. Coarse-graining was performed on these AA molecular structures using Martini313,22 with the martinize267,68 and vermouth67 python codebases. The Martini3 force field is generated in accordance with the pre-defined Martini Interaction Matrix13, which contains four main types of interaction sites: Polar (P), nonpolar (N), apolar (C), and charged (Q). Within the main interaction sites, there are sub-levels, and the interactions between each sub-level across different interaction sites are captured in an interaction matrix22 with LJ potential values assigned for each interaction. Despite the limited applicability of Martini on polymers, it serves as a great starting point for helping map AA to CG molecular structures and topologies. For instance, Table 2 shows an example per polymer of the AA to CG structure mapping for 20 degrees of polymerization.

Table 2 All-atom structure to Coarse-grained structure mapping using Martini321,23

Molecular dynamics simulation setup

The AAMD calculations were conducted using the GROMACS34 simulation package. The calculation of derived properties, such as the radius of gyration of polymer chains and the density, was also performed using GROMACS. We evaluate ensembles of 100 AA molecular structures for the 12 types of polymer chains. Three-dimensional periodic boundary conditions have been adopted for the simulation cell to place 100 polymer chains randomly. MD simulations were performed with a time step of 2 fs in the NPT ensemble, using the V-rescale thermostat (T = 300 K, P = 1 bar) and C-rescale barostat for 100 ns. The convergence of the radius of gyration of each polymer chain has been confirmed at ~10 ns; therefore, the rest of the run was used for sampling.

All CGMD calculations were performed with GROMACS with the Martini3 force field as defined earlier. We evaluate ensembles of 100 CG polymer chains of each polymer of interest, representing a baseline level of accuracy. These ensembles were subjected to energy minimization, followed by NVT equilibration and, finally, NPT equilibration for 10 ns, using the V-rescale thermostat (T = 300 K, P = 1 bar) and C-rescale barostat for 100 ns. The convergence of the radius of gyration of each polymer chain has been confirmed at ~8 ns; therefore, the rest of the run was used for sampling.

Supplementary Fig. S2 evaluates the absolute error of raw Martini3 forcefield with respect to corresponding AA computations. We observe that the errors lie in the range of 20–30% for most cases, the maximum error being over 60%, which shows the inaccuracies posed by computing macroscopic properties from CGMD using the raw Martini3 force field.

Bayesian optimization (BO)

Optimization of expensive underlying functions is a problem endemic to multiple scientific fields of study, including material informatics69,70, bioinformatics71, manufacturing72, and economics73. Frequently, these costly predictive tools only enable the capability to query select points in the input space without any ability to differentiate with respect to the response—resulting in essentially “black-box” functions. BO has emerged as an efficient method for optimizing such black-box functions \(f\), formalized as

$${x}^{* }=\mathop{{argmax}}\limits_{x\in X}f\left(x\right)$$
(3)

where \({\mathcal{X}}\) denotes the input space over which a solution is sought and \({x}^{* }\) the input maximizing \(f.\) The gain in efficiency towards addressing problems of this type is primarily due to the ability to exploit information theory74 and Bayesian inference over the underlying function space, frequently through the creation of GP surrogates75.

Multi-output Gaussian process regression

Gaussian Processes (\({\mathcal{G}}{\mathcal{P}}{\rm{s}}\))76 are widely used probabilistic surrogate models. In the context of BO, they are similarly leveraged towards approximating the true underlying objective function over which to optimize. (\({\mathcal{G}}{\mathcal{P}}{\rm{s}}\)) can be viewed as probability distributions over function spaces, providing essential properties related to Bayesian analysis77,78. This relationship is denoted as a \({\mathcal{G}}{\mathcal{P}}\), i.e., \(f\left(\cdot \right)\,{{\sim }}\,{\mathcal{G}}{\mathcal{P}}\left(v\left(\cdot \right),k\left(\cdot ,{\cdot }^{{\prime} }\right)\right)\), which is uniquely determined through a mean function υ(·) and a covariance function k(·,·) parameterized by hyperparameters, θ. Often, the mean function is taken to be υ ≡ 0 without loss of generality.

Given a training dataset \(\{\left({{\boldsymbol{x}}}_{n},{y}_{n}\right){\}}_{n=1}^{N}\) of N corrupted observations with an assumed Gaussian noise \({\xi }_{i}{\mathscr{\sim }}{\mathcal{N}}(0,{\sigma }_{y}^{2})\), the collection of all training inputs can be denoted as \({\boldsymbol{X}}\in {{\mathcal{R}}}^{N\times M}\), the vector of all outputs as \({\boldsymbol{y}}\), and \({\boldsymbol{f}}\) the infinite-dimensional process latent function values. The particular covariance function used in this work is the automatic relevance determination squared exponential (ARD-SE)76, defined as

$$k\,\left({\boldsymbol{x}},{\boldsymbol{x}}^{\prime} \right)={\sigma }_{f}^{2}\mathrm{exp}\left(-\frac{1}{2}\mathop{\sum }\limits_{m=1}^{M}\frac{{\left({\boldsymbol{x}}-{\boldsymbol{x}}^{\prime} \right)}^{{\rm{\top }}}\left({\boldsymbol{x}}-{\boldsymbol{x}}^{\prime} \right)}{{\lambda }_{m}^{2}}\right)$$
(4)

where \({\lambda }_{m}\) is the lengthscale associated with input dimension m of M, and \({\sigma }_{f}\) the amplitude. The resulting set of hyperparameters for this covariance function is then \({\boldsymbol{\theta }}=\{{\boldsymbol{\lambda }},{\sigma }_{f}\}\). \(K\left({\boldsymbol{X}},{\boldsymbol{X}}^{\prime} \right)\) represents the constructed covariance matrix using the covariance function established in Eq. (4). For legibility, this will be abbreviated as \({{\boldsymbol{K}}}_{{\boldsymbol{f}\,\boldsymbol{f}}}\) to denote the covariance matrix constructed with the available training dataset, defining the latent process. The hyperparameters of this covariance function are inferred through maximizing the log marginal likelihood.

$${log\,p}\left({\boldsymbol{y}}|{\boldsymbol{X}},{\boldsymbol{\theta }}\right)=\frac{1}{2}{{\boldsymbol{y}}}^{{\rm{\top }}}{{\boldsymbol{K}}}_{y}^{-1}{\boldsymbol{y}}-\frac{1}{2}\log |{{\boldsymbol{K}}}_{y}|-\frac{N}{2}\log 2\pi$$
(5)

where \({K}_{y}=K\left({\boldsymbol{X}},{\boldsymbol{X}}^{{{\prime}}}\right)+{\sigma }_{y}^{2}{\boldsymbol{I}}\). Predictions of this base model can be expanded to handle multi-output functions in a similar manner to the scalar output case, through expansion of the covariance matrix to express correlations between related outputs79. Such Multioutput Gaussian processes (MOGP) learn a multioutput function \(f({\boldsymbol{x}}){{:}}{\mathcal{X}}\to {{\mathbb{R}}}^{P}\) with the input space \({\mathcal{X}}\in {{\mathbb{R}}}^{D}\). The p-th output of \(f({\boldsymbol{x}})\) is expressed as \({f}_{p}({\boldsymbol{x}})\), with its complete representation given as \(f=\{{\hat{f}}_{p}\left({\boldsymbol{x}}\right)\}_{i=1}^{P}\). MOGPs are similarly completely defined by their covariance function (assuming \(\upsilon \equiv 0\)), resulting in a covariance matrix \({\boldsymbol{K}}\in {{\mathbb{R}}}^{{NP}\times {NP}}\). In this work, the multi-output covariance matrix is constructed through the Linear Model of Coregionalization79,80. This model represents a method of constructing the multi-output function from a linear transformation \(W\in {{\mathbb{R}}}^{P\times L}\) of \(L\) independent functions \(g\left({\boldsymbol{x}}\right)={\{{g}_{l}\left({\boldsymbol{x}}\right)\}}_{l=1}^{L}\). Each function is constructed as an independent \({\mathcal{G}}{\mathcal{P}}\), \({g}_{l}({\boldsymbol{x}})\sim {\mathcal{G}}{\mathcal{P}}(0,{k}_{l}({\boldsymbol{x}},{\boldsymbol{x}}^{{{\prime}}}))\), each with its own covariance function, resulting in the final expression \(f(x)={\boldsymbol{W}}g({\boldsymbol{x}})\). The multi-output covariance function described by this model is then expressed as:

$$k\left(\{{\boldsymbol{x}},p\},\{{\boldsymbol{x}}^{\prime} ,p^{\prime} \}\right)=\mathop{\sum }\limits_{l=1}^{L}{W}_{{pl}}{k}_{l}\left({\boldsymbol{x}},{\boldsymbol{x}}^{\prime} \right){W}_{p^{\prime} l}$$
(6)

which can be seen to encode correlations between output dimensions.

Acquisition function

Acquisition functions are the core machinery by which subsequent points are selected to query the true underlying function. While many of these utility functions exist, they all aim to strike a balance between exploration of the input space \({\mathscr{X}}\) and exploiting prominent subspaces. Out of the variety of such acquisition functions available, this work relies upon the well-established EHVI acquisition function81 due to the multi-objective optimization problem involving a set of target material properties.

Multi-objective optimization involves the simultaneous optimization of multiple conflicting objectives. A common goal is to approximate the Pareto front, which represents the set of non-dominated solutions. In the context of BO, the EHVI acquisition function is a widely used criterion for guiding the selection of which candidate points to evaluate. EHVI balances exploration and exploitation by quantifying the expected improvement in the hypervolume metric, a measure of Pareto front quality.

The hypervolume of a set of points in the objective space is defined as the volume of the region dominated by those points and bounded by a reference point. Let \(P\) denote the current Pareto front and \({\boldsymbol{r}}\) a reference point in the objective space. The hypervolume of \(P\) is given by:

$${HV}(P)={Volume}\,(\cup _{{\boldsymbol{p}}\in P}[{\boldsymbol{p}},{\boldsymbol{r}}])$$
(7)

where \(\left[{\boldsymbol{p}},{\boldsymbol{r}}\right]\) denotes the hyper-rectangle spanned between \({\boldsymbol{p}}\) and \({\boldsymbol{r}}\). A measure of improvement in Hypervolume (HVI) then quantifies the increase in hypervolume achieved by adding a new candidate point \({\boldsymbol{y}}\) to the Pareto front:

$${HVI}\left({\boldsymbol{y}},P\right)=\max (0,{HV}\left(P\cup \left\{{\boldsymbol{y}}\right\}\right)-{HV}(P))$$
(8)

We can then extend this notion to the creation of the EHVI acquisition function, which evaluates the expected value of the HVI under the predictive distribution of the surrogate model. Let \({\boldsymbol{Y}}\) be the random vector representing the predicted objective values at a candidate input \({\boldsymbol{x}}\). The EHVI is defined as:

$${EHVI}\left({\boldsymbol{y}},P\right){\mathbb{=}}{\mathbb{E}}\left[{HVI}\left({\boldsymbol{Y}},P\right)\right]$$
(9)

where the expectation is taken with respect to the posterior distribution of \({\boldsymbol{Y}}\) conditioned on the observed data. The computation of EHVI generally requires an analytically intractable integration over the multi-objective posterior distribution, frequently performed via monte-carlo integration.

The q-Expected Hypervolume Improvement (q-EHVI) is an extension of the EHVI that enables evaluation of the EHVI across a batch of q candidate points simultaneously. It measures the expected increase in hypervolume when all q candidates are jointly evaluated, incorporating correlations between their predicted objective values. Formally, q-EHVI is defined as:

$$\begin{array}{l}{\alpha }_{q-{EHVI}}\left({{\boldsymbol{X}}}_{{\boldsymbol{cand}}}\right){\mathbb{=}}{\mathbb{E}}\left[{HVI}\left(f\left({{\boldsymbol{X}}}_{{\boldsymbol{cand}}}\right)\right)\right]={\int }_{\!\!-\infty }^{\infty }{HVI}\left(f\left({{\boldsymbol{X}}}_{{\boldsymbol{cand}}}\right)\right)p\left(f\left({{\boldsymbol{X}}}_{{\boldsymbol{cand}}}\right){|D}\right){df}\end{array}$$
(10)

where \({{\boldsymbol{X}}}_{{\boldsymbol{cand}}}=\{{{\boldsymbol{x}}}_{1},\ldots ,{{\boldsymbol{x}}}_{{\boldsymbol{q}}}\}\) is the set of q candidate points, \(f\left({{\boldsymbol{X}}}_{{\boldsymbol{cand}}}\right)\) are the corresponding objective values, and \(p\left(f\left({{\boldsymbol{X}}}_{{\boldsymbol{cand}}}\right){|D}\right)\) is the joint posterior predictive distribution of the model conditioned on the observed data \(D\).

Since there is no closed-form solution for \(q > 1\) or when the predicted outcomes are correlated, the expectation is approximated using Monte Carlo (MC) integration. This involves drawing \(N\) samples \({\left\{f\left({{\boldsymbol{X}}}_{{\boldsymbol{cand}}}\right)\right\}}_{t=1}^{N}\) from the joint posterior \(p\left(f\left({{\boldsymbol{X}}}_{{\boldsymbol{cand}}}\right){|D}\right)\). Letting \({z}_{k,{X}_{j},t}^{(m)}=\min [{u}_{k},\mathop{\min }\limits_{{x}^{{\prime} }\in {X}_{j}}{f}_{t}\left({x}^{{\prime} }\right)]\), the expectation can be estimated as:

$${\hat{\alpha }}_{q-{EHVI}}\left({{\boldsymbol{X}}}_{{\boldsymbol{cand}}}\right)=\frac{1}{N}\mathop{\sum }\limits_{t=1}^{N}{HVI}\left(\,{f}_{t}\left({{\boldsymbol{X}}}_{{cand}}\right)\right)$$
(11)
$${\hat{\alpha }}_{q-{EHVI}}\left({{\boldsymbol{X}}}_{{\boldsymbol{cand}}}\right)=\frac{1}{N}\mathop{\sum }\limits_{t=1}^{N}\mathop{\sum }\limits_{k=1}^{K}\mathop{\sum }\limits_{j=1}^{q}\sum _{{X}_{j}\in {X}_{j}}{\left(-1\right)\,}^{j+1}\mathop{\prod }\limits_{m=1}^{M}\left[{z}_{k}^{\left(m\right)}-{l}_{k}^{\left(m\right)}\right]$$
(12)

The integration region is divided into \(K\) hyper-rectangular cells based on the current Pareto where \({z}_{k}^{\left(m\right)}\) is the upper bound of the \(m\)-th objective in the \(k\)-th cell, and \({l}_{k}^{\left(m\right)}\) is the lower bound. The overall q-EHVI is obtained by summing the contributions of all active cells and accounting for the combinatorial subsets of the q candidates.

The MC estimation error decreases as \(O\left(1/\sqrt{N}\right)\) with independent samples, regardless of the dimensionality of the search space. To improve efficiency, randomized quasi-Monte Carlo methods are often employed, which reduce variance and provide faster convergence in practice.