Introduction

Polymers are foundational materials that drive innovation across diverse domains. In aerospace, lightweight polymer composites enhance fuel efficiency1, while in medicine, they enable drug delivery systems and tissue engineering scaffolds2. The food industry relies on polymers for packaging, preserving quality, and extending shelf life3, reflecting their broad utility in both everyday and advanced applications4. As the demand for tailored polymeric materials grows, traditional experimental design, which is often slow and resource-intensive, is being complemented by machine learning (ML) techniques5,6 to accelerate property prediction and material optimization. Notable efforts, such as the Polymer Genome project7 demonstrate ML’s potential to transform polymer science and polymer informatics beyond homopolymers8.

Despite these advancements, polymer informatics faces persistent challenges. Scarce and noisy datasets, combined with fragmented tools and inconsistent workflows, complicate the development of reproducible ML models9,10,11,12. A critical bottleneck is the digital representation of polymer structures13. While Morgan fingerprints14,15 are commonly employed, they often fail to capture the nuanced hierarchical features of polymers such as backbone and side chain contributions16. Additionally, standard data splitting practices, such as random cross-validation17, may not adequately test models’ ability for real-world extrapolation beyond training data, a key requirement for materials discovery18. Random cross-validation involves partitioning data into folds that originate from the same distribution. While efforts have been made to create frameworks for specific aspects of digital polymer science, such as running simulations19,20,21, there is no standardized framework for machine learning in polymer science. This contrasts with fields such as materials science22, reticular chemistry23, and small molecule research24.

To overcome these limitations, we introduce PolyMetriX, a Python library tailored for polymer informatics. PolyMetriX provides a unified framework focused on advanced feature engineering, empowering the entire ML workflow from data preparation to modeling. It includes standardized datasets, a curated glass transition temperature (Tg) database with 7367 data points, and custom data splitting strategies optimized for polymer structures. It also provides featurizers, which extract hierarchical features at the full polymer, backbone, and side chain levels25, integrating RDKit26 for robust molecular descriptors and offering specialized polymer-specific representations. Designed with a consistent application programming interface (API) inspired by Matminer27, PolyMetriX is available open-source at github.com/lamalab-org/PolyMetriX. By streamlining workflows and enhancing reproducibility, we hope that PolyMetriX contributes to the foundation for a new era of polymer informatics.

Polymer design has evolved significantly, transitioning from purely experimental approaches to computational screening and, more recently, ML-enabled discovery. While ML promises rational and accelerated polymer design, its implementation remains highly artisanal due to fragmented workflows, inconsistent featurization methods, and challenges in ensuring robust generalization across datasets. The lack of standardized polymer representations and reliable benchmarking further hinders reproducibility in polymer informatics.

To address these limitations, we introduce PolyMetriX, an open-source ecosystem designed to streamline polymer informatics. By integrating hierarchical feature representations spanning full polymer structures, backbones, and side chains PolyMetriX surpasses conventional fingerprint-based approaches like Morgan fingerprints, enabling more accurate structure-property predictions. Additionally, a curated Tg dataset, comprising 7367 polymer entries, provides a standardized benchmarking resource for future polymer ML studies. In addition, PolyMetriX provides structured data splitting strategies, such as LOCOCV and Tg based extrapolation splitters, ensuring that models are tested under conditions that reflect real-world material discovery challenges. The framework’s modular architecture facilitates the computational analysis of polymer-organic mixtures and multi-component systems through unified featurization protocols that maintain consistency across both polymeric and small-molecule components.

By making PolyMetriX openly available, we aim to standardize machine learning workflows in polymer informatics, fostering collaboration and accelerating data-driven polymer research. Future work will expand the featurization framework and increase the number of curated datasets. Ultimately, we envision PolyMetriX as a community-driven cornerstone for the next generation of AI-driven polymer discovery.

Results and discussion

The PolyMetriX package

In our framework, we aim to cover the entire lifecycle of building a machine learning model for polymer chemistry (see Fig. 1). In the following, we introduce the PolyMetriX package, detailing its dataset curation process, featurization strategies, evaluation techniques, and integration with machine learning models. We also discuss the challenges in polymer data standardization and how PolyMetriX attempts to address these issues.

Fig. 1: Overview of PolyMetriX.
Fig. 1: Overview of PolyMetriX.
Full size image

The ecosystem covers the full machine learning workflow for polymer informatics, starting with standardized datasets focused on the Tg. Given a polymer’s PSMILES representation, hierarchical featurizers are applied at various structural levels (full polymer, backbone, and side chains) to generate meaningful descriptors. The framework includes advanced data splitting strategies (e.g., random, LOCOCV, and property-based splits) to support robust model training and evaluation. All components—from dataset loading to featurization and ML model integration—are accessible through a consistent and modular API, enabling seamless downstream applications in polymer property prediction.

Dataset curation and variability impact

The availability of a clean and reliable dataset is crucial for conducting ML studies. In PolyMetriX, we have curated a dataset focusing on the glass transition temperature (Tg) of polymers. We selected Tg because of its critical role in transforming polymers into practical, usable products. However, a major challenge in polymer ML arises from the fact that various models reported in the literature have been trained on diverse datasets28,29,30,31,32,33, many of which remain unpublished34,35,36,37.

To illustrate the issue of dataset incompatibility clearly, we performed cross-tests on several existing datasets using a Gradient Boosting Regression (GBR) model, where we train a model on one dataset that has been used in previous studies and test on all the others. The results (see Supplementary Fig. 1) show large variations in predictive performance upon cross-testing, with mean absolute errors ranging from 13.79 Kelvin to 214.75 Kelvin—clearly highlighting that the datasets that are currently used to train and test ML models in polymer chemistry are not comparable and this hampering reuse of prior work. This underscores the necessity of standard benchmark datasets in polymer informatics.

To address this gap, we developed a standardized, curated Tg dataset intended to serve as a robust benchmark for future polymer ML studies, thereby enabling reliable and meaningful model comparisons.

We collected datasets from various literature sources (see Table 1), resulting in a total of nine distinct datasets comprising 8992 data points. A notable observation was the presence of duplicated data points, where polymer samples with the same repeat unit exhibited different Tg values. While this variability is expected, given that polymer samples’ Tg can fluctuate based on chain length, dispersity38, and experimental methods39, these parameters are often not reported in the literature. Consequently, they are frequently omitted from the ML models in polymer science, limiting the interpretability of predictive models. The maximum Z-score was found to be 9.19. This inherent noise, as shown in Fig. 2, fundamentally limits the performance that ML models can achieve, as it is impossible to overcome this irreducible error with better learning approaches.

Table 1 Data sources
Fig. 2: Variability in (Tg) values across different sources for the same polymer structures, grouped by their canonical PSMILES representations.
Fig. 2: Variability in (Tg) values across different sources for the same polymer structures, grouped by their canonical PSMILES representations.
Full size image

The x-axis represents PSMILES representation of the polymers, sorted in ascending order based on the standard deviation of their experimental Tg values. The y-axis denotes the mean experimental Tg (in Kelvin). Error bars indicate the standard deviation of reported Tg values for each polymer, illustrating the variability in experimental measurements. The marker size reflects the number of available data points for each polymer, with larger markers indicating a higher number of experimental values. This highlights the substantial variability in reported Tg values across different sources, emphasizing the necessity of data curation to minimize noise and improve the reliability of machine learning models.

To enhance data reliability and mitigate variability, we implemented a curation strategy that organizes polymers with their corresponding Tg values and assigns them to one of four reliability categories that we name Black, Yellow, Gold, and Red. For this categorization, we consider—due to lack of information on the molecular weight distribution, end groups, and synthesis conditions—two polymers as identical if they share a PSMILES. This is a meaningful approximation for current work in polymer informatics, where models are commonly trained by deriving features from PSMILES alone. The workflow for this curation process is illustrated in Fig. 3.

  • Black: This category indicates uncertain reliability, as it includes polymers that are unique in our dataset, wherefore we cannot make estimates for robustness based on variance of the Tg values. This category contains 7088 data points.

  • Red: This category marks unreliable data, applied to polymers with varying Tg values across sources where the Z-score exceeds 2. This category includes 4 data points.

  • Yellow: This category suggests moderately reliable data, assigned to polymers with exactly two different Tg values from distinct sources, provided the Z-score is ≤2, based on our estimate. This category includes 132 data points.

  • Gold: This category includes highly reliable data, reserved for polymers with more than two different Tg values from various sources, all with a Z-score ≤2, based on our estimate. This category comprises 143 data points.

Fig. 3: Workflow for curating the Tg dataset to standardize data and ensure reliability.
Fig. 3: Workflow for curating the Tg dataset to standardize data and ensure reliability.
Full size image

This process assigns reliability classes to data points based on their frequency of occurrence and statistical consistency. The workflow groups polymer PSMILES with associated Tg values and categorizes them according to their occurrence count: (1) unique occurrences are assigned the lowest reliability (black), (2) duplicate occurrences undergo a Z-score check where values within ±2 standard deviations are considered reliable (yellow), and (3) multiple occurrences (>2) follow the same Z-score validation to classify them into gold or red reliability categories. This structured approach ensures that the dataset maintains consistency and reduces errors due to outliers.

For a given polymer, different Tg values might be reported across sources. To address this, we considered the median Tg value for each polymer, as the median is less sensitive to extreme values than the mean.

Following this curation, we obtained 7367 unique PSMILES-Tg pairs, with canonicalized PSMILES representations40. We also performed ablation studies with the reliability classes (see Supplementary Table 1).

PolyMetriX featurization and performance

Polymers must be presented in a form that can be utilized by models to perform ML. This commonly involves creating descriptors (also known as features) that describe the polymer as a vector. An important contribution of PolyMetriX is that it provides a standardized API for the use, combination, and creation of featurizers, focusing on different aspects of polymers.

PolyMetriX featurizers are categorized into two main types: chemical and topological.

Chemical featurizers describe the composition of polymers, capturing attributes such as the number of rings, rotatable bonds, heteroatoms, and hybridization states, which influence polymer properties and behavior. Topological featurizers focus on the connectivity, describing structural and spatial arrangements like the number of side chains, backbone atom count, diverse side chain count, and side chain length, which are critical for understanding polymer structure relationships Fig. 4. Supplementary Table 2 provides details on the specific featurizers implemented in PolyMetriX.

Fig. 4: Examples of chemical and topological featurizers in PolyMetriX.
Fig. 4: Examples of chemical and topological featurizers in PolyMetriX.
Full size image

Chemical featurizers include attributes such as the number of rings, rotatable bonds, heteroatoms, and hybridization states. Topological featurizers describe the connectivity of the polymer via properties like the number of side chains, backbone atom count, diverse side chain count, and side chain length. The diverse side chain count represents the number of structurally distinct side chains in a polymer.

Morgan fingerprints are widely used in polymer ML, encoding the presence or absence of substructures such as functional groups. However, they are high-dimensional and challenging to interpret. PolyBERT is a DeBERTa-based encoder-only Transformer trained on 100 million hypothetical polymer SMILES strings using masked language modeling. It generates 600-dimensional dense fingerprint vectors from PSMILES strings40. PolyMetriX introduces hierarchical featurizers that provide a compact but targeted polymer representation by considering the full polymer, side chains, and backbone structures in a modular approach.

To evaluate susceptibility to overfitting and generalization capability, we analyzed test error as a function of similarity to the training set using a Leave-One-Out-Cluster-Validation (LOOCV) split and a GBR model with default settings (see Fig. 5). Morgan fingerprints were computed using PSMILES and RDKit, while PolyMetriX features were derived through hierarchical featurization at the full polymer, side chain, and backbone levels. To generate PolyBERT fingerprints, we used a pretrained PolyBERT model checkpoint from Hugging Face40. LOCOCV ensured test samples belonged to unseen clusters during training. GBR models were trained on respective feature representations, and test errors were recorded. To quantify overfitting tendencies, we computed the maximum Tanimoto similarity between each test sample and the training set. Note that Tanimoto similarity was always computed using Morgan fingerprints for all test-training pairs, regardless of whether Morgan fingerprints, PolyBERT fingerprints or PolyMetriX features were used in the model. This allowed for a consistent measure of structural similarity independent of the feature space used for learning.

Fig. 5: Mean Absolute Error (MAE) in Kelvin (K) as a function of Tanimoto similarity between the test set and the training set.
Fig. 5: Mean Absolute Error (MAE) in Kelvin (K) as a function of Tanimoto similarity between the test set and the training set.
Full size image

While Morgan fingerprints exhibit improved performance with increasing similarity, the PolyMetriX featurizers (full polymer, side chain, and backbone) maintain robustness even at low similarity levels. The error bars represent the standard error of the mean for each bin. All the curves demonstrate improved performance as the similarity score increases. Notably, the advantage of the PolyMetriX featurizers at low similarity values highlights their generalization capability compared to Morgan fingerprints while being more compact than both Morgan fingerprints and PolyBERT embeddings.

Results show that Morgan fingerprints yield lower MAE with increasing Tanimoto similarity, indicating strong performance in independently and identically distributed (IID) settings but limited extrapolation to structurally dissimilar compounds. PolyBERT fingerprints show a moderate decrease in MAE with increasing similarity. Similarly, but at lower dimensionality (28 and 72, respectively, compared to 600), PolyMetriX features maintain relatively consistent performance across varying similarity levels.

Advanced featurization and polymer systems

PolyMetriX currently implements 25 chemical featurizers and 7 topological featurizers, but its strength lies in the hierarchical application of chemical featurizers across different structural levels. This approach allows featurizers to be computed separately for the backbone, side chain and full polymer-level (see Supplementary Note 4).

Future enhancements to PolyMetriX will aim to expand the topological featurizers, which currently captures polymer-specific connectivity patterns. Incorporating additional descriptors, such as 3D conformational descriptors that account for chain flexibility and packing behavior41, could further add value to the current framework.

Although PSMILES notation does not explicitly represent terminal groups, PolyMetriX’s modular design allows terminal groups—such as hydroxyl, carboxyl, amine, and methyl groups—to be incorporated into the backbone or side chains. This enables backbone-level and side-chain-level featurizers to quantify their chemical contributions (see Supplementary Note 5).

PolyMetriX, primarily designed for polymers, supports featurization of polymer-molecule interaction by processing molecular components like drugs, solvents, or additives using a dedicated molecule class. This class accepts SMILES notation and generates chemical featurizers aligned with polymer features. This is particularly useful with a comparator class implemented in PolyMetriX that compares chemical descriptors of polymers and molecular components to assess compatibility and interactions, using customizable comparison (e.g., absolute difference, signed difference) and aggregation (e.g., mean, max, min, and sum) methods. This enables characterization of systems like polymer-drug formulations or polymer-solvent mixtures (see Supplementary Notes 6 and 7).

PolyMetriX is primarily designed for homopolymers. However, the featurizers are agnostic to polymer architecture, enabling potential extension to other polymer architectures. In its current implementation, PolyMetriX can process any polymer that can be represented with a PSMILES and hence also support any degree of polymerization (e.g., by providing a longer PSMILES, see Supplementary Note 8).

Data splitting and model evaluation

In materials discovery, the goal is to identify novel materials with desired properties. To achieve this, machine learning models must be evaluated under realistic data splitting strategies that mimic real-world discovery scenarios42. A well-chosen splitting strategy ensures that models can generalize beyond seen data, predicting properties of structurally novel polymers rather than memorizing previously observed patterns.

Evaluating ML models under different data splitting strategies is essential to ensure their robustness and generalizability. Traditional random splitting often overestimates performance for discovery applications, as similar polymers can appear in both training and test sets. More rigorous approaches, such as LOOCV43 and Tg-based property extrapolation, better reflect real-world challenges by testing a model’s generalization ability to structurally dissimilar polymers.

As shown in Fig. 6, PolyMetriX featurizers consistently outperform Morgan and PolyBERT fingerprints across all splitting strategies. As expected, we observe higher errors and variances for property-based and LOCOCV splitting approaches than for random splits. This trend highlights the increased difficulty of these splitting strategies, which impose greater generalization demands on the models. Notably, combining PolyMetriX featurizers at multiple hierarchical levels (full polymer, side chain, and backbone) yields superior predictive performance, making them highly effective in both interpolation and extrapolation settings.

Fig. 6: MAE (K) for different featurization methods across three splitting strategies: random (5-fold), LOOCV (5 clusters), and Tg-based extrapolation (5 quantile bins) using a GBR model.
Fig. 6: MAE (K) for different featurization methods across three splitting strategies: random (5-fold), LOOCV (5 clusters), and Tg-based extrapolation (5 quantile bins) using a GBR model.
Full size image

PolyMetriX featurizers (side chain, backbone, and full polymer) consistently outperform Morgan fingerprints and PolyBERT fingerprints, with the best performance achieved by combining all PolyMetriX features. Error bars indicate the standard error of the mean, calculated as the standard deviation of the k-fold cross-validation divided by the square root of n, where n = 5.

Fig. 7: Data filtering process applied to the polymer Tg dataset.
Fig. 7: Data filtering process applied to the polymer Tg dataset.
Full size image

The funnel diagram illustrates the sequential filtering steps, beginning with 8992 data points and reducing to 7874 after removing duplicates, handling missing values, and averaging Tg values under defined conditions.

PolyMetriX API design

PolyMetriX integrates the entire ML cycle for polymer chemistry, starting with curated datasets, featurization, and custom splitting strategies to train ML models. In the design of PolyMetriX, we aimed for an easy-to-use and modular API inspired by sklearn44 and matminer27.

The following code snippets demonstrate how to load a curated dataset, perform featurization, and train a model with splitting strategies.

PolyMetriX provides curated datasets to facilitate research in polymer chemistry. The following example demonstrates how to load the glass transition temperature dataset we described above:

The dataset object provides access to polymer structures and their respective properties, making it easy to retrieve features and labels for machine learning models. Additionally, dataset objects include metadata such as the name of the polymer, the source name for the data point, the (Tg) range for polymers with multiple values, the number of data points available for the multiple Tg values, a list of Tg values for those polymers, the standard deviation of these multiple values, and the reliability classes associated with the data.

PolyMetriX supports various hierarchical featurization for polymer on full, side chain and backbone levels. The example below illustrates how to perform featurization at different levels:

This featurization process enables obtaining features for polymer structures at different length scales. Aggregation functions (sum, mean, max, min) summarize the features to obtain fixed-length descriptor vectors.

PolyMetriX simplifies the workflow by providing dataset handling, featurization, and integration with machine learning models. Below is an example of training a machine learning model using a property-based splitting strategy:

This workflow not only simplifies polymer chemistry machine learning tasks but also introduces a standardized syntax akin to other fields, which enables rapid iteration by allowing seamless swapping of dataset components, feature extraction methods, and model configurations, fostering reproducibility and efficiency in polymer informatics research.

Polymer design has evolved significantly, transitioning from purely experimental approaches to computational screening and, more recently, ML-enabled discovery. While ML promises rational and accelerated polymer design, its implementation remains highly artisanal due to fragmented workflows, inconsistent featurization methods, and challenges in ensuring robust generalization across datasets. The lack of standardized polymer representations and reliable benchmarking further hinders reproducibility in polymer informatics.

Conclusions

To address these limitations, we introduce PolyMetriX, an open-source ecosystem designed to streamline polymer informatics. By integrating hierarchical feature representations spanning full polymer structures, backbones, and side chains PolyMetriX surpasses conventional fingerprint-based approaches like Morgan fingerprints, enabling more accurate structure-property predictions. Additionally, a curated Tg dataset, comprising 7367 polymer entries, provides a standardized benchmarking resource for future polymer ML studies. In addition, PolyMetriX provides structured data splitting strategies, such as LOCOCV and Tg based extrapolation splitters, ensuring that models are tested under conditions that reflect real-world material discovery challenges. The framework’s modular architecture facilitates the computational analysis of polymer-organic mixtures and multi-component systems through unified featurization protocols that maintain consistency across both polymeric and small-molecule components.

By making PolyMetriX openly available, we aim to standardize machine learning workflows in polymer informatics, fostering collaboration and accelerating data-driven polymer research. Future work will expand the featurization framework and increase the number of curated datasets. Ultimately, we envision PolyMetriX as a community-driven cornerstone for the next generation of AI-driven polymer discovery.

Methods

Data acquisition and cleaning

The dataset for glass transition temperatures (Tg) was compiled from multiple sources, as summarized in Table 1. The initial dataset contained 8992 data points.

The sources are categorized into three distinct groups: B, P, and Others (including C, D, and E). The B category represents data that we could trace back to the Bicerano Handbook45, while the P category consists of data we could trace back to the PolyInfo database46. The Others category includes data from sources (C, D, and E), where the original references or provenance details were not explicitly reported in the respective publications. As a result, while these sources provide data, their traceability to primary literature remains uncertain.

To ensure the quality and consistency of the dataset, we applied a series of preprocessing steps. Initially, the dataset contained 8992 data points. Through various cleaning stages, we reduced the dataset as follows:

First, we performed canonicalization of the polymer SMILES (PSMILES) representations. Canonicalization involves converting each PSMILES string into a unique, standardized form that represents the same polymer structure, which helps ensure consistency and removes redundancy in the dataset. This process failed for a small subset of entries due to issues such as invalid representations and the presence of more than two stars in the PSMILES (typically indicating branched polymers that could not be properly handled). After canonicalization, the dataset was reduced to 8765 data points (97.4%). Next, we removed duplicate entries based on identical PSMILES and Tg values, resulting in 8199 data points (93.5%).

Subsequently, rows with missing PSMILES values were removed, reducing the dataset to 8156 points (90.7%). Finally, for entries with identical PSMILES but different Tg values, we computed the mean Tg, provided that the standard deviation across measurements was ≤5. Importantly, this mean aggregation approach was applied only to B sources (B1, B2, B3, and B4), as they originate from the same parent source, the Bicerano Handbook45.

It is crucial to note that this differs from the curation workflow (Fig. 3), where we aggregated Tg values using the median instead of the mean. This distinction was made to ensure internal consistency within the B sources while maintaining robustness in the overall dataset curation process.

After these final steps, the cleaned dataset contained 7874 points, representing 87.5% of the original dataset. These refined data points were then used as the foundation for the data curation process, which is detailed in Fig. 7.

Backbone and side chain classification

In the PolyMetriX package, distinguishing between the backbone and side chains of polymers is crucial for applying chemical featurizers effectively. The Polymer class accomplishes this classification by utilizing graph theory concepts and the NetworkX library47 to analyze the polymer’s structure, represented by its PSMILES notation40.

The polymer backbone is identified based on key graph properties, including the shortest paths between connection points, cycle detection, and node degree analysis. Atoms that are not part of the backbone are classified as side chains. The classification process consists of the following steps (Fig. 8).

Fig. 8: Illustration of backbone and side chain classification in Poly(N-vinyl carbazole) using the PolyMetriX package.
Fig. 8: Illustration of backbone and side chain classification in Poly(N-vinyl carbazole) using the PolyMetriX package.
Full size image

(1) Connection points are denoted by asterisks (*) in the PSMILES. (2) The shortest path between these connection points is computed to form the initial backbone. (3) If cycles (e.g., aromatic rings) are present along the shortest path, their atoms are incorporated into the backbone. (4) Degree-based classification includes terminal groups (nodes with degree 1) that are attached to the backbone. (5) Atoms not included in the backbone are assigned as side chains.

The polymer is represented as a graph, where nodes correspond to atoms, and edges represent chemical bonds. This graph is constructed using RDKit26 and NetworkX47.

In PSMILES notation, asterisks (*) indicate the connection points of the polymer.

To determine the backbone structure, the shortest paths between the connection points (atoms labeled with asterisks) are computed. These paths serve as an initial backbone representation.

If cycles are present within the shortest paths, all atoms forming these cycles are included in the backbone. In this context, a cycle refers to a closed sequence of bonds forming a ring-like structure, such as the benzene ring in styrene-based polymers.

Nodes with a degree of 1 (connected to only one other node) that are attached to the backbone are also included in the backbone. These are typically terminal groups.

Atoms that are not included in the backbone, based on the above criteria, are classified as side chains.

Splitters in PolyMetriX

In this study, we implemented a random splitter using the scikit-learn package48. Random splitting is a commonly used approach in machine learning, but it might not be the best choice for measuring real-world impact49, as both the training and test sets can share similar structural patterns. This phenomenon is demonstrated in Fig. 9, where the folds in the random splitter appear closely distributed. As a result, machine learning models tend to perform exceptionally well on random splits because they are not rigorously tested on their generalization ability.

Fig. 9: Visualization of different data splitting strategies used in PolyMetriX.
Fig. 9: Visualization of different data splitting strategies used in PolyMetriX.
Full size image

Left: The Random splitter demonstrates high overlap between different folds, making it less challenging for ML models. Right: The LOCOCV splitter assigns structurally similar data points to separate clusters, preventing information leakage. Bottom: The Property-based splitter divides data into five quantile bins based on Tg values, creating a more rigorous test for model generalization.

To provide a more challenging evaluation of model performance, we incorporated a property-based splitter. This method partitions the data into quantized bins based on Tg values, creating five distinct groups. As shown in Fig. 9, the bins are well separated, with extreme values forming distinct clusters at the left and right, while intermediate values are clustered in the central three bins. This splitting strategy presents a significantly more stringent test for model generalization.

Additionally, we employed the LOCOCV splitter, implemented using the mofdscribe package23. LOCOCV ensures that each fold consists of structurally distinct clusters, making it a robust method for evaluating generalization performance. Unlike random splitting, which allows structural overlap between the training and test sets, LOCOCV strictly enforces separation, leading to a more realistic assessment of how models perform on unseen polymer structures.

For the modeling tasks, we used GBR from the scikit-learn package48. GBR50 is an ensemble learning technique that builds a series of weak predictive models, typically decision trees, and combines them to create a strong, accurate predictor.

We employed GBR with its default settings, without performing hyperparameter optimization. This decision was made because the primary focus of this work is on polymer featurization and how these features can be used for training and downstream tasks, rather than optimizing model performance.