PolyMetriX: an ecosystem for digital polymer chemistry

Kunchapu, Sreekanth; Jablonka, Kevin Maik

doi:10.1038/s41524-025-01823-y

Download PDF

Article
Open access
Published: 21 October 2025

PolyMetriX: an ecosystem for digital polymer chemistry

Sreekanth Kunchapu¹ &
Kevin Maik Jablonka^1,2,3,4

npj Computational Materials volume 11, Article number: 312 (2025) Cite this article

6356 Accesses
6 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Digital polymer chemistry leverages computational methods to design and optimize polymer materials. While there have been advances in using machine learning to accelerate the design of polymers, the field is hampered by the lack of standards, which precludes comparability and makes it difficult to build on top of prior work. To address this gap, we introduce PolyMetriX, an open-source Python library designed to facilitate the entire polymer informatics workflow—from obtaining data to training models. PolyMetriX provides curated polymer property datasets, and novel featurization techniques that extract hierarchical structural information at the full polymer, backbone, and sidechain levels. Additionally, it incorporates polymer-specific data splitting strategies to ensure robust model generalization. PolyMetriX enhances the predictive performance of models while improving reproducibility in digital polymer chemistry.

Recent advances and challenges in experiment-oriented polymer informatics

Article 02 December 2022

POLYT5: an encoder-decoder foundation chemical language model for generative polymer design

Article Open access 03 March 2026

polyRETRO a language model approach to predict polymerization class and monomers for a target polymer

Article Open access 08 May 2026

Introduction

Polymers are foundational materials that drive innovation across diverse domains. In aerospace, lightweight polymer composites enhance fuel efficiency¹, while in medicine, they enable drug delivery systems and tissue engineering scaffolds². The food industry relies on polymers for packaging, preserving quality, and extending shelf life³, reflecting their broad utility in both everyday and advanced applications⁴. As the demand for tailored polymeric materials grows, traditional experimental design, which is often slow and resource-intensive, is being complemented by machine learning (ML) techniques^5,6 to accelerate property prediction and material optimization. Notable efforts, such as the Polymer Genome project⁷ demonstrate ML’s potential to transform polymer science and polymer informatics beyond homopolymers⁸.

Despite these advancements, polymer informatics faces persistent challenges. Scarce and noisy datasets, combined with fragmented tools and inconsistent workflows, complicate the development of reproducible ML models^9,10,11,12. A critical bottleneck is the digital representation of polymer structures¹³. While Morgan fingerprints^14,15 are commonly employed, they often fail to capture the nuanced hierarchical features of polymers such as backbone and side chain contributions¹⁶. Additionally, standard data splitting practices, such as random cross-validation¹⁷, may not adequately test models’ ability for real-world extrapolation beyond training data, a key requirement for materials discovery¹⁸. Random cross-validation involves partitioning data into folds that originate from the same distribution. While efforts have been made to create frameworks for specific aspects of digital polymer science, such as running simulations^19,20,21, there is no standardized framework for machine learning in polymer science. This contrasts with fields such as materials science²², reticular chemistry²³, and small molecule research²⁴.

To overcome these limitations, we introduce PolyMetriX, a Python library tailored for polymer informatics. PolyMetriX provides a unified framework focused on advanced feature engineering, empowering the entire ML workflow from data preparation to modeling. It includes standardized datasets, a curated glass transition temperature (T_g) database with 7367 data points, and custom data splitting strategies optimized for polymer structures. It also provides featurizers, which extract hierarchical features at the full polymer, backbone, and side chain levels²⁵, integrating RDKit²⁶ for robust molecular descriptors and offering specialized polymer-specific representations. Designed with a consistent application programming interface (API) inspired by Matminer²⁷, PolyMetriX is available open-source at github.com/lamalab-org/PolyMetriX. By streamlining workflows and enhancing reproducibility, we hope that PolyMetriX contributes to the foundation for a new era of polymer informatics.

Polymer design has evolved significantly, transitioning from purely experimental approaches to computational screening and, more recently, ML-enabled discovery. While ML promises rational and accelerated polymer design, its implementation remains highly artisanal due to fragmented workflows, inconsistent featurization methods, and challenges in ensuring robust generalization across datasets. The lack of standardized polymer representations and reliable benchmarking further hinders reproducibility in polymer informatics.

To address these limitations, we introduce PolyMetriX, an open-source ecosystem designed to streamline polymer informatics. By integrating hierarchical feature representations spanning full polymer structures, backbones, and side chains PolyMetriX surpasses conventional fingerprint-based approaches like Morgan fingerprints, enabling more accurate structure-property predictions. Additionally, a curated T_g dataset, comprising 7367 polymer entries, provides a standardized benchmarking resource for future polymer ML studies. In addition, PolyMetriX provides structured data splitting strategies, such as LOCOCV and T_g based extrapolation splitters, ensuring that models are tested under conditions that reflect real-world material discovery challenges. The framework’s modular architecture facilitates the computational analysis of polymer-organic mixtures and multi-component systems through unified featurization protocols that maintain consistency across both polymeric and small-molecule components.

By making PolyMetriX openly available, we aim to standardize machine learning workflows in polymer informatics, fostering collaboration and accelerating data-driven polymer research. Future work will expand the featurization framework and increase the number of curated datasets. Ultimately, we envision PolyMetriX as a community-driven cornerstone for the next generation of AI-driven polymer discovery.

Results and discussion

The PolyMetriX package

In our framework, we aim to cover the entire lifecycle of building a machine learning model for polymer chemistry (see Fig. 1). In the following, we introduce the PolyMetriX package, detailing its dataset curation process, featurization strategies, evaluation techniques, and integration with machine learning models. We also discuss the challenges in polymer data standardization and how PolyMetriX attempts to address these issues.

Dataset curation and variability impact

The availability of a clean and reliable dataset is crucial for conducting ML studies. In PolyMetriX, we have curated a dataset focusing on the glass transition temperature (T_g) of polymers. We selected T_g because of its critical role in transforming polymers into practical, usable products. However, a major challenge in polymer ML arises from the fact that various models reported in the literature have been trained on diverse datasets^{28,29,30,31,32,33}, many of which remain unpublished^34,35,36,37.

To illustrate the issue of dataset incompatibility clearly, we performed cross-tests on several existing datasets using a Gradient Boosting Regression (GBR) model, where we train a model on one dataset that has been used in previous studies and test on all the others. The results (see Supplementary Fig. 1) show large variations in predictive performance upon cross-testing, with mean absolute errors ranging from 13.79 Kelvin to 214.75 Kelvin—clearly highlighting that the datasets that are currently used to train and test ML models in polymer chemistry are not comparable and this hampering reuse of prior work. This underscores the necessity of standard benchmark datasets in polymer informatics.

To address this gap, we developed a standardized, curated T_g dataset intended to serve as a robust benchmark for future polymer ML studies, thereby enabling reliable and meaningful model comparisons.

We collected datasets from various literature sources (see Table 1), resulting in a total of nine distinct datasets comprising 8992 data points. A notable observation was the presence of duplicated data points, where polymer samples with the same repeat unit exhibited different T_g values. While this variability is expected, given that polymer samples’ T_g can fluctuate based on chain length, dispersity³⁸, and experimental methods³⁹, these parameters are often not reported in the literature. Consequently, they are frequently omitted from the ML models in polymer science, limiting the interpretability of predictive models. The maximum Z-score was found to be 9.19. This inherent noise, as shown in Fig. 2, fundamentally limits the performance that ML models can achieve, as it is impossible to overcome this irreducible error with better learning approaches.

Table 1 Data sources

Full size table

Fig. 2: Variability in (Tg) values across different sources for the same polymer structures, grouped by their canonical PSMILES representations. — **Fig. 2: Variability in (T_g) values across different sources for the same polymer structures, grouped by their canonical PSMILES representations.**

To enhance data reliability and mitigate variability, we implemented a curation strategy that organizes polymers with their corresponding T_g values and assigns them to one of four reliability categories that we name Black, Yellow, Gold, and Red. For this categorization, we consider—due to lack of information on the molecular weight distribution, end groups, and synthesis conditions—two polymers as identical if they share a PSMILES. This is a meaningful approximation for current work in polymer informatics, where models are commonly trained by deriving features from PSMILES alone. The workflow for this curation process is illustrated in Fig. 3.

Black: This category indicates uncertain reliability, as it includes polymers that are unique in our dataset, wherefore we cannot make estimates for robustness based on variance of the T_g values. This category contains 7088 data points.
Red: This category marks unreliable data, applied to polymers with varying T_g values across sources where the Z-score exceeds 2. This category includes 4 data points.
Yellow: This category suggests moderately reliable data, assigned to polymers with exactly two different T_g values from distinct sources, provided the Z-score is ≤2, based on our estimate. This category includes 132 data points.
Gold: This category includes highly reliable data, reserved for polymers with more than two different T_g values from various sources, all with a Z-score ≤2, based on our estimate. This category comprises 143 data points.

Fig. 3: Workflow for curating the Tg dataset to standardize data and ensure reliability. — **Fig. 3: Workflow for curating the T_g dataset to standardize data and ensure reliability.**

For a given polymer, different T_g values might be reported across sources. To address this, we considered the median T_g value for each polymer, as the median is less sensitive to extreme values than the mean.

Following this curation, we obtained 7367 unique PSMILES-T_g pairs, with canonicalized PSMILES representations⁴⁰. We also performed ablation studies with the reliability classes (see Supplementary Table 1).

PolyMetriX featurization and performance

Polymers must be presented in a form that can be utilized by models to perform ML. This commonly involves creating descriptors (also known as features) that describe the polymer as a vector. An important contribution of PolyMetriX is that it provides a standardized API for the use, combination, and creation of featurizers, focusing on different aspects of polymers.

PolyMetriX featurizers are categorized into two main types: chemical and topological.

Chemical featurizers describe the composition of polymers, capturing attributes such as the number of rings, rotatable bonds, heteroatoms, and hybridization states, which influence polymer properties and behavior. Topological featurizers focus on the connectivity, describing structural and spatial arrangements like the number of side chains, backbone atom count, diverse side chain count, and side chain length, which are critical for understanding polymer structure relationships Fig. 4. Supplementary Table 2 provides details on the specific featurizers implemented in PolyMetriX.

**Fig. 4: Examples of chemical and topological featurizers in PolyMetriX.**

Morgan fingerprints are widely used in polymer ML, encoding the presence or absence of substructures such as functional groups. However, they are high-dimensional and challenging to interpret. PolyBERT is a DeBERTa-based encoder-only Transformer trained on 100 million hypothetical polymer SMILES strings using masked language modeling. It generates 600-dimensional dense fingerprint vectors from PSMILES strings⁴⁰. PolyMetriX introduces hierarchical featurizers that provide a compact but targeted polymer representation by considering the full polymer, side chains, and backbone structures in a modular approach.

To evaluate susceptibility to overfitting and generalization capability, we analyzed test error as a function of similarity to the training set using a Leave-One-Out-Cluster-Validation (LOOCV) split and a GBR model with default settings (see Fig. 5). Morgan fingerprints were computed using PSMILES and RDKit, while PolyMetriX features were derived through hierarchical featurization at the full polymer, side chain, and backbone levels. To generate PolyBERT fingerprints, we used a pretrained PolyBERT model checkpoint from Hugging Face⁴⁰. LOCOCV ensured test samples belonged to unseen clusters during training. GBR models were trained on respective feature representations, and test errors were recorded. To quantify overfitting tendencies, we computed the maximum Tanimoto similarity between each test sample and the training set. Note that Tanimoto similarity was always computed using Morgan fingerprints for all test-training pairs, regardless of whether Morgan fingerprints, PolyBERT fingerprints or PolyMetriX features were used in the model. This allowed for a consistent measure of structural similarity independent of the feature space used for learning.

**Fig. 5: Mean Absolute Error (MAE) in Kelvin (K) as a function of Tanimoto similarity between the test set and the training set.**

Results show that Morgan fingerprints yield lower MAE with increasing Tanimoto similarity, indicating strong performance in independently and identically distributed (IID) settings but limited extrapolation to structurally dissimilar compounds. PolyBERT fingerprints show a moderate decrease in MAE with increasing similarity. Similarly, but at lower dimensionality (28 and 72, respectively, compared to 600), PolyMetriX features maintain relatively consistent performance across varying similarity levels.

Advanced featurization and polymer systems

PolyMetriX currently implements 25 chemical featurizers and 7 topological featurizers, but its strength lies in the hierarchical application of chemical featurizers across different structural levels. This approach allows featurizers to be computed separately for the backbone, side chain and full polymer-level (see Supplementary Note 4).

Future enhancements to PolyMetriX will aim to expand the topological featurizers, which currently captures polymer-specific connectivity patterns. Incorporating additional descriptors, such as 3D conformational descriptors that account for chain flexibility and packing behavior⁴¹, could further add value to the current framework.

Although PSMILES notation does not explicitly represent terminal groups, PolyMetriX’s modular design allows terminal groups—such as hydroxyl, carboxyl, amine, and methyl groups—to be incorporated into the backbone or side chains. This enables backbone-level and side-chain-level featurizers to quantify their chemical contributions (see Supplementary Note 5).

PolyMetriX, primarily designed for polymers, supports featurization of polymer-molecule interaction by processing molecular components like drugs, solvents, or additives using a dedicated molecule class. This class accepts SMILES notation and generates chemical featurizers aligned with polymer features. This is particularly useful with a comparator class implemented in PolyMetriX that compares chemical descriptors of polymers and molecular components to assess compatibility and interactions, using customizable comparison (e.g., absolute difference, signed difference) and aggregation (e.g., mean, max, min, and sum) methods. This enables characterization of systems like polymer-drug formulations or polymer-solvent mixtures (see Supplementary Notes 6 and 7).

PolyMetriX is primarily designed for homopolymers. However, the featurizers are agnostic to polymer architecture, enabling potential extension to other polymer architectures. In its current implementation, PolyMetriX can process any polymer that can be represented with a PSMILES and hence also support any degree of polymerization (e.g., by providing a longer PSMILES, see Supplementary Note 8).

Data splitting and model evaluation

In materials discovery, the goal is to identify novel materials with desired properties. To achieve this, machine learning models must be evaluated under realistic data splitting strategies that mimic real-world discovery scenarios⁴². A well-chosen splitting strategy ensures that models can generalize beyond seen data, predicting properties of structurally novel polymers rather than memorizing previously observed patterns.

Evaluating ML models under different data splitting strategies is essential to ensure their robustness and generalizability. Traditional random splitting often overestimates performance for discovery applications, as similar polymers can appear in both training and test sets. More rigorous approaches, such as LOOCV⁴³ and T_g-based property extrapolation, better reflect real-world challenges by testing a model’s generalization ability to structurally dissimilar polymers.

As shown in Fig. 6, PolyMetriX featurizers consistently outperform Morgan and PolyBERT fingerprints across all splitting strategies. As expected, we observe higher errors and variances for property-based and LOCOCV splitting approaches than for random splits. This trend highlights the increased difficulty of these splitting strategies, which impose greater generalization demands on the models. Notably, combining PolyMetriX featurizers at multiple hierarchical levels (full polymer, side chain, and backbone) yields superior predictive performance, making them highly effective in both interpolation and extrapolation settings.

Fig. 6: MAE (K) for different featurization methods across three splitting strategies: random (5-fold), LOOCV (5 clusters), and Tg-based extrapolation (5 quantile bins) using a GBR model. — **Fig. 6: MAE (K) for different featurization methods across three splitting strategies: random (5-fold), LOOCV (5 clusters), and T_g-based extrapolation (5 quantile bins) using a GBR model.**

Fig. 7: Data filtering process applied to the polymer Tg dataset. — **Fig. 7: Data filtering process applied to the polymer T_g dataset.**

PolyMetriX API design

PolyMetriX integrates the entire ML cycle for polymer chemistry, starting with curated datasets, featurization, and custom splitting strategies to train ML models. In the design of PolyMetriX, we aimed for an easy-to-use and modular API inspired by sklearn⁴⁴ and matminer²⁷.

The following code snippets demonstrate how to load a curated dataset, perform featurization, and train a model with splitting strategies.

PolyMetriX provides curated datasets to facilitate research in polymer chemistry. The following example demonstrates how to load the glass transition temperature dataset we described above:

The dataset object provides access to polymer structures and their respective properties, making it easy to retrieve features and labels for machine learning models. Additionally, dataset objects include metadata such as the name of the polymer, the source name for the data point, the (T_g) range for polymers with multiple values, the number of data points available for the multiple T_g values, a list of T_g values for those polymers, the standard deviation of these multiple values, and the reliability classes associated with the data.

PolyMetriX supports various hierarchical featurization for polymer on full, side chain and backbone levels. The example below illustrates how to perform featurization at different levels:

This featurization process enables obtaining features for polymer structures at different length scales. Aggregation functions (sum, mean, max, min) summarize the features to obtain fixed-length descriptor vectors.

PolyMetriX simplifies the workflow by providing dataset handling, featurization, and integration with machine learning models. Below is an example of training a machine learning model using a property-based splitting strategy:

This workflow not only simplifies polymer chemistry machine learning tasks but also introduces a standardized syntax akin to other fields, which enables rapid iteration by allowing seamless swapping of dataset components, feature extraction methods, and model configurations, fostering reproducibility and efficiency in polymer informatics research.

Polymer design has evolved significantly, transitioning from purely experimental approaches to computational screening and, more recently, ML-enabled discovery. While ML promises rational and accelerated polymer design, its implementation remains highly artisanal due to fragmented workflows, inconsistent featurization methods, and challenges in ensuring robust generalization across datasets. The lack of standardized polymer representations and reliable benchmarking further hinders reproducibility in polymer informatics.

Conclusions

To address these limitations, we introduce PolyMetriX, an open-source ecosystem designed to streamline polymer informatics. By integrating hierarchical feature representations spanning full polymer structures, backbones, and side chains PolyMetriX surpasses conventional fingerprint-based approaches like Morgan fingerprints, enabling more accurate structure-property predictions. Additionally, a curated T_g dataset, comprising 7367 polymer entries, provides a standardized benchmarking resource for future polymer ML studies. In addition, PolyMetriX provides structured data splitting strategies, such as LOCOCV and T_g based extrapolation splitters, ensuring that models are tested under conditions that reflect real-world material discovery challenges. The framework’s modular architecture facilitates the computational analysis of polymer-organic mixtures and multi-component systems through unified featurization protocols that maintain consistency across both polymeric and small-molecule components.

By making PolyMetriX openly available, we aim to standardize machine learning workflows in polymer informatics, fostering collaboration and accelerating data-driven polymer research. Future work will expand the featurization framework and increase the number of curated datasets. Ultimately, we envision PolyMetriX as a community-driven cornerstone for the next generation of AI-driven polymer discovery.

Methods

Data acquisition and cleaning

The dataset for glass transition temperatures (T_g) was compiled from multiple sources, as summarized in Table 1. The initial dataset contained 8992 data points.

The sources are categorized into three distinct groups: B, P, and Others (including C, D, and E). The B category represents data that we could trace back to the Bicerano Handbook⁴⁵, while the P category consists of data we could trace back to the PolyInfo database⁴⁶. The Others category includes data from sources (C, D, and E), where the original references or provenance details were not explicitly reported in the respective publications. As a result, while these sources provide data, their traceability to primary literature remains uncertain.

To ensure the quality and consistency of the dataset, we applied a series of preprocessing steps. Initially, the dataset contained 8992 data points. Through various cleaning stages, we reduced the dataset as follows:

First, we performed canonicalization of the polymer SMILES (PSMILES) representations. Canonicalization involves converting each PSMILES string into a unique, standardized form that represents the same polymer structure, which helps ensure consistency and removes redundancy in the dataset. This process failed for a small subset of entries due to issues such as invalid representations and the presence of more than two stars in the PSMILES (typically indicating branched polymers that could not be properly handled). After canonicalization, the dataset was reduced to 8765 data points (97.4%). Next, we removed duplicate entries based on identical PSMILES and T_g values, resulting in 8199 data points (93.5%).

Subsequently, rows with missing PSMILES values were removed, reducing the dataset to 8156 points (90.7%). Finally, for entries with identical PSMILES but different T_g values, we computed the mean T_g, provided that the standard deviation across measurements was ≤5. Importantly, this mean aggregation approach was applied only to B sources (B1, B2, B3, and B4), as they originate from the same parent source, the Bicerano Handbook⁴⁵.

It is crucial to note that this differs from the curation workflow (Fig. 3), where we aggregated T_g values using the median instead of the mean. This distinction was made to ensure internal consistency within the B sources while maintaining robustness in the overall dataset curation process.

After these final steps, the cleaned dataset contained 7874 points, representing 87.5% of the original dataset. These refined data points were then used as the foundation for the data curation process, which is detailed in Fig. 7.

Backbone and side chain classification

In the PolyMetriX package, distinguishing between the backbone and side chains of polymers is crucial for applying chemical featurizers effectively. The Polymer class accomplishes this classification by utilizing graph theory concepts and the NetworkX library⁴⁷ to analyze the polymer’s structure, represented by its PSMILES notation⁴⁰.

The polymer backbone is identified based on key graph properties, including the shortest paths between connection points, cycle detection, and node degree analysis. Atoms that are not part of the backbone are classified as side chains. The classification process consists of the following steps (Fig. 8).

**Fig. 8: Illustration of backbone and side chain classification in Poly(N-vinyl carbazole) using the PolyMetriX package.**

The polymer is represented as a graph, where nodes correspond to atoms, and edges represent chemical bonds. This graph is constructed using RDKit²⁶ and NetworkX⁴⁷.

In PSMILES notation, asterisks (*) indicate the connection points of the polymer.

To determine the backbone structure, the shortest paths between the connection points (atoms labeled with asterisks) are computed. These paths serve as an initial backbone representation.

If cycles are present within the shortest paths, all atoms forming these cycles are included in the backbone. In this context, a cycle refers to a closed sequence of bonds forming a ring-like structure, such as the benzene ring in styrene-based polymers.

Nodes with a degree of 1 (connected to only one other node) that are attached to the backbone are also included in the backbone. These are typically terminal groups.

Atoms that are not included in the backbone, based on the above criteria, are classified as side chains.

Splitters in PolyMetriX

In this study, we implemented a random splitter using the scikit-learn package⁴⁸. Random splitting is a commonly used approach in machine learning, but it might not be the best choice for measuring real-world impact⁴⁹, as both the training and test sets can share similar structural patterns. This phenomenon is demonstrated in Fig. 9, where the folds in the random splitter appear closely distributed. As a result, machine learning models tend to perform exceptionally well on random splits because they are not rigorously tested on their generalization ability.

**Fig. 9: Visualization of different data splitting strategies used in PolyMetriX.**

To provide a more challenging evaluation of model performance, we incorporated a property-based splitter. This method partitions the data into quantized bins based on T_g values, creating five distinct groups. As shown in Fig. 9, the bins are well separated, with extreme values forming distinct clusters at the left and right, while intermediate values are clustered in the central three bins. This splitting strategy presents a significantly more stringent test for model generalization.

Additionally, we employed the LOCOCV splitter, implemented using the mofdscribe package²³. LOCOCV ensures that each fold consists of structurally distinct clusters, making it a robust method for evaluating generalization performance. Unlike random splitting, which allows structural overlap between the training and test sets, LOCOCV strictly enforces separation, leading to a more realistic assessment of how models perform on unseen polymer structures.

For the modeling tasks, we used GBR from the scikit-learn package⁴⁸. GBR⁵⁰ is an ensemble learning technique that builds a series of weak predictive models, typically decision trees, and combines them to create a strong, accurate predictor.

We employed GBR with its default settings, without performing hyperparameter optimization. This decision was made because the primary focus of this work is on polymer featurization and how these features can be used for training and downstream tasks, rather than optimizing model performance.

Data availability

The curated dataset for T_g can be found on the Zenodo record at https://doi.org/10.5281/zenodo.14980914. The software is also archived under https://doi.org/10.5281/zenodo.16795419.

Code availability

The code can be found on GitHub https://github.com/lamalab-org/PolyMetriX.

References

Dyer, W. E. & Kumru, B. Polymers as aerospace structural components: how to reach sustainability? Macromol. Chem. Phys. 224, 2300186 (2023).
Article CAS Google Scholar
Namazi, H. Polymers in our daily life. BioImpacts 7, 73 (2017).
Article CAS PubMed PubMed Central Google Scholar
Zhang, M. et al. Recent advances in polymers and polymer composites for food packaging. Mater. Today 53, 134–161 (2022).
Article CAS Google Scholar
Zhao, Y., Mulder, R. J., Houshyar, S. & Le, T. C. A review on the application of molecular descriptors and machine learning in polymer design. Polym. Chem. 14, 3325–3346 (2023).
Article CAS Google Scholar
Jablonka, K. M., Jothiappan, G. M., Wang, S., Smit, B. & Yoo, B. Bias free multiobjective active learning for materials design and discovery. Nat. Commun. 12, 2312 (2021).
Article CAS PubMed PubMed Central Google Scholar
Schuett, T. et al. Application of digital methods in polymer science and engineering. Adv. Funct. Mater. https://doi.org/10.1002/adfm.202309844 (2023).
Kim, C., Chandrasekaran, A., Huan, T. D., Das, D. & Ramprasad, R. Polymer genome: a data-powered polymer informatics platform for property predictions. J. Phys. Chem. C. 122, 17575–17585 (2018).
Article CAS Google Scholar
Shukla, S. S., Kuenneth, C. & Ramprasad, R. Polymer informatics beyond homopolymers. MRS Bull. 49, 17–24 (2024).
Article Google Scholar
Xu, P., Chen, H., Li, M. & Lu, W. New opportunity: machine learning for polymer materials design and discovery. Adv. Theory Simul. 5, 2100565 (2022).
Article Google Scholar
Audus, D. J. & de Pablo, J. J. Polymer informatics: opportunities and challenges. ACS Macro Lett. 6, 1078–1082 (2017).
Article CAS PubMed PubMed Central Google Scholar
Sha, W. et al. Machine learning in polymer informatics. InfoMat 3, 353–361 (2021).
Article CAS Google Scholar
Hatakeyama-Sato, K. Recent advances and challenges in experiment-oriented polymer informatics. Polym. J. 55, 117–131 (2023).
Article CAS Google Scholar
Köster, Y., Kimmig, J., Zechel, S. & Schubert, U. S. Fingerprint applicable for machine learning tested on LCST behavior of polymers. Cell Rep. Phys. Sci. 4, 101553 (2023).
Article Google Scholar
Zhong, S. & Guan, X. Count-based Morgan fingerprint: a more efficient and interpretable molecular representation in developing machine learning-based predictive regression models for water contaminants’ activities and properties. Environ. Sci. Technol. 57, 18193–18202 (2023).
Article CAS PubMed Google Scholar
Morgan, H. L. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. J. Chem. Doc. 5, 107–113 (1965).
Article CAS Google Scholar
Gao, L., Lin, J., Wang, L. & Du, L. Machine learning-assisted design of advanced polymeric materials. Acc. Mater. Res. 5, 571–584 (2024).
Article CAS Google Scholar
Gorman, K. & Bedrick, S. We need to talk about standard splits. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019, 2786 (2019).
Meredig, B. et al. Can machine learning identify the next high-temperature superconductor? Examining extrapolation performance for materials discovery. Mol. Syst. Des. Eng. 3, 819–825 (2018).
Article CAS Google Scholar
Hayashi, Y., Shiomi, J., Morikawa, J. & Yoshida, R. Radonpy: automated physical property calculation using all-atom classical molecular dynamics simulations for polymer informatics. npj Comput. Mater. 8, 222 (2022).
Article CAS Google Scholar
Sahu, H., Shen, K.-H., Montoya, J. H., Tran, H. & Ramprasad, R. Polymer structure predictor (PSP): a Python toolkit for predicting atomic-level structural models for a range of polymer geometries. J. Chem. Theory Comput. 18, 2737–2748 (2022).
Article CAS PubMed Google Scholar
Grünewald, F. et al. Polyply; a Python suite for facilitating simulations of macromolecules and nanomaterials. Nat. Commun. https://doi.org/10.1038/s41467-021-27627-4 (2022).
Dunn, A., Wang, Q., Ganose, A., Dopp, D. & Jain, A. Benchmarking materials property prediction methods: the matbench test set and automatminer reference algorithm. npj Comput. Mater. 6, 138 (2020).
Article Google Scholar
Jablonka, K. M., Rosen, A. S., Krishnapriyan, A. S. & Smit, B. An ecosystem for digital reticular chemistry. ACS Cent. Sci. 9, 563–581 (2023).
Article CAS PubMed PubMed Central Google Scholar
Wu, Z. et al. Moleculenet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Article CAS PubMed Google Scholar
Patel, R. A., Borca, C. H. & Webb, M. A. Featurization strategies for polymer sequence or composition design by machine learning. Mol. Syst. Des. Eng. 7, 661–676 (2022).
Article CAS Google Scholar
Landrum, G. Rdkit documentation. Release 1, 4 (2013).
Google Scholar
Ward, L. et al. Matminer: An open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).
Article Google Scholar
Tao, L., Varshney, V. & Li, Y. Benchmarking machine learning models for polymer informatics: an example of glass transition temperature. J. Chem. Inf. Model. 61, 5395–5413 (2021).
Article CAS PubMed Google Scholar
Afzal, M. A. F. et al. High-throughput molecular dynamics simulations and validation of thermophysical properties of polymers for various applications. ACS Appl. Polym. Mater. 3, 620–630 (2020).
Article Google Scholar
Liu, G., Zhao, T., Xu, J., Luo, T. & Jiang, M. Graph rationalization with environment-based augmentations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1069–1078 (2022).
Nguyen, T. & Bavarian, M. A machine learning framework for predicting the glass transition temperature of homopolymers. Ind. Eng. Chem. Res. 61, 12690–12698 (2022).
Article CAS Google Scholar
Tao, L., Chen, G. & Li, Y. Machine learning discovery of high-temperature polymers. Patterns 2, 100225 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tchoua, R. B. et al. Towards a hybrid human-computer scientific information extraction pipeline. In 2017 IEEE 13th International Conference on e-Science (e-Science), 109–118 (IEEE, 2017).
Uddin, M. J. & Fan, J. Interpretable machine learning framework to predict the glass transition temperature of polymers. Polymers 16, 1049 (2024).
Article CAS PubMed PubMed Central Google Scholar
Babbar, A., Ragunathan, S., Mitra, D., Dutta, A. & Patra, T. K. Explainability and extrapolation of machine learning models for predicting the glass transition temperature of polymers. J. Polym. Sci. 62, 1175–1186 (2024).
Article CAS Google Scholar
Hassan, A. U. et al. Polymer design using machine learning: a quest for high glass transition temperature. Synth. Met. 307, 117659 (2024).
Article CAS Google Scholar
Goswami, S., Ghosh, R., Neog, A. & Das, B. Deep learning based approach for prediction of glass transition temperature in polymers. Mater. Today.: Proc. 46, 5838–5843 (2021).
CAS Google Scholar
Li, S.-J., Xie, S.-J., Li, Y.-C., Qian, H.-J. & Lu, Z.-Y. Influence of molecular-weight polydispersity on the glass transition of polymers. Phys. Rev. E https://doi.org/10.1103/PhysRevE.93.012613 (2016).
Gibbs, J. H. & DiMarzio, E. A. Nature of the glass transition and the glassy state. J. Chem. Phys. 28, 373–383 (1958).
Article CAS Google Scholar
Kuenneth, C. & Ramprasad, R. polybert: a chemical language model to enable fully machine-driven ultrafast polymer informatics. Nat. Commun. 14, 4099 (2023).
Article CAS PubMed PubMed Central Google Scholar
Stuart, S., Watchorn, J. & Gu, F. X. Sizing up feature descriptors for macromolecular machine learning with polymeric biomaterials. npj Comput. Mater. 9, 102 (2023).
Article CAS Google Scholar
Alampara, N., Schilling-Wilhelmi, M. & Jablonka, K. M. Lessons from the trenches on evaluating machine learning systems in materials science. Comput. Mater. Sci 259, 114041 (2025).
Article Google Scholar
Durdy, S., Gaultois, M. W., Gusev, V. V., Bollegala, D. & Rosseinsky, M. J. Random projections and kernelised leave one cluster out cross validation: universal baselines and evaluation tools for supervised machine learning of material properties. Digit. Discov. 1, 763–778 (2022).
Article CAS Google Scholar
Buitinck, L. et al. Api design for machine learning software: experiences from the scikit-learn project. Preprint at https://arxiv.org/abs/1309.0238 (2013).
Bicerano, J. Prediction of Polymer Properties (CRC Press, 2002).
Otsuka, S., Kuwajima, I., Hosoya, J., Xu, Y. & Yamazaki, M. Polyinfo: Polymer database for polymeric materials design. In 2011 International Conference on Emerging Intelligent Data and Web Technologies, 22–29 (IEEE, 2011).
Hagberg, A., Swart, P. J. & Schult, D. A. Exploring network structure, dynamics, and function using networkx. Tech. Rep., Los Alamos National Laboratory (LANL), Los Alamos, NM (United States) (2008).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Jain, A. N., Cleves, A. E. & Walters, W. P. Deep-learning based docking methods: fair comparisons to conventional docking workflows. Preprint at https://arxiv.org/abs/2412.02889 (2024).
Prettenhofer, P. & Louppe, G. Gradient boosted regression trees in scikit-learn. In PyData 2014 (2014).
Liu, W. & Cao, C. Artificial neural network prediction of glass transition temperature of polymers. Colloid Polym. Sci. 287, 811–818 (2009).
Article CAS Google Scholar
Mattioni, B. E. & Jurs, P. C. Prediction of glass transition temperatures from monomer and repeat unit structure using computational neural networks. J. Chem. Inf. Comput. Sci. 42, 232–240 (2002).
Article CAS PubMed Google Scholar
Wu, K. et al. Prediction of polymer properties using infinite chain descriptors (ICD) and machine learning: toward optimized dielectric polymeric materials. J. Polym. Sci. Part B: Polym. Phys. 54, 2082–2091 (2016).
Article CAS Google Scholar
Qiu, H., Qiu, X., Dai, X. & Sun, Z.-Y. Design of polyimides with targeted glass transition temperature using a graph neural network. J. Mater. Chem. C. 11, 2930–2940 (2023).
Article CAS Google Scholar

Download references

Acknowledgements

This work was funded by the Carl-Zeiss Foundation. K.M.J. is part of the NFDI consortium FAIRmat funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project 460197019.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Jena, Germany
Sreekanth Kunchapu & Kevin Maik Jablonka
Center for Energy and Environmental Chemistry Jena (CEEC Jena), Friedrich Schiller University Jena, Jena, Germany
Kevin Maik Jablonka
Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Jena, Germany
Kevin Maik Jablonka
Jena Center for Soft Matter (JCSM), Friedrich Schiller University Jena, Jena, Germany
Kevin Maik Jablonka

Authors

Sreekanth Kunchapu
View author publications
Search author on:PubMed Google Scholar
Kevin Maik Jablonka
View author publications
Search author on:PubMed Google Scholar

Contributions

S.K. and K.M.J. contributed to conceptualization, methodology, validation, writing (review & editing). S.K. was responsible for datacuration, formal analysis, investigation, methodology, software, visualization, and writing. K.M.J. contributed to funding acquisition,resources, methodology, project administration, supervision, and writing (review & editing).

Corresponding authors

Correspondence to Sreekanth Kunchapu or Kevin Maik Jablonka.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kunchapu, S., Jablonka, K.M. PolyMetriX: an ecosystem for digital polymer chemistry. npj Comput Mater 11, 312 (2025). https://doi.org/10.1038/s41524-025-01823-y

Download citation

Received: 24 March 2025
Accepted: 02 October 2025
Published: 21 October 2025
Version of record: 21 October 2025
DOI: https://doi.org/10.1038/s41524-025-01823-y