Abstract
Polymer property prediction is a critical task in polymer science. Conventional approaches typically rely on a single data modality or a limited set of modalities, which constrains both predictive accuracy and practical applicability. In this paper, we present Uni-Poly, a novel framework that integrates diverse data modalities to achieve a comprehensive and unified representation of polymers. Uni-Poly encompasses all commonly used structural formats, including SMILES, 2D graphs, 3D geometries, and fingerprints. In addition, it incorporates domain-specific textual descriptions to enrich the representation. Experimental results demonstrate that Uni-Poly outperforms all single-modality and multi-modality baselines across various property prediction tasks. The integration of textual descriptions provides complementary information that structural representations alone cannot capture. These findings underscore the value of leveraging multimodal and domain-specific information to enhance polymer property prediction, thereby advancing high-throughput screening and the discovery of novel polymer materials.
Similar content being viewed by others
Introduction
Polymer materials are essential to a wide range of applications, from packaging and construction to advanced biomedical devices, due to their versatility and tunable properties1,2,3,4,5. A comprehensive understanding of the complex relationships between polymer structures and their properties is vital for the rational design of materials tailored to specific applications. While traditional experimental and computational methods have been employed to study these structure–property relationships, they are often time-consuming and resource-intensive because of the immense diversity within the polymer chemical space. Recently, machine learning (ML) models have emerged as powerful tools for uncovering polymer structure–property relationships, demonstrating remarkable success. These ML-based strategies necessitate the development of effective and robust representations of polymer structures to ensure accurate predictions and meaningful insights. Early methods utilized molecular fingerprints6,7,8,9, which encode structural information into fixed-length bit vectors. While these fingerprints effectively capture polymer substructures, their fixed format limits end-to-end learning capabilities. Advances in deep learning techniques have introduced learned representations, including those based on SMILES sequences10,11,12,13,14,15,16, graph-based representations17,18,19,20,21,22,23,24,25, and 3D geometric data26,27,28,29,30,31. Although these approaches have shown promising results, existing studies often have not considered on multimodal methods, limiting their ability to integrate the diverse and complementary information provided by different data types. This limitation typically restricts their effectiveness to specific properties, making it challenging to achieve general-purpose applicability. For example, a recent study by Zhu32 demonstrated that different representation methods exhibit varying strengths depending on the target property: for 2D molecular structures, SMILES-based representations excel in predicting chirality, whereas graph-based representations outperform others in capturing ring-related features. These findings emphasize the modality-specific strengths and limitations of current approaches and highlight the need for a comprehensive framework capable of integrating multiple information sources into a unified and informative representation.
Beyond molecular structures, recent advancements in language models have introduced an emerging representation method: text-based descriptions. Studies in the field of small molecules have demonstrated that leveraging molecular-related captions, i.e., domain-specific knowledge embedded in textual descriptions, can enhance the molecule modeling33,34,35. These captions provide contextual information that is often difficult to encode using conventional molecular descriptors alone. However, applying textual knowledge to polymer science remains challenging due to the absence of consistent, large-scale, and high-quality datasets of polymer captions. The advent of large language models (LLMs) offers a promising solution to this issue. General-purpose LLMs, trained on extensive scientific corpora, implicitly store vast amounts of domain knowledge. Recent works36,37 have successfully extracted domain-specific insights and material data from these models, showcasing their capacity to generate meaningful, context-rich captions for polymers. This advancement moves beyond purely structure-based predictions, enabling the integration of human-like domain reasoning into polymer modeling.
This work introduces Uni-Poly, a novel unified framework for generating multimodal and multi-domain polymer representations. The primary contributions of this work are as follows: (i) leveraging large language models and structured data, construct a dataset Poly-Caption that containing over 10,000 textual descriptions of polymers through knowledge-enhanced prompting; (ii) developing the Uni-Poly framework, which integrates multiple polymer data modalities—including SMILES, 2D molecular graphs, 3D molecular geometries, molecular fingerprints, and textual descriptions—into a unified representation. To the best of our knowledge, this is the first framework to systematically integrate multimodal and multi-domain information for polymers; and (iii) conducting a comprehensive evaluation of the Uni-Poly framework across a diverse set of polymer property prediction tasks. The results demonstrate Uni-Poly’s superiority in predictive performance compared to existing single-modality and multimodal approaches, highlighting the effectiveness of combining diverse modalities and domain-specific textual knowledge, offering a more comprehensive approach to polymer informatics.
Results
Overall performance
We compared the property prediction performance of Uni-Poly with other single- and multi-modality models. As shown in Table 1, overall, the glass transition temperature (Tg) emerged as the best-predicted property, achieving an R² of ~0.9, highlighting the strong correlation between Tg and polymer structure. This was followed by thermal decomposition temperature (Td) and density (De), with R² values ranging from 0.7 to 0.8. In contrast, properties like electrical resistivity (Er) and melting temperature (Tm) exhibited lower R² values, typically between 0.4 and 0.6, reflecting the challenges of predicting these properties based solely on monomer unit structures.
Among all evaluated models, Uni-Poly consistently outperformed all baselines across the evaluated properties, achieving at least a 1.1% improvement in R² over the best-performing baseline in various tasks. Notably, for Tm, Uni-Poly demonstrated a significant 5.1% increase in R², underscores the advantage of integrating complementary modalities, particularly for challenging properties where structural data alone is insufficient.
Moreover, we observed that no single-modality model achieved optimal performance across all evaluation metrics. For single-modality models, Morgan excelled in predicting Td and Tm, ChemBERTa performed best for De and Tg, while Uni-mol achieved the highest R² for Er. This finding supports our earlier assertion that single-modality models are constrained by their limited representational capacity, making it challenging to excel across diverse tasks. In contrast, multimodal models generally exhibit superior representational capacity, leading to their outperformance of single-modality models in most tasks, showcasing the benefits of leveraging multiple data modalities.
Notably, the Text+Chem T5 model demonstrated the weakest overall performance, likely due to the challenges of capturing precise structural and physicochemical information through textual descriptions alone. Despite this limitation, the model achieved R² values exceeding 0.44 across all tasks and performed notably well in predicting Tg, achieving an R² of 0.745. Remarkably, for Tm, it outperformed SchNet, illustrating the rich informational content of textual data—a resource often overlooked in polymer research.
Although Uni-Poly achieves higher accuracy than existing models in predicting polymer properties, its performance remains insufficient for engineering applications. For instance, considering the glass transition temperature (Tg), the best-predicted property, the mean absolute error is approximately 22 °C, which still exceeds the error tolerance generally accepted in the industry. This limitation arises from two major constraints. First, differences in experimental measurement methods (e.g., variations in heating rates in DSC(Differential Scanning Calorimeter) tests could lead to differences in Tg exceeding 10 °C38) introduce inherent noise in the data, setting a theoretical limit on prediction accuracy that is difficult to surpass. More importantly, current polymer representation in Uni-Poly lacks a multi-scale representation of polymer structures. Polymer properties are intrinsically determined by structural features spanning multiple scales, including monomer structure, molecular weight distribution, chain entanglement, and aggregated structures. However, due to limitations in data availability, the current model extracts features solely from the monomer-level inputs. To overcome the accuracy bottlenecks, it is promising to incorporate multi-scale structural information into predictive models. For example, studies such as BigSMILES39, have extended polymer representation by encoding monomer sequence information and have achieved certain improvements in prediction accuracy. Integrating such representations into Uni-Poly holds potential for further enhancing its performance.
Poly-Caption and the impact of multi-domain input
We first examine the generated caption dataset, Poly-Caption. Figure 1 illustrates the overall length distribution of the captions. The generated text lengths generally follow a Gaussian distribution, with most captions falling within the range of 90–135 words.
To further analyze the dataset’s content at a macro level, we created a word cloud to visualize the high-frequency terms appearing in the captions, as shown in Fig. 2. The word cloud highlights key aspects of the captions, including applications (e.g., “application”, “coating,” and “device,”), properties (e.g., “thermal,” “electronic,” and “mechanical”), structures (e.g., “backbone,” “ring,” and “aromatic”), and synthesis processes (e.g., “polycondensation” and “polymerization”).
The accuracy of captions is crucial for demonstrating the effectiveness of our approach. We manually evaluated the captions in the Poly-Caption dataset in two aspects. The first aspect focuses on whether the captions accurately retain the information we provided during generation. This includes fundamental polymer details such as polymer names, monomer molecular formulas, monomer molecular weights, and polymer synthesis information. We randomly sampled 100 captions from the Poly-Caption dataset and compared them with the input JSON data. Our analysis confirmed that all captions accurately reproduce these fundamental details. This finding indicates that the LLM demonstrates a precise understanding of the provided information. The verification of the remaining information posed additional challenges, primarily due to the inherent non-uniqueness and subjectivity in describing polymer structures, properties, and applications. Given the inherent challenges in evaluating such captions, the accuracy assessment of certain rare polymers is difficult even for polymer experts. Consequently, we selected captions corresponding to the 30 most common polymers from the dataset for a comprehensive analysis (sample captions are provided in the SI). Following a thorough review and evaluation by polymer experts, it was observed that while certain terms, such as “high-performance,” were general or vague, the captions overall accurately captured the actual performance of the polymers and offered valuable insights into the field of polymer science. This finding validates the effectiveness of our knowledge-enhanced prompting in generating factually accurate captions that integrate domain-specific knowledge.
To assess the specific contribution of the textual modality to prediction performance, we evaluated a variant of Uni-Poly, namely Uni-Polyw/o-text, which integrates multiple structural representations but excludes captions. The results, as shown in Fig. 3, indicate that while Uni-Polyw/o-text achieves performance comparable to Uni-Poly for Tg prediction, it demonstrates a notable decrease in R² for the other properties, with declines ranging from ~1.6 to 3.9%. These findings suggest that the inclusion of textual data provides valuable domain-specific insights, enhancing predictive accuracy.
For example, captions highlight areas of application, such as advanced devices or coatings, and environmental conditions like high-temperature usage or chemical exposure. These references offer critical information about performance requirements and context, suggesting properties such as resistivity, thermal performance, mechanical strength, and chemical stability. Beyond applications, captions enrich the understanding of chemical structure. While structural representations reveal functional groups, captions incorporate domain-specific knowledge, offering experiential insights into the roles and impacts of these groups on polymer properties. This combination enhances the predictive power of structural data by linking it to practical expertise. Captions also reveal the connection between synthesis methods or processing conditions and resulting material properties. By detailing how fabrication processes influence functional outcomes, they provide a deeper understanding of the relationship between material preparation and performance. Overall, textual information complements structural representations, especially for properties shaped by real-world usage and processing scenarios.
To further quantify the impact of specific words or phrases on predictions and to inform future efforts in more effective prompt engineering, we conducted an in-depth analysis of the attention weights assigned to tokens by the text encoder in Uni-Poly for the best-performing Tg prediction task. These attention weights can be interpreted as indicators of the importance of tokens in the subsequent embedding and final prediction results. Figure 4 presents the top 20 tokens with the highest average attention weights in the Poly-Caption dataset. We observed that high-importance tokens can be broadly categorized into four groups:
Chemical structure-related tokens
This category constitutes the largest proportion of high-importance tokens, including “aromatic”, “benz”, and “phenol”, which primarily reflect the influence of polymer monomer structures or precursor reactants on Tg. Notably, “aromatic” stands out with a significantly higher importance weight than other tokens. This aligns with chemical intuition, as the aromaticity of a polymer affects chain rigidity and intermolecular interactions, which are strongly correlated with Tg.
Application-related tokens
Certain words, such as “aerospace” and “adhesive”, indicate that the intended use of a polymer may implicitly correlate with its Tg. For instance, polymers used in aerospace applications often require high thermal stability, which translates to a higher Tg. Similarly, adhesives may have Tg implications depending on their required flexibility and operating temperature. The presence of these tokens in high-importance rankings suggests that the model captures such contextual information and integrates it into its predictions.
Descriptors of polymer properties
Tokens like “versatility”, “organic”, and “intricate” represent broader or more abstract descriptions of a polymer’s characteristics. These words do not directly specify chemical structures or applications but rather encapsulate generalizable features that may be associated with properties.
Adverbs and emphasizing words
Examples include “paramount” and “robust”. While it is difficult to infer their direct connection to the property from the tokens alone, we examined their usage within the Poly-Caption dataset by retrieving sentences containing these keywords. We found that such words are typically used to describe key property properties, such as “…with paramount heat resistance”, indicating a specific property, which is correlated with the predicted property.
The results offer insights for future prompt engineering. For instance, strategically emphasizing words related to structure, application, or property description in input prompts could optimize model performance and improve interpretability.
Task-modality attention weights analysis
To further explain the contribution of different modalities to specific property prediction tasks, we analyzed the attention weight matrix learned by the Uni-Poly model, as visualized in Fig. 5.
Among all modalities in the Uni-Poly representation, SMILES and FP consistently emerge as the dominant contributors across various properties. This underscores that two-dimensional structural features are the primary determinants for recognizing backbone and functional group types, which are essential for polymer property prediction. Other modalities, such as Graph, Geom, and Text, serve primarily complementary roles. Notably, Geom demonstrates higher relevance only in Tm prediction, while Text shows the lowest overall contribution, functioning mainly as supplementary information.
Density (De)
SMILES and FP effectively capture key chemical features, such as monomer mass and side-chain size. Graph and Geom provide moderate contributions, reflecting topological and spatial configurations that influence packing efficiency.
Electronic resistance (Er)
SMILES and FP dominate by identifying conductive motifs like conjugated systems and functional groups. Due to the relatively low prediction accuracy based solely on structural information, Text becomes more valuable, offering application-related insights into electronic properties.
Decomposition temperature (Td)
FP exhibits strong contributions by detecting thermally stable substructures, such as aromatic groups and cyclic structures, while SMILES complements this by providing backbone information. Text contributes moderately by capturing branching features, which restrict chain segment mobility and enhance thermal stability.
Glass transition temperature (Tg)
The attention weights for SMILES and FP are significantly higher compared to other modalities, while those for Text and Geom are notably lower. This is likely because Tg is influenced by factors such as chain segment mobility, side-group bulkiness, and polar group interactions—all closely related to monomer structure and functional groups. Consequently, two-dimensional descriptors like SMILES and FP, which excel at identifying structural motifs, can largely determine Tg, leaving minimal room for supplementary contributions from Text and Geom.
Melting temperature (Tm)
For Tm, FP remains the most significant modality, but Geom contributes above the average. This reflects its ability to emphasize spatial configurations and molecular symmetry, which govern crystallinity and play a critical role in determining Tm.
Ablation study
In this section, we present ablation experiments to evaluate the effectiveness of various components within the Uni-Poly framework. Specifically, we assess how performance changes relative to the full Uni-Poly model when individual components are removed or altered. The configurations tested are as follows:
“w/o pre-train”
This variant skips the pretraining phase that aligns features across modalities. It evaluates the importance of contrastive learning for feature alignment.
“w/o multi-head attention”
The multi-head attention mechanism for feature refinement is removed, testing the impact of inter-modal feature fusion.
“Mean pooling”
The attention-based weighted pooling layer is replaced with simple average pooling, assessing the impact of attention-based weighting on modality aggregation.
“Frozen encoder”
Encoder parameters are frozen during training, restricting the model to optimize only the downstream components.
The results in Table 2 show that freezing the encoder parameters leads to the most significant performance degradation, particularly for properties Tm and Er. This underscores the necessity of end-to-end training in fine-tuning the pre-trained encoders to adapt them effectively to downstream tasks. Skipping the pretraining phase also causes a substantial decline, emphasizing the importance of contrastive learning in aligning features across modalities and adapting encoders—originally trained on small-molecule SMILES datasets—to the polymer domain. Similarly, replacing the attention-based weighted pooling layer with mean pooling results in notable performance reduction, confirming that dynamic weighting enhances feature aggregation by prioritizing informative components. Although removing the multi-head attention mechanism results in relatively smaller accuracy reductions, its absence consistently lowers performance across all properties. This indicates that multi-head attention provides beneficial enhancements to inter-modal feature fusion and refinement, even if it is not a critical component. These findings highlight the importance of each component in the Uni-Poly framework, where the interplay of these elements collectively ensures robust and accurate predictions of polymer properties.
Applicability to copolymers
Regarding its current applicability, this study focuses exclusively on homopolymers, while the extension to copolymer systems, which are of greater relevance in many engineering applications, remains an open area for further exploration. Common approaches to handling copolymers can be broadly categorized into two types. The first treats the copolymer chain as an oligomer or a periodic structure, which is then considered as a monomer for feature extraction and analysis. This method has demonstrated strong applicability, particularly in graph-based representations22,40. This method can be integrated into the Uni-Poly framework by treating the resulting unit as an equivalent “monomer.” However, a significant limitation of this method is that it results in excessively long chain structure, which substantially increases computational costs in the SMILES encoding and 3D coordinate branches of Uni-Poly. As a result, this approach is currently computationally impractical within our framework. The second approach involves computing monomer-level descriptors independently, followed by aggregation based on their composition ratios. In the context of Uni-Poly, this would involve generating embeddings for each monomer across all modalities and then linearly combining them to represent the copolymer. While this strategy has been shown to work well for handcrafted features such as molecular fingerprints41, this approach is less suitable for learned representations, particularly the textual captions, which encode contextualized chemical semantics. A simple linear averaging of such representations lacks clear physical or chemical meaning. Instead, a more promising direction would be to generate copolymer-specific captions that explicitly reflect monomer composition and sequence arrangement. Future work will therefore focus on developing tailored prompt templates and architectural adjustments to more effectively encode copolymer-specific features, such as monomer composition and sequence.
Discussion
In this work, we introduced Uni-Poly, a comprehensive multimodal framework designed for polymer property prediction by integrating structural representations (SMILES, 2D molecular graphs, 3D geometries, and fingerprints) with textual descriptions embedding domain-specific knowledge. Uni-Poly represents a paradigm shift in polymer informatics by systematically leveraging diverse data modalities to achieve a unified representation of polymers.
To validate Uni-Poly’s effectiveness, we first constructed a polymer caption dataset, Poly-Caption, which incorporates domain knowledge through large language models. Experimental results demonstrate that Uni-Poly consistently outperforms single-modal and existing multimodal approaches across a range of polymer property prediction tasks, including glass transition temperature, density, thermal decomposition temperature, melting point, and electrical resistivity. Notably, the inclusion of textual descriptions provides domain-specific insights that enhance performance, particularly for properties influenced by practical applications or processing conditions. A detailed analysis of modality-specific contributions reveals the complementary strengths of different data types, with SMILES and molecular fingerprints consistently showing the strongest impact, while text and geometry add critical contextual and structural layers for specific properties.
These findings underscore the transformative potential of multimodal and multi-domain approaches in polymer science. By offering a more comprehensive representation of polymers, Uni-Poly enables enhanced predictive accuracy, supporting high-throughput polymer screening and the targeted discovery of novel materials.
Future research will focus on the integration of more comprehensive and information-rich polymer structural features to further enhance predictive accuracy and expand the applicability of our framework to a broader range of properties and more complex polymer systems, such as copolymers. By incorporating these advancements, we aim to improve the model’s capability in predicting diverse polymer behaviors, ultimately enabling more reliable and versatile applications in polymer informatics.
Methods
Polymer representation
Polymer representation aims to convert the chemical information of polymer into forms that can be processed by machine learning algorithms. A variety of methods have been developed to represent polymers. In Fig. 6, we illustrate several representation methods using polystyrene as an example. These polymer representations are categorized into two types: structural representations and textual representations, which are briefly introduced below.
Due to the complexity of polymer chain segments, structural representations of polymers are often simplified by focusing only on the monomer unit. Commonly, four types of structural representations are used for monomers: molecular fingerprint, sequential representation, 2D graph, and 3D geometry.
Molecular fingerprint
Molecular fingerprints6 encode chemical information into binary vectors that indicate the presence or absence of predefined chemical features and substructures. Their primary advantage lies in their simplicity and computational efficiency, enabling rapid similarity searches and clustering. However, fingerprint-based representations heavily depend on predefined substructures, which can limit their ability to capture the full complexity of diverse polymer structures.
Sequential
SMILES (simplified molecular input line entry system)42 is the most popular linear notation used to describe chemical structures. For polymers, SMILES notation often replaces the connection points between repeating units with an asterisk (*) to indicate where monomers are linked. SMILES representations are compact, easily interpretable, and well-suited for sequence-based machine learning models, such as BERT15 and Transformers16. However, due to the limitations of its encoding rules, the linear format may separate adjacent atoms into distant parts of the string, making it challenging to capture some global structural information.
2D graph
In 2D topology graphs, atoms are represented as nodes and bonds as edges, providing a detailed depiction of molecular connectivity and bonding relationships. Graph neural networks, including graph convolutional networks (GCNs)17, GraphSAGE18, graph isomorphism networks (GINs)19, and graph attention networks (GATs)20, are commonly employed to encode these graphs into feature vectors that capture both local and global structural information of polymer monomers. Compared to SMILES, 2D graphs explicitly represent bonds, making them more effective at capturing connectivity information.
3D geometry
3D geometry graphs incorporate atomic coordinates, offering a representation of the spatial arrangement of atoms. Geometric deep learning techniques, such as spatial graph networks, are often used to process these 3D geometries. The primary advantage of 3D geometry graphs lies in their ability to capture stereochemistry and conformational details critical for predicting properties related to molecular conformation, such as energy and molecular interactions. However, obtaining accurate 3D geometries can be computationally intensive, and these representations are often sensitive to variations in molecular conformation.
Textual representation
Text-based representations use natural language descriptions to convey domain-specific knowledge about polymers, as illustrated in Fig. 6b. There is no strict definition of what information a caption must contain; captions may include details about chemical composition and structure, morphology, application scenarios, preparation methods, and more. Although captions cannot describe the detailed chemical structure of a monomer as precisely as structural representations, their scope is significantly broader and often closely related to the polymer’s physicochemical properties. These textual descriptions are processed using language models such as BERT33,35 and Transformer34 models, which generate semantic embeddings that encapsulate the context and knowledge expressed in the text. Another critical consideration is data availability. While the small-molecule domain benefits from publicly available caption databases like ChEBI-2043, no large-scale and publicly accessible datasets are currently available for polymers, presenting a significant challenge for utilizing text-based representations.
Multimodal representation
While single-modality methods have shown promise in capturing specific aspects of polymer features, they are limited in their ability to comprehensively represent polymers. Multimodal representations aim to utilize the strengths of different data modalities to create an integrated representation of polymers. Multimodal methods have been explored in various chemical domains, including reaction prediction, molecular property prediction and drug–drug interaction33,44,45,46,47,48. In the context of polymers, recent studies49,50 have proposed to combine both SMILES and 3D geometry representations. However, restricting integration to only a subset of modalities prevents these methods from fully exploiting the diverse structural information available for polymers. Moreover, no current attempts have been made to incorporate text-based representations into polymer modeling, further limiting the expressive power of these approaches.
Datasets
We conducted a comprehensive evaluation of Uni-Poly using datasets collected from the largest publicly available polymer database, PolyInfo51 (The copyrights of this database are owned by the National Institute for Materials Science [NIMS]). Specifically, we focused on five key polymer properties: glass transition temperature (Tg), density (De), thermal decomposition temperature (Td), melting point (Tm), and electrical resistivity (Er). These properties span a wide range of physical and chemical behaviors, involving different mechanisms such as thermal stability, phase transition, and electronic properties. As such, they provide versatility and robustness of the Uni-Poly model in capturing diverse property relationships. For a given monomer, different samples can exhibit significant variations in their properties due to variations in experimental conditions, synthesis routes, or measurement methods. Such variability introduces inherent noise into the data, which poses an additional challenge for accurate prediction. To reduce noise and provide a more consistent target for model training, we grouped samples for the same monomer and computed the average value for each property to serve as a representative value. After processing, our dataset comprised a total of 15,443 structure–property data points, providing a diverse and comprehensive foundation for evaluating Uni-Poly. The distribution and number of samples for each property are illustrated in Fig. 7.
We propose Uni-Poly, a comprehensive framework designed to unify multiple data modalities relevant to polymer materials into a single embedding. In this section, we introduce the core components of Uni-Poly, as shown in Fig. 8.
The process begins with knowledge-enhanced prompting using a large language model to generate polymer captions. Multimodal data, including text, geometry, graph representations, fingerprints, and SMILES, are encoded through modality-specific encoders to generate unified modality embeddings. These embeddings are then projected into a shared space and pre-trained using contrastive learning. Subsequently, the framework employs a multi-head attention mechanism and a feed-forward network to facilitate refinement. Finally, an attention-based weighted pooling layer aggregates the refined embeddings into the final Uni-Poly embedding, which can be used for polymer property prediction tasks.
Structural and textual representation generation
The structural and textual representation generation module generates various representations, including SMILES, 2D molecular graphs, 3D geometries, molecular fingerprints, and textual descriptions. For structural representations, we employ established methodologies to extract data directly from available datasets or utilize specialized tools such as RDKit52.
Generating textual representations for polymers is challenging due to the absence of large-scale polymer-specific caption datasets. To address this gap, we adopted a synthetic data approach and constructed Poly-Caption, a comprehensive dataset that integrates extensive domain-specific knowledge in polymer science.
General-purpose LLMs, such as GPTs53, possess vast domain-specific knowledge from pretraining on extensive text corpora. However, these models face issues such as hallucination54,55, where factually inaccurate information is generated, and output variability, which can undermine their reliability for scientific applications. To mitigate these limitations, we implemented a knowledge-enhanced prompting strategy. This approach employs in-context learning to guide the generation process through carefully designed prompts. Our prompting strategy is structured into three key components:
-
Role definition and task specification: We first define the model’s role and explicitly outline the task of caption generation. Additionally, we impose constraints on the content and style of the captions to ensure relevance and consistency.
-
Structured polymer knowledge: We provide the model with a JSON dataset containing essential polymer information, including SMILES, polymer type, structural name, molecular formula, molecular weight, processing conditions, and synthesis procedures. The selection of this information is critical to balance its reference utility and avoid data leakage. This structured input enhances the factual accuracy of the generated captions.
-
Polymer caption examples: To facilitate few-shot learning, we manually created three polymer caption examples that clarify the expected scope and format of the captions, improving output consistency. Considering our ultimate goal of predicting polymer properties, we incorporated various descriptive information directly related to polymer properties.
By adopting this approach, we generated a caption for each polymer that combines the inherent domain knowledge of the LLM with the additional structured information provided. The resulting captions include details about the polymer’s name, type, structural characteristics, synthesis methods, general properties, and potential applications, which provides a domain-view for polymer representation. The complete prompt template used in this process is provided in the SI, offering readers a detailed understanding of the specific steps and methodology involved.
Embedding and projection
In this module, each modality is independently encoded to generate an initial embedding that captures its unique information, which is then projected into a joint representation space for further processing. A high-level overview of each encoder is provided below, while detailed architectures are available in the referenced literatures.
Given a polymer monomer represented by its SMILES \(\left(S\right)\), graph \(\left(G\right)\), geometry \(\left({\mathcal{G}}\right)\), fingerprints \(\left(F\right)\) and caption \(\left(C\right)\), we define the polymer representation as follows:
For modality \(r\in \left(S,G,{\mathcal{G}}{,}C\right)\), we use a specific encoder \({{\mathscr{E}}}_{r}\) to form the corresponding embedding \({h}_{r}\).
SMILES
We employ ChemBERTa14, a transformer model pre-trained on a large SMILES dataset using a masked language modeling strategy, as the SMILES encoder. Following established practices, the embedding of the first token from the encoder’s last hidden state is used as the SMILES embedding, denoted as: \({h}_{S}\),
Graph
The graph isomorphism network (GIN)19 is utilized for encoding graph-based representations. GIN takes atom attributes (\({X}_{G}\)), bond attributes (\(E\)), and the adjacency matrix (\(A\)) as inputs. Through multiple interaction layers, it iteratively updates node features by aggregating neighbor information. A readout function aggregates the updated node features into the graph embedding \({h}_{G}\),
Geometry
SchNet26, designed for quantum chemistry tasks, is employed to generate geometry-based embeddings. SchNet processes atom attributes (\({X}_{{\mathcal{G}}}\)) and atom positions (\(R\)) using continuous filter convolutions and interaction blocks to generate the geometry embedding \({h}_{{\mathcal{G}}}\),
Caption
For caption-based representations, we use Text+ Chem T556, a language model based on the T557 architecture that has been specifically trained for chemistry-related tasks, including caption generation and generating chemical representations from text. Similar to the SMILES encoder, the embedding of the first token from the last hidden state is used as the caption embedding, denoted as \({h}_{C}\), which captures the semantic meaning of the caption,
Fingerprints are inherently vectorized representations of molecular structures. We directly use the Morgan fingerprint as its embedding representation,
These embeddings reside in distinct feature spaces due to the differences in their original representations. To obtain alignment and consistency across modalities while preserving their intrinsic information content, we introduce a projection layer \(\rho\) for each modality. The projection layer is modality-specific, transforming the individual embeddings into a shared space representation. The respective embeddings after projection can be represented as follows:
where \({H}_{r}\in {{\mathbb{R}}}^{d}\) represents the embedding vector of the modality \(r\) with dimension \(d\). The stacked embedding matrix for all modalities is denoted as \(H\in {{\mathbb{R}}}^{\left|{\mathcal{P}}\right|\times d}\), which is used for subsequent processing.
Fusion module
To effectively utilize information from all available modalities, we employ a multi-head attention mechanism58 and attention-based pooling to enhance the integration and fusion of inter-modal information. The embedding matrix \(H\) is first updated through multi-head attention:
where \(\text{Multihead}(\cdot )\) denotes the multi-head attention mechanism that captures complex inter-modal relationships by computing attention scores across modalities and generating refined feature representations. The output is refined through residual connections and layer normalization to stabilize training, followed by an attention-based weighted pooling mechanism that determines the contribution of each modality. The importance of a modality \(r\) is computed through an attention network:
where \({\omega }_{r}\) and \({b}_{r}\) are learnable parameters. The final Uni-Poly embedding \({H}_{{\rm{Uni}}-{\rm{Poly}}}\in {{\mathbb{R}}}^{d}\) is computed as the weighted sum of each individual modality embeddings:
Output module
The output module generates the model’s final outputs. In our work, we followed a common practice with a two-stage paradigm of pretraining and finetuning.
In this work, we adopt cross-modal contrastive learning59 as self-supervised pretraining. This method aligns data across different modalities by ensuring that embeddings of the same data point are closely positioned in the vector space, while embeddings of distinct data points are pushed farther apart without the need for explicit annotations. Specifically, we utilize the InfoNCE60 loss to maximize the cosine similarity between positive pairs (representations of different modalities of the same polymer) while minimizing the similarity between negative pairs (representations of different polymers). This enables efficient learning of shared representations across modalities, providing a robust foundation for downstream tasks.
After pretraining, we utilize a simple Multi-Layer Perceptron (MLP) as the prediction head for estimating polymer properties. The output from the fusion module, denoted as \({H}_{{\rm{Uni}}-{\rm{Poly}}}\), is fed into the MLP to make final predictions:
where \(y\) is the target property.
Baselines
For a comprehensive comparison, we evaluated Uni-Poly against both single-modal models and multimodal baselines. Specifically, the single-modal models included Morgan Fingerprints6, SMILES-BERT11, ChemBERTa14, GCN17, GIN19, SchNet26, Uni-mol31, 3D Infomax45, GraphMVP46, DVMP47, and MMPolymer49. Additionally, we develop a text-based model that leverages embeddings generated from Text+Chem T561 for the textual descriptions baseline. For a fair comparison, all baseline models are trained with the default settings.
Implementation details
For the SMILES representations in the dataset, we first convert all SMILES into canonical SMILES, followed by tokenization using ChemBERTa’s tokenizer to ensure consistent and meaningful input sequences. For graphs, we utilized atomic properties, including atom type and chirality tag, as node features, while bond type and bond direction were used as edge features to capture the bonding context. Regarding 3D molecular structures, we replaced connection points represented by asterisks (*) with hydrogen atoms to ensure stability, followed by geometry optimization using the MMFF9462 force field in RDKit to achieve minimum energy conformations and obtain the final atomic coordinates. For molecular fingerprints, we generated 1024-bit binary vectors with a radius of 2 using RDKit’s fingerprinting functionality. For the text captions, considering the capabilities of existing large language models, we utilized the state-of-the-art GPT-453 as the backbone to generate domain-specific captions, leveraging its advanced proficiency in understanding and producing scientific language. A caption was generated for each unique SMILES entry in the database, resulting in a total of 10,142 SMILES-caption pairs. Subsequently, the structural and textual representations were encoded into modality-specific embeddings and projected into the Uni-Poly space with \(d=256\) for further processing.
During the contrastive pretraining phase, the model was optimized for ten epochs on the entire polymer dataset, using a batch size of 32 and the Adam63 optimizer with an initial learning rate of 1e-5. The scale parameter \(\tau\) for contrastive learning was set to 0.07. In the fine-tuning stage, 10% of the dataset was used for testing, while the remaining 90% was allocated for training and validation. To enhance numerical stability, Z-score normalization was applied to each property prediction task. The model was trained for 100 epochs with early stopping, using a batch size of 32 and the Adam optimizer with an initial learning rate of 1e-4. Additionally, a cosine annealing learning rate scheduler was used to adjust the learning rate throughout the training process. For further details regarding the software and hardware platforms, model training procedures, and hyperparameter specifications, please refer to SI.
Evaluation metrics
In our property prediction tasks, all of which are regression tasks, we primarily evaluate model performance using the R-squared (R²) metric. Additionally, root mean square error (RMSE) and mean absolute error (MAE) are also reported in the SI. To ensure the robustness and reliability of our results, we performed fivefold cross-validation and presented both the mean and standard deviation of the metrics.
Data availability
The polymer data utilized for generating the captions and the property data supporting the findings of this study were manually collected from the PoLyInfo database by June 2016. (The copyrights of this database are owned by the National Institute for Materials Science [NIMS].) Other data were available from the authors upon reasonable request. The codes developed for this work are available at https://github.com/huang-qi/Uni-Poly.
References
Pendhari, S. S., Kant, T. & Desai, Y. M. Application of polymer composites in civil construction: a general review. Compos. Struct. 84, 114–124 (2008).
Ramakrishna, S., Mayer, J., Wintermantel, E. & Leong, K. W. Biomedical applications of polymer-composite materials: a review. Compos. Sci. Technol. 61, 1189–1224 (2001).
Schmidt, G. & Malwitz, M. M. Properties of polymer–nanoparticle composites. Curr. Opin. Colloid Interface Sci. 8, 103–108 (2003).
Tjong, S. C. Structural and mechanical properties of polymer nanocomposites. Mater. Sci. Eng. R Rep. 53, 73–197 (2006).
Siracusa, V., Rocculi, P., Romani, S. & Rosa, M. D. Biodegradable polymers for food packaging: a review. Trends Food Sci. Technol. 19, 634–643 (2008).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Ma, R., Liu, Z., Zhang, Q., Liu, Z. & Luo, T. Evaluating polymer representations via quantifying structure–property relationships. J. Chem. Inf. Model. 59, 3110–3119 (2019).
Chen, F.-C. Virtual screening of conjugated polymers for organic photovoltaic devices using support vector machines and ensemble learning. Int. J. Polym. Sci. 2019, 4538514 (2019).
Tao, L., Chen, G. & Li, Y. Machine learning discovery of high-temperature polymers. Patterns 2, 100225 (2021).
Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
Wang, S., Guo, Y., Wang, Y., Sun, H. & Huang, J. SMILES-BERT: large scale unsupervised pre-training for molecular property prediction. In Proc. 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 429–436 (Association for Computing Machinery, 2019).
Chithrananda, S., Grand, G. & Ramsundar, B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. Preprint at http://arxiv.org/abs/2010.09885 (2020).
Ross, J. et al. Large-scale chemical language representations capture molecular structure and properties. Nat. Mach. Intell. 4, 1256–1264 (2022).
Ahmad, W., Simon, E., Chithrananda, S., Grand, G. & Ramsundar, B. ChemBERTa-2: towards chemical foundation models. Preprint at http://arxiv.org/abs/2209.01712 (2022).
Kuenneth, C. & Ramprasad, R. polyBERT: a chemical language model to enable fully machine-driven ultrafast polymer informatics. Nat. Commun. 14, 4099 (2023).
Xu, C., Wang, Y. & Barati Farimani, A. TransPolymer: a transformer-based language model for polymer property predictions. npj Comput. Mater. 9, 64 (2023).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (2022).
Hamilton, W., Ying, Z. & Leskovec, J. Inductive representation learning on large graphs. In Proc. 31st International Conference on Neural Information Processing Systems 1025–1035 (Curran Associates Inc., 2017).
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations (2018).
Velickovic, P. et al. Graph attention networks. stat 1050, 10–48550 (2017).
Aldeghi, M. & Coley, C. W. A graph representation of molecular ensembles for polymer property prediction. Chem. Sci. 13, 10486–10498 (2022).
Antoniuk, E. R., Li, P., Kailkhura, B. & Hiszpanski, A. M. Representing polymers as periodic graphs with learned descriptors for accurate polymer property predictions. J. Chem. Inf. Model. 62, 5435–5445 (2022).
Park, J. et al. Prediction and interpretation of polymer properties using the graph convolutional network. ACS Polym. Au. 2, 213–222 (2022).
Queen, O. et al. Polymer graph neural networks for multitask property learning. npj Comput. Mater. 9, 90 (2023).
Xia, J. et al. Mole-BERT: rethinking pre-training graph neural networks for molecules. In The Eleventh International Conference on Learning Representations (2023).
Schütt, K. et al. Schnet: a continuous-filter convolutional neural network for modeling quantum interactions. In Proc. 31st International Conference on Neural Information Processing System (Curran Associates Inc., 2017).
Smith, J. S., Isayev, O. & Roitberg, A. E. ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost. Chem. Sci. 8, 3192–3203 (2017).
Zhang, L., Han, J., Wang, H., Car, R. & E, W. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett. 120, 143001 (2018).
Batzner, S. et al. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).
Gasteiger, J. et al. GemNet-OC: developing graph neural networks for large and diverse molecular simulation datasets. Preprint at arXiv:2204.02782 (2022).
Zhou, G. et al. Uni-Mol: a universal 3d molecular representation learning framework. In Proc. 11th International Conference on Learning Representations (ICLR, 2023).
Zhu, Y. et al. Molecular contrastive pretraining with collaborative featurizations. J. Chem. Inf. Model. 64, 1112–1122 (2024).
Su, B. et al. A molecular multimodal foundation model associating molecule graphs with natural language. Preprint at arXiv:2209.05481 (2022).
Zhao, H. et al. GIMLET: a unified graph-text model for instruction-based molecule zero-shot learning. In Proc. 37th International Conference on Neural Information Processing Systems 5850–5887 (Curran Associates Inc., 2024).
Liu, S. et al. Multi-modal molecule structure–text model for text-based retrieval and editing. Nat. Mach. Intell. 5, 1447–1457 (2023).
Liu, S., Wen, T., Pattamatta, A. S. L. S. & Srolovitz, D. J. A prompt-engineered large language model, deep learning workflow for materials classification. Mater. Today 80, 240–249 (2024).
Polak, M. P. & Morgan, D. Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nat. Commun. 15, 1569 (2024).
Wellen, R. M. R., Canedo, E. & Rabello, M. S. Nonisothermal cold crystallization of poly(ethylene terephthalate). J. Mater. Res. 26, 1107–1115 (2011).
Lin, T.-S. et al. BigSMILES: a structurally-based line notation for describing macromolecules. ACS Cent. Sci. 5, 1523–1531 (2019).
Huang, Q. et al. Enhancing copolymer property prediction through the weighted-chained-SMILES machine learning framework. ACS Appl. Polym. Mater. 6, 3666–3675 (2024).
Kuenneth, C., Schertzer, W. & Ramprasad, R. Copolymer informatics with multitask deep neural networks. Macromolecules 54, 5957–5961 (2021).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Edwards, C., Zhai, C. & Ji, H. Text2Mol: cross-modal molecule retrieval with natural language queries. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing (eds. Moens, M.-F., Huang, X., Specia, L. & Yih, S. W.) 595–607 (Association for Computational Linguistics, 2021).
Guo, Z., Sharma, P., Martinez, A., Du, L. & Abraham, R. Multilingual molecular representation learning via contrastive pre-training. In Proc. 60th Annual Meeting of the Association for Computational Linguistics (eds. Muresan, S., Nakov, P. & Villavicencio, A.) 3441–3453 (Association for Computational Linguistics, 2022).
Stärk, H. et al. 3D infomax improves GNNs for molecular property prediction. In Proc. 39th International Conference on Machine Learning 20479–20502 (PMLR, 2022).
Liu, S. et al. Pre-training molecular graph representation with 3D geometry. In Proc. 10th International Conference on Learning Representations (ICLR, 2022).
Zhu, J. et al. Dual-view molecular pre-training. In Proc. 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 3615–3627 (ACM, 2023).
Feng, S., Yang, L., Ma, W. & Lan, Y. UniMAP: Universal SMILES-graph representation learning. Preprint at http://arxiv.org/abs/2310.14216 (2023).
Han, S. et al. Multimodal transformer for property prediction in polymers. ACS Appl. Mater. Interfaces 16, 16853–16860 (2024).
Wang, F. et al. MMPolymer: a multimodal multitask pretraining framework for polymer property prediction. Preprint at https://doi.org/10.48550/arXiv.2406.04727 (2024).
Otsuka, S., Kuwajima, I., Hosoya, J., Xu, Y. & Yamazaki, M. PoLyInfo: polymer database for polymeric materials design. In 2011 International Conference on Emerging Intelligent Data and Web Technologies 22–29 (IEEE, 2011).
Landrum, G. RDKit: a software suite for cheminformatics, computational chemistry, and predictive modeling. Greg. Landrum 8, 31 (2013).
OpenAI et al. GPT-4 technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2024).
Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55, 1–38 (2023).
Bang, Y. et al. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proc. 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) 675–718 (Association for Computational Linguistics, 2023).
Christofidellis, D. et al. Unifying molecular and textual representations via multi-task language modelling. In Proc. 40th International Conference on Machine Learning 6140–6157 (PMLR, 2023).
Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1–67 (2020).
Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems 6000–6010 (Curran Associates Inc., 2017).
Zolfaghari, M., Zhu, Y., Gehler, P. & Brox, T. Crossclr: cross-modal contrastive learning for multi-modal video representations. In Proc. IEEE/CVF International Conference on Computer Vision 1450–1459 (IEEE, 2021).
van den Oord, A., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at arXiv:1807.03748 (2018).
Christofidellis, D. et al. Unifying molecular and textual representations via multi-task language modelling. In Proc. 40th International Conference on Machine Learning 6140–6157 (JMLR.org, 2023).
Halgren, T. A. Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J. Comput. Chem. 17, 490–519 (1996).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://doi.org/10.48550/arXiv.1412.6980 (2017).
Acknowledgements
This study was funded by the National Natural Science Foundation of China, No.62474183. The funder played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.
Author information
Authors and Affiliations
Contributions
Q.H. conceived and designed the study, developed the Uni-Poly model, and wrote the manuscript. Y.L. contributed to the dataset preparation, performed model training and evaluation, and assisted with manuscript preparation. L.Z. supervised the research, provided domain expertise, and critically reviewed the manuscript. Q.Z. provided hardware resources, guidance on framework design, and advice on algorithmic implementation. W.Y. provided overarching project supervision, strategic guidance, and critical feedback to ensure the study's alignment with broader research goals.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Huang, Q., Li, Y., Zhu, L. et al. Unified multimodal multidomain polymer representation for property prediction. npj Comput Mater 11, 153 (2025). https://doi.org/10.1038/s41524-025-01652-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41524-025-01652-z