High-accuracy physical property prediction for pure organics via molecular representation learning: bridging data to discovery

Ou, Qi; Wang, Hongshuai; Zhuang, Minyang; Chen, Shangqian; Liu, Lele; Wang, Ning; Gao, Zhifeng

doi:10.1038/s41524-025-01720-4

Download PDF

Article
Open access
Published: 11 July 2025

High-accuracy physical property prediction for pure organics via molecular representation learning: bridging data to discovery

Qi Ou¹^na1,
Hongshuai Wang²^na1,
Minyang Zhuang¹,
Shangqian Chen²,
Lele Liu¹,
Ning Wang¹ &
…
Zhifeng Gao^2,3

npj Computational Materials volume 11, Article number: 224 (2025) Cite this article

2033 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

The escalating energy crisis has spurred extensive research into organic compounds for energy-efficient applications, taking advantage of their environmental friendliness, cost-effective synthesis, and adaptable molecular structures. Traditional trial-and-error methods for discovering highly functional organic compounds are expensive and time-consuming. We employed a 3D transformer-based molecular representation learning algorithm to create the Org-Mol pre-trained model, using 60 million semi-empirically optimized small organic molecule structures. After fine-tuning with public experimental data, the model can accurately predict various physical properties of pure organics, with test set R² values exceeding 0.92. These fine-tuned models are used in high-throughput screening among millions of ester molecules to identify novel immersion coolants, resulting in the experimental validation of two promising candidates. This work not only demonstrates the potential of Org-Mol in predicting bulk properties for pure organic compounds but also paves the way for the rational and efficient development of ideal candidates for energy-saving materials.

Efficient screening framework for organic solar cells with deep learning and ensemble learning

Article Open access 23 October 2023

Organic materials repurposing, a data set for theoretical predictions of new applications for existing compounds

Article Open access 14 February 2022

Unified and explainable molecular representation learning for imperfectly annotated data from the hypergraph view

Article Open access 30 September 2025

Introduction

Amidst the current energy crisis, the research and development of materials that are energy-efficient and/or offer high energy utilization efficiency have become an immediate imperative and a focal point in scientific inquiry^1,2,3,4. Organic compounds play a pivotal role in this context, owing to their environmental friendliness, low-cost processing, and almost infinite modifiability to meet a wide array of property requirements. Prospective applications lie in the fields such as immersion coolants for data centers^5,6,7, phase change materials for thermal energy storage^8,9, and liquid organic hydrogen carriers^10,11,12. All of these fields raise specific requirements on various physical and chemical properties of the applied organics. Taking immersion coolants for the data center as an example, an ideal coolant should possess a low dielectric constant and viscosity, as well as high heat capacity and thermal conductivity, so as to provide a high thermal exchange efficiency during the circulation while preventing plausible signal loss during the operation of electronic devices^13,14.

Although the measurement of the properties mentioned above for a single compound is straightforward¹⁵, the synthesis of various candidates can be time-consuming. Therefore, the experimental trial-and-error cost of exploring the chemical space to develop ideal energy-saving organic materials remains high. Multiple attempts have been made to predict these electrical, mechanical, and thermal properties of organic compounds via molecular dynamics (MD) and/or machine learning (ML) methods^{16,17,18,19,20,21,22,23,24}. However, the accuracy of such methods relies on the applied force field and/or the algorithm applied in the ML model, as well as the quality and the size of the data set^25,26,27. It can be seen that predicting these physical properties via MD is time-consuming for a single system¹⁶ and becomes unaffordable in a high-throughput screening scenario with thousands of candidates, while the validity of some previously reported ML models is limited to a few specific types of compounds^20,23,28. The requirement of constructing the appropriate descriptors for ML models can also be computationally expensive and sometimes non-trivial^19,21,27,29, raising additional challenges for making an accurate yet efficient predictio,n which is generalizable to a plethora of organic compounds.

With the development of deep learning algorithms, highly efficient screening and rational design of novel materials with required properties have been further made simple^30,31,32. For instance, Uni-Mol, a 3D transformer-based molecular representation learning (MRL) algorithm³³, has been successfully applied in the prediction of photophysical properties of organic photoluminescence systems³⁴ and the adsorption energies of various gaseous molecules on metal-organic-frameworks³⁵. One of the most favored advantages of Uni-Mol is that sufficient and rational descriptors can be automatically constructed via the pre-training process, requiring only a single molecular structures as input, and the pre-trained model can be further optimized by the subsequent fine-tuning with a small number of labeled data^33,34,35. While most of the previously reported properties predicted by the Uni-Mol framework depend solely on a single molecular structure or on a single cell of a periodic crystallized structure, its feasibility in predicting bulk properties for amorphous (or liquid phase) organic compounds is yet to be validated. If proven effective, such a protocol may remarkably streamline the model-constructing process and reduce the computational cost in predicting a variety of physical properties for pure amorphous organics.

Keeping these in mind and to facilitate the discovery of energy-saving materials in terms of pure organics, we apply the Uni-Mol framework to develop a pre-trained model specifically designed for organic compounds, namely, the Org-Mol model, followed by the fine-tuning process over various physical properties with public experimental data. Our results show that even though only the single molecular coordinates are provided during the pre-training process, the fine-tuned models for all bulk properties of pure organic small molecules achieve significantly high accuracy (with the R² value of the test set higher than 0.95). As a proof-of-concept, we also assess how effectively Org-Mol can fine-tune the glass transition temperature of homopolymers using the structure of the repeating unit as input, achieving an R² score above 0.92 on the test set. Notably, these fine-tuned models are applied in a high-throughput screening for the discovery of novel immersion coolants over millions of ester molecules. Hundreds of candidates are finally filtered out, and two of them are experimentally synthesized and validated, highlighting the capability of Org-Mol to directly bridge data to discovery. Notwithstanding that the current study focuses on single-component organic systems, the prediction framework established here can be readily adapted to other energy-saving materials (e.g., phase-change materials) via the same fine-tuning process.

Results

Pre-trained model and fine-tuning data sets

We first perform a t-SNE analysis³⁶ for the molecular weight of the pre-training data set to speculate the effectiveness of Org-Mol pre-trained model in learning molecular structures and its ability to distinguish between different molecules. The 512-dimension embedding layer of the Org-Mol model is projected into two dimensions, and the distributions of the molecular weight of 20,000 randomly selected molecules in the pre-training data are shown in Fig. 1a and b. With the randomly initialized weight (as shown in Fig. 1a), the data points are more evenly distributed, while in Fig. 1b, where the Org-Mol pre-trained weight is applied, the data points with similar molecular weights tend to cluster together, elucidating that a simple structure-property relationship has been constructed by the Org-Mol pre-trained model. As detailed in Fig. 1c, to accurately predict physical properties of various organic compounds, we carry out the fine-tuning process for each physical property of interest so that a more specific and sophisticated structure-property relationship can be established. The size and range of each property are summarized in Table 1.

**Fig. 1: Org-Mol model’s pretraining dataset’s molecular weight distribution via t-SNE analysis and the pre-training/fine-tuning process.**

Table 1 Dataset size and value ranges of the fine-tuning dataset used in this study, including dielectric constant ε near 25 °C, kinematic viscosity ν at 40 °C and 100 °C, density ρ at 40 °C, the specific heat at constant pressure C_p at 25 °C, and thermal conductivity κ at 25 °C for organic small molecules, and the glass transition temperature T_g of homopolymers

Full size table

Prediction of the electrical property

The dielectric constant is chosen as a representative to validate the performance of Org-Mol on predicting the electrical property, which plays a crucial role in the field of energy materials, especially in the areas of electrochemical energy storage as well as the development of novel immersion coolants³⁷. The static dielectric constant ε of a given system is the value of the frequency-dependent dielectric function in the limit of zero frequency³⁸. The numerical value of ε_S can be evaluated via the fluctuations of the total dipole moment of the system, which equals the summation of the molecular dipole moments (μ_i), i.e., M = ∑_iμ_i. Mathematically, one has^39,40

$$\varepsilon =1+\frac{\left\langle {{\bf{M}}}^{2}\right\rangle -{\left\langle {\bf{M}}\right\rangle }^{2}}{3{\epsilon }_{0}V{k}_{{\rm{B}}}T}$$

(1)

where ϵ₀ is vacuum permittivity, V the volume, k_B the Boltzmann constant, and T the temperature. For systems with a totally random and uncorrelated arrangement of dipoles, ${\left\langle {\bf{M}}\right\rangle }^{2}=0$ and ε are proportional to the square of the average molecular dipole moment. However, such straight forward relationship between the static dielectric constant of a bulk organic system and the molecular dipole moment can merely be satisfied in practice due to the unpredictable orientation of the single molecular dipoles. As shown in Fig. 2a, the correlation coefficient (R) between the square of the computed single molecular dipole moment (without averaging over different configurations) and the experimental dielectric constant is only 0.728. A few conspicuous outliers are those with a hydroxyl group and/or an amino group (as labeled in Fig. 2a), which introduce the formation of the hydrogen bond and the corresponding oriented arrangement of the dipoles. Consequently, the single molecular dipole moment, despite being straightforward to calculate, is not an appropriate metric, especially for a quantitative evaluation of the static dielectric constant. It should be noted that previous studies have demonstrated the feasibility of accurately computing the dielectric constant for organic systems using MD simulations via a delicate approach to extract the short-ranged Kirkwood g-factor^16,41. Though with high accuracy, such calculations usually require remarkably long simulation time for a single system (especially for those large flexible organics), and are merely affordable to be applied in high high-throughput screening process.

**Fig. 2: Org-Mol model’s performance in predicting dielectric constant and related analyses.**

Although the static dielectric constant is a bulk property, its value for various organic compounds can be accurately predicted via the Org-Mol fine-tuned model with only the single molecular coordinates as the input. As shown in Fig. 2b, for the static dielectric constant near room temperature, the R² value of the training set is 0.990, and that of the test set is around 0.968, with an MAE of 0.726 (as shown in Fig. 2c). While most of the compounds in the data set are liquids, we further validate Org-Mol’s applicability in organic semiconductors. As shown in Table S1, we compare the dielectric constants of three organic semiconductors (structures given in Fig. S3) predicted by Org-Mol with their experimental values. For all three organic semiconductors, Org-Mol provides reasonable predictions as compared to the experimental values. The capability of Org-Mol to predict bulk properties from single-molecule inputs arises from its unique integration of Uni-Mol’s 3D transformer architecture and fine-tuning with experimental measured properties. During pre-training, the model implicitly learns geometric patterns (e.g., bond angles, atomic distances) from 60 million PM6-optimized single-molecule structures, capturing latent structural motifs associated with intermolecular interactions like hydrogen bonding or dipole-dipole forces. However, challenges persist because single-molecule representations lack explicit information about condensed-phase dynamics, such as dipole correlations governing the dielectric constant (as shown in Eq. (1)). To address this, fine-tuning with experimental bulk data aligns pre-trained features with collective behavior. This process bridges static molecular geometry with bulk phenomena, correcting single-molecule limitations and enabling accurate predictions despite collective-effect complexity.

During the prediction process, it is found that the dielectric constant of a given carboxylic acid is lower than that of its isomeric esters even though the carboxilic group seems “more polar” compared to the ester group. As shown in Table S2, the predicted dielectric constants for butanoic acid, methyl propanoate, ethyl acetate, and propyl methanoate are 2.990, 6.017, 6.266, and 6.973, respectively, which are highly consistent with the corresponding experimental values measured at 303K. Similar observations can be found in other carboxylic acids and their corresponding isomeric esters as demonstrated in Table S1. This seemingly counterintuitive phenomenon can be ascribed to two aspects. First, the molecular dipole moment of the carboxylic acid is actually lower than that of its isomeric esters. As shown in Fig. 2d, although the carboxylic group itself possesses a more polarized charge distribution than the ester group does, the interaction between the positively charged protonic hydrogen and the negatively charged carbonyl oxygen within the carboxylic acid effectively counterbalances the overall molecular dipole moment. Second, due to the intermolecular hydrogen bond, the condensed phase carboxylic acid contains plentiful “symmetric” dimer structures (with a remarkably low dipole moment ~ 0.6 Debye based on density functional theory calculation) as shown in Fig. 2e, which lowers the overall polarity and hence the dielectric constant.

Prediction of the mechanical and thermal properties

We now examine the capability of Org-Mol in predicting the viscosity, one of the crucial metrics in fluid mechanics. Note that compared with the dielectric constant, the viscosity is more sensitive to the temperature. Therefore, we carry out two separate fine-tuning jobs based on the pre-trained model with two sets of kinematic viscosities of various organic liquids measured at 40 °C and 100 °C, respectively. Note also that the relationship between the dynamic viscosity μ and the kinematic viscosity ν is μ = ν × ρ, where ρ is the density⁴². To enable the prediction for both the dynamic viscosity and the kinematic viscosity as well as the transformation between them, we also carry out a multi-regression fine-tuning task to predict the density of organic compounds at 25 °C, 40 °C, 60 °C, and 100 °C.

It can be seen in Fig. 3a–d that the predicted values of the kinematic viscosity at two tested temperatures are highly consistent with the corresponding experimental counterparts. The R² values for the train and test set of the kinematic viscosity at 40 °C are 0.991 and 0.972, respectively, while those at 100 °C are even higher (0.996 and 0.974, respectively). As shown in Fig. 3e, f and Fig. S2, the predicted densities are in extremely good agreement with the target values for both the train set and the test set. Compared to the prediction of dielectric constant and the viscosity, predicting the density is less difficult, due to the fact that the latter is a pure structural property and is less relevant to the dynamical behavior of the system. The slight underestimation of densities in high-density regions can be attributed to the sparsity of experimental data in those areas. In practice, very few common organic liquids reach such elevated densities, thus mitigating any significant impact on potential material design and selection processes.

**Fig. 3: Org-Mol model’s performance in predicting kinematic viscosity and density.**

For thermal properties, we first take the specific heat at constant pressure C_p (or the heat capacity), and the thermal conductivity κ as two representatives to examine the performance of Org-Mol. As shown in Fig. 4a–d, the correlation between the Org-Mol predicted values and the experimental counterparts for both of these two thermal properties remains high. The R² values for the training and test sets are 0.993 and 0.982 for the heat capacity, respectively, with very small MAEs, evincing the capability of Org-Mol in the accurate prediction of thermal properties. Note that even though only 252 data points are collected for thermal conductivity, the performance of Org-Mol on the test set is still satisfactory (with the R² values 0.958 for the test set), demonstrating its strength of tackling data with relatively small sample size. Furthermore, we assess the ability of Org-Mol in predicting properties of polymer systems by taking the glass transition temperature (T_g) as an example. We fine-tune Org-Mol on T_g data for around 300 homopolymers using repeating-unit structures as input (capping by hydrogen). As shown in Fig. 4e and f, the model achieves an R² of 0.926 and MAE of 25.4K on the test set, outperforming previously reported MD-based predictions⁴³ and demonstrating the potential generalizability of Org-Mol to homopolymer systems. It should be noted that predicting complex polymer properties remains challenging, especially for more sophisticated properties and those of copolymers, and specialized models like MMPolymer⁴⁴ or PolyBERT⁴⁵ are better suited for such scenarios.

**Fig. 4: Org-Mol model’s performance in predicting thermal properties.**

Comparison with reference models

To validate the accuracy of Org-Mol across the aforementioned properties of organic small molecules, we conduct comparative experiments using three state-of-the-art methods: two graph neural networks (EGNN⁴⁶ and NequIP⁴⁷), and a tree-based algorithm (XGBoost⁴⁸) with Morgan fingerprints as molecular descriptors. Hyperparameters are optimized for each model to ensure fair comparison. The metrics for the test set are summarized in Table 2. While graph neural network algorithms (EGNN and NequIP) outperform the tree-based model XGBoost, Org-Mol demonstrates superior performance across all evaluated properties. Notably, substantial enhancements are observed for dielectric constant and thermal conductivity, two properties inherently challenging due to their complexity and sparse experimental data, underscoring the efficacy of Org-Mol’s pre-trained architecture in handling complex and/or data-scarce scenarios.

Table 2 Metrics of Org-Mol, EGNN, NequIP, and XGBoost for the dielectric constant ε, the kinematic viscosity ν at 40°C and 100°C, the heat capacity C_p, and the thermal conductivity κ.^a

Full size table

The heat capacity of organic compounds, being a fairly predictable characteristic, has been thoroughly investigated using a multitude of ML techniques^24,49. As summarized in Table 3, while previously reported models have gain relatively satisfying accuracies in predicting the heat capacity of various organics, Org-Mol achieves the highest accuracy among all applied algorithms (with marginal improvement compared to EGNN and NequIP as shown in Table 2) over a larger and more diverse dataset.

Table 3 Comparison of previously reported models and this work for the heat capacity C_p

Full size table

To further validate Org-Mol’s generalizability and prediction accuracy, we adopt a sphere-exclusion partitioning strategy (detailed in the Supporting Information) for all tested properties, representing a more rigorous test than random splitting. As shown in Fig. S4, the model maintained consistently high R² values (>0.90) for small organic molecule property predictions, with the exception of dielectric constant (R²=0.876). In particular, this discrepancy arises from a single outlier. Excluding this point improves the R² to 0.945, likely attributed to experimental measurement error. For homopolymer glass transition temperature prediction, the sphere-exclusion split test set yielded an R² of 0.892, slightly lower than random partitioning results but still robust. Overall, Org-Mol achieves satisfying R² scores across all properties under the more challenging sphere-exclusion scheme, underscoring its robust generalizability within the numerical bounds of the training data. Predictions beyond these property ranges remain challenging and necessitate further exploration.

Experimental validation for the discovery of immersion cooling candidates

As a proof-of-concept of how these property-prediction models can accelerate the discovery of new materials as well as to experimentally validate their accuracy, we apply the Org-Mol fine-tuned models to systematically screen out promising candidates for immersion coolants, which require low dielectric constant (strictly lower than 3.20 at room temperature for electrical safety), low kinematic viscosity (preferably lower than 12 cSt at 40 °C), high thermal conductivity (preferably higher than 0.140 W/m ⋅ K), and environmental friendliness. We focus on synthetic esters for the design of novel immersion coolants owing to their extraordinary biodegradability. Due to its strict limit for immersion coolants, we apply the dielectric constant as a primary criterion and screen out over 6 million saturated monoesters and dibasic esters (constructed from saturated acid and alcohol, checked out from PubChem⁵⁰) with the number of carbon atoms lower or equal to 30. Interestingly, all tested dibasic esters possess dielectric constants larger than 3.50, while 8569 monoesters meet the criterion. The corresponding t-SNE analysis is applied to the predicted dielectric constants of these monoesters to further justify the effectiveness of both the pre-trained and the fine-tuned model. As shown in Fig. 5a), with the randomly initialized weight, data points are evenly distributed with the 2D projected space. The distribution after pre-training, as seen in Fig. 5b), is capable of effectively distinguishing certain structures. For instance, in the upper-right corner of Fig. 5b), molecules with higher dielectric constants are grouped together. Upon fine-tuning, a noticeable decreasing trend of the dielectric constant values emerges from left to right in Fig. 5c) as compared to the pre-trained model, which indicates that the combination of the pre-training and fine-tuning process can more accurately differentiate molecules based on their dielectric constants.

**Fig. 5: Distributions of the dielectric constant of monoesters in 2D projections of Org-Mol embedding layer.**

Monoesters with dielectric constants lower than 3.20 are then fed to the kinetic viscosity model and then the thermal conductivity model. Viscosity is prioritized over thermal conductivity due to its significant impact on system efficiency and operational feasibility. Based on the aforementioned criteria of the kinematic viscosity and thermal conductivity, 461 monoesters are finally filtered out. By further incorporating manual synthetic feasibility assessments based on experimental expertise, we select two candidates for synthesis, and their physical properties are experimentally measured. The whole sequential screening process is summarized in Scheme 1 with the molecular structures of two experimentally synthesized candidates. Detailed synthesis method and characterization analysis are given in the Supporting Information. As shown in Table 4, the predicted values for all tested properties of these two candidates are in good agreement with the experimental results. A slight systematic overestimation is observed for both the kinematic viscosity and the thermal conductivity, presumably due to the inconsistency of the measuring methods between the current work and the referenced works contributing to the data set. (See Supporting Information for testing methods applied in this work.) It should be noticed that even though the thermal conductivity of the second candidates is slightly lower than the preferred value (0.140 W/m ⋅ K), its low dielectric constant ensures superior electrical insulation, which is particularly crucial for practical applications. Overall, the Org-Mol fine-tuned models succeed in screening out promising candidates for immersion coolants over millions of candidates, and can be further applied in the discovery of other energy-saving materials.

Table 4 Predicted and experimentally measured dielectric constants ε, kinematic viscosities ν at 40 °C and 100 °C, and thermal conductivities κ for two filtered-out immersion cooling candidates

Full size table

Discussion

In conclusion, we have applied the framework of Uni-Mol, a 3D MRL algorithm, and developed the Org-Mol pre-trained model, which is specifically designed for organic compounds on a vast dataset of 60 million semi-empirically optimized structures. Leveraging the power of 3D transformer-based algorithms, we have successfully fine-tuned the Org-Mol pre-trained model, achieving R² values exceeding 0.95 for test sets across various physical properties of pure organic small molecules. Furthermore, Org-Mol’s ability to predict glass transition temperatures of homopolymers using repeating-unit structures (test set R² > 0.92) highlights its potential for polymeric systems. This high level of accuracy not only surpasses current state-of-the-art models but also offers a practical solution to the challenges posed by traditional experimental methods, which are often costly and time-consuming.

The application of Org-Mol in high-throughput screening has demonstrated its capability to enable the rapid identification of potential immersion coolants from a pool of millions of automatically constructed ester molecules. The experimental validation of two promising candidates further underscores the practical utility of our approach in accelerating the discovery of new materials for energy-efficient applications.

To summarize, the Org-Mol model provides an important advancement in predicting properties of pure organic compounds, with far-reaching implications for energy-saving applications. Although the scope of the current work is limited to single-component systems, the demonstrated accuracy in predicting electrical, mechanical, and thermal properties-including its generalization to homopolymer glass transition temperatures—offers a foundational framework for rational design of tailored amorphous materials, as evinced by the experimental validation of immersion cooling candidates for energy-efficient electronics. While challenges in complex assemblies (e.g., mixtures, copolymers) necessitate further innovation, the model’s modularity and performance in pure organics paves the way for guiding the design of novel energy-efficient materials.

Methods

Model construction

We apply the framework of Uni-Mol to construct a pre-trained model, particularly designed for organic molecules containing C, H, O, N, S, Se, B, and halogens, namely the Org-Mol pre-trained model. This pre-trained model is constructed based on the PubChemQC PM6 data sets⁵¹. (See ref. ³³ for the detailed architecture of the pre-training process.) Specifically, PM6⁵²-optimized structures of 60 million small organic molecules containing the aforementioned elements are fed into the pre-training framework to initiate the self-supervised task including the 3D coordinates recovery and masked atom prediction. The Org-Mol pre-trained model is then fine-tuned via multiple experimentally measured physical properties of general single-component organic compounds, i.e., the dielectric constant near 25 °C collected from ref. ⁵³ and ref. ⁵⁴, the kinematic viscosity near 40 and 100 °C collected from ref. ⁵⁵ and ref. ⁵⁶, the density at multiple temperatures and two thermal properties at 25 °C (thermal conductivity and the specific heat at constant pressure) collected from ref. ⁵⁵, the glass transition temperature of a set of homopolymers collected from ref. ⁴³. These data, together with the PM6-optimized single molecular structures of the corresponding compounds (repeating unit structures for polymer case), are collected and divided into training set and test set in a 9:1 ratio. The whole model-construction process is schematized in Fig. 1c. Metrics applied to validate the fine-tuned model for the aforementioned properties include the R² value, the mean absolute error (MAE), the root mean square error (RMSE), and the mean absolute percentage error (MAPE) are calculated as follows

$${R}^{2}=1-\frac{\mathop{\sum }\nolimits_{i = 1}^{N}{\left({\hat{y}}_{i}-{y}_{i}\right)}^{2}}{\mathop{\sum }\nolimits_{i = 1}^{N}{\left({\hat{y}}_{i}-\bar{y}\right)}^{2}}$$

(2)

$${\rm{MAE}}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}| {\hat{y}}_{i}-{y}_{i}|$$

(3)

$${\rm{MAPE}}( \% )=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\frac{| {\hat{y}}_{i}-{y}_{i}| }{\max (\eta ,| {y}_{i}| )}\times 100 \%$$

(4)

$${\rm{RMSE}}=\sqrt{\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\left({\hat{y}}_{i}-{y}_{i}\right)}^{2}}$$

(5)

where N is the number of samples, y_i the experimentally measured data for sample i, ${\hat{y}}_{i}$ the Org-Mol predicted data for sample i, $\bar{y}$ the average of the experimental data, η an arbitrary small yet strictly positive number to avoid undefined results when y_i is zero.

Quantum chemistry calculation of molecular dipole moments

To investigate the relationship between the single molecular dipole moment and the dielectric constant of the corresponding liquid phase system, we calculate the dipole moment for randomly picked systems in the dielectric constant data set via density functional theory in the quantum chemistry package Gaussian 16⁵⁷. B3LYP functional (with D3 empirical dispersion correction) and def2-svp basis set are applied to perform the single-molecule ground state geometry optimization and dipole moment calculation.

Data availability

The data that support the findings of this study, i.e., the PM6-optimized molecular structures (60 million) and experimental property datasets (dielectric constant, viscosity, etc.) are available via the referenced sources. Complete fine-tuning datasets can be obtained upon reasonable request by contacting the corresponding authors. Experimental validated results are provided in the Supporting Information, with the 3D coordinates of the tested molecules deposited at https://zenodo.org/records/15258939.

Code availability

Open-source code: The Uni-Mol framework (used as the basis for Org-Mol) is freely available on GitHub (https://github.com/DeepModeling/Uni-Mol).

References

Mirkin, C. A., Sargent, E. H. & Schrag, D. P. Energy transition needs new materials. Science 384, 713–713 (2024).
Article PubMed Google Scholar
Davidson, D. J. Exnovating for a renewable energy transition. Nat. Energy 4, 254–256 (2019).
Article Google Scholar
Zhang, Q., Sun, Y., Xu, W. & Zhu, D. Organic thermoelectric materials: Emerging green energy materials converting heat to electricity directly and efficiently. Adv. Mater. 26, 6829–6851 (2014).
Article CAS PubMed Google Scholar
Kandpal, V., Jaswal, A., Santibanez Gonzalez, E. D. R. & Agarwal, N. Energy Efficiency and Renewable Energy Technologies, 89–123 https://doi.org/10.1007/978-3-031-52943-6_3 (Springer Nature Switzerland, Cham, 2024).
Kuncoro, I. W. et al. Immersion cooling as the next technology for data center cooling: A review. In Journal of Physics: Conference Series, vol. 1402, 044057 (IOP Publishing, 2019).
Kanbur, B. B., Wu, C., Fan, S. & Duan, F. System-level experimental investigations of the direct immersion cooling data center units with thermodynamic and thermoeconomic assessments. Energy 217, 119373 (2021).
Article Google Scholar
Huang, Y. et al. Experimental study on the immersion liquid cooling performance of high-power data center servers. Energy 297, 131195 (2024).
Article Google Scholar
Sharma, R., Ganesan, P., Tyagi, V., Metselaar, H. & Sandaran, S. Developments in organic solid–liquid phase change materials and their applications in thermal energy storage. Energy Convers. Manag. 95, 193–228 (2015).
Article CAS Google Scholar
Li, B. et al. Colossal barocaloric effects in plastic crystals. Nature 567, 506–510 (2019).
Article CAS PubMed Google Scholar
Preuster, P., Papp, C. & Wasserscheid, P. Liquid organic hydrogen carriers (LOHCS): Toward a hydrogen-free hydrogen economy. Acc. Chem. Res. 50, 74–85 (2017).
Article CAS PubMed Google Scholar
Mah, A. X. Y. et al. Targeting and scheduling of standalone renewable energy system with liquid organic hydrogen carrier as energy storage. Energy 218, 119475 (2021).
Article Google Scholar
Tan, R., Ji, Q., Ling, Y. & Li, L. Advances in liquid organic hydrogen carriers: Developing efficient dehydrogenation strategies. Chem. Commun. 60, 8186–8203 (2024).
Article CAS Google Scholar
Chhetri, A. et al. Numerical simulation of the single-phase immersion cooling process using a dielectric fluid in a data server. Mater. Today.: Proc. 51, 1532–1538 (2022).
Google Scholar
Wu, X. et al. Prediction models of saturated vapor pressure, saturated density, surface tension, viscosity and thermal conductivity of electronic fluoride liquids in two-phase liquid immersion cooling systems: A comprehensive review. Appl. Sci. 13, 4200 (2023).
Article CAS Google Scholar
Nadkarni, R. & Nadkarni, R. Guide to ASTM Test Methods for the Analysis of Petroleum Products and Lubricants, vol. 44 (ASTM International West Conshohocken, 2007).
Yamaguchi, T., Yamada, H., Fujiwara, T. & Hori, K. Simulations of dielectric constants and viscosities of organic electrolytes by quantum mechanics and molecular dynamics. J. Mol. Liq. 312, 113288 (2020).
Article CAS Google Scholar
Caleman, C. et al. Force field benchmark of organic liquids: Density, enthalpy of vaporization, heat capacities, surface tension, isothermal compressibility, volumetric expansion coefficient, and dielectric constant. J. Chem. Theory Comput. 8, 61–74 (2012).
Article CAS PubMed Google Scholar
Ju, S.-P., Huang, S.-C., Lin, K.-H., Chen, H.-Y. & Shen, T.-K. Prediction of optical and dielectric properties of 4-cyano-4-pentylbiphenyl liquid crystals by molecular dynamics simulation, coarse-grained dynamics simulation, and density functional theory calculation. J. Phys. Chem. C. 120, 14277–14288 (2016).
Article CAS Google Scholar
Sild, S. & Karelson, M. A general qspr treatment for dielectric constants of organic compounds. J. Chem. Inf. Comput. Sci. 42, 360–367 (2002).
Article CAS PubMed Google Scholar
Meng, X., Jia, M. & Wang, T. Neural network prediction of biodiesel kinematic viscosity at 313 k. Fuel 121, 133–140 (2014).
Article CAS Google Scholar
Deng, J. & Jia, G. Dielectric constant prediction of pure organic liquids and their mixtures with water based on interpretable machine learning. Fluid Phase Equilib. 561, 113545 (2022).
Article CAS Google Scholar
Luo, S., Deng, J. & Jia, G. Physical information neural network combined with the symbolic regression: A machine learning method for prediction of dielectric constants in organic liquids and water mixtures. J. Mol. Liq. 417, 126544 (2024).
Zheng, H., Lv, W., Wang, Y., Feng, Y. & Yang, H. Molecular kinematic viscosity prediction of natural ester insulating oil based on sparse machine learning models. J. Mol. Liq. 385, 122355 (2023).
Article CAS Google Scholar
Shan, Y., Wu, Q., Yuan, H. & Liu, W. Develop machine learning-based model and automated process for predicting liquid heat capacity of organics at different temperatures. Fluid Phase Equilib. 584, 114132 (2024).
Article CAS Google Scholar
Gereben, O. & Pusztai, L. On the accurate calculation of the dielectric constant from molecular dynamics simulations: The case of spc/e and swm4-dp water. Chem. Phys. Lett. 507, 80–83 (2011).
Article CAS Google Scholar
Krishnamoorthy, A. et al. Dielectric constant of liquid water determined with neural network quantum molecular dynamics. Phys. Rev. Lett. 126, 216403 (2021).
Article CAS PubMed Google Scholar
Comesana, A. E., Huntington, T. T., Scown, C. D., Niemeyer, K. E. & Rapp, V. H. A systematic method for selecting molecular descriptors as features when training models for predicting physiochemical properties. Fuel 321, 123836 (2022).
Article CAS Google Scholar
Amano, T., Yamazaki, T. & Tsuneyuki, S. Chemical bond based machine learning model for dipole moment: Application to dielectric properties of liquid methanol and ethanol. Phys. Rev. B 110, 165159 (2024).
Article CAS Google Scholar
Gao, N., Wang, X., Xuan, Y. & Chen, G. An artificial neural network for the residual isobaric heat capacity of liquid hfc and hfo refrigerants. Int. J. Refrig. 98, 381–387 (2019).
Article CAS Google Scholar
Walters, W. P. & Barzilay, R. Applications of deep learning in molecule generation and molecular property prediction. Acc. Chem. Res. 54, 263–270 (2020).
Article PubMed Google Scholar
Meftahi, N. et al. Machine learning property prediction for organic photovoltaic devices. npj Comput. Mater. 6, 166 (2020).
Article CAS Google Scholar
Pang, C., Tong, H. H. & Wei, L. Advanced deep learning methods for molecular property prediction. Quant. Biol. 11, 395–404 (2023).
Article CAS Google Scholar
Zhou, G. et al. Uni-mol: A universal 3d molecular representation learning framework. In The Eleventh International Conference on Learning Representationshttps://openreview.net/forum?id=6K2RM6wVqKu (2023).
Cheng, Z. et al. Automatic screen-out of ir (iii) complex emitters by combined machine learning and computational analysis. Adv. Opt. Mater. 11, 2301093 (2023).
Article CAS Google Scholar
Wang, J. et al. A comprehensive transformer-based approach for high-accuracy gas adsorption predictions in metal-organic frameworks. Nat. Commun. 15, 1904 (2024).
Article CAS PubMed PubMed Central Google Scholar
van der Maaten, L. & Hinton, G. Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Zhang, C., Wang, H., Huang, Y., Zhang, L. & Chen, Y. Immersion liquid cooling for electronics: Materials, systems, applications and prospects. Renew. Sustain. Energy Rev. 208, 114989 (2025).
Article CAS Google Scholar
Maji, D., Indra, S. & Biswas, R. Dielectric relaxations of molten acetamide: Dependence on the model interaction potentials and the effects of system size. J. Chem. Sci. 133, 1–12 (2021).
Article Google Scholar
Neumann, M. Dipole moment fluctuation formulas in computer simulations of polar systems. Mol. Phys. 50, 841–858 (1983).
Article CAS Google Scholar
Neumann, M., Steinhauser, O. & Pawley, G. S. Consistent calculation of the static and frequency-dependent dielectric constant in computer simulations. Mol. Phys. 52, 97–113 (1984).
Article CAS Google Scholar
Zhang, C., Hutter, J. & Sprik, M. Computing the Kirkwood g-factor by combining constant Maxwell electric field and electric displacement simulations: Application to the dielectric constant of liquid water. J. Phys. Chem. Lett. 7, 2696–2701 (2016).
Article CAS PubMed Google Scholar
Alam, M. S., Ashokkumar, B. & Siddiq, A. M. The density, dynamic viscosity and kinematic viscosity of protic polar solvents (pure and mixed systems) studies: A theoretical insight of thermophysical properties. J. Mol. Liq. 251, 458–469 (2018).
Article CAS Google Scholar
Afzal, M. A. F. et al. High-throughput molecular dynamics simulations and validation of thermophysical properties of polymers for various applications. ACS Appl. Polym. Mater. 3, 620–630 (2021).
Article CAS Google Scholar
Wang, F. et al. Mmpolymer: A multimodal multitask pretraining framework for polymer property prediction. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, CIKM ’24, 2336-2346 https://doi.org/10.1145/3627673.3679684 (Association for Computing Machinery, New York, NY, USA, 2024).
Kuenneth, C. & Ramprasad, R. polybert: A chemical language model to enable fully machine-driven ultrafast polymer informatics. Nat. Commun. 13, 2453 (2022).
Google Scholar
Satorras, V. G., Hoogeboom, E. & Welling, M. E (n) equivariant graph neural networks. In International conference on machine learning, 9323–9332 (PMLR, 2021).
Batzner, S. et al. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).
Article CAS PubMed PubMed Central Google Scholar
Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, 785-794 https://doi.org/10.1145/2939672.2939785 (Association for Computing Machinery, New York, NY, USA, 2016).
Khajeh, A. & Modarress, H. Quantitative structure–property relationship prediction of liquid heat capacity at 298.15k for organic compounds. Ind. Eng. Chem. Res. 51, 6251–6255 (2012).
Article CAS Google Scholar
Kim, S. et al. PubChem 2025 update. Nucleic Acids Res. 53, D1516–D1525 (2024).
Article PubMed Central Google Scholar
Nakata, M., Shimazaki, T., Hashimoto, M. & Maeda, T. Pubchemqc pm6: Data sets of 221 million molecules with optimized molecular geometries and electronic properties. J. Chem. Inf. Model. 60, 5891–5899 (2020).
Article CAS PubMed Google Scholar
Stewart, J. J. Optimization of parameters for semiempirical methods v: Modification of nddo approximations and application to 70 elements. J. Mol. Model. 13, 1173–1213 (2007).
Article CAS PubMed PubMed Central Google Scholar
Haynes, W. CRC Handbook of Chemistry and Physics. CRC Handbook of Chemistry and Physics https://books.google.com.sg/books?id=bNDMBQAAQBAJ (CRC Press, 2014).
Wohlfahrt, C. 2 pure liquids: Data. https://materials.springer.com/lb/docs/sm_lbs_978-3-540-47619-1_2. Copyright 1991 Springer-Verlag Berlin Heidelberg.
Yaws, C. Chemical Properties Handbook: Physical, Thermodynamics, Environmental Transport, Safety & Health Related Properties for Organic &. Chemical Engineering Books https://books.google.com.sg/books?id=PH2KG4cbtzYC (McGraw-Hill Education, 1999).
Wohlfarth, C. Viscosity of methyl decanoate. https://materials.springer.com/lb/docs/sm_lbs_978-3-662-49218-5_364. Copyright 2017 Springer-Verlag Berlin Heidelberg.
Frisch, M. J. et al. Gaussian1̃6 Revision C.01 Gaussian Inc. Wallingford CT (2016).

Download references

Acknowledgements

This work was supported by research grants from China Petroleum & Chemical Corp (funding number 124014). N.W. thanks the financial support from the National Key R&D Program of China (Grant No. 2024YFA1510200).

Author information

These authors contributed equally: Qi Ou, Hongshuai Wang.

Authors and Affiliations

SINOPEC Research Institute of Petroleum Processing Co., Ltd, Beijing, China
Qi Ou, Minyang Zhuang, Lele Liu & Ning Wang
DP Technology, Beijing, China
Hongshuai Wang, Shangqian Chen & Zhifeng Gao
AI for Science Institute, Beijing, China
Zhifeng Gao

Authors

Qi Ou
View author publications
Search author on:PubMed Google Scholar
Hongshuai Wang
View author publications
Search author on:PubMed Google Scholar
Minyang Zhuang
View author publications
Search author on:PubMed Google Scholar
Shangqian Chen
View author publications
Search author on:PubMed Google Scholar
Lele Liu
View author publications
Search author on:PubMed Google Scholar
Ning Wang
View author publications
Search author on:PubMed Google Scholar
Zhifeng Gao
View author publications
Search author on:PubMed Google Scholar

Contributions

Q.O., N.W., and Z.G. conceived the project. Q.O. and H.W. carried out the model training and computational tasks. S.C. helped with the model training process. L.L. and M.Z. carried out the synthesis and characterization of two candidates. All authors discussed the results and contributed to the final manuscript.

Corresponding authors

Correspondence to Qi Ou, Ning Wang or Zhifeng Gao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ou, Q., Wang, H., Zhuang, M. et al. High-accuracy physical property prediction for pure organics via molecular representation learning: bridging data to discovery. npj Comput Mater 11, 224 (2025). https://doi.org/10.1038/s41524-025-01720-4

Download citation

Received: 04 January 2025
Accepted: 27 June 2025
Published: 11 July 2025
DOI: https://doi.org/10.1038/s41524-025-01720-4