Abstract
The multifaceted effects of the presence of thiophenic compounds on the environment are significant and cannot be overlooked. As heterocyclic compounds, thiophene and its derivatives play a significant role in materials science, particularly in the design of organic semiconductors, pharmaceuticals, and advanced polymers. Accurate prediction of their thermophysical properties is critical due to its impact on structural, thermal, and transport properties. This study utilizes state-of-the-art machine learning and deep learning models to predict high-pressure density of seven thiophene derivatives, namely thiophene, 2-methylthiophene, 3-methylthiophene, 2,5-dimethylthiophene, 2-thiophenemethanol, 2-thiophenecarboxaldehyde, and 2-acetylthiophene. The critical properties including critical temperature (Tc), critical pressure (Pc), critical volume (Vc), and acentric factor (ω), together with boiling point (Tb), and molecular weight (Mw) were used as input parameters. Models employed include Decision Tree (DT), Adaptive Boosting Decision Tree (AdaBoost-DT), Light Gradient Boosting Machine (LightGBM), Gradient Boosting (GBoost), TabNet, and Deep Neural Network (DNN). The statistical error evaluation showed that the LightGBM model showed superior performance with an average absolute percent relative error (AAPRE) of 0.0231, a root mean square error of 0.3499, and coefficient of determination (R2) of 0.9999. The leverage method showed that 99.10 percent of the data was valid. These findings highlight the effectiveness of using critical properties as inputs and underscore the potential of the LightGBM model for reliable high-pressure density prediction of thiophene derivatives. This provides a robust tool for advancing materials science applications, and offers valuable insights for material design under extreme conditions.
Similar content being viewed by others
Introduction
Heterocyclic thiophenic compounds1, which contain sulfur in a five-membered ring structure, are increasingly significant in environmental studies due to their widespread use in various industrial applications, including pharmaceuticals, agrochemicals, and organic electronics. Their environmental impact is multifaceted, as they can persist in ecosystems, leading to potential toxicity to aquatic life and soil microorganisms. In aquatic ecosystems, the presence of thiophenic compounds, which can enter water bodies through industrial discharge, runoff, and wastewater, often leads to detrimental effects on aquatic life2. In soil, thiophenic compounds can inhibit microbial activity and alter soil chemistry. The stability of these compounds makes them resistant to typical biodegradation processes, accumulating in sediments and organic tissues of wildlife. In air, the presence of thiophenic compounds can significantly affect the air environment, and lead to both direct and indirect environmental consequences, primarily through industrial emissions, combustion processes, and the degradation of fossil fuels. Petroleum-derived transportation fuels contain considerable amounts of organic sulfur compounds, including benzothiophene, dibenzothiophene, and thiophene. Combustion of sulfur-rich fuels releases sulfur oxides (SOx), primarily sulfur dioxide (SO2)3. This colorless, odorless, and corrosive gas poses serious environmental concerns, playing a key role in acid rain formation, the greenhouse effect, photochemical pollution, and eutrophication4,5,6,7,8,9. Consequently, understanding and mitigating the properties of sulfur compounds is crucial for reducing their environmental and industrial impact.
Heterocyclic compounds represent a fascinating category of aromatic substances. These compounds represent one of the most extensive and structurally diverse families in organic chemistry, characterized by a broad spectrum of intermolecular interactions. Their remarkable diversity makes them a crucial and complex area of study. Alongside carbon and hydrogen, heterocyclic compounds commonly contain heteroatoms like sulfur, oxygen, and nitrogen1,10. The most prevalent types of heterocyclic compounds include five-membered rings (such as furan, thiophene, dioxolane, imidazole, and pyrrole) and six-membered rings (like morpholine and pyridine), which are commonly found in a variety of sources including plants, herbs, animals, coal, and petroleum. Due to their diverse properties and applications across various fields, thiophene-based materials are found everywhere, including medicine11,12,13,14,15,16, material sciences17,18,19, or for use in organic electronic devices and molecular electronics20,21,22. Thiophene-based compounds, particularly their derivatives, are widely utilized as chemo-sensors. They serve as effective fluorescence signaling promoters for detecting organic acids, metal ions, and cations23,24. Their unique electronic properties and structural diversity render thiophene compounds vital in advanced technologies, such as optoelectronic devices OLEDs, OFETs, OTFTs, OSCs, OLFETs and various sensors25,26,27,28,29,30,31,32,33,34,35,36. 2-thiophenecarboxaldehyde and 2-Thiophenemethanol find applications in material sciences, nanoparticles, and biotechnology37,38,39,40,41,42,43,44,45,46,47. 2-Acetylthiophene finds applications in food flavoring and the synthesis of drugs for anxiety, inflammation, and parasitic infections, as well as in the production of metal complexes48,49,50,51,52,53,54,55.
To maximize the performance of these chemicals in industrial processes, it is crucial to enhance our understanding of their various physicochemical properties. It is essential to emphasize volumetric properties like density and its related characteristics, including isobaric expansibility and isothermal compressibility, especially under high-temperature and high-pressure conditions. Density serves as a fundamental material property with significant implications for process mechanics and engineering in chemical plants. Furthermore, understanding a compound’s density offers valuable insights into its molecular arrangement and packing behavior. Analyzing density variations with temperature or pressure allows for the determination of key parameters such as isothermal compressibility, isobaric expansibility, and internal pressure. However, traditional methods for determining density are often labor-intensive and susceptible to experimental errors.
Equations of state (EoS) and empirical relationships, while widely used for predicting thermophysical properties, often suffer from limitations such as the need for simplifying assumptions, poor accuracy under extreme conditions, and reliance on substance-specific constants that may not be available or accurate for all compounds. These models may also lack the flexibility to capture complex, nonlinear relationships inherent in experimental data, especially for structurally diverse compounds like thiophenes56,57,58,59,60. In contrast, machine learning (ML) and deep learning (DL) models offer several advantages, including the ability to learn directly from data without predefined functional forms, adapt to nonlinear patterns, and generalize across a wide range of conditions and molecular structures. They also enable the integration of diverse input features and provide high predictive accuracy, making them powerful tools for property estimation tasks where traditional models fall short61,62,63,64. Recently, Machine learning has been widely used for modelling of thermophysical properties, enabling faster predictions, enhanced accuracy, and exploration of complex systems efficiently63,65,66,67,68,69,70,71,72,73,74,75,76,77.
To address these challenges, this study leverages machine learning (ML) and deep learning (DL) methods for prediction of the density of seven thiophene-based heterocyclic compounds. An important innovation followed in this work is the use of critical properties including critical temperature (Tc), critical pressure (Pc), critical volume (Vc), and acentric factor (ω), together with boiling point (Tb), and molecular weight (Mw) as input parameters to predict the density of thiophene derivatives. These parameters inherently reflect the molecular structure, intermolecular interactions, and phase transition characteristics, which are essential for accurately predicting density under various conditions. Choosing the right input parameters is crucial for developing accurate and reliable predictive models, as they directly influence the model’s ability to capture the underlying physical and chemical relationships. Using such physically meaningful and experimentally accessible inputs not only improves model performance and generalizability, but also ensures that predictions remain grounded in real-world chemical behaviour. In this work, in addition to machine learning models (DT, AdaBoost-DT, LightGBM and GBoost), we also used two deep learning models (TabNet and DNN) for high-pressure density prediction. By modelling the complex relationships between molecular structure and density, these computational approaches provide an efficient and scalable alternative to experimental methods. The findings not only enhance our understanding of thiophene derivatives but also demonstrate the potential of ML and DL in advancing predictive materials science, particularly for applications in pharmaceuticals, organic electronics, and sustainable energy solutions.
Theory and methodology
Dataset construction and description
This study delas with the density predictions for seven compounds from the thiophene family containing different functional groups (see Fig. S1 in Supplementary Material). Density prediction was studied in a wide temperature range (283.15–338.15 K) and pressure range (0.1–65 MPa), including 1336 data points1. The experimental data for these compounds (thiophene, 2-methylthiophene, 3-methylthiophene, 2,5-dimethylthiophene, 2-thiophenemethanol, 2-thiophenecarboxaldehyde, and 2-acetylthiophene) were obtained from literature78,79,80. Also, the critical properties of the compounds were extracted directly from the experimental data reported in sources81. As the ML/DL models were used in this work for thiophenic materials, at specified temperature and pressure ranges (283.15–338.15 K and 0.1–65 MPa), the potential reduced applicability of these models for molecules with different functional groups or different temperature and pressure ranges not present in the current dataset should be considered.
The thermal map presented in Fig. 1 shows a clear relationship between the density of thiophenes and temperature (T), pressure (P), critical temperature (Tc), critical pressure (Pc), critical volume (Vc), acentric factor (ω), boiling point (Tb), and molecular weight (Mw). This thermal map shows that P, Mw, Tc, Tb, Vc, ω have a direct relationship with the density of thiophene. Meanwhile, T and Pc have an inverse relationship with density. Fig. 1 shows that strong correlations between some descriptors may lead to overfitting by introducing redundancy and multicollinearity into the model. To address this issue, we applied data normalization to ensure all features contribute equally during training, preventing those with larger scales from dominating the learning process. Additionally, we employed k-fold cross-validation to evaluate model performance across multiple data splits, which helps in selecting models that generalize well rather than fitting noise in the training data. Together, these techniques enhance the robustness and reliability of the model by reducing the risk of overfitting and improving predictive performance on unseen data.
Box plots offer valuable insights into outliers, median values, as well as minimum and maximum data points. The dataset comprises five key features: minimum, Q1 (the median of the lower half of the dataset), median (the middle value of the dataset), Q3 (the median of the upper half of the dataset), and maximum values. This figure consists of two main components: a pair of whiskers and a box. The lower whisker represents the minimum value, while the upper whisker indicates the maximum. The box itself spans from Q1 to Q3, illustrating data distribution. Additionally, the horizontal line in the center marks the median value. The box plots representing the input and target variables for temperature (T), pressure (P), critical temperature (Tc), critical pressure (Pc), critical volume (Vc), acentric factor (ω), boiling point (Tb), molecular weight (Mw) and density (ρ) are presented in Fig. S2 in Supplementary Material.
Predictive analytics
Enhancing accuracy can be achieved through grid search cross-validation82. This method systematically explores various models and hyperparameter combinations by testing each one individually and validating the results. The goal of grid search is to identify the optimal combination that yields the best model performance for prediction tasks83. Typically, grid search is integrated with k-fold cross-validation to establish a reliable evaluation metric for classification models82,84. In scikit-learn85 the `GridSearchCV` function can be utilized to implement the grid search algorithm for identifying the optimal hyperparameters86. In this study, we used GridSearchCV to tune hyperparameters. Table 1 shows the hyperparameter search ranges used for machine learning and deep learning models, along with the optimal values identified through Grid Search.
Machine learning models
Decision tree (DT)
The decision tree (DT) method is a widely recognized machine learning approach for both classification and regression tasks87. It derives its name from its hierarchical, tree-like structure, which operates similarly to a flowchart and is constructed using a partitioning process. Over time, various decision tree algorithms have been introduced, including ID3, C4.5, CART, CHAID, and MARS. The primary aim of DT learning is to establish a framework capable of effectively predicting variations in a response variable or categorizing data within a test dataset. To accomplish this, DT employs a branching structure where internal nodes represent decision points based on attributes, and leaf nodes indicate predicted output label88,89. One of the strengths of the DT algorithm is its robustness to missing data and outliers, making it well-suited for both categorical and continuous variables. To prevent overfitting, key hyperparameters such as the minimum number of samples per leaf node and the maximum depth of the tree can be adjusted. Additionally, DT regression provides an intuitive way to examine the relationships between input and output variables, with its graphical representation serving as a practical tool for predicting continuous target values90. Fig. S3 presents a schematic representation of the DT model.
Adaptive boosting decision tree (AdaBoost-DT)
Freund and Schapire introduced the adaptive boosting method (Adaboost) in 199791 to develop a classifier. An adaptive resampling technique selects training samples, with classifiers being trained iteratively. During each iteration, misclassified samples are assigned more weight. Therefore, the final classifier is derived from a weighted aggregation of predictions from all trained models in the ensemble92. When paired with the AdaBoost algorithm, the DT, typically considered a weak classifier, is expected to achieve notably improved performance. The AdaBoost-DT model is implemented in Python 3.7 using the AdaBoost class from the scikit-learn library.
Gradient boosting (GBoost)
The Gradient Boosting Regressor (GBoost) is an ensemble learning technique that builds a series of decision trees in a sequential manner, where each successive tree is trained to minimize the errors made by the previous one. GBoost is an iterative learning algorithm designed to enhance predictive performance by combining multiple weak learners into a more robust model93. As the number of weak models increases, the model’s error progressively reduces94. Furthermore, boosting addresses the bias-variance trade-off by initially constructing a weak learner and progressively enhancing its performance by sequentially adding new trees. Each newly added tree focuses on correcting the errors made by its predecessor by prioritizing the training instances with the highest prediction errors95. In essence, the new tree assigns greater importance to the misclassified rows from the previous iteration. A schematic representation of the GBoost concept is shown in Fig. S4.
Deep neural network (DNN)
A Deep Neural Network (DNN) is a type of neural network that consists of multiple hidden layers. In recent years, DNNs have gained widespread popularity, largely due to advancements in computational resources and increased accessibility to high-performance computing96,97. An appropriate network architecture is essential for ensuring the effective performance of a neural network. A standard DNN comprises an input layer, one or more hidden layers, and an output layer. The input and output layers define the model’s inputs and expected outputs, while the hidden layers play a key role in extracting meaningful features from the given dataset. Each layer consists of numerous neurons that apply mathematical operations to the input data. Throughout the training process, the model refines its performance by adjusting neuron-associated weights (w) and biases (b), a process guided by optimization techniques like gradient descent98.
Light gradient boosting machine (LightGBM)
In 2016, Guolin Ke et al.99 presented a new machine learning model, LightGBM, based on gradient boosting theory. Unlike other machine learning approaches, LightGBM requires less memory. LightGBM and XGBoost support parallel computations, but LightGBM outperforms the previous XGBoost model with faster training speed and lower memory usage. This reduction in memory occupation results in decreased communication costs during parallel learning. LightGBM stands out due to its decision tree-based architecture, which leverages gradient-based one-side sampling (GOSS), exclusive feature bundling (EFB), and a histogram-based learning strategy with a depth-constrained, leaf-wise growth mechanism100. Gradually based one-sided sampling (GOSS) can strike a desirable balance between the sample size and the accuracy of LightGBM’s decision tree. In LightGBM, an efficient algorithm called EFB is employed to group those parameters that rarely have nonzero values simultaneously (see Fig. S5 in Supplementary Material for the leaf-wise tree growth strategy). As decision trees deepen, overfitting tendencies increase, leading to more undesirable leaf directions. LightGBM’s crucial parameters enable it to handle large volumes of data, perform at high speed, and achieve higher accuracy in predictions101. However, when LightGBM leads to overfitting, setting a maximum depth limit for the leaf nodes can result in higher efficiency99,102. Concerning the construction of a LightGBM model, parameters and computations can be described as follows103,104:
After minimizing the loss function L, the value of f(x) was predicted:
In conclusion, the training process of each tree can be described as follows:
In the given equation, N represents the leaf count in a tree, q indicates the decision rules employed in a single tree, and W signifies the weight term of each leaf node. To minimize the objective function using Newton’s method, the outcome of each stage’s training is adjusted as follows:
TabNet
TabNet is a deep learning model specifically designed for tabular data105. Unlike traditional deep learning models, it directly processes raw data without requiring manual feature engineering. TabNet employs a sparse attention mechanism to dynamically select relevant features, enhancing both interpretability and efficiency. Its core components include106:
Feature Transformer: Processes input data and generates complex feature representations.
Attentive Transformer: Determines which features should be selected at each decision step using a sparse attention mechanism.
Masking Mechanism: Guides the feature selection process to improve model transparency and efficiency.
Aggregation: Combines the selected features from multiple steps to produce the final output.
TabNet is built on a multi-step decision-making process, refining its feature selection iteratively107. Its architecture integrates a feature transformer, an attentive transformer, and a masking mechanism, making it a powerful model for structured data tasks. Empirical studies demonstrate its high performance and strong generalization capabilities across various datasets108,109.
Statistical error analysis
The following statistical parameters were used to compare the performance of the model used in this study. (ρpred) represents the density predicted by the deep learning and machine learning models, and (ρexp) represents the experimental values of the density.
average percent relative error (APRE)
average absolute percent relative error (AAPRE)
root mean square error (RMSE)
standard deviation (SD)
coefficient of determination (R2)
Results and discussion
Table 2 provides an overview of the statistical performance metrics for six models: LightGBM, AdaBoost-DT, GBoost, DT, TabNet, and DNN. The assessment was conducted on training (1068 data points), testing (268 data points), and the complete dataset (1336 data points). The performance metrics in the table clearly demonstrate that LightGBM outperforms both deep learning models TabNet and DNN as well as other traditional machine learning models in predicting the density of seven thiophene-based compounds. LightGBM achieves the lowest errors across all key metrics, including AAPRE (0.03034 test), RMSE (0.44628 test), and SD (0.00043 test), while maintaining an exceptionally high R2 of 0.99997 on the test set. Although the LightGBM model demonstrates extremely high accuracy and low error metrics, suggesting excellent predictive performance, we acknowledge the importance of assessing the risk of overfitting. To address this, we employed k-fold cross-validation during model training, which ensures the model’s performance is consistent across multiple data subsets and not just the training set. The close alignment between training and test performance, along with low standard deviation in error metrics, indicates that the model generalizes well and is not overfitting. Nonetheless, we remain cautious and have included model validation measures to confirm its robustness and reliability. In contrast, TabNet and DNN show significantly higher prediction errors, with TabNet yielding an RMSE of 2.64649 and DNN 1.46719 on the test set, indicating weaker generalization. This superior performance of LightGBM is primarily due to the inherent suitability of tree-based models for structured, tabular data such as the molecular descriptors used in this study. Tree-based models like LightGBM can naturally model non-linear feature interactions and manage small to medium datasets more efficiently, without requiring extensive tuning. Meanwhile, deep learning models like TabNet and DNN face architectural limitations in tabular contexts they often struggle to generalize well without large datasets, are prone to overfitting, and require complex hyperparameter optimization. These findings highlight LightGBM’s superior accuracy, as further illustrated in Fig. S6 (Supplementary Material).
The Taylor diagram [78] provides a visual representation of key statistical metrics R2, RMSE, and standard deviation (SD) to assess how well the predicted density aligns with experimental data. In this diagram, models with higher accuracy appear closer to the reference measurement point, while those with greater error deviate further. Among the evaluated models, LightGBM demonstrates the closest alignment with experimental data for both training and test sets, confirming its superior predictive accuracy (see Fig. S7 in Supplementary Material).
Graphical analysis
Graphical error analysis is a method for evaluating a model’s performance. This graphical tool is handy for comparing the performance of multiple models. Various schematic analyses were conducted in this study to demonstrate the effectiveness of the developed model. Graphical curves, including cross-plots, error distributions, group errors, and cumulative frequencies, were used to illustrate the reliability of the developed models.
Cross-plot
A cross plot is a type of scatter plot that visualizes the relationship between actual and predicted values by aligning them along a 45○ line that passes through the origin. Fig. 2 plots the predicted values of the models against the experimental data. The greater the concentration of points on the Y = X line, the greater the accuracy of the model. As can be seen in Fig. 2, all models perform well and the points on the ideal line are aligned.
Error distribution plot
Fig. 3 shows the distribution of relative errors of the proposed models in the training and testing processes. The lower the data density near the line Y = 0 the greater the model error and the lower the accuracy for predicting density. As a result, the GBoost and LightGBM models have lower relative error than the proposed models for the training and testing data, thus they have higher accuracy for density prediction.
Cumulative frequency graph
The cumulative frequency plot of absolute relative error (%) for the models used in this study is shown in Fig. 4. This figure clearly shows the higher accuracy of the GBoost and LightGBM models than other proposed models for density prediction. In addition, the TabNet model has a higher error than other models.
This study also explores error frequency by creating histograms of relative error. Fig. 5 displays histograms of relative error for six developed models. In the LightGBM model, most data points have errors between − 0.25 and 0.25, centering around zero relative error. Data with errors outside the range of − 0.25 to 0.25 for the DT, AdaBoost-DT, TabNet and DNN models indicates that these models have less coverage than LightGBM for both the training and testing data.
The obtained results for the error values (see Fig. S8 in Supplementary Material) show that LightGBM model produces the smallest error distribution range, from − 0.1337 to 0.1321. The GBoost model ranges from − 0.2130 to 0.2028, while other models show a higher error distribution range than these two models.
Fig. 6 compares the effect of input parameters (critical temperature, critical pressure, critical volume, molecular weight, boiling temperature, together with operational determination temperature and pressure), on absolute relative error (%) for all models. As can be observed, in all ranges of molecular weight, boiling temperature, critical temperature, critical pressure, critical volume, temperature, and pressure, the LightGBM model has the least error compared to other models, which confirms the high accuracy of this model.
A comparative analysis of the relative error among the proposed models offers valuable insights into identifying the most accurate predictive approach. This visual assessment demonstrates the strong alignment between experimental data and the predictions generated by the LightGBM model, as depicted in Fig. S9 of the Supplementary Materials.
Fig. 7 illustrates the percentage of relative error for the LightGBM model across the studied materials. The consistently low relative error across all materials underscores the model’s high precision in predicting density. This minimal deviation further confirms the reliability and effectiveness of the LightGBM model in accurately estimating density values.
Model trend analysis
To assess how well the developed models capture the expected density trends, Fig. 8 presents the LightGBM model’s predicted values as a function of temperature and pressure. The plots illustrate that at fixed pressures of 7.0 and 65 MPa, density decreases with rising temperature. Conversely, at constant temperatures of 303.15 K and 338.15 K, increasing pressure results in higher density.
Sensitivity analysis
The relevancy factor (r) and the output of the LightGBM model are employed to assess the relative significance of input variables in predicting density. The correlation coefficient for each input parameter is determined using the following formula [63, 64]:
\({I}_{i,k}\) and \(\overline{{I }_{k}}\) represent the ith average values of the kth input, respectively. K represents pressure temperature or other input parameters. \({y}_{i}\) and \(\overline{y }\) represent the ith predictive value and average. The parameter \((r)\) varies between − 1 and 1, reflecting the correlation between independent and dependent variables. A positive \((r)\) suggests that as the input variable increases, the output also rises, whereas a negative \((r)\) implies an inverse correlation. And he closer \((r)\) are to 1, the stronger the association between the model’s input and output values. The findings of the sensitivity analysis on the results of the LightGBM model, as the best-obtained model, are presented visually in Fig. 9. The relevancy factor plot clearly shows how each input parameter influences the model’s prediction of density, with boiling point (Tb), critical volume (Vc), and critical temperature (Tc) having the highest positive relevancy, indicating they are the most influential features. This aligns well with established physicochemical principles of thiophenes, where thermophysical properties such as density are strongly governed by phase behavior and intermolecular interactions both of which are reflected in critical and boiling point properties. For example, the strong correlation of Tb (0.7302) suggests that vaporization characteristics significantly impact the density profile. Similarly, the contributions of Tc (0.5683), Vc (0.5857), and ω (0.5131) highlight the importance of molecular structure and dispersion forces, which are central to understanding thiophene derivatives due to their aromatic and heterocyclic nature. The negative relevancy of Pc (− 0.4949) and T (− 0.1675) further supports the idea that increased external pressure or system temperature can reduce the predictability of density if not properly accounted for by structural properties. Overall, this plot confirms that the selected features not only enhance model performance but also reflect fundamental chemical behavior.
Implementation of the Leverage method
After following statistical and graphical analyses that confirmed the superiority of the LightGBM model over other approaches, an additional outlier detection method was applied to identify data points that could adversely affect model predictions and to validate the reliability domain of the proposed model. The Williams plot visualizes standardized residuals (R) against hat values (H), providing insights into potential outliers. The key parameters for constructing this plot are determined using the following calculations77,110,111:
Hat matrix (H):
Here, XT represents the transpose of the matrix X, which is a (y × z) matrix. In this case, y refers to the number of data points, and z refers to the number of input variables used by the model.
• Leverage limit (H*):
• standardized residuals (SR):
where ej is the ordinary residual of the jth index, MSE is the mean square error, and Hj is the jth Leverage value. H values greater than H* are outside the range of applicability of the model. In addition, data points with H values less than H* and R values greater than 3 or less than − 3 are considered suspected data. Data points with H values less than H* and R values between − 3 and 3 are considered valid data112. As illustrated in Fig. 10, over 99% of the dataset is deemed valid, with only 12 out of 1336 data points identified as potential anomalies. Williams chart analysis shows that 99.10% of the data falls within the acceptable range.
Conclusion
In this study, the critical properties including critical temperature (Tc), critical pressure (Pc), critical volume (Vc), and acentric factor (ω), together with boiling point (Tb), and molecular weight (Mw) were used as input parameters for machine / deep learning models to predict the density of thiophenes. Accurate density prediction is vital for understanding and mitigating the environmental and industrial impact of sulfur compounds in fuels. In this work, in addition to four machine learning models (DT, AdaBoost-DT, LightGBM, and GBoost) we also used two deep learning models (TabNet and DNN) for density prediction. Results revealed that the LightGBM model outperformed the others, with the lowest errors in statistical evaluations (AAPRE = 0.02308, APRE = − 0.00014, RMSE = 0.34998, and an R2 = 0.99998). Graphical evaluations further confirmed the LightGBM model’s high accuracy in predicting thiophene density across training and test datasets. In addition, the comparison of the experimental data and predicted values by the LightGBM model at constant temperatures of 303.15 and 338.15 K and constant pressures of 7 and 65 MPa proved the accuracy of the prediction. Using the relevance factor, the impact of input characteristics on the model’s target parameter was also investigated. The Leverage technique revealed that all data points appeared trustworthy and valid, except for a few that fell into the suspected data region. In summary, applying the Leverage method confirmed the data integrity and effectiveness of the proposed LightGBM model. This study distinguishes itself through its comprehensive dataset, a broader range of thiophene derivatives, and the incorporation of advanced machine/deep learning models. The findings provide a robust foundation for optimizing the properties of thiophene derivatives, supporting innovations in fuel refinement, environmental sustainability, and advanced material applications.
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Razavi, E., Khoshsima, A. & Shahriari, R. Phase behavior modeling of mixtures containing N-, S-, and O-heterocyclic compounds using PC-SAFT equation of state. Ind. Eng. Chem. Res. 58, 11038–11059. https://doi.org/10.1021/acs.iecr.9b01429 (2019).
Bai, Y. et al. Research on ultrafast catalytic degradation of heterocyclic compounds by the novel high entropy catalyst. J. Environ. Chem. Eng. 11, 110703. https://doi.org/10.1016/j.jece.2023.110703 (2023).
Abro, R. et al. Extractive desulfurization of fuel oils using deep eutectic solvents—A comprehensive review. J. Environ. Chem. Eng. 10, 107369. https://doi.org/10.1016/j.jece.2022.107369 (2022).
Granger, P. & Parvulescu, V. I. Catalytic NO x abatement systems for mobile sources: From three-way to lean burn after-treatment technologies. Chem. Rev. 111, 3155–3207 (2011).
Zipper, C. E. & Gilroy, L. Sulfur dioxide emissions and market effects under the clean air act acid rain program. J. Air Waste Manag. Assoc. 48, 829–837 (1998).
Peng, Y., Liu, C., Zhang, X. & Li, J. The effect of SiO2 on a novel CeO2–WO3/TiO2 catalyst for the selective catalytic reduction of NO with NH3. Appl. Catal. B 140, 276–282 (2013).
Singh, A. & Agrawal, M. Acid rain and its ecological consequences. J. Environ. Biol. 29, 15 (2007).
Chen, R. et al. Short-term exposure to sulfur dioxide and daily mortality in 17 Chinese cities: The China air pollution and health effects study (CAPES). Environ. Res. 118, 101–106 (2012).
Chiang, T.-Y., Yuan, T.-H., Shie, R.-H., Chen, C.-F. & Chan, C.-C. Increased incidence of allergic rhinitis, bronchitis and asthma, in children living near a petrochemical complex with SO2 pollution. Environ. Int. 96, 1–7 (2016).
Alshareef, F. M. et al. Recent advancement in organic fluorescent and colorimetric chemosensors for the detection of Al3+ ions: A review (2019–2024). J. Environ. Chem. Eng. 12, 114110. https://doi.org/10.1016/j.jece.2024.114110 (2024).
Chen, H.-J. et al. Rational design and synthesis of 2, 2-bisheterocycle tandem derivatives as non-nucleoside hepatitis B virus inhibitors. ChemMedChem 3, 1316–1321 (2008).
Monge Vega, A., Aldana, I., Rabbani, M. & Fernandez-Alvarez, E. The synthesis of 11H–1, 2, 4-triazolo [4, 3-b] pyridazino [4, 5-b] indoles, 11H-tetrazolo [4, 5-b] pyridazino [4, 5-b] indoles and 1, 2, 4-triazolo [3, 4-f]-1, 2, 4-triazino [4, 5-a] indoles. J. Heterocycl. Chem. 17, 77–80 (1980).
Burnett, D. A., Caplen, M. A., Davis, H. R. Jr., Burrier, R. E. & Clader, J. W. 2-Azetidinones as inhibitors of cholesterol absorption. J. Med. Chem. 37, 1733–1736 (1994).
Russell, R. K. et al. Thiophene system. 9. Thienopyrimidinedione derivatives as potential antihypertensive agents. J. Med. Chem. 31, 1786–1793 (1988).
Mohareb, R. M., Abdallah, A. E. & Abdelaziz, M. A. New approaches for the synthesis of pyrazole, thiophene, thieno [2, 3-b] pyridine, and thiazole derivatives together with their anti-tumor evaluations. Med. Chem. Res. 23, 564–579 (2014).
Chen, Z., Ku, T. C. & Seley-Radtke, K. L. Thiophene-expanded guanosine analogues of gemcitabine. Bioorg. Med. Chem. Lett. 25, 4274–4276 (2015).
Chan, H. S. O. & Ng, S. C. Synthesis, characterization and applications of thiophene-based functional polymers. Prog. Polym. Sci. 23, 1167–1231 (1998).
Song, X. & Parish, C. A. Pyrolysis mechanisms of thiophene and methylthiophene in asphaltenes. J. Phys. Chem. A 115, 2882–2891 (2011).
Abdou, M. M. Thiophene-based azo dyes and their applications in dyes chemistry. Am. J. Chem 3, 126–135 (2013).
Perepichka, I. F., Perepichka, D. F., Meng, H. & Wudl, F. Light-emitting polythiophenes. Adv. Mater. 17, 2281–2305 (2005).
Mi, S., Wu, J., Liu, J., Zheng, J. & Xu, C. Donor–π-bridge–acceptor fluorescent polymers based on thiophene and triphenylamine derivatives as solution processable electrochromic materials. Org. Electron. 23, 116–123 (2015).
Kim, C. et al. The effects of octylthiophene ratio on the performance of thiophene based polymer light-emitting diodes. Sci. Adv. Mater. 7, 2401–2409 (2015).
Jeong, H. Y., Lee, S. Y., Han, J., Lim, M. H. & Kim, C. Thiophene and diethylaminophenol-based “turn-on” fluorescence chemosensor for detection of Al3+ and F− in a near-perfect aqueous solution. Tetrahedron 73, 2690–2697 (2017).
Fernandes, R. S., Shetty, N. S., Mahesha, P. & Gaonkar, S. L. A Comprehensive review on thiophene based chemosensors. J. Fluoresc. 32, 19–56. https://doi.org/10.1007/s10895-021-02833-x (2022).
Mishra, A., Ma, C.-Q. & Bauerle, P. Functional oligothiophenes: Molecular design for multidimensional nanoarchitectures and their applications. Chem. Rev. 109, 1141–1276 (2009).
Thompson, B. C. & Fréchet, J. M. Polymer–fullerene composite solar cells. Angew. Chem. Int. Ed. 47, 58–77 (2008).
Allard, S., Forster, M., Souharce, B., Thiem, H. & Scherf, U. Organic semiconductors for solution-processable field-effect transistors (OFETs). Angew. Chem. Int. Ed. 47, 4070–4098 (2008).
Gather, M. C., Köhnen, A. & Meerholz, K. White organic light-emitting diodes. Adv. Mater. 23, 233–248 (2011).
Raposo, M. M. M. et al. Synthesis and characterization of dicyanovinyl-substituted thienylpyrroles as new nonlinear optical chromophores. Org. Lett. 8, 3681–3684 (2006).
Tang, C. W. & VanSlyke, S. A. Organic electroluminescent diodes. Appl. Phys. Lett. 51, 913–915 (1987).
Xu, L., Tang, C. W. & Rothberg, L. J. High efficiency phosphorescent white organic light-emitting diodes with an ultra-thin red and green co-doped layer and dual blue emitting layers. Org. Electron. 32, 54–58 (2016).
Piyakulawat, P. et al. Effect of thiophene donor units on the optical and photovoltaic behavior of fluorene-based copolymers. Sol. Energy Mater. Sol. Cells 95, 2167–2172 (2011).
Klauk, H. Organic thin-film transistors. Chem. Soc. Rev. 39, 2643–2666 (2010).
Lin, Y., Li, Y. & Zhan, X. Small molecule semiconductors for high-efficiency organic photovoltaics. Chem. Soc. Rev. 41, 4245–4272 (2012).
O’Neill, M. & Kelly, S. M. Ordered materials for organic electronics and photonics. Adv. Mater. 23, 566–584 (2011).
Usta, H., Facchetti, A. & Marks, T. J. n-Channel semiconductor materials design for organic complementary circuits. Acc. Chem. Res. 44, 501–510 (2011).
Bae, S.-E., Kim, K.-J., Hwang, Y.-K. & Huh, S. Simple preparation of Pd-NP/polythiophene nanospheres for heterogeneous catalysis. J. Colloid Interface Sci. 456, 93–99 (2015).
Lee, H., Shin, M., Lee, M. & Hwang, Y. J. Photo-oxidation activities on Pd-doped TiO2 nanoparticles: critical PdO formation effect. Appl. Catal. B 165, 20–26 (2015).
Megahed, A. S., Al-Amoudi, M. & Refat, M. A modern technique for preparation of zinc (II) and nickel (II) nanometric oxides using Schiff base compounds: Synthesis, characterization, and antibacterial properties. Res. Chem. Intermed. 40, 1425–1439 (2014).
Xie, J., Dong, H., Yu, Y. & Cao, S. Inhibitory effect of synthetic aromatic heterocycle thiosemicarbazone derivatives on mushroom tyrosinase: Insights from fluorescence, 1H NMR titration and molecular docking studies. Food Chem. 190, 709–716 (2016).
Roy, S. et al. Structural basis for molecular recognition, theoretical studies and anti-bacterial properties of three bis-uracil derivatives. Tetrahedron 70, 6931–6937 (2014).
Caboni, P. et al. Nematicidal activity of 2-thiophenecarboxaldehyde and methylisothiocyanate from caper (Capparis spinosa) against Meloidogyne incognita. J. Agric. Food Chem. 60, 7345–7351 (2012).
Pereira, G. A. et al. A broad study of two new promising antimycobacterial drugs: Ag (I) and Au (I) complexes with 2-(2-thienyl) benzothiazole. Polyhedron 38, 291–296 (2012).
Zhou, Y. et al. Sulfur-rich carbon cryogels for supercapacitors with improved conductivity and wettability. J. Mater. Chem. A 2, 8472–8482 (2014).
Kumar, S., Dutta, P. & Sen, P. Preparation and characterization of optical property of crosslinkable film of chitosan with 2-thiophenecarboxaldehyde. Carbohyd. Polym. 80, 563–569 (2010).
Wang, H., Wang, J.-L., Yuan, S.-C., Pei, J. & Pei, W.-W. Novel highly fluorescent dendritic chiral amines: Synthesis and optical properties. Tetrahedron 61, 8465–8474 (2005).
Yannai, S. Dictionary of Food Compounds with CD-ROM (CRC Press, 2012).
Bredie, W. L., Mottram, D. S. & Guy, R. C. Effect of temperature and pH on the generation of flavor volatiles in extrusion cooking of wheat flour. J. Agric. Food Chem. 50, 1118–1125 (2002).
Mahadevan, K. & Farmer, L. Key odor impact compounds in three yeast extract pastes. J. Agric. Food Chem. 54, 7242–7250 (2006).
Fujima, Y., Ikunaka, M., Inoue, T. & Matsumoto, J. Synthesis of (S)-3-(N-Methylamino)-1-(2-thienyl) propan-1-ol: Revisiting Eli Lilly’s resolution—racemization—recycle synthesis of duloxetine for its robust processes. Org. Process Res. Dev. 10, 905–913 (2006).
White, J. D., Juniku, R., Huang, K., Yang, J. & Wong, D. T. Synthesis of 1, 1-[1-Naphthyloxy-2-thiophenyl]-2-methylaminomethylcyclopropanes and their evaluation as inhibitors of serotonin, norepinephrine, and dopamine transporters. J. Med. Chem. 52, 5872–5879 (2009).
Lazer, E. S., Wong, H. C., Possanza, G. J., Graham, A. G. & Farina, P. R. Antiinflammatory 2, 6-di-tert-butyl-4-(2-arylethenyl) phenols. J. Med. Chem. 32, 100–104 (1989).
Hussain, H., Al-Harrasi, A., Al-Rawahi, A., Green, I. R. & Gibbons, S. Fruitful decade for antileishmanial compounds from 2002 to late 2011. Chem. Rev. 114, 10369–10428 (2014).
Esteruelas, M. A., Lopez, A. M. & Olivan, M. Polyhydrides of platinum group metals: Nonclassical interactions and σ-bond activation reactions. Chem. Rev. 116, 8770–8847 (2016).
Allais, C., Grassot, J.-M., Rodriguez, J. & Constantieux, T. Metal-free multicomponent syntheses of pyridines. Chem. Rev. 114, 10829–10868 (2014).
Ahmadi, M., Chen, Z., Clarke, M. & Fedutenko, E. Comparison of kriging, machine learning algorithms and classical thermodynamics for correlating the formation conditions for CO2 gas hydrates and semi-clathrates. J. Nat. Gas Sci. Eng. 84, 103659. https://doi.org/10.1016/j.jngse.2020.103659 (2020).
von Solms, N., Kouskoumvekaki, I. A., Michelsen, M. L. & Kontogeorgis, G. M. Capabilities, limitations and challenges of a simplified PC-SAFT equation of state. Fluid Phase Equilib. 241, 344–353. https://doi.org/10.1016/j.fluid.2006.01.001 (2006).
Sun, Y., Zuo, Z., Laaksonen, A., Lu, X. & Ji, X. How to detect possible pitfalls in ePC-SAFT modelling: Extension to ionic liquids. Fluid Phase Equilib. 519, 112641. https://doi.org/10.1016/j.fluid.2020.112641 (2020).
Sun, Y., Laaksonen, A., Lu, X. & Ji, X. How to detect possible pitfalls in ePC-SAFT modelling. 2. Extension to binary mixtures of 96 ionic liquids with CO2, H2S, CO, O2, CH4, N2, and H2. Ind. Eng. Chem. Res. 59, 21579–21591. https://doi.org/10.1021/acs.iecr.0c04485 (2020).
Chaparro, G. & Müller, E. A. Development of thermodynamically consistent machine-learning equations of state: Application to the Mie fluid. J. Chem. Phys. https://doi.org/10.1063/5.0146634 (2023).
Ding, J. et al. Machine learning for molecular thermodynamics. Chin. J. Chem. Eng. 31, 227–239. https://doi.org/10.1016/j.cjche.2020.10.044 (2021).
Luo, W. et al. Bridging machine learning and thermodynamics for accurate pKa prediction. JACS Au 4, 3451–3465. https://doi.org/10.1021/jacsau.4c00271 (2024).
Cersonsky, R. K., Cheng, B., Kofke, D. & Müller, E. A. Machine learning for generating and analyzing thermophysical data: Where we are and where we’re going. J. Chem. Eng. Data 69, 2041–2043. https://doi.org/10.1021/acs.jced.4c00207 (2024).
Kevrekidis, G. A. et al. Neural network representations of multiphase equations of State. Sci. Rep. 14, 30288. https://doi.org/10.1038/s41598-024-81445-4 (2024).
Jirasek, F. & Hasse, H. Perspective: Machine learning of thermophysical properties. Fluid Phase Equilib. 549, 113206. https://doi.org/10.1016/j.fluid.2021.113206 (2021).
Saha, A., Basu, A. & Banerjee, S. Enhancing thermal management systems: A machine learning and metaheuristic approach for predicting thermophysical properties of nanofluids. Eng. Res. Express 6, 045537. https://doi.org/10.1088/2631-8695/ad8536 (2024).
Shateri, A., Yang, Z. & Xie, J. Machine learning-based molecular dynamics studies on predicting thermophysical properties of ethanol–octane blends. Energy Fuels 39, 1070–1090. https://doi.org/10.1021/acs.energyfuels.4c05653 (2025).
Liang, W., Lu, G. & Yu, J. Machine-learning-driven simulations on microstructure and thermophysical properties of MgCl2–KCl eutectic. ACS Appl. Mater. Interfaces. 13, 4034–4042. https://doi.org/10.1021/acsami.0c20665 (2021).
Wang, H. & Chen, X. A comprehensive review of predicting the thermophysical properties of nanofluids using machine learning methods. Ind. Eng. Chem. Res. 61, 14711–14730. https://doi.org/10.1021/acs.iecr.2c02059 (2022).
Akilu, S., Sharma, K. V., Baheta, A. T., Kanti, P. K. & Paramasivam, P. Machine learning analysis of thermophysical and thermohydraulic properties in ethylene glycol- and glycerol-based SiO2 nanofluids. Sci. Rep. 14, 14829. https://doi.org/10.1038/s41598-024-65411-8 (2024).
Kanti, P. K., Paramasivam, P., Wanatasanappan, V. V., Dhanasekaran, S. & Sharma, P. Experimental and explainable machine learning approach on thermal conductivity and viscosity of water based graphene oxide based mono and hybrid nanofluids. Sci. Rep. 14, 30967. https://doi.org/10.1038/s41598-024-81955-1 (2024).
Ahmad, H. et al. Numerical and machine learning based evaluation of ethylene glycol based hybrid nano-structured (TiO2-SWCNTs) fluid flow. Sci. Rep. 15, 6084. https://doi.org/10.1038/s41598-025-88789-5 (2025).
Gomaa, S. et al. Machine learning prediction of methane, nitrogen, and natural gas mixture viscosities under normal and harsh conditions. Sci. Rep. 14, 15155. https://doi.org/10.1038/s41598-024-64752-8 (2024).
Abdollahzadeh, M. et al. Estimating the density of deep eutectic solvents applying supervised machine learning techniques. Sci. Rep. 12, 4954. https://doi.org/10.1038/s41598-022-08842-5 (2022).
Yarahmadi, A., Rashedi, A. & Bemani, A. Machine learning based estimation of density of binary blends of cyclohexanes in normal alkanes. Sci. Rep. 15, 8469. https://doi.org/10.1038/s41598-025-92608-2 (2025).
Huwaimel, B., Alanazi, J., Alanazi, M., Alharby, T. N. & Alshammari, F. Computational models based on machine learning and validation for predicting ionic liquids viscosity in mixtures. Sci. Rep. 14, 31857. https://doi.org/10.1038/s41598-024-82989-1 (2024).
Kiani, S. et al. Modeling of ionic liquids viscosity via advanced white-box machine learning. Sci. Rep. 14, 8666. https://doi.org/10.1038/s41598-024-55147-w (2024).
Antón, V., Muñoz-Embid, J., Artigas, H., Artal, M. & Lafuente, C. Thermophysical properties of oxygenated thiophene derivatives: Experimental data and modelling. J. Chem. Thermodyn. 113, 330–339. https://doi.org/10.1016/j.jct.2017.07.008 (2017).
Antón, V., Lomba, L., Cea, P., Giner, B. & Lafuente, C. Densities at high pressures and derived properties of thiophenes. J. Chem. Thermodyn. 109, 16–22. https://doi.org/10.1016/j.jct.2016.08.005 (2017).
Antón, V., Artigas, H., Muñoz-Embid, J., Artal, M. & Lafuente, C. Thermophysical study of 2-acetylthiophene: Experimental and modelled results. Fluid Phase Equilib. 433, 126–134. https://doi.org/10.1016/j.fluid.2016.10.026 (2017).
Piña-Martinez, A., Privat, R. & Jaubert, J.-N. Use of 300,000 pseudo-experimental data over 1800 pure fluids to assess the performance of four cubic equations of state: SRK, PR, tc-RK, and tc-PR. AIChE J. 68, e17518. https://doi.org/10.1002/aic.17518 (2022).
Yan, T., Shen, S.-L., Zhou, A. & Chen, X. Prediction of geological characteristics from shield operational parameters by integrating grid search and K-fold cross validation into stacking classification algorithm. J. Rock Mech. Geotech. Eng. 14, 1292–1303 (2022).
Ranjan, G., Verma, A. K. & Radhika, S. In 2019 IEEE 5th International Conference for Convergence in Technology (I2CT) 1–5 (IEEE).
Mangkunegara, I. S. & Purwono, P. In 2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom) 427–432.
Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Belete, D. M. & Huchaiah, M. D. Grid search in hyperparameter optimization of machine learning models for prediction of HIV/AIDS test results. Int. J. Comput. Appl. 44, 875–886 (2022).
Loh, W. Y. Classification and regression trees. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 1, 14–23 (2011).
Rokach, L. & Maimon, O. Decision trees. Data Min. Knowl. Discov. Handbook 165–192 (2005).
Ding, Y. & Jin, Y. Development of advanced hybrid computational model for description of molecular separation in liquid phase via polymeric membranes. J. Mol. Liq. 396, 123999. https://doi.org/10.1016/j.molliq.2024.123999 (2024).
Song, H., Shao, H., Zhang, Y. & Wang, X. Advancing nanomedicine production via green method: Modeling and simulation of pharmaceutical solubility at different temperatures and pressures. J. Mol. Liq. 411, 125806. https://doi.org/10.1016/j.molliq.2024.125806 (2024).
Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139. https://doi.org/10.1006/jcss.1997.1504 (1997).
Hong, H. et al. Landslide susceptibility mapping using J48 decision tree with AdaBoost, bagging and rotation forest ensembles in the guangchang area (China). CATENA 163, 399–413 (2018).
Kumar, R., Rai, B. & Samui, P. Machine learning techniques for prediction of failure loads and fracture characteristics of high and ultra-high strength concrete beams. Innov. Infrastruct. Solut. 8, 219 (2023).
Elith, J., Leathwick, J. R. & Hastie, T. A working guide to boosted regression trees. J. Anim. Ecol. 77, 802–813 (2008).
James, G., Witten, D., Hastie, T. & Tibshirani, R. An Introduction to Statistical Learning Vol. 112 (Springer, 2013).
Chong, E., Han, C. & Park, F. C. Deep learning networks for stock market analysis and prediction: Methodology, data representations, and case studies. Expert Syst. Appl. 83, 187–205 (2017).
Asim, M., Rashid, A. & Ahmad, T. Scour modeling using deep neural networks based on hyperparameter optimization. ICT Express 8, 357–362 (2022).
Krauss, C., Do, X. A. & Huck, N. Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500. Eur. J. Oper. Res. 259, 689–702 (2017).
Ke, G. et al. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 30 (2017).
Ma, X. et al. Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning. Electron. Commer. Res. Appl. 31, 24–39 (2018).
Fan, J. et al. Light gradient boosting machine: An efficient soft computing model for estimating daily reference evapotranspiration with local and external meteorological data. Agric. Water Manag. 225, 105758 (2019).
Yang, X., Dindoruk, B. & Lu, L. A comparative analysis of bubble point pressure prediction using advanced machine learning algorithms and classical correlations. J. Petrol. Sci. Eng. 185, 106598 (2020).
Sun, X., Liu, M. & Sima, Z. A novel cryptocurrency price trend forecasting model based on LightGBM. Financ. Res. Lett. 32, 101084 (2020).
Zheng, H., Mahmoudzadeh, A., Amiri-Ramsheh, B. & Hemmati-Sarapardeh, A. Modeling viscosity of CO2–N2 gaseous mixtures using robust tree-based techniques: Extra tree, random forest, GBoost, and LightGBM. ACS Omega 8, 13863–13875 (2023).
Arik, S. Ö. & Pfister, T. in Proceedings of the AAAI Conference on Artificial Intelligence 6679–6687.
Mao, S., Liang, Y., Sun, W. & Li, Q. interpretable transfer learning for small sample coal and gas outburst risk identification using TabNet. Energy Sci. Eng. 13, 909–925. https://doi.org/10.1002/ese3.2049 (2025).
McDonnell, K., Murphy, F., Sheehan, B., Masello, L. & Castignani, G. Deep learning in insurance: Accuracy and model interpretability using TabNet. Expert Syst. Appl. 217, 119543 (2023).
Razavi-Termeh, S. V., Sadeghi-Niaraki, A., Sorooshian, A., Abuhmed, T. & Choi, S.-M. Spatial mapping of land susceptibility to dust emissions using optimization of attentive Interpretable Tabular Learning (TabNet) model. J. Environ. Manag. 358, 120682. https://doi.org/10.1016/j.jenvman.2024.120682 (2024).
Wang, H. et al. Enhancing predictive accuracy for urinary tract infections post-pediatric pyeloplasty with explainable AI: a#n ensemble TabNet approach. Sci. Rep. 15, 2455. https://doi.org/10.1038/s41598-024-82282-1 (2025).
Rousseeuw, P. J. & Leroy, A. M. Robust Regression and Outlier Detection (Wiley, 2005).
Goodall, C. R. 13 Computation using the QR decomposition (1993).
Sheikhshoaei, A. H., Khoshsima, A. & Zabihzadeh, D. Predicting the heat capacity of strontium-praseodymium oxysilicate SrPr4(SiO4)3O using machine learning, deep learning, and hybrid models. Chem. Thermodyn. Thermal Anal. 17, 100154. https://doi.org/10.1016/j.ctta.2024.100154 (2025).
Author information
Authors and Affiliations
Contributions
Amirhossein Sheikhshoaei: Investigation, Writing—original draft, Software, Validation, Data curation, Formal analysis. Ali Khoshsima: Supervision, Project administration, Conceptualization, Methodology, Visualization, Writing—review & editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Sheikhshoaei, A.H., Khoshsima, A. Machine and deep learning models for predicting high pressure density of heterocyclic thiophenic compounds based on critical properties. Sci Rep 15, 25465 (2025). https://doi.org/10.1038/s41598-025-09600-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-09600-z












