Introduction

Heterocyclic thiophenic compounds1, which contain sulfur in a five-membered ring structure, are increasingly significant in environmental studies due to their widespread use in various industrial applications, including pharmaceuticals, agrochemicals, and organic electronics. Their environmental impact is multifaceted, as they can persist in ecosystems, leading to potential toxicity to aquatic life and soil microorganisms. In aquatic ecosystems, the presence of thiophenic compounds, which can enter water bodies through industrial discharge, runoff, and wastewater, often leads to detrimental effects on aquatic life2. In soil, thiophenic compounds can inhibit microbial activity and alter soil chemistry. The stability of these compounds makes them resistant to typical biodegradation processes, accumulating in sediments and organic tissues of wildlife. In air, the presence of thiophenic compounds can significantly affect the air environment, and lead to both direct and indirect environmental consequences, primarily through industrial emissions, combustion processes, and the degradation of fossil fuels. Petroleum-derived transportation fuels contain considerable amounts of organic sulfur compounds, including benzothiophene, dibenzothiophene, and thiophene. Combustion of sulfur-rich fuels releases sulfur oxides (SOx), primarily sulfur dioxide (SO2)3. This colorless, odorless, and corrosive gas poses serious environmental concerns, playing a key role in acid rain formation, the greenhouse effect, photochemical pollution, and eutrophication4,5,6,7,8,9. Consequently, understanding and mitigating the properties of sulfur compounds is crucial for reducing their environmental and industrial impact.

Heterocyclic compounds represent a fascinating category of aromatic substances. These compounds represent one of the most extensive and structurally diverse families in organic chemistry, characterized by a broad spectrum of intermolecular interactions. Their remarkable diversity makes them a crucial and complex area of study. Alongside carbon and hydrogen, heterocyclic compounds commonly contain heteroatoms like sulfur, oxygen, and nitrogen1,10. The most prevalent types of heterocyclic compounds include five-membered rings (such as furan, thiophene, dioxolane, imidazole, and pyrrole) and six-membered rings (like morpholine and pyridine), which are commonly found in a variety of sources including plants, herbs, animals, coal, and petroleum. Due to their diverse properties and applications across various fields, thiophene-based materials are found everywhere, including medicine11,12,13,14,15,16, material sciences17,18,19, or for use in organic electronic devices and molecular electronics20,21,22. Thiophene-based compounds, particularly their derivatives, are widely utilized as chemo-sensors. They serve as effective fluorescence signaling promoters for detecting organic acids, metal ions, and cations23,24. Their unique electronic properties and structural diversity render thiophene compounds vital in advanced technologies, such as optoelectronic devices OLEDs, OFETs, OTFTs, OSCs, OLFETs and various sensors25,26,27,28,29,30,31,32,33,34,35,36. 2-thiophenecarboxaldehyde and 2-Thiophenemethanol find applications in material sciences, nanoparticles, and biotechnology37,38,39,40,41,42,43,44,45,46,47. 2-Acetylthiophene finds applications in food flavoring and the synthesis of drugs for anxiety, inflammation, and parasitic infections, as well as in the production of metal complexes48,49,50,51,52,53,54,55.

To maximize the performance of these chemicals in industrial processes, it is crucial to enhance our understanding of their various physicochemical properties. It is essential to emphasize volumetric properties like density and its related characteristics, including isobaric expansibility and isothermal compressibility, especially under high-temperature and high-pressure conditions. Density serves as a fundamental material property with significant implications for process mechanics and engineering in chemical plants. Furthermore, understanding a compound’s density offers valuable insights into its molecular arrangement and packing behavior. Analyzing density variations with temperature or pressure allows for the determination of key parameters such as isothermal compressibility, isobaric expansibility, and internal pressure. However, traditional methods for determining density are often labor-intensive and susceptible to experimental errors.

Equations of state (EoS) and empirical relationships, while widely used for predicting thermophysical properties, often suffer from limitations such as the need for simplifying assumptions, poor accuracy under extreme conditions, and reliance on substance-specific constants that may not be available or accurate for all compounds. These models may also lack the flexibility to capture complex, nonlinear relationships inherent in experimental data, especially for structurally diverse compounds like thiophenes56,57,58,59,60. In contrast, machine learning (ML) and deep learning (DL) models offer several advantages, including the ability to learn directly from data without predefined functional forms, adapt to nonlinear patterns, and generalize across a wide range of conditions and molecular structures. They also enable the integration of diverse input features and provide high predictive accuracy, making them powerful tools for property estimation tasks where traditional models fall short61,62,63,64. Recently, Machine learning has been widely used for modelling of thermophysical properties, enabling faster predictions, enhanced accuracy, and exploration of complex systems efficiently63,65,66,67,68,69,70,71,72,73,74,75,76,77.

To address these challenges, this study leverages machine learning (ML) and deep learning (DL) methods for prediction of the density of seven thiophene-based heterocyclic compounds. An important innovation followed in this work is the use of critical properties including critical temperature (Tc), critical pressure (Pc), critical volume (Vc), and acentric factor (ω), together with boiling point (Tb), and molecular weight (Mw) as input parameters to predict the density of thiophene derivatives. These parameters inherently reflect the molecular structure, intermolecular interactions, and phase transition characteristics, which are essential for accurately predicting density under various conditions. Choosing the right input parameters is crucial for developing accurate and reliable predictive models, as they directly influence the model’s ability to capture the underlying physical and chemical relationships. Using such physically meaningful and experimentally accessible inputs not only improves model performance and generalizability, but also ensures that predictions remain grounded in real-world chemical behaviour. In this work, in addition to machine learning models (DT, AdaBoost-DT, LightGBM and GBoost), we also used two deep learning models (TabNet and DNN) for high-pressure density prediction. By modelling the complex relationships between molecular structure and density, these computational approaches provide an efficient and scalable alternative to experimental methods. The findings not only enhance our understanding of thiophene derivatives but also demonstrate the potential of ML and DL in advancing predictive materials science, particularly for applications in pharmaceuticals, organic electronics, and sustainable energy solutions.

Theory and methodology

Dataset construction and description

This study delas with the density predictions for seven compounds from the thiophene family containing different functional groups (see Fig. S1 in Supplementary Material). Density prediction was studied in a wide temperature range (283.15–338.15 K) and pressure range (0.1–65 MPa), including 1336 data points1. The experimental data for these compounds (thiophene, 2-methylthiophene, 3-methylthiophene, 2,5-dimethylthiophene, 2-thiophenemethanol, 2-thiophenecarboxaldehyde, and 2-acetylthiophene) were obtained from literature78,79,80. Also, the critical properties of the compounds were extracted directly from the experimental data reported in sources81. As the ML/DL models were used in this work for thiophenic materials, at specified temperature and pressure ranges (283.15–338.15 K and 0.1–65 MPa), the potential reduced applicability of these models for molecules with different functional groups or different temperature and pressure ranges not present in the current dataset should be considered.

The thermal map presented in Fig. 1 shows a clear relationship between the density of thiophenes and temperature (T), pressure (P), critical temperature (Tc), critical pressure (Pc), critical volume (Vc), acentric factor (ω), boiling point (Tb), and molecular weight (Mw). This thermal map shows that PMw, Tc, Tb, Vc, ω have a direct relationship with the density of thiophene. Meanwhile, T and Pc have an inverse relationship with density. Fig. 1 shows that strong correlations between some descriptors may lead to overfitting by introducing redundancy and multicollinearity into the model. To address this issue, we applied data normalization to ensure all features contribute equally during training, preventing those with larger scales from dominating the learning process. Additionally, we employed k-fold cross-validation to evaluate model performance across multiple data splits, which helps in selecting models that generalize well rather than fitting noise in the training data. Together, these techniques enhance the robustness and reliability of the model by reducing the risk of overfitting and improving predictive performance on unseen data.

Fig. 1
figure 1

Effect of input parameters on density.

Box plots offer valuable insights into outliers, median values, as well as minimum and maximum data points. The dataset comprises five key features: minimum, Q1 (the median of the lower half of the dataset), median (the middle value of the dataset), Q3 (the median of the upper half of the dataset), and maximum values. This figure consists of two main components: a pair of whiskers and a box. The lower whisker represents the minimum value, while the upper whisker indicates the maximum. The box itself spans from Q1 to Q3, illustrating data distribution. Additionally, the horizontal line in the center marks the median value. The box plots representing the input and target variables for temperature (T), pressure (P), critical temperature (Tc), critical pressure (Pc), critical volume (Vc), acentric factor (ω), boiling point (Tb), molecular weight (Mw) and density (ρ) are presented in Fig. S2 in Supplementary Material.

Predictive analytics

Enhancing accuracy can be achieved through grid search cross-validation82. This method systematically explores various models and hyperparameter combinations by testing each one individually and validating the results. The goal of grid search is to identify the optimal combination that yields the best model performance for prediction tasks83. Typically, grid search is integrated with k-fold cross-validation to establish a reliable evaluation metric for classification models82,84. In scikit-learn85 the `GridSearchCV` function can be utilized to implement the grid search algorithm for identifying the optimal hyperparameters86. In this study, we used GridSearchCV to tune hyperparameters. Table 1 shows the hyperparameter search ranges used for machine learning and deep learning models, along with the optimal values identified through Grid Search.

Table 1 Hyperparameter search ranges and optimized values for each machine learning model.

Machine learning models

Decision tree (DT)

The decision tree (DT) method is a widely recognized machine learning approach for both classification and regression tasks87. It derives its name from its hierarchical, tree-like structure, which operates similarly to a flowchart and is constructed using a partitioning process. Over time, various decision tree algorithms have been introduced, including ID3, C4.5, CART, CHAID, and MARS. The primary aim of DT learning is to establish a framework capable of effectively predicting variations in a response variable or categorizing data within a test dataset. To accomplish this, DT employs a branching structure where internal nodes represent decision points based on attributes, and leaf nodes indicate predicted output label88,89. One of the strengths of the DT algorithm is its robustness to missing data and outliers, making it well-suited for both categorical and continuous variables. To prevent overfitting, key hyperparameters such as the minimum number of samples per leaf node and the maximum depth of the tree can be adjusted. Additionally, DT regression provides an intuitive way to examine the relationships between input and output variables, with its graphical representation serving as a practical tool for predicting continuous target values90. Fig. S3 presents a schematic representation of the DT model.

Adaptive boosting decision tree (AdaBoost-DT)

Freund and Schapire introduced the adaptive boosting method (Adaboost) in 199791 to develop a classifier. An adaptive resampling technique selects training samples, with classifiers being trained iteratively. During each iteration, misclassified samples are assigned more weight. Therefore, the final classifier is derived from a weighted aggregation of predictions from all trained models in the ensemble92. When paired with the AdaBoost algorithm, the DT, typically considered a weak classifier, is expected to achieve notably improved performance. The AdaBoost-DT model is implemented in Python 3.7 using the AdaBoost class from the scikit-learn library.

Gradient boosting (GBoost)

The Gradient Boosting Regressor (GBoost) is an ensemble learning technique that builds a series of decision trees in a sequential manner, where each successive tree is trained to minimize the errors made by the previous one. GBoost is an iterative learning algorithm designed to enhance predictive performance by combining multiple weak learners into a more robust model93. As the number of weak models increases, the model’s error progressively reduces94. Furthermore, boosting addresses the bias-variance trade-off by initially constructing a weak learner and progressively enhancing its performance by sequentially adding new trees. Each newly added tree focuses on correcting the errors made by its predecessor by prioritizing the training instances with the highest prediction errors95. In essence, the new tree assigns greater importance to the misclassified rows from the previous iteration. A schematic representation of the GBoost concept is shown in Fig. S4.

Deep neural network (DNN)

A Deep Neural Network (DNN) is a type of neural network that consists of multiple hidden layers. In recent years, DNNs have gained widespread popularity, largely due to advancements in computational resources and increased accessibility to high-performance computing96,97. An appropriate network architecture is essential for ensuring the effective performance of a neural network. A standard DNN comprises an input layer, one or more hidden layers, and an output layer. The input and output layers define the model’s inputs and expected outputs, while the hidden layers play a key role in extracting meaningful features from the given dataset. Each layer consists of numerous neurons that apply mathematical operations to the input data. Throughout the training process, the model refines its performance by adjusting neuron-associated weights (w) and biases (b), a process guided by optimization techniques like gradient descent98.

Light gradient boosting machine (LightGBM)

In 2016, Guolin Ke et al.99 presented a new machine learning model, LightGBM, based on gradient boosting theory. Unlike other machine learning approaches, LightGBM requires less memory. LightGBM and XGBoost support parallel computations, but LightGBM outperforms the previous XGBoost model with faster training speed and lower memory usage. This reduction in memory occupation results in decreased communication costs during parallel learning. LightGBM stands out due to its decision tree-based architecture, which leverages gradient-based one-side sampling (GOSS), exclusive feature bundling (EFB), and a histogram-based learning strategy with a depth-constrained, leaf-wise growth mechanism100. Gradually based one-sided sampling (GOSS) can strike a desirable balance between the sample size and the accuracy of LightGBM’s decision tree. In LightGBM, an efficient algorithm called EFB is employed to group those parameters that rarely have nonzero values simultaneously (see Fig. S5 in Supplementary Material for the leaf-wise tree growth strategy). As decision trees deepen, overfitting tendencies increase, leading to more undesirable leaf directions. LightGBM’s crucial parameters enable it to handle large volumes of data, perform at high speed, and achieve higher accuracy in predictions101. However, when LightGBM leads to overfitting, setting a maximum depth limit for the leaf nodes can result in higher efficiency99,102. Concerning the construction of a LightGBM model, parameters and computations can be described as follows103,104:

$$X={\left\{\left({x}_{j},{y}_{j}\right)\right\}}_{j=1}^{N}$$
(1)

After minimizing the loss function L, the value of f(x) was predicted:

$$L\left(y,f\left(x\right)\right):\widehat{f}\left(x\right)=\mathit{arg}\mathit{min}{E}_{x,y}.L(y,f\left(x\right))$$
(2)

In conclusion, the training process of each tree can be described as follows:

$${W}_{q(x)}, q\in (1, 2, 3, \dots ,N)$$
(3)

In the given equation, N represents the leaf count in a tree, q indicates the decision rules employed in a single tree, and W signifies the weight term of each leaf node. To minimize the objective function using Newton’s method, the outcome of each stage’s training is adjusted as follows:

$${G}_{t}\cong \sum_{i=1}^{N}L[{y}_{i},{F}_{t-1}\left({x}_{i}\right)+{f}_{t}({x}_{i})]$$
(4)

TabNet

TabNet is a deep learning model specifically designed for tabular data105. Unlike traditional deep learning models, it directly processes raw data without requiring manual feature engineering. TabNet employs a sparse attention mechanism to dynamically select relevant features, enhancing both interpretability and efficiency. Its core components include106:

Feature Transformer: Processes input data and generates complex feature representations.

Attentive Transformer: Determines which features should be selected at each decision step using a sparse attention mechanism.

Masking Mechanism: Guides the feature selection process to improve model transparency and efficiency.

Aggregation: Combines the selected features from multiple steps to produce the final output.

TabNet is built on a multi-step decision-making process, refining its feature selection iteratively107. Its architecture integrates a feature transformer, an attentive transformer, and a masking mechanism, making it a powerful model for structured data tasks. Empirical studies demonstrate its high performance and strong generalization capabilities across various datasets108,109.

Statistical error analysis

The following statistical parameters were used to compare the performance of the model used in this study. (ρpred) represents the density predicted by the deep learning and machine learning models, and (ρexp) represents the experimental values of the density.

average percent relative error (APRE)

$$APRE=\frac{1}{n}\sum_{i=1}^{n}\frac{{(\rho }_{iexp}-{\rho }_{ipred})}{({\rho }_{iexp})}$$
(5)

average absolute percent relative error (AAPRE)

$$AAPRE=\frac{1}{n}\sum_{i=1}^{n}\left|\frac{{(\rho }_{iexp}-{\rho }_{ipred})}{({\rho }_{iexp})}\right|$$
(6)

root mean square error (RMSE)

$$RMSE=\sqrt{\frac{{\sum }_{i=1}^{n}{\left({\rho }_{iexp}-{\rho }_{ipred}\right)}^{2}}{n}}$$
(7)

standard deviation (SD)

$$SD=\sqrt{\frac{{\sum }_{i=1}^{n}{\frac{\left({\rho }_{iexp}-{\rho }_{ipred}\right)}{{\rho }_{iexp}}}^{2}}{n-1}}$$
(8)

coefficient of determination (R2)

$${R}^{2}=1-\frac{{\sum }_{i=1}^{n}{\left({\rho }_{iexp}-{\rho }_{ipred}\right)}^{2}}{{\sum }_{i=1}^{n}{\left({\rho }_{iexp}-{\overline{\rho }}_{iexp}\right)}^{2}}$$
(9)

Results and discussion

Table 2 provides an overview of the statistical performance metrics for six models: LightGBM, AdaBoost-DT, GBoost, DT, TabNet, and DNN. The assessment was conducted on training (1068 data points), testing (268 data points), and the complete dataset (1336 data points). The performance metrics in the table clearly demonstrate that LightGBM outperforms both deep learning models TabNet and DNN as well as other traditional machine learning models in predicting the density of seven thiophene-based compounds. LightGBM achieves the lowest errors across all key metrics, including AAPRE (0.03034 test), RMSE (0.44628 test), and SD (0.00043 test), while maintaining an exceptionally high R2 of 0.99997 on the test set. Although the LightGBM model demonstrates extremely high accuracy and low error metrics, suggesting excellent predictive performance, we acknowledge the importance of assessing the risk of overfitting. To address this, we employed k-fold cross-validation during model training, which ensures the model’s performance is consistent across multiple data subsets and not just the training set. The close alignment between training and test performance, along with low standard deviation in error metrics, indicates that the model generalizes well and is not overfitting. Nonetheless, we remain cautious and have included model validation measures to confirm its robustness and reliability. In contrast, TabNet and DNN show significantly higher prediction errors, with TabNet yielding an RMSE of 2.64649 and DNN 1.46719 on the test set, indicating weaker generalization. This superior performance of LightGBM is primarily due to the inherent suitability of tree-based models for structured, tabular data such as the molecular descriptors used in this study. Tree-based models like LightGBM can naturally model non-linear feature interactions and manage small to medium datasets more efficiently, without requiring extensive tuning. Meanwhile, deep learning models like TabNet and DNN face architectural limitations in tabular contexts they often struggle to generalize well without large datasets, are prone to overfitting, and require complex hyperparameter optimization. These findings highlight LightGBM’s superior accuracy, as further illustrated in Fig. S6 (Supplementary Material).

Table 2 Statistical error analysis for the models developed in this work.

The Taylor diagram [78] provides a visual representation of key statistical metrics R2, RMSE, and standard deviation (SD) to assess how well the predicted density aligns with experimental data. In this diagram, models with higher accuracy appear closer to the reference measurement point, while those with greater error deviate further. Among the evaluated models, LightGBM demonstrates the closest alignment with experimental data for both training and test sets, confirming its superior predictive accuracy (see Fig. S7 in Supplementary Material).

Graphical analysis

Graphical error analysis is a method for evaluating a model’s performance. This graphical tool is handy for comparing the performance of multiple models. Various schematic analyses were conducted in this study to demonstrate the effectiveness of the developed model. Graphical curves, including cross-plots, error distributions, group errors, and cumulative frequencies, were used to illustrate the reliability of the developed models.

Cross-plot

A cross plot is a type of scatter plot that visualizes the relationship between actual and predicted values by aligning them along a 45 line that passes through the origin. Fig. 2 plots the predicted values of the models against the experimental data. The greater the concentration of points on the Y = X line, the greater the accuracy of the model. As can be seen in Fig. 2, all models perform well and the points on the ideal line are aligned.

Fig. 2
figure 2

Cross-plot of the developed models for density prediction.

Error distribution plot

Fig. 3 shows the distribution of relative errors of the proposed models in the training and testing processes. The lower the data density near the line Y = 0 the greater the model error and the lower the accuracy for predicting density. As a result, the GBoost and LightGBM models have lower relative error than the proposed models for the training and testing data, thus they have higher accuracy for density prediction.

Fig. 3
figure 3

Error distribution diagrams of the models.

Cumulative frequency graph

The cumulative frequency plot of absolute relative error (%) for the models used in this study is shown in Fig. 4. This figure clearly shows the higher accuracy of the GBoost and LightGBM models than other proposed models for density prediction. In addition, the TabNet model has a higher error than other models.

Fig. 4
figure 4

Cumulative frequency distribution of the models developed in this study.

This study also explores error frequency by creating histograms of relative error. Fig. 5 displays histograms of relative error for six developed models. In the LightGBM model, most data points have errors between − 0.25 and 0.25, centering around zero relative error. Data with errors outside the range of − 0.25 to 0.25 for the DT, AdaBoost-DT, TabNet and DNN models indicates that these models have less coverage than LightGBM for both the training and testing data.

Fig. 5
figure 5

Histograms of relative error for the proposed models in density prediction.

The obtained results for the error values (see Fig. S8 in Supplementary Material) show that LightGBM model produces the smallest error distribution range, from − 0.1337 to 0.1321. The GBoost model ranges from − 0.2130 to 0.2028, while other models show a higher error distribution range than these two models.

Fig. 6 compares the effect of input parameters (critical temperature, critical pressure, critical volume, molecular weight, boiling temperature, together with operational determination temperature and pressure), on absolute relative error (%) for all models. As can be observed, in all ranges of molecular weight, boiling temperature, critical temperature, critical pressure, critical volume, temperature, and pressure, the LightGBM model has the least error compared to other models, which confirms the high accuracy of this model.

Fig. 6
figure 6

AARE of all models proposed in this work for different input parameter ranges.

A comparative analysis of the relative error among the proposed models offers valuable insights into identifying the most accurate predictive approach. This visual assessment demonstrates the strong alignment between experimental data and the predictions generated by the LightGBM model, as depicted in Fig. S9 of the Supplementary Materials.

Fig. 7 illustrates the percentage of relative error for the LightGBM model across the studied materials. The consistently low relative error across all materials underscores the model’s high precision in predicting density. This minimal deviation further confirms the reliability and effectiveness of the LightGBM model in accurately estimating density values.

Fig. 7
figure 7

Box plots displaying the relative error distribution for various thiophene compounds.

Model trend analysis

To assess how well the developed models capture the expected density trends, Fig. 8 presents the LightGBM model’s predicted values as a function of temperature and pressure. The plots illustrate that at fixed pressures of 7.0 and 65 MPa, density decreases with rising temperature. Conversely, at constant temperatures of 303.15 K and 338.15 K, increasing pressure results in higher density.

Fig. 8
figure 8

Top: Effect of temperature change on density at constant pressures 7 and 65 MPa; Bottom: Effect of pressure change on density at constant temperatures 303.15, and 338.15 K.

Sensitivity analysis

The relevancy factor (r) and the output of the LightGBM model are employed to assess the relative significance of input variables in predicting density. The correlation coefficient for each input parameter is determined using the following formula [63, 64]:

$$r\left({I}_{k},y\right)=\frac{{\sum }_{i=1}^{n}({I}_{i,k}-\overline{{I }_{k}})({y}_{i}-\overline{y })}{\sqrt{{\sum }_{i=1}^{n}{({I}_{i,k}-\overline{{I }_{k}})}^{2} {\sum }_{i=1}^{n}{({y}_{i}-\overline{y })}^{2}}}$$
(10)

\({I}_{i,k}\) and \(\overline{{I }_{k}}\) represent the ith average values of the kth input, respectively. K represents pressure temperature or other input parameters. \({y}_{i}\) and \(\overline{y }\) represent the ith predictive value and average. The parameter \((r)\) varies between − 1 and 1, reflecting the correlation between independent and dependent variables. A positive \((r)\) suggests that as the input variable increases, the output also rises, whereas a negative \((r)\) implies an inverse correlation. And he closer \((r)\) are to 1, the stronger the association between the model’s input and output values. The findings of the sensitivity analysis on the results of the LightGBM model, as the best-obtained model, are presented visually in Fig. 9. The relevancy factor plot clearly shows how each input parameter influences the model’s prediction of density, with boiling point (Tb), critical volume (Vc), and critical temperature (Tc) having the highest positive relevancy, indicating they are the most influential features. This aligns well with established physicochemical principles of thiophenes, where thermophysical properties such as density are strongly governed by phase behavior and intermolecular interactions both of which are reflected in critical and boiling point properties. For example, the strong correlation of Tb (0.7302) suggests that vaporization characteristics significantly impact the density profile. Similarly, the contributions of Tc (0.5683), Vc (0.5857), and ω (0.5131) highlight the importance of molecular structure and dispersion forces, which are central to understanding thiophene derivatives due to their aromatic and heterocyclic nature. The negative relevancy of Pc (− 0.4949) and T (− 0.1675) further supports the idea that increased external pressure or system temperature can reduce the predictability of density if not properly accounted for by structural properties. Overall, this plot confirms that the selected features not only enhance model performance but also reflect fundamental chemical behavior.

Fig. 9
figure 9

Sensitivity analysis on the LightGBM model.

Implementation of the Leverage method

After following statistical and graphical analyses that confirmed the superiority of the LightGBM model over other approaches, an additional outlier detection method was applied to identify data points that could adversely affect model predictions and to validate the reliability domain of the proposed model. The Williams plot visualizes standardized residuals (R) against hat values (H), providing insights into potential outliers. The key parameters for constructing this plot are determined using the following calculations77,110,111:

Hat matrix (H):

$$H=X{\left({X}^{T} X\right)}^{-1}{X}^{T}$$
(11)

Here, XT represents the transpose of the matrix X, which is a (y × z) matrix. In this case, y refers to the number of data points, and z refers to the number of input variables used by the model.

• Leverage limit (H*):

$${H}^{*}=\frac{3\times (z+1)}{y}$$
(12)

• standardized residuals (SR):

$${R}_{i}=\frac{{e}_{j}}{\sqrt{MSE\left(1-{H}_{j}\right)}}$$
(13)

where ej is the ordinary residual of the jth index, MSE is the mean square error, and Hj is the jth Leverage value. H values greater than H* are outside the range of applicability of the model. In addition, data points with H values less than H* and R values greater than 3 or less than − 3 are considered suspected data. Data points with H values less than H* and R values between − 3 and 3 are considered valid data112. As illustrated in Fig. 10, over 99% of the dataset is deemed valid, with only 12 out of 1336 data points identified as potential anomalies. Williams chart analysis shows that 99.10% of the data falls within the acceptable range.

Fig. 10
figure 10

William’s design is based on the LightGBM model.

Conclusion

In this study, the critical properties including critical temperature (Tc), critical pressure (Pc), critical volume (Vc), and acentric factor (ω), together with boiling point (Tb), and molecular weight (Mw) were used as input parameters for machine / deep learning models to predict the density of thiophenes. Accurate density prediction is vital for understanding and mitigating the environmental and industrial impact of sulfur compounds in fuels. In this work, in addition to four machine learning models (DT, AdaBoost-DT, LightGBM, and GBoost) we also used two deep learning models (TabNet and DNN) for density prediction. Results revealed that the LightGBM model outperformed the others, with the lowest errors in statistical evaluations (AAPRE = 0.02308, APRE = − 0.00014, RMSE = 0.34998, and an R2 = 0.99998). Graphical evaluations further confirmed the LightGBM model’s high accuracy in predicting thiophene density across training and test datasets. In addition, the comparison of the experimental data and predicted values by the LightGBM model at constant temperatures of 303.15 and 338.15 K and constant pressures of 7 and 65 MPa proved the accuracy of the prediction. Using the relevance factor, the impact of input characteristics on the model’s target parameter was also investigated. The Leverage technique revealed that all data points appeared trustworthy and valid, except for a few that fell into the suspected data region. In summary, applying the Leverage method confirmed the data integrity and effectiveness of the proposed LightGBM model. This study distinguishes itself through its comprehensive dataset, a broader range of thiophene derivatives, and the incorporation of advanced machine/deep learning models. The findings provide a robust foundation for optimizing the properties of thiophene derivatives, supporting innovations in fuel refinement, environmental sustainability, and advanced material applications.