Introduction

The coronavirus family, formally known as Coronaviridae, is a large and diverse group of viruses that infect the respiratory tract with a wide range of hosts, including humans and other mammals1,2,3. This family causes respiratory illnesses with mild symptoms known as causing severe respiratory diseases such as severe acute respiratory syndrome (SARS), Middle East Respiratory Syndrome (MERS), and the ongoing global pandemic, SARS-CoV-2. The coronaviruses genome is a positive-sense single-stranded virus, approximately 26 to 32 kilobases in size4. In this study, one of the best-characterized drug targets is the main protease (MPro, also known as 3CLPro)5,6, which involves viral replication and a highly conservative pocket site of about 80% in clusters of the coronavirus family7.

The emergence of SARS-CoV-2 in Wuhan, China, in December 2019 was the beginning of a global health emergency that rapidly evolved into a pandemic. This newly emerging disease, for which there were no existing protections or medications, caused a surge in fatalities during the early phase of the pandemic8. In response, the World Health Organization (WHO) announced the official name of the disease as coronavirus disease 2019 (COVID-19), and the new coronavirus has been named severe acute respiratory syndrome 2 (SARS-CoV-2)9,10. Treatments for COVID-19 are categorized into several approaches, including vaccines such as Pfizer-BioNTech, Moderna, AstraZeneca, etc.11. In tandem with vaccination efforts, antiviral medications have played a crucial role in the therapeutic landscape for COVID-19. For example, Remdesivir, a broad-spectrum antiviral drug like Ebola, inhibits virus replication within host cells12. Another innovative antiviral medication is Molnupiravir, an oral drug designed to introduce errors during viral RNA replication, impeding the virus’s ability to proliferate13.

Previous studies evaluated the efficacy of ebselen derivatives against HIV14, HSV15, HCV16, and Zika virus infections17, and also SARS-CoV-218,19,20,21. Ebselen derivatives potent and effective inhibition of the main protease of SARS-CoV-2 via covalent inhibition via S-Se interaction22. Furthermore, it has computational techniques that were employed in many studies22,23,24 to confirm the inhibitory efficiency of ebselen and understand insight, such as molecular docking and molecular dynamic simulations (MDs) techniques. Moreover, a recent report presents a new ebselen and ebsulfur series that were synthesized and tested as inhibitors of SARS-CoV-2 MPro by fluorescence resonance energy transfer (FRET) technique by Sun, Le-Yun et al.25 which would attract some theoretical research on ligand-based drug design.

Computational approaches have emerged as powerful tools for drug discovery, offering a faster and more cost-effective alternative is the quantitative structure–activity relationship (QSAR) methodology26. It is useful in drug discovery to understand and predict the biological activity of compounds based on molecular structure. Drug discovery, based on extensive laboratory testing and experimentation, is time-consuming and expensive27,28,29,30. By integrating QSAR with machine learning algorithms to predict the biological activity of compounds based on their chemical structure. This allows for the rapid screening and prioritization of potential drug candidates31,32,33, enabling researchers to focus on molecules with higher probabilities of exhibiting the desired activity against SARS-CoV-2.

Thus, in this work, firstly, the application of QSAR with the three distinct methods, Genetic Function Approximation-Multiple Linear Regression (GFA-MLR), Random Forests (RF), and Artificial Neural Network (ANN), was employed to construct the relationship between structural properties and biological activity on ebselen and ebsulfur derivatives. The validation of our findings would be further confirmed by biological activity prediction on an external set compound. Insightly, the models further applied on SAR-CoV-2 inhibitory activity prediction of new ebselen and ebsulfur derivatives reported by Qing-Feng et al.34, we further investigated experimental inhibitory activity and toxicity drug evaluation, specifically focusing on drug-likeness and toxicity, grounded in the chemical structure of newly synthesized compounds. The successful process has led to the development of new effective inhibitor candidates. An overview of the study’s workflow shown in Fig. 1.

Fig. 1
figure 1

An overview of the study’s workflow, illustrating the key stages of developing the effective SARS-CoV-2 inhibitors which are composed of ligand based QSAR machine learning, enzyme-based assay on new designed ebselen, and structure based MD simulations.

Materials and methods

Data set

A data set of twenty-seven ebselen and ebsulfur derivatives and their MPro inhibitory activity (IC50) were synthesized by Sun et al.25. IC50 was then converted into pIC50 using Eq. (1), as shown in Fig. 2. The data set was divided by using the Kennard–Stone35,36 algorithm, a technique for selecting which data was suitable to be a training set or test set from the feature value distribution in the whole data set and calculated based on a distance metric between data points18. The training set was used to construct the QSAR model. While the test set was used for QSAR model validation.

Fig. 2
figure 2

The 2D structures and pIC50 values of ebselen and ebsulfur derivatives25. a is the IC50 (μM) and b is the pIC50. values.

$${\text{pIC}}_{50} = \log \left(\frac{1}{{\text{IC}}_{50}\text{ (M)}}\right)$$
(1)

All structures were built and minimized, and their molecular descriptors were then generated using the Materials Studio version 8.0 program37, which consisted of thirty-five molecular descriptors listed in Table S1. The molecular descriptors served as independent variables.

Descriptor selection and model construction

To find descriptors that are critical to significant SARS-CoV-2 MPro inhibitory activity by two algorithms are genetic functional algorithms (GFA); the GFA is a novel optimization technique that can be used to search for variables that are suitable for model construction38,39. The MLR model used the GFA to select important descriptors in Material Studio version 8.0. The condition to construct the GFA-MLR model in Material Studio version 8.0 with the population and maximum generation set at 100 and 500, respectively, and a mutation probability of 0.100.

Another one is Gini’s importance40,41, which is applied to select the crucial features of the RF and ANN models. The features with the highest Gini importance values indicate that compound structures significantly impact potency and bioactivity. Gini importance varies between 0 and 1, with 0 representing the lowest and best possible importance. A higher Gini importance indicates greater. Also, the RF and ANN models used Gini importance by Google Colab42. In addition, Fig. S1 displays the correlation matrix between two descriptors used to indicate the relationship between the descriptors in the model.

To find the greatest RF model, a hyperparameter for finding suitable conditions should be performed by varying four parameters: (1) Max feature is the maximum number of features considered for splitting a node. (2) Min_sample_leaf is the minimum number of data points allowed in a leaf node. (3) Min_samples_split is the minimum number of data points placed in a node before the node is split, and (4) The number of estimators is the number of trees in the forest. The optimized hyperparameter of the RF model is shown in Table 1(a).

Table 1 Hyperparameters to be tested for (a) RF and (b) ANN.

The ANN is machine learning processing based on artificial neurons, which transform input data into output predictions via mathematical operations. It is a machine learning algorithm based on the structure and function of biological neurons and mimics human brain processing, which processes it using a non-linear activation function. The ANN model construction varied by two parameters: the number of nodes in the input layers, representing the number of descriptors, and the number of nodes and layers in the hidden layers43.

To improve the predictive performance and generalization capacity of the ANN model, we optimized the hyperparameters to obtain the optimal combination. Hyperparameter optimization for the ANN includes determining the number of hidden layers, the number of neurons, the maximum number of iterations (max_iter), the learning rate, and the batch size. A range of hyperparameters of the machine learning tools is varied to obtain the most robust and predictive non-linear models based on an n-fold cross-validation scheme using the Grid-Search CV of Scikit-learn44,45. The range of metrics for the grid search, where the five hyperparameters of the ANN model are examined, is presented in Table 1(b).

Model evaluation statistical terms

To investigate the degree of linear correlation between two descriptors by calculating the correlation coefficient (r)46,47. A correlation coefficient of 1.0 or − 1.0 indicates that two variables are highly correlated, while a coefficient of 0.0 shows no correlation, as shown in Eq. (2). When C(x,y) is covariance, which is the joint variance of two variables, x and y, the variance of a variable X (\({V}_{x}\)) and the variance of a variable Y (\({V}_{y}\)).

$$\text{r}=\frac{{\text{C}}_{\text{(x,y)}}}{\sqrt{{\text{V}}_{\text{x}} \times {\text{ V}}_{\text{y}}}}$$
(2)

The variance inflation factor (VIF) indicates collinearity between descriptors in multiple regression models, indicating statistical significance48,49,50 as determined by Eq. (3).

$$\text{VIF} = \frac{1}{\text{1} - {\text{R}}^{2}}$$
(3)

The quality model was validated using statistical parameters, with R-Squared (R2) being a measure of the fit model’s quality, which should be greater than 0.6. When the predicted y values (\({y}_{pred}\)) and the mean values (\(\overline{y}\))

$${\text{R}}^{2} = 1 - \frac{\sum_{\text{i=1}}^{\text{n}}{\text{(}{\text{y}}_{\text{pred.}}-\overline{\text{y}}\text{)}}^{2}}{\sum_{\text{i=1}}^{\text{n}}{\text{(}{\text{y}}_{\text{exp.}}-\overline{\text{y}}\text{)}}^{2}}$$
(4)

Root mean square error (RMSE) is a measure of prediction accuracy calculated as the square root of the average squared errors. A lower RMSE indicates better prediction quality, ideally closer to zero, as shown in Eq. (5).

$$\text{RMSE}=\sqrt{\sum_{\text{i=1}}^{\text{n}}\frac{{\text{(}{\text{y}}_{\text{exp.}}-{\text{y}}_{\text{pred.}}\text{)}}^{2}}{\text{n}}}$$
(5)

Enzyme-based assay

The MPro activity and inhibition assay at 100 μM compound concentration was performed exactly as previously described51,52,53. Briefly, SARS-CoV-2 MPro with no tags at the termini was expressed and purified as described for SARS-CoV-1 MPro54. All assays were performed with BioTek Synergy H1 microplate reader using PBS containing 1 mM DTT and 1% DMSO as the reaction buffer. The fluorogenic substrate E(EDANS)TSAVLQSGFRK(DABCYL) (Biomatik) at 25 µM was used with 0.2 µM of MPro in the total reaction volume of 100 µL. The excitation and emission wavelengths employed were 340 and 490 nm, respectively. The percentage of the enzymatic activity was calculated from the initial rate of the reaction when the compound being tested was present relative to the initial rate of the reaction without the inhibitor. PF-07321332 at 100 nM was used as a positive control55. GraphPad Prism 856 (San Diego, California USA, https://www.graphpad.com) was used for graphing.

Cytotoxicity testing

Cytotoxic (CC50) tests were evaluated according to the previous description36. Vero E6 cells were seeded and incubated overnight before the test. The compounds were prepared in DMSO for a final concentration of 500 µM. The compounds were twofold serially diluted to 8 concentrations before addition to Vero E6 cells. Cells were incubated for 48 h, and cytotoxicity was measured using the CellTiter 96® Aqueous One Solution Cell Proliferation Assay Kit (MTS) (Promega, Madison, WI, USA) according to the manufacturer’s instructions and analyzed by spectrophotometry at 490 nm. The concentration required for 50% cell death (CC50) was determined by three independent experiments.

The efficacy study was conducted according to the guidelines of the Declaration of Helsinki and Chulalongkorn University Institutional Biosafety Committee (CU-IBC 003/2021). The Institutional Review Board of the Faculty of Medicine, Chulalongkorn University certified the protocol exemption (COE 017/2021, IRB No. 297/64). The SARS-CoV-2 B.1.617.2 (accession number ON381169) were propagated in Vero E6 cells with MEM supplemented with 1% fetal bovine serum, 100 I.U./ml penicillin, and 100 μg/ml streptomycin, 10 mM HEPES, NEAA, and sodium pyruvate at 37 °C humidified chamber under 5% CO2. Virus titers were determined as TCID50/ml in confluent cells in 96-well cell culture plates. All experiments with live SARS-CoV-2 MPro were performed in a certified biosafety level 3 facility of the research affair-Medical Research Center (MRC), Faculty of Medicine, Chulalongkorn University.

Seven ebselen analogs were tested against four strains of SARS-CoV-2 MPro. Briefly, Vero E6 cells at 5 × 104 cells per well were seeded into a 24-well plate and incubated overnight at 37 C under 5% CO2. Cells were infected with SARS-CoV-2 at 1000TCID50 for 1 h. After infection, cells were washed with phosphate buffer saline (PBS) and incubated with 1 ml of maintenance medium. The compounds were prepared at the indicated concentrations in 0.1% DMSO in the maintenance medium during and after infection. Cells were incubated at 37 °C for 72 h under 5% CO2 humidified chamber. Supernatants were collected for analysis of the viral infectivity by TCID50/ml (v2.1—20-01-2017_MB* by Marco Binder; adapted @ TWC. 5. 6, accessed on 16 May 2022). The compound was serially diluted to 6–8 different concentrations and was added to final concentrations into SARS-CoV-2-infected cells. Dimethyl sulfoxide at 0.1% was used as a vehicle, with no inhibition control. Cells were incubated for 72 h and supernatants were collected for subsequent TCID50/ml analysis43,57. Data were plotted and effective concentration EC50 values were calculated using nonlinear regression analysis.

Molecular dynamic simulations

This study used ligand-binding path sampling parallel cascade selection MD (LB-PaCS-MD)58, an extension of the original PaCS-MD59,60. PaCS-MD was developed to sample the transition paths of proteins between a set of endpoint structures, where multiple short-timescale MD simulations are repeated from reasonable structures to promote their conformational transitions from a reactant to a product61,62,63. In the case of LB-PaCS-MD, this technique repeats short timescale (about 100-ps) MD simulations from reasonable protein–ligand configurations, focusing on ligand-unbinding states. In this application, configurations are ranked based on the center-of-mass (COM) distance between the Se atom of each ligand and the sulfur atom in the active site (C145) of SARS-CoV-2 Mpro, termed dCOM. Top-ranked (five) snapshots from each cycle serve as initial structures for subsequent simulations. LB-PaCS-MD terminates automatically after 100 cycles, with 10 independent replications conducted by changing their initial velocities to ensure reliable results.

To generate the parameters and perform geometry optimization of the compound P8, the B3LYP/6-31 + G(d,p) method of calculations64 were applied to generate the electrostatic potential (ESP) charges using Gaussian 1665. Subsequently, the ligand-charged fitting was constructed by restricted ESP and topological parameters of the ligands (frcmod and prep files) using MCPB.py66 in AmberTools2167, together with the generalized Amber force field 2 (GAFF2)68. The 3D structure of ebselen covalently bound dimeric SARS-CoV-2 Mpro (PDB ID: 7BAK19) was utilized as the protein receptor. To construct the initial structure for the LB-PaCS-MD simulation, P8 was placed far from C145 located at the active site of SARS-CoV-2 Mpro on chain A, around 30 Å in a cubic box. The tLEaP module included in the AmberTools2167 was used to set up the complex by adding hydrogen atoms, TIP3P water molecules, and neutralized ions. This complex was converted to the GROMACS input file format to conduct the multiple MD simulations under NPT (T = 300 K and P = 1 bar) in each LB-PaCS-MD cycle using GROMACS (version 2019.6)69. The MD condition was used according to the previously described70,71.

All 10 LB-PaCS-MD trajectories were used to calculate the free-energy profile (kBT) as a function of the distances of S(C145)–Se(P8) and Nε(H41)–N(P8), which were then plotted as a two-dimensional free energy landscape (2D-FEL). The complex sampled from the Global Minimum State (GMS) was evaluated for binding interaction energy using the LigandScout 4.4.6 program, following standard protocol72,73. The 3D and 2D interactions of the complex at GMS were visualized using Visual Molecular Dynamics (VMD) version 1.9.474,75 and BIOVIA Discovery Studio Visualizer76.

Results

Classical QSAR

The Kennard–Stone algorithm was applied to divide the data set into twenty-one training sets and six test sets. The MLR model was crafted using a selection of 5 descriptors of the training data set determined through the GFA algorithm, as shown in Eq. (6), and the definition of descriptors was explained in Table S2. This was a predicted pIC50 value, which shows residue values less than 1. The model validation parameters of Eq. (6): R2 of the training set = 0.69, RMSE of the training set = 0.16, and RMSE of the test set = 0.56, indicating that the model was predictively accurate and acceptable (Fig. 3).

$$\begin{aligned} {\text{pIC}}_{50} & = - 0.507 *{\text{ Molecular}}\;{\text{flexibility }} + 0.036 *{\text{ Zagreb}}\;{\text{index }} - 0.057 *{\text{ E-state}}\;{\text{keys}}\;({\text{sums}}){:}\,{\text{ S\_aaCH }} \\ & \quad - 0.015 *{\text{ E-state keys}}\;({\text{sums}}){:}\,{\text{ S\_dO }} + 0.336 *{\text{ Shadow}}\;{\text{length:}}\;{\text{LY }} + 2.897 \end{aligned}$$
(6)
Fig. 3
figure 3

A scatter plot of the predicted pIC50 for each model (Blue circles = Training set, Red triangles = Test set) (A) MLR, (B) RF, and (C) ANN.

After that, investigate the correlation matrix of descriptors and the VIF presented in Table S3. The correlation matrix between the two descriptors is less than 0.7. Further confirmed, the VIF values for each descriptor were less than 10, indicating that the five descriptors were not multicollinear and could not lead to problems in model interpretation and stability.

QSAR-ML

The RF and ANN models were developed utilizing the Gini importance method (Fig. S2), with emphasis on key descriptors such as shadow length along the Y-axis, AlogP98, shadow area fraction in the YZ plane, and principal moment of inertia along the Y-axis. Detailed definitions for each descriptor can be found in Table S2. To ascertain the significance of these descriptors, VIF helps identify multicollinearity among predictors by measuring how much the variance of an estimated regression coefficient increases if your predictors are correlated.

The best RF model was constructed by conditions consisting of a max depth of 10, a max feature of 4, a min_sample_leaf of 2, and a min_samples_split of 2. The number of estimators is 30. The results have the acceptable statistical parameters: R2 of the training set = 0.82; RMSE of the training set = 0.14; RMSE of the test set = 0.18. (Fig. 3) For the development of better predictive models according to Fig. S2. This method obtained four descriptors from the same RF model by selecting descriptors based on Gini importance. The good performance of the ANN architecture was 4-(5-5-5)-1, which represents the number in the first position as one input layer of four neurons, which is the number of descriptors selected by the Gini importance method. The number in the second position is three hidden layers with each with five neurons, and the number in the last is one output layer with inhibitory layers. The artificial neural network (ANN) model in Fig. 3 illustrated robust and stable performance with the notable statistical parameters: R2 of 0.89 for the training set, RMSE of 0.10 for the training set, and RMSE of 0.05 for the test set. Like the MLR model, the correlation analysis revealed that the four descriptors did not exhibit high correlations in RF and ANN models, as shown in Table S4. Compared to the MLR and RF models, the ANN shows much higher prediction accuracy in Table S5.

External validation

To evaluate the predictive ability and robustness of the ANN model developed in the previous step, an external validation was conducted using ebselen data reported from Amporndanai et al.19. Three compounds from this work were selected, including MR6-17-1, D_MR6-18-4, and D_MR6-26-2. The analysis revealed an RMSE of 0.35, which demonstrates good predictive performance (Fig. 4).

Fig. 4
figure 4

The data of the external compounds and their predicted and experimental pIC50 values predicted by ANN model SARS-CoV-2.

Inhibitory activity of the new synthetic ebselen analogs prediction

The structures of thirteen new ebselen analogs were synthesized by Qing-Feng et al.34. The external set used in this work came from a collaboration with Osaka University. The molecular descriptors for the new ebselen analogs calculated by Material Studio version 8.0 were displayed in Table S7. The pIC50 values of each new ebselen analog were then predicted using the ANN model, and it was found that the pIC50 values (Table S8) were all within the range of the data set shown in Fig. 5.

Fig. 5
figure 5

The new synthetic ebselen structures and the predicted pIC50 values by ANN model (Blue circles = Training set, Red triangles = Test set, Grey squares = New designed set).

Enzyme based assay

The ebselen analogs were initially tested for their inhibitory activities against SARS-CoV-2 MPro. At 100 μM concentration, the compounds P1, P3, P4, P5, P7, P8, and P12 showed modest inhibitory activity in Fig. 6A. The most potent inhibitor is P8 that caused reduction of enzymatic activity to 64.5%. Therefore, these ebselen analogs could serve as starting points for further modification to improve inhibitory potency.

Fig. 6
figure 6

(A) and (B) Effect of the new synthetic ebselen analogs in Vero E6 cells viability. (C) Inhibitory activity of ebselen analogs (100 μM) against SARS-CoV-2 MPro.

Toxicity and efficacy testing

The compounds P1, P3, P4, P5, P7, P8, and P12 were tested for cytotoxicity in Vero E6 cells in Fig. 6B. Cytotoxicity was not observed in P1, P3, P4, P5, P7, and P8 in any tested concentrations to 100 µM; therefore, we concluded that the compounds ‘cytotoxicity was higher than 100 µM. However, the P12 showed a cytotoxic effect at higher concentrations, calculated to 51.58 ± 5.90 µM. Moreover, the efficacy was tested against SARS-CoV-2 in the BSL-3 facility. The P3 compound showed 1–1.5 log TCID50 reduction from the initial concentration, and the inhibition was consistent through the higher concentrations. The P4, P5, P7, P8, and P12 showed 1–2 log TCID50 reduction from the initial concentration, but the inhibitions were reduced in higher concentrations in Fig. 6C. We speculated that the finding could correlate to the solubility issue as crystals were found in those respective tested concentrations. Finally, the P1 compound showed fluctuating SARS-CoV-2 titers, suggesting the inconsistent solubility of the compound.

P8 binding pathway towards the catalytic dyad region

To sample the plausible binding pathway and configuration of P8 towards the active site of SARS-CoV-2 MPro, LB-PaCS-MD simulation was conducted for 10 individual replications (#1–10) using the same initial coordinates but with varying initial velocities (Fig. 7A). The 2D free-energy profile (2D-FEL) for each replication, derived from LB-PaCS-MD trajectories based on the Markov State Model (MSM), shows the relative free energy (kBT) of P8 across its conformational space. Analyzing the representative trajectory (#1), P8 reveals a global minimum state (GMS, Fig. 7B), indicating its search for an optimal conformation facilitating binding at the active site of SARS-CoV-2. The binding process of the P8 observed in this study was similar to that of ebselen, as recently reported77. The P8 rearranged its conformation by orienting the Se of the benzoselenazole moiety toward the S of C145 (Fig. 7C), resulting in a binding interaction energy of − 16.15 kcal/mol (Fig. 7D). We found that chalcogen-bonding interaction between S atoms in P8 and C145, and also a π-donor hydrogen interaction with N142, could induce and stabilize the binding mode of P8. Additionally, the influence of the naphthalene ring at R1 of the benzoselenazole ring, introduced from the QSAR-ML study, could maintain ligand binding through interactions with M49 and M165 via alkyl-π interaction, as well as van der Waals interactions with residues within sub-pockets S2 and S4 of the SARS-CoV-2 MPro active site (Fig. 7C). Our findings are congruent with previous studies that confirmed the existence of the naphthalene moiety in compound CDD-1733 leads to a full occupation of the sub-pocket S278, aligning with raised hydrophobicity. Moreover, the P2 of α-Ketoamide inhibitors and Nirmatrelvir fit well into the sub-pocket S2, contributing to hydrophobic interactions with M49, M165, and D18779,80. This sub-pocket S2 could also accommodate the benzene ring of flavonoid and the bicycloproline moieties of boceprevir and telaprevir through hydrophobic interactions81,82,83.

Fig. 7
figure 7

(A) The P8 binding pathway towards the catalytic dyad region of SARS-CoV-2 MPro is elucidated using LB-PaCS-MD with 10 independent runs (#1–10), each individually set by varying their initial velocities. (B, C) 2D-FEL of the representative trajectory is chosen to visualize the binding pathway and the metastable stage at GMS (×). (D) The binding pattern and interaction of P8 in 3D and 2D are illustrated.

Conclusions

In this study, The QSAR provides a significant understanding of the properties of compounds that significantly inhibit SARS-CoV-2 MPro activity (pIC50) by using several algorithms, including MLR, RF, and ANN. When comparing all models together, the statistical parameters of the ANN model had the highest R2, which was 0.89, and the lowest RMSE of the test set was 0.05, which indicates that the performance of this model was accurate. Consequently, the ANN model was used to predict inhibitory SARS-CoV-2 MPro activity of the thirteen new synthetics ebselen analogs and then examined the enzyme base activity and toxicity testing found that the compound P8 was notable inhibitory SARS-CoV-2 MPro activity and passed the enzyme base activity examination and non-toxicity. The LB-PaCS-MD study was conducted for 10 individual replications, analyzing the representative trajectory of P8 that demonstrates a global minimum state with a binding interaction energy of − 16.15 kcal/mol. It can effectively bind to the active site of SARS-CoV-2, while P2 in α-Ketoamide inhibitors plays a role in hydrophobic interactions with M49, M165, and D187 residues in pocket S2.