Abstract
Drug screening resembles finding a needle in a haystack: identifying a few effective inhibitors from a large pool of potential drugs. Large experimental screens are expensive and time-consuming, while virtual screening trades off computational efficiency and experimental correlation. Here we develop a framework that combines molecular dynamics (MD) simulations with active learning. Two components drastically reduce the number of candidates needing experimental testing to less than 20: (1) a target-specific score that evaluates target inhibition and (2) extensive MD simulations to generate a receptor ensemble. The active learning approach reduces the number of compounds requiring experimental testing to less than 10 and cuts computational costs by ∼29-fold. Using this framework, we discovered BMS-262084 as a potent inhibitor of TMPRSS2 (IC50 = 1.82 nM). Cell-based experiments confirmed BMS-262084’s efficacy in blocking entry of various SARS-CoV-2 variants and other coronaviruses. The identified inhibitor holds promise for treating viral and other diseases involving TMPRSS2.
Similar content being viewed by others
Introduction
The efficient discovery of drugs is critical for the development of therapies against rapidly evolving diseases. Despite scientific advancements, drug discovery remains a slow and expensive process, characterized by high failure rates1. The initial phase of drug discovery, known as hit identification, is particularly challenging and can be regarded as a needle-in-a-haystack problem.
Virtual screening has enhanced the efficiency of exploring the chemical space, reducing the number of compounds that require experimental testing. However, relying on brute-force virtual screening might not solve the needle-in-a-haystack problem and is potentially very wasteful. Methods typically employed in virtual screening approaches include pharmacophore modeling2, molecular docking3, molecular dynamics (MD)4 and machine learning (ML)5. While docking methods and their associated scoring functions are highly efficient in screening through vast databases of candidate molecules, a heuristic for binding affinity they provide can be very inaccurate6. The quantity we are interested to predict is protein function, and how to inhibit that. Furthermore, many virtual screening approaches disregard that protein binding pockets can have multiple conformational states, which can play a crucial role7, both in the context of induced fit and conformational selection mechanisms8.
To explore the power of using target-specific information, protein flexibility and active learning in virtual screening, we focused on TMPRSS2, a human serine protease, whose inhibition mechanism was studied in detail9. TMPRSS2 is involved in prostate cancer10, as well as cellular entry of influenza A, SARS-CoV and MERS-CoV viruses11,12. Notably, it facilitates the entry of SARS-CoV-213 (Fig. 1a), a function that is retained for the Omicron variant14,15. Its known inhibitors either form a covalent bond with the enzymatic reactive center16,17,18,19 (Fig. 1b) or establish a stable non-covalent complex20.
a TMPRSS2 role in Spike (S) protein-driven entry of SARS-CoV-2 into host cells. b Important structural features of TMPRSS2. Catalytic triad residues are depicted in black. Labels of the features included in our target-specific score are in red. c, d Structures belonging to receptor ensemble from apo and holo MD simulations.
In this manuscript, we present the development of an active learning approach for drug screening and apply it to TMPRSS2 inhibition. Our approach ranks candidates according to a target-specific score and efficiently navigates through chemical space in several cycles. Using our approach, we successfully identified BMS-262084, a potent nanomolar inhibitor of TMPRSS2 which effectively blocks coronavirus entry into Calu-3 human lung cells.
Results
We develop an active learning approach for TMPRSS2 inhibition (Fig. 2a) and we apply it to the DrugBank library and the NCATS in-house library. The DrugBank contains four experimentally verified TMPRSS2 inhibitors: nafamostat, camostat, gabexate and otamixaban. To validate our approach, we simulate a virtual screen on the DrugBank library as follows: starting with 1% of the whole library, we employ an active learning cycle to select subsequent extension sets of the same size until the method scores all four known inhibitors. Our objective is that the known inhibitors receive high rankings – thus minimizing the number of experimental tests – while screening as few compounds as possible from the whole library – thus minimizing the computational cost. Subsequently, we use our approach to find TMPRSS2 inhibitors.
a Schematic representation of our active learning approach. b–d Correlation between experimental drug efficacies (as measured by maximal response in the biochemical assay, x-axis) and different virtual screening scores (y-axis) - mean normalized vinardo docking score, h-score of the docking pose and dynamic h-score (averaged over MD trajectory), respectively. Green and pink denote correctly and incorrectly classified compounds, respectively. Squares mark potentially reactive compounds, circles non-reactive ones. e Correlation between experimental (true) binding affinities and predicted binding affinities for trypsin-domain proteins in PDBbind version 2020.
Target-specific score enables hit discovery
We introduce an empirical score tailored to measure target inhibition, aiming to mitigate the inaccuracy of docking scores. An effective TMPRSS2 inhibitor (Fig. 1b), either physically occludes its active site (non-covalent inhibitor) or forms a stable enzyme-drug complex (covalent inhibitor). Therefore, our proposed score (Eq. (1)) rewards the occlusion of the S1 pocket and the adjacent hydrophobic patch, as well as short distances for features that describe reactive and recognition states (see Target-specific scoring for details). In general, such a score can be learned (see Learned score generalizes to trypsin-domain proteins) when considering many protein targets and their corresponding inhibition data.
To evaluate our target-specific score (Fig. 2b, c), we use a dataset of compounds tested against TMPRSS2 from the NCATS OpenData Portal. Despite not being optimized for this dataset, our target-specific score computed from docking poses (static h-score, sensitivity of 0.5) outperforms the docking score (sensitivity of 0.38), emerging as a better model for serine protease inhibition. The number of false positives is likely underestimated as only the top 50 hits from the dataset were used and, therefore, we do not compute specificity values here.
Next, we compare the effectiveness of the docking score and our static h-score in the active learning cycle on the DrugBank library (Table 1, rows 1 and 2). We dock candidates to each of the 20 structures in our receptor ensemble and score the resulting docking poses. On average, the docking score requires computationally screening 2755.2 compounds (simulation time 15,612.8 h) compared to only 262.4 compounds (simulation time 1486.9 h) using the static h-score for ranking candidates. More importantly, using the docking score the four known inhibitors appear within the top 1299.4, whereas they are in the top 5.6 positions using the target-specific score, resulting in a more than 200-fold reduction in the number of compounds that need to be experimentally screened.
We examine structures where the docking score and the static h-score substantially differ (Fig. 3a) and find that the h-score effectively captures important structural features, distinguishing good inhibitors far more accurately than the docking score. The latter either overestimates (e.g. nafamostat and otamixaban in Fig. 3c) or underestimates (e.g. camostat and gabexate in Fig. 3c) the inhibitory effect of a candidate.
a Examples of docking poses of known inhibitors misclassified by docking score with respective static h-score and normalized docking score. Scores are color coded as green for good, blue for intermediate and orange for bad. b, c Examples of docking poses of known inhibitors for good (MD-generated) vs bad (homology model) receptor structure with respective static and dynamic h-scores.
Learned score generalizes to trypsin-domain proteins
We hypothesize that our target-specific scoring method can be extended to other trypsin-domain proteins by learning a score based on the individual observables used in the h-score. Instead of relying on an empirical formulation that aggregates or selects specific features, we adopt a data-driven approach, training a model to predict binding affinities for proteins containing the trypsin domain using ∆SASA values and distances from the ligand for each residue in the S1 pocket and hydrophobic patch.
To evaluate this hypothesis, we select a subset of the PDBbind database containing experimental structures of trypsin-domain proteins bound to ligands, along with their binding affinity data. A simple random forest regressor, trained on the aforementioned observables for this subset, achieves a correlation of 0.80 between true and predicted binding affinities on the test set (Fig. 2e), demonstrating the learned scores ability to generalize to proteins containing the trypsin domain.
Feature importance analysis (Supplementary Fig. S1) identifies ∆SASA of the residue at the S1 pocket entrance (residue Trp461 in TMPRSS2) and the catalytic histidine as key predictive features, consistent with the expectation that strong binders shield this region. Additionally, the distance to the residue opposite the S1 pocket (residue Lys300 in TMPRSS2) emerges as an important factor, supporting the idea that potent inhibitors extend toward this patch.
We recommend using the learned score for future investigations of trypsin-domain proteins. For the subsequent computational and experimental screens, we now resort to the TMRPSS2-specific score.
Molecular dynamics allow accurate candidate ranking and docking
We also conduct experiments to evaluate the significance of MD simulations within our active learning approach. MD was used in two ways: (1) to generate 10-ns simulations of protein-ligand complexes for dynamic h-scoring, totaling 100 ns per ligand and 818 µs for all 8180 ligands and (2) to generate a ≈100-µs simulation of the receptor from which 20 snapshots are used for docking (“receptor ensemble”, Fig. 1c, d).
We first examine the relevance of running MD simulations for inhibitor scoring. MD seeded from docked poses can reduce false positives/negatives by expelling a misposed ligand from the active site or relaxing a non-optimal pose. Indeed, computing the score from MD simulations (dynamic h-score) further improves the classification of TMPRSS2 inhibitors from the NCATS OpenData Portal (Fig. 2c, d), increasing sensitivity to 0.88.
How relevant is MD-based scoring for finding the four known inhibitors in DrugBank? The number of compounds requiring computational and experimental screening is similar between static and dynamic h-scoring (Table 1, rows 2 and 3), suggesting that the dynamic h-score does not provide a significant benefit in our case while doubling the computational cost. However, it increases the correlation of the known inhibitors’ rankings from 0.2 to 1.0, which indicates that dynamic scoring may be more robust and beneficial with other targets.
Next, we evaluate the effect of dynamics on the target by removing the MD-generated receptor ensemble. Instead, we dock candidates to a single homology model and rank them by their dynamic h-score. This increases the average number of computationally screened compounds to 754.4 (simulation time 829.8 h) and, more importantly, results in poor ranking of the known inhibitors (within the top 709.0 compounds). This result underscores the importance of having a receptor ensemble, which increases the likelihood of docking to binding-competent target structures.
Finally, we remove both the receptor ensemble and MD scoring, docking candidates to a single homology model and ranking them by their static h-score. This substantially increases the average number of compounds screened to 2230.4 (simulation time 631.9 h) and produces an almost useless ranking, emphasizing the critical role of MD in achieving meaningful results.
To better understand the role of MD from a structural perspective, we compare the poses of known inhibitors docked to different target structures. Docking to one of the MD-generated receptors (example of a good structure) produces high-scoring poses while docking to the homology model (example of a bad structure) results in consistently low scores, using both static and dynamic h-score (Fig. 3b, c). Moreover, MD scoring can correct docking artifacts when a candidate docks into a sub-optimal pose (e.g. camostat in Fig. 3b) or an unstable high-scoring pose (e.g. otamixaban in Fig. 3b).
Active learning vastly accelerates compound search
Even though our dynamic target-specific score is effective for ranking drug candidates, it requires expensive MD simulations, which may not be feasible for screening large compound libraries. To tackle this problem, we use the active learning cycle (see Active learning cycle).
To explore the effect of the active learning cycle on the computational burden, we simulate another virtual screen where candidates are selected randomly. Without an active learning cycle, random selection requires screening 7166.8 compounds (simulation time 99,140.7 h) and the known inhibitors appear within the top 16.6 positions on average. Therefore, using a machine learning model to select extension sets of candidates reduces the computational burden by ∼29-fold.
Active learning approach identifies inhibitors of TMPRSS2
To identify compounds with promising inhibitory properties against TMPRSS2, we examined the predictions of our active learning approach on the DrugBank after screening 10% of the whole library (Supplementary Table S1).
The best-ranked compound, DB03417, was characterized by a high h-score and showed potential upon visual inspection. This compound maps to a crystal structure (PDB ID: 1RXP21) in which its parent compound (chemical relation in Supplementary Fig. S2), known as BMS-262084, engages in covalent binding with trypsin. Leveraging this information, we moved forward with scoring BMS-262084 (h-score = 1.249) and assessing its inhibitory potential.
We experimentally evaluated the effect of BMS262084 in a TMPRSS2 biochemical assay (Fig. 4a). The inhibitory profile of BMS-262084 IC50 = 1.82 nM was better than that of camostat IC50 = 3.17 nM and comparable to that of nafamostat IC50 = 1.08 nM, the most potent known inhibitor of TMPRSS2.
a Dose-response curves and IC50 estimates for inhibition in TMPRSS2 biochemical assay. For BMS-262084, the average (mean) ±SD of three technical replicates is shown. b IC50 as a function of pre-incubation time, estimates of IC50 at infinite time and time at which the function reaches its minimum. c, d Inhibition of live SARS-CoV-2 infection of Calu-3 cells. PFU, plaque-forming units. The average (mean) ±SD of three technical replicates is shown. Statistical significance was analyzed by two-way analysis of variance (ANOVA) with Dunnetts post hoc test. P values (for concentrations between 5 and 50,000 nM, from left to right) are as follows: AY.1 (0.9911, 0.9992, <0.0001, 0.0031, <0.0001, <0.0001, <0.0001, <0.0001, <0.0001, <0.0001) and KP.3.1.1 (0.9995, 1.0000, <0.0001, <0.0001, <0.0001, <0.0001, <0.0001, <0.0001, <0.0001, <0.0001). e–l Dose-response curves and IC50 estimates for inhibition of pseudovirus cell entry into Calu-3 cells driven by VSV-G (control, dashed lines) or S protein (solid lines) of SARS-CoV-2 lineages B.1, B.1.617.2, EG.5.1 and BA.2.86 or coronaviruses HCoV-NL63, HCoV-229E, SARS-CoV-1 and MERS-CoV, respectively. The average (mean) ±SD of three biological replicates is shown. Each biological replicate was performed with four technical replicates.
Next, we repeated the same experiment with different pre-incubation times to understand the time-dependence of inhibitor potencies. Across pre-incubation times of 1, 4 and 8 h, the three inhibitors showed similar inhibitory profiles to those of the initial experiment (Fig. 4b). However, from the 18 h pre-incubation mark, inhibitor potencies decreased. This decline was more pronounced for camostat and nafamostat that converged to the same IC50 after 48 h of pre-incubation. In contrast, BMS-262084 showed a 5-fold higher IC50 for the same pre-incubation duration.
We also applied our active learning approach to the NCATS in-house library containing ∼145,000 compounds. In the first round, we used the docking score to select 1100 compounds for experimental validation, while in the second round, we used the dynamic h-score to select an additional 500.
Experimental validation of our predictions on the NCATS in-house library revealed 33 compounds with a maximum response below −40% (molecular graphs in Supplementary Figs. S3 and S4). Among these, otamixaban (IC50 = 0.79 µM), dabigatran ethyl ester (IC50 = 2.24 µM) and two more compounds (IC50 = 8.91 µM for both) exhibited an IC50 below 10 µM, representing promising scaffolds for further optimization.
BMS-262084 blocks coronavirus entry into Calu-3 cells
We investigated the impact of our most potent inhibitor, BMS-262084, on cell entry of live SARS-CoV-2 and pseudovirus particles bearing coronavirus S proteins into Calu-3 human lung cells (TMPRSS2-positive).
We first analyzed the inhibitory effect of BMS-262084 in the context of the live SARS-CoV-2 virus (Supplementary Fig. S5). Preincubation of Calu-3 cells with BMS-262084 at noncytotoxic concentrations (Supplementary Fig. S6) strongly inhibited the relative (compared to no inhibitor) infectivity of SARS-CoV-2 with an IC50 of 0.51 µM, as evidenced by a reduction in SARS-CoV-2 nucleoprotein signals in infected cells at 24 h postinoculation (Supplementary Fig. S7). We also conducted live-virus inhibition experiments comparing BMS-262084 with camostat (Fig. 4c, d and Supplementary Fig. S8). BMS-262084 showed greater efficacy against both AY.1 (Delta) and KP.3.1.1 (recent Omicron sublineage), with IC50 values of 8.66 nM and 8.03 nM, respectively, making it more potent than camostat (22.05 nM and 38.30 nM) in blocking infection of Calu-3 lung cells.
Next, we preincubated Calu-3 cells with different concentrations of BMS-262084 or camostat before adding pseudoviruses carrying diverse coronavirus S proteins. This included S proteins of four SARS-CoV-2 lineages - B.1 (early pandemic), B.1.617.2 (Delta variant), EG.5.1 (XBB-sublineage of Omicron variant, circulating in 2023), BA.2.86 (Omicron subvariant, dominating lineage in 2024). In addition, we analyzed S-proteins of four additional coronaviruses that can infect humans: HCoV-NL63 and HCoV-229E, which are seasonal coronaviruses causing common cold, as well as the zoonotic coronavirus SARS-CoV-1 and MERS-CoV, which can cause life-threatening disease in humans.
For Calu-3 cell entry of particles bearing either B.1-S, B.1.617.2-S or BA.2.86-S, similar inhibition profiles were observed for the two inhibitors (Fig. 4e, f, h), with strong inhibition by BMS-262084 (IC50 = 24.47 nM, IC50 = 52.51 nM and IC50 = 48.18 nM for B.1, B.1.617.2 and BA.2.86, respectively) and by camostat (IC50 = 190.10 nM, IC50 = 190.67 nM and IC50 = 148.47 nM for B.1, B.1.617.2 and BA.2.86, respectively). Inhibition of EG.5.1-S protein-driven Calu-3 cell entry by BMS-262084 IC50 = 1.20 µM or camostat IC50 = 8.56 µM (Fig. 4g) was ∼20–50-fold less efficient compared to particles bearing B.1-S, B.1.617.2-S or BA.2.86-S. Even though both BMS-262084 and camostat showed a robust inhibition profile, BMS-262084 was ∼3–8-fold more potent than camostat. In addition, for inhibiting Calu-3 cell entry of particles bearing either NL63-S, 229E-S, SARS-1-S or MERS-S, BMS-262084 was consistently ∼2-fold more potent than camostat (Fig. 4i–l).
Structural basis of TMPRSS2 inhibition by BMS-262084
Based on the general mechanism of action of β-lactam inhibitors, we propose a mechanism of action for BMS-262084, which includes an initial binding step and two subsequent reaction steps (Fig. 5a). Upon binding of BMS-262084 to TMPRSS2, a non-covalent substrate enzyme complex is formed. In the first reaction step (acylation), the catalytic histidine (His296) deprotonates the catalytic serine (Ser441), which attacks the carbonyl center of the β-lactam ring to establish an acyl-enzyme intermediate. In the second reaction step (hydrolysis), the catalytic histidine activates an incoming water molecule, which attacks the acyl-enzyme intermediate, releasing the hydrolyzed substrate and reinstating TMPRSS2 in its active form.
Using Markov state modeling, we analyze 40 MD simulations of 1 µs each and identify three metastable states of BMS-262084 binding to TMPRSS2 (Supplementary Fig. S9). In all these, BMS-262084’s head binds to the S1 pocket of TMPRSS2. The three states differ in the position of the β-lactam ring and, consequently, in the orientation of BMS-262084’s tail, which either points toward the hydrophobic patch (state 1) or away from it (states 2 and 3).
BMS-262084 binds to TMPRSS2 (Fig. 5b), with its guanidinobutane group forming a typical salt bridge to the aspartate at the bottom of the S1 pocket (Asp435). The strong interaction between the positively charged guanidine moiety and the negatively charged carboxylate of the aspartate is a known recognition mechanism, also exploited by nafamostat and camostat, both featuring a guanidinobenzoyl head instead of the BMS-262084’s guanidinobutane.
The first metastable state (Fig. 5b) is characterized by the β-lactam ring of BMS-262084 positioned atop the catalytic serine (Ser441). We consider this conformation reactive when the β-lactam ring’s carbonyl center is sufficiently close to the oxygen of the serine, enabling a suitable configuration for a nucleophilic attack. In our simulations, this reactive configuration is rarely observed. Opposite the S1 pocket, BMS-262084’s formylpiperazine extends over the Cys281-Cys297 disulfide bridge, with its hydrophobic N-tert-butylformamide interacting with Val280 from the hydrophobic patch of TMPRSS2. This conformation resembles the crystal structure of BMS-262084 in a covalent complex with bovine trypsin (Supplementary Fig. S10). Alternatively, the tail of BMS-262084 can be found in contact with the catalytic histidine (His296).
Discussion
Targeting TMPRSS2 is significant for developing therapeutics against respiratory viruses, given its integral role in facilitating the cell entry of influenza A11, coronaviruses12 and certain paramyxo-(i.e. parainfluenza virus)22 and pneumoviruses (i.e. metapneumovirus)23. Here, we developed an active learning approach for drug screening and applied it to TMPRSS2 inhibition, discovering BMS-262084 as a potent inhibitor.
We introduced a simple target-specific score that addressed the docking score limitations. Remarkably, the h-score was capable of identifying at least three active compounds (nafamostat, camostat and DB03417/BMS-262084) within the top 5 of the entire DrugBank library, showcasing its usefulness in hit discovery.
The static h-score demonstrated clear advantages over the conventional docking score, offering increased accuracy with minimal computational expenses. The learned version of the static h-score generalizes effectively to a broader family of trypsin-domain proteins, providing a framework for hit discovery across serine protease targets that are similar to TMPRSS2.
While the dynamic h-score delivers excellent performance, its reliance on MD simulations incurs higher costs, albeit lower than popular free energy calculation methods24. To balance computational cost and predictive accuracy, we suggest employing the static h-score as an initial filter, followed by MD simulations and dynamic h-score calculation for a subset of the most promising compounds.
Our approach enabled a direct comparison between static and dynamic methods using the same scoring metric, allowing for a clearer evaluation of their respective strengths. Based on our insights, we offer a strategy for developing static/dynamic mechanism-informed scores and recommend investing computational resources in running MD simulations to generate receptor ensembles that better capture target flexibility. We also provide a unified scoring framework that applies to both covalent and noncovalent inhibitors, offering an effective and simpler alternative to conventional covalent docking tools.
We demonstrate the scalability of our approach by successfully applying it to the NCATS in-house library of ∼145,000 compounds. This screening effort identified otamixaban which we further investigated in combination with nafamostat or camostat in a separate study20.
Recent works have shown that TMPRSS2 remains an important host cell factor for lung cell entry of contemporary SARS-CoV-2 lineages14,25, especially in more relevant models such as human airway and intestinal organoids15. In line with those studies, our experimental validation robustly affirmed BMS-262084’s effectiveness in inhibiting SARS-CoV-2 entry across multiple lineages, including two descendants of the Omicron variant.
Although BMS-262084 and camostat were both able to inhibit S protein-driven Calu-3 cell entry, BMS-262084 did so with superior efficiency as indicated by its ∼3–8-fold and ∼2-fold lower IC50 values for all SARS-CoV-2 lineages and coronaviruses tested, respectively. Moreover, it showed a more robust inhibitory profile at longer preincubation times. Our findings confirm BMS-262084’s relevance and encourage further research on this compound to explore its suitability as an antiviral therapeutic.
The identification of a nanomolar inhibitor through virtual screening, as presented in this study, remains a notable accomplishment in computer-aided drug discovery1. Remarkably, our method may extend beyond the specific case of TMPRSS2, presenting a valuable tool for discovering inhibitors targeting other serine proteases. In addition, it shows promise for broader applicability across various systems, provided that their underlying mechanism is understood.
Methods
Target structures
At the time this study was conducted, there was no experimentally determined structure of TMPRSS2 and we use models of the catalytic domain from ref. 26 instead. Specifically, we exploit a homology model based on Enteropeptidase-1 (PDB: 3W94) in apo and holo conformations, the latter being docked complexes of TMPRSS2 with known inhibitors camostat16 or nafamostat17. Our constructs contain residues 256–489 or residues 256–491, for apo and holo structures, respectively.
To eliminate possible artifacts of the models27 and to allow for target flexibility we conduct molecular dynamics (MD) simulations of both apo and holo structures. We run the simulations with OpenMM 7.4.028 using the CHARMM 36 force field (version from 2019)29. Further details of the setup can be found in ref. 9.
We use hidden Markov models30 on the resulting datasets to select representative target conformations. Specifically, we pick a total of 20 target structures – 10 from apo simulations, 4 from nafamostat-bound and 6 from camostat-bound simulations. For selecting structures, we use three features to account for S1-pocket entrance-loop openness, occlusion of S1-pocket by Trp461 and occlusion of Asp435 by charged residues.
Drug libraries
We obtained the publicly available TMPRSS2 enzymatic activity screen from the NCATS Open Data Portal31 (downloaded on October 29, 2020). The top 50 compounds were selected according to AUC.
We download the DrugBank database (version 5.1.6)32 and extract all small molecules that have a specified SMILES string (10,758 compounds). We then filter out: (1) salts, (2) compounds with a molecular weight above 550 Da, (3) compounds with less than 6 heavy atoms and (4) compounds containing transition or post-transition metals. The resulting library consists of 8918 compounds.
Several in-house libraries (Genesis, Sytravon, NPACT and NPC) of the National Center for Advancing Translational Sciences (NCATS) are readily available for high-throughput screening. These libraries together amount to a total of ∼145,000 compounds that we refer to as the NCATS in-house library.
Docking and scoring
To obtain the three-dimensional structure of each ligand for docking, we either retrieve it from the ZINC database33, in case a reference molecule is available, or we generate it from the SMILES string with the optimal ionization states at pH 7.05 using LigPrep from the Schrödinger Suite 2020-2. We prepare all receptor and ligand structures with MGLTools 1.5.634.
We dock each ligand structure against each of the 20 receptor structures using a fork of AutoDock Vina35 called smina (version 2017.11.9)36. Docking is performed in a search space of size 30 Å centered around the catalytic serine (Ser441). We set the exhaustiveness to 10 and use the Vinardo scoring function37 which is suitable for virtual screening purposes. Five residues are kept flexible throughout the docking run, namely: Glu299, Lys300, Asp435, Gln438 and Trp461. For each receptor-ligand pair, we only keep the best pose. The docked complex is finally assembled by combining: (a) the rigid part of the receptor, (b) the flexible part of the receptor associated with the best pose and (c) the best pose of the ligand.
To get the score for each ligand we first normalize the raw docking scores per receptor and filter out poses in which the ligand does not form at least 2 contacts (based on heavy atom distance and a threshold of 3.5 Å) with the S1 pocket (residues 435–441 and 459–464). We then compute the mean of the normalized docking score (of retained receptor-ligand poses) for each ligand.
Target-specific scoring
We propose a score that summarizes our insights about the mechanism of action of TMPRSS2 inhibitors9, termed the h-score:
where S1 is the S1 pocket (residues 435–441 and 459–464), H is the adjacent hydrophobic patch (residues 279–281 and 296–300), d(react) is the distance between the closest cleavable bond of the ligand and the oxygen of the catalytic serine (Ser441), d(recog) is the minimal distance between the ligand and heavy atoms of Asp435 (major substrate recognition residue at the bottom of the S1 pocket). The factor n−4/3 compensates for the bias towards large molecules otherwise posed by the two SASA differences.
The observables for the h-score (Eq. (1)) are computed using MDTraj 1.9.438. The difference in solvent accessible surface area upon binding (∆SASA) is computed with MDTraj’s implementation of the Shrake–Rupley algorithm39.
Possible cleavable bonds of the ligand are detected using RDKit 2020.03.4 and SMARTS patterns for the following classes of compounds: ester, phenylmethylsulfonyl fluoride (PMSF), chloromethyl ketone (CMK), aldehyde, trifluoromethyl ketone (TFK) and β-lactam. If a ligand does not contain a cleavable bond and, therefore, cannot react with the protease, we set this distance to its average value of 0.827 nm to not bias non-covalent compounds versus covalent ones.
The static h-score is computed as the mean of the top 3 equilibrated docking poses, each from a different receptor structure, to rule out docking artifacts, and the dynamic score is calculated as the mean of the top 3 trajectories, each from a different receptor structure, over all frames of the MD simulation (in 1 ns steps).
Trypsin-domain-specific scoring
We propose a learned version of the h-score that generalizes to targets containing the trypsin domain. To construct the dataset, PDBbind version 2020 was downloaded and annotated using the InterPro database. Entries with reported binding affinities as exact Ki or Kd values and the annotation IPR001254 (Serine proteases, trypsin domain) were retained. Receptor structures were repaired using PDBFixer to replace non-standard amino acids and add missing heavy atoms, followed by removal of inactive chains (with no residues within 5 Å of the ligand).
Pairwise alignments of the receptors were generated using TMalign40 and receptors with a high mean pairwise distance (above 0.6) were excluded. The pairwise distances were used to conduct hierarchical clustering with SciPy’s average linkage method. Closely related structures were grouped and progressively aligned until all structures were unified into a single aligned group. A representative structure was selected based on the lowest sum of distances to all other receptors. We manually annotated its S1 pocket and hydrophobic patch residues considering the TMPRSS2 structure. A residue correspondence between the representative and each receptor was established using the Needleman–Wunsch algorithm41 on the positions of Cα atoms of the superposed structures. Receptors with missing S1 pocket or hydrophobic patch residues, or mutated catalytic serine, were excluded.
We calculated ∆SASA and distance from ligand for each S1 pocket and hydrophobic patch residue. Complexes in which the ligand was not bound near the S1 pocket (∆SASA(S1) below 0.5) were excluded. The resulting dataset of 651 complexes was stratified based on the distribution of binding affinities, which were divided into 20 percentile-based bins, and split into 80% training and 20% test sets. A random forest regressor with 200 estimators was then trained on these observables to predict -log10(Ki/Kd).
Molecular dynamics simulations
We use the AMBER ff14SB force field42 for the receptor and the openff-1.1.0 small molecule force field43 for the ligand, which is parameterized from the SMILES string with the openff toolkit (version 0.7.1)44 and the openmmforcefields package (version 0.8.0).
The setup and subsequent production runs are carried out with OpenMM 7.4.028 in a cubic periodic box of 7.2 nm side length with TIP3P water45 and a 0.1 mol/L NaCl ion concentration (neutral charge).
Molecular dynamics simulations are automatically seeded from the docking pose of the receptor-ligand pair. The solvent is generated for every receptor-ligand pair individually. It is equilibrated with constraints on the heavy atoms of receptor and ligand for 0.1 ns in the NVT ensemble and, subsequently, for 0.9 ns in the NPT ensemble at 310 K (physiological temperature) and 1 bar. We choose a Langevin integrator with a time step of 2 fs at the equilibration phase. In the production phase, we apply hydrogen mass repartitioning46 and a 4 fs integration step with hydrogen bond restraints.
Markov state modeling
We analyzed MD simulations by employing PyEMMA 2.5.747 to calculate inverse minimal distances between protein residues and various drug groups (guanidinobutane, beta-lactam ring, carboxylate, formylpiperazine, and tert-butylformamide). We conduct a linear VAMP48 dimension reduction operation with a 5 ns lag time, utilizing the top 8 dimensions with the highest kinetic variance. Next, we perform k-means clustering with 180 cluster centers and estimate a 4-state hidden Markov model (HMM)30 at a lag time of 1 ns, enabling the assignment of metastable states (binding modes).
Active learning cycle
We define an active learning cycle that iteratively trains a machine learning (ML) model on the already screened candidates and selects new candidates for screening. To this end, it takes advantage of a pretrained deep learning autoencoder49 to encode the SMILES of already screened candidates in a continuous latent space. These continuous and data-driven descriptors (CDDD) encodings are then used together with the associated scores (mean normalized docking scores or h-scores) to train a support vector regressor (SVR) using scikit-learn 0.22.150. Once trained, this model predicts the scores on the subset of the library that has not been screened yet and selects candidates for the next round. We repeat this procedure for a certain number of rounds. The steps for screening the initial set of candidates and the subsequent extension sets are detailed below, while the algorithm is available in the SI (Alg. S1).
Initial set screening steps are:
-
1.
Get Morgan fingerprints for the whole library;
-
2.
Cluster Morgan fingerprints into n_clusters clusters using the k-means algorithm with 20 initializations and a maximum of 400 iterations;
-
3.
Get a diverse initial set of candidates by taking one representative from each cluster;
-
4.
Score the initial set of candidates.
Extension set screening steps are:
-
1.
Get the set of potential candidates by subtracting the screened set of candidates from the whole library;
-
2.
Train a support vector regressor (SVR) with default parameters on CDDD encodings and scores of screened candidates;
-
3.
Predict scores for potential candidates from their CDDD encodings;
-
4.
Get the extension set of candidates by taking top ext_size candidates with the best predicted score;
-
5.
Score the extension set of candidates.
TMPRSS2 biochemical assay
Inhibitors of TMPRSS2 were tested using a high throughput activity assay. The experiment was performed in a 1536-well black plate according to the published protocol51: Boc-Gln-Ala-Arg-AMC substrate (20 nl) and inhibitor (20 nl in DMSO) were added using an ECHO 655 acoustic dispenser (LabCyte). TMPRSS2 (5 µl, 0.018 mg/ml) in assay buffer (50 mM Tris pH 8, 150 mM NaCl, 0.01% Tween20) was dispensed to that, using a BioRAPTR (Beckman Coulter), for a total assay volume of 5 µl.
For pre-incubation: To a 1536-well black plate was added DMSO (20 nl) and inhibitor (20 nl, 250×) using an ECHO 655 acoustic dispenser (LabCyte). To that was dispensed TMPRSS2 (5 µl, 0.018 mg/ml) in assay buffer (50 mM Tris pH 8, 150 mM NaCl, 0.01% Tween20) using a BioRAPTR (Beckman Coulter) to give a total reaction volume of 5 µl. Following the desired pre-incubation time of the inhibitor with TMPRSS2, the fluorogenic peptide substrate, Boc-Gln-Ala-Arg-AMC, was added at 20 nl (10 µM, 250×). The final assay conditions are 10 µM peptide, 0.018 mg/ml TMPRSS2 in assay buffer (50 mM Tris-HCl, pH 8, 150 mM NaCl, 0.01% Tween20).
After incubation at room temperature for 1 h, fluorescence was measured. PHERAstar with excitation at 340 nm and emission at 440 nm was used for detection. Raw plate reads for each titration point were normalized relative to a positive control containing no enzyme (0% activity, full inhibition) and a negative control containing DMSO-only wells (100% activity, basal activity). Data normalization was performed using GraphPad Prism (GraphPad Software, San Diego, CA).
Recombinant Human TMPRSS2 protein expressed from yeast (human TMPRSS2 residues 106492, N-terminal 6x His-tag) (cat. # CSB-YP023924HU) was acquired from Cusabio. The fluorogenic peptide substrate, Boc-QAR-AMC·HCl was obtained from Bachem (cat. # I-1550).
Analysis of cell viability
Sub-confluent Calu-3 cells were incubated for 24 h in the presence of different concentrations (4-fold serial dilutions, 50,000–0.76 nM) of BMS-262084 or camostat. Cells incubated with a medium containing DMSO (solvent) served as controls. Cell viability was assessed using the CellTiter-Glo® Luminescent Cell Viability Assay (Promega) according to the manufacturer’s instructions. Data normalization was performed as follows: cell viability in the absence of inhibitor (DMSO-only samples) was set as 100% and the relative viability of cells incubated with the respective inhibitor concentrations was calculated.
Inhibition of live SARS-CoV-2 cell entry
All infection studies with authentic SARS-CoV-2 were conducted under BSL-3 conditions at the German Primate Center. For stock preparation virus was propagated on Calu-3 (kindly provided by Stephan Ludwig) or Vero E6-TMPRSS2 cells (kindly provided by Stuart Turville) and to ensure that no unwanted S protein mutations occurred during passaging, correct S protein sequences of all SARS-CoV-2 lineages were confirmed by Sanger sequencing for each passage.
BMS-262084- and control-treated Calu-3 cells were infected with SARS-CoV-2 isolate hCoV-19/Germany/FI1103201/2020 (GISAID accession: EPI-ISL_463008; kindly provided by Stephan Ludwig) using a multiplicity of infection (MOI) of 1 and fixed after 24 h. Next, the infected cells were stained with SARS-CoV-2 nucleoprotein-specific (Sino Biologicals, 40143-R019) and Alexa-488-conjugated secondary antibodies (Thermo Fisher Scientific, A-21467) and the infection was analyzed by fluorescence microscopy (nuclei of the cells were stained with DAPI). The same experiment was run with a GFP-expressing vesicular stomatitis virus (VSV, MOI = 0.1)52, which does not depend on TMPRSS2 for cell entry and is hence not affected by BMS-262084. Finally, the relative infectivity was quantified (normalized to samples without inhibitor = 100% infection) based on the fluorescence intensities in the green channel of the microscopic images, using the ImageJ software53.
Confirmatory experiments were conducted with BMS-262084 and camostat using plaque assay for virus quantification. A pre-Omicron variant (Delta, AY.1; obtained from Andrew S. Pekosz through BEI Resources, NIAID, NIH; Catalog No. NR-55691) and a recent Omicron sub-lineage (KP.3.1.1) were included. Calu-3 cells were pretreated with different concentrations of BMS-262084 or camostat (diluted in culture medium) or mock-treated. After an incubation period of 2 h at 37 °C, the culture supernatant was aspirated and cells were inoculated 5000 plaque forming units of SARS-CoV-2 Delta or Omicron variants or VSV (diluted in medium containing the respective compound concentration). After an incubation period of 1 h at 37 °C, the inoculum was removed and cells were washed two-times with PBS before they received medium containing fresh compound at the desired concentration. At 48 h post-inoculation, culture supernatants were collected and viral titers determined. For the determination of viral titers, plaque titration was employed using the following protocol. Vero E6-TMPRSS2 cells were seeded in 48-well plates and incubated until they reached confluence. Then, the medium was aspirated and cells were incubated for 1 h with ten-fold serial dilutions of supernatant. Next, the inoculum was removed and cells were washed two-times with PBS before overlay medium (culture medium containing 1% w/v methylcellulose; Sigma-Aldrich, Catalog No. M0512) was added. The cells were incubated for 36 h (VSV), 72 h (SARS-CoV-2 Delta variant AY.1) or 96 h (SARS-CoV-2 Omicron variant KP.3.1.1) until plaques were formed. For plaque counting, the overlay medium was aspirated and cells were washed two-times with PBS and fixed with 4% paraformaldehyde solution (ROTI Histofix, Carl Roth; Catalog No. P087.5) for 1 h at room temperature. Following removal of the paraformaldehyde solution, the cells were stained with crystal violet solution (0.5% crystal violet w/v, 20% ethanol [96%], 79.5% deionized water) for 30 min at room temperature, washed three-times with PBS, air-dried and analyzed using ZEISS Axio Vert. A1 light microscope (equipped with ZEISS A-Plan 2.5×/0.06 M27 objective).
Inhibition of pseudovirus cell entry
Pseudoviruses bearing VSV-G (control) or the S protein of human coronavirus (HCoV) NL6354, HCoV-229E54, SARS-CoV-155, MERS-CoV56, or SARS-CoV-2 lineages B.1 (early pandemic)57, B.1.617.2 (Delta variant)58, EG.5.1 (XBB-sublineage, currently circulating)59, or BA.2.86 (Omicron subvariant)25 were produced according to a published protocol60 and inoculated onto Calu-3 cells, which were preincubated (1 h, 37 °C) with different concentrations (10-fold serial dilutions, 25,000–0.25 nM) of BMS-262084 or camostat. Cells incubated with a medium containing DMSO (solvent) served as controls. At 16–18 h postinoculation, pseudovirus cell entry was analyzed by measurement of the activity of virus-encoded firefly luciferase in cell lysates. Data normalization was performed as follows: cell entry in the absence of inhibitor (DMSO-only samples) was set as 0% inhibition and the relative inhibition of cell entry by the respective inhibitor concentrations was calculated.
Dose-response curve fitting
To determine the inhibitor concentration that causes 50% inhibition of TMPRSS2 activity or (pseudo)virus entry (IC50), we utilize the four-parameter log-logistic model with variable slope. The model is characterized by the following formula:
where x is the concentration of the inhibitor, b is the Hill slope, c is the lower limit (set to 0), d is the upper limit (set to 100) and e is the IC50 value. We employed the curve fitting algorithm available in the SciPy package (version 1.10.1)61 to derive the Hill slope and the IC50 value, along with their corresponding error estimates.
Pre-incubation time vs IC50 modeling
To model the relationship between the pre-incubation time and the IC50, we use the following equation inspired by the Morse potential:
where t is the time, a is the width of the well, D is the IC50 at infinite time (depth of the well) and te is the time at which the IC50 reaches its minimum. Again, we made use of SciPy’s curve fitting method to obtain parameters a, D and te, with their respective error estimates.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The molecular simulations and the associated scores generated in this study have been deposited in the Zenodo database under accession code 1550821462. Additional computational and experimental data generated in this study are provided in the Supplementary Information file. Source data are provided with this paper.
Code availability
Our protocol is open-sourced under the MIT license at https://github.com/noegroup/tmprss2_structures/tree/master/scripts63.
References
Graff, D. E., Shakhnovich, E. I. & Coley, C. W. Accelerating high-throughput virtual screening through molecular pool-based active learning. Chem. Sci. 12, 7866–7881 (2021).
Schaller, D. et al. Next generation 3D pharmacophore modeling. WIREs Comput. Mol. Sci. 10, e1468 (2020).
Fan, J., Fu, A. & Zhang, L. Progress in molecular docking. Quant. Biol. 7, 83–89 (2019).
Menchon, G., Maveyraud, L. & Czaplicki, G. in Methods in Molecular Biology, 145–178 (Springer, 2018).
Gorgulla, C. Recent developments in ultralarge and structure-based virtual screening approaches. Ann. Rev. Biomed. Data Sci. 6, 229–258 (2023).
Chen, Y.-C. Beware of docking!. Trends Pharmacological Sci. 36, 78–95 (2015).
Plattner, N. & Noé, F. Protein conformational plasticity and complex ligand-binding kinetics explored by atomistic simulations and Markov models. Nat. Commun. 6, 7653 (2015).
Miller, E. B. et al. Reliable and accurate solution to the induced fit docking problem for protein-ligand binding. J. Chem. Theory Comput. 17, 2630–2639 (2021).
Hempel, T. et al. Molecular mechanism of inhibiting the SARS-CoV-2 cell entry facilitator TMPRSS2 with camostat and nafamostat. Chem. Sci. 12, 983–992 (2021).
Lucas, J. M. et al. The androgen-regulated protease TMPRSS2 activates a proteolytic cascade involving components of the tumor microenvironment and promotes prostate cancer metastasis. Cancer Discov. 4, 1310–1325 (2014).
Hatesuer, B. et al. Tmprss2 is essential for influenza H1N1 virus pathogenesis in mice. PLoS Pathog. 9, e1003774 (2013).
Iwata-Yoshikawa, N. et al. TMPRSS2 contributes to virus spread and immunopathology in the airways of murine models after coronavirus infection. J. Virol. 93, e01815-18 (2019).
Hoffmann, M. et al. SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor. Cell 181, 271–280.e8 (2020).
Metzdorf, K. et al. TMPRSS2 is essential for SARS-CoV-2 Beta and Omicron infection. Viruses 15, 271 (2023).
Mykytyn, A. Z. et al. SARS-CoV-2 Omicron entry is type II transmembrane serine protease-mediated in human airway and intestinal organoid models. J. Virol. 97, e00851–23 (2023).
Hoffmann, M. et al. Camostat mesylate inhibits SARS-CoV-2 activation by TMPRSS2-related proteases and its metabolite GBPA exerts antiviral activity. EBioMedicine, 65, 103255 (2021).
Hoffmann, M. et al. Nafamostat mesylate blocks activation of SARS-CoV-2: new treatment option for COVID-19. Antimicrob. Agents Chemother. 64, e00754-20 (2020).
Azouz, N. P. et al. Alpha 1 antitrypsin is an inhibitor of the SARS-CoV-2–priming protease TMPRSS2. Pathog. Immun. 6, 55–74 (2021).
Shapira, T. et al. A TMPRSS2 inhibitor acts as a pan-SARS-CoV-2 prophylactic and therapeutic. Nature 605, 340–348 (2022).
Hempel, T. et al. Synergistic inhibition of SARS-CoV-2 cell entry by otamixaban and covalent protease inhibitors: pre-clinical assessment of pharmacological and molecular properties. Chem. Sci. 12, 12600–12609 (2021).
Sutton, J. C. et al. Solid-phase synthesis and SAR of 4-carboxy-2-azetidinone mechanism-based tryptase inhibitors. Bioorg. Med. Chem. Lett. 14, 2233–2239 (2004).
Abe, M. et al. TMPRSS2 is an activating protease for respiratory parainfluenza viruses. J. Virol. 87, 11930–11935 (2013).
Shirogane, Y. et al. Efficient multiplication of human metapneumovirus in Vero cells expressing the transmembrane serine protease TMPRSS2. J. Virol. 82, 8942–8946 (2008).
King, E., Aitchison, E., Li, H. & Luo, R. Recent developments in free energy calculations for drug discovery. Front. Mol. Biosci. 8, 712085 (2021).
Zhang, L. et al. SARS-CoV-2 BA.2.86 enters lung cells and evades neutralizing antibodies with high efficiency. Cell 187, 596–608.e17 (2024).
Rensi, S. et al. Homology modeling of TMPRSS2 yields candidate drugs that may inhibit entry of SARS-CoV-2 into human cells. Preprint at https://doi.org/10.26434/chemrxiv.12009582 (2020).
Raval, A., Piana, S., Eastwood, M. P., Dror, R. O. & Shaw, D. E. Refinement of protein structure homology models via long, all-atom molecular dynamics simulations. Proteins Struct. Funct. Bioinforma. 80, 2071–2079 (2012).
Eastman, P. et al. OpenMM 7: rapid development of high performance algorithms for molecular dynamics. PLOS Computational Biol. 13, e1005659 (2017).
Best, R. B. et al. Optimization of the additive CHARMM all-atom protein force field targeting improved sampling of the backbone ϕ, ψ and side-chain χ 1 and χ 2 dihedral angles. J. Chem. Theory Comput. 8, 3257–3273 (2012).
Noé, F., Wu, H., Prinz, J.-H. & Plattner, N. Projected and hidden Markov models for calculating kinetics and metastable states of complex molecules. J. Chem. Phys. 139, 184114 (2013).
Brimacombe, K. R. et al. An Open-Data portal to share COVID-19 drug repurposing data in real time. Preprint at https://doi.org/10.1101/2020.06.04.135046 (2020).
Wishart, D. S. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, D668–D672 (2006).
Sterling, T. & Irwin, J. J. ZINC 15 ligand discovery for everyone. J. Chem. Inf. Model. 55, 2324–2337 (2015).
Morris, G. M. et al. AutoDock4 and AutoDock-Tools4: automated docking with selective receptor flexibility. J. Computational Chem. 30, 2785–2791 (2009).
Trott, O. & Olson, A. J. AutoDock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Computational Chem. 31, 455–461 (2009).
Koes, D. R., Baumgartner, M. P. & Camacho, C. J. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J. Chem. Inf. Model. 53, 1893–1904 (2013).
Quiroga, R. & Villarreal, M. A. Vinardo: a scoring function based on autodock vina improves scoring, docking, and virtual screening. PLOS One 11, e0155183 (2016).
McGibbon, R. T. et al. MDTraj: a modern open library for the analysis of molecular dynamics trajectories. Biophys. J. 109, 1528–1532 (2015).
Shrake, A. & Rupley, J. Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J. Mol. Biol. 79, 351–371 (1973).
Zhang, Y. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970).
Maier, J. A. et al. ff14SB: improving the accuracy of protein side chain and backbone parameters from ff99SB. J. Chem. Theory Comput. 11, 3696–3713 (2015).
Qiu, Y. et al. Development and benchmarking of open force field v1.0.0the Parsley small-molecule force field. J. Chem. Theory Comput. 17, 6262–6280 (2021).
Mobley, D. L. et al. Escaping atom types in force fields using direct chemical perception. J. Chem. Theory Comput. 14, 6076–6092 (2018).
Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., Impey, R. W. & Klein, M. L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926–935 (1983).
Hopkins, C. W., Le Grand, S., Walker, R. C. & Roitberg, A. E. Long-time-step molecular dynamics through hydrogen mass repartitioning. J. Chem. Theory Comput. 11, 1864–1874 (2015).
Scherer, M. K. et al. PyEMMA 2: a software package for estimation, validation, and analysis of Markov models. J. Chem. Theory Comput. 11, 5525–5542 (2015).
Wu, H. & Noé, F. Variational approach for learning markov processes from time series data. J Nonlinear Sci. 30, 23–66 (2020).
Winter, R., Montanari, F., Noé, F. & Clevert, D.-A. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem. Sci. 10, 1692–1701 (2019).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Shrimp, J. H. et al. An enzymatic TMPRSS2 assay for assessment of clinical candidates and discovery of inhibitors as potential treatment of COVID-19. ACS Pharmacol. Transl. Sci. 3, 997–1007 (2020).
Brinkmann, C. et al. The glycoprotein of vesicular stomatitis virus promotes release of virus-like particles from tetherin-positive cells. PLOS One 12, e0189073 (2017).
Schindelin, J. et al. Fiji: an open-source platform for biological-image analysis. Nat. Methods 9, 676–682 (2012).
Hofmann, H. et al. Human coronavirus NL63 employs the severe acute respiratory syndrome coronavirus receptor for cellular entry. Proc. Natl Acad. Sci. 102, 7988–7993 (2005).
Hoffmann, M. et al. Differential sensitivity of bat cells to infection by enveloped RNA viruses: coronaviruses, paramyxoviruses, filoviruses, and influenza viruses. PLOS One 8, e72942 (2013).
Gierer, S. et al. The spike protein of the emerging betacoronavirus EMC uses a novel coronavirus receptor for entry, can be activated by TMPRSS2, and is targeted by neutralizing antibodies. J. Virol. 87, 5502–5511 (2013).
Hoffmann, M. et al. SARS-CoV-2 mutations acquired in mink reduce antibody-mediated neutralization. Cell Rep. 35, 109017 (2021).
Arora, P. et al. B.1.617.2 enters and fuses lung cells with increased efficiency and evades antibodies induced by infection and vaccination. Cell Rep. 37, 109825 (2021).
Zhang, L. et al. Neutralisation sensitivity of SARS-CoV-2 lineages EG.5.1 and XBB.2.3. Lancet Infect. Dis. 23, e391–e392 (2023).
Kleine-Weber, H. et al. Mutations in the spike protein of middle east respiratory syndrome coronavirus transmitted in Korea increase resistance to antibody-mediated neutralization. J. Virol. 93, https://doi.org/10.1128/jvi.01381-18 (2019).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Elez, K. et al. Simulations and active learning enable efficient identification of an experimentally-validated broad coronavirus inhibitor. Zenodo https://doi.org/10.5281/zen-odo.15508214 (2025).
Elez, K. et al. Simulations and active learning enable efficient identification of an experimentally-validated broad coronavirus inhibitor. Github https://doi.org/10.5281/zenodo.15505754 (2025).
Acknowledgements
K.E. and T.H. acknowledge the financial support of the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - SFB 1114, project C03 and SFB/TRR 186, project A12. F.N. acknowledges funding by the European Commission (ERC CoG 772230 “ScaleCell”), the Berlin Mathematics Research Center MATH+ (AA1-10) and the Berlin Institute for the Foundations of Learning and Data (BIFOLD). L.R. acknowledges funding by the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement no. 897414. R.W. and T.L. acknowledge Bayer AG’s PhD scholarships. This work was also supported by the National Center for Advancing Translational Sciences, Division of Preclinical Innovation. S.P. acknowledges funding by the EU project UNDINE (grant agreement number 101057100), the COVID-19-Research Network Lower Saxony (COFONI) through funding from the Ministry of Science and Culture of Lower Saxony in Germany (14-76103-184, projects 7FF22, 6FF22, 10FF22) and the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG; PO 716/11-1). We thank Simon Olsson (Chalmers University) and Moritz Hoffmann (Ethereum Foundation). The following reagent was obtained through BEI Resources, NIAID, NIH: SARS-Related Coronavirus 2, Isolate hCoV-19/USA/CA-VRLC086/2021 (Delta Variant), NR-55691, contributed by Andrew S. Pekosz.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
F.N., M.D.H., M.H. and S.P. designed research. K.E., T.H., L.R. and F.N. conducted & analyzed computational experiments. J.H.S. and M.D.H. conducted TMPRSS2 biochemical assay experiments. N.M., C.R., S.P. and M.H. conducted cell entry experiments. R.W. and T.L. conducted preliminary active learning experiments. K.E., T.H., J.H.S., N.M., L.R., C.R., S.P., M.H., M.D.H. and F.N. analyzed experimental results. K.E., T.H., M.H., L.R. and F.N. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Gi Uk Jeong, Yu Kang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Elez, K., Hempel, T., Shrimp, J.H. et al. Simulations and active learning enable efficient identification of an experimentally-validated broad coronavirus inhibitor. Nat Commun 16, 6949 (2025). https://doi.org/10.1038/s41467-025-62139-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-025-62139-5