Abstract
Understanding the dynamic regulation mechanisms of RNA is crucial for unlocking the secrets of biological processes and diseases. Existing computational methods primarily focus on static structures and give insufficient attention to RNA. Here, we develop a pipeline based on physics-informed machine learning to uncover regulatory mechanisms and manage the function of RNA complexes. We applied this pipeline to investigate the latent regulatory networks with key regions in response to regulation and detect binding sites underlying the P-TEFb/Tat/TAR system involved in HIV-1 transcriptional activation and the aaRS/tRNA system essential for genetic translation. Additionally, we search for potential small-molecule inhibitors at these sites. By integrating neural relational inference with network models, the pipeline demonstrates high performance, as evidenced by experimental studies, outperforming existing state-of-the-art tools. This pipeline serves as a paradigm for studying regulatory mechanisms in macromolecular RNA complexes, showing the insights provided by physics-informed machine learning in revealing dynamic regulatory processes.

Similar content being viewed by others
Introduction
The functional regulation of biomolecules is crucial in accelerating various biological processes and is closely linked to their dynamic structural behavior1,2,3,4,5,6. However, the experimental validation of these regulatory processes through wet lab techniques remains costly7,8. Nevertheless, computational approaches still face limitations in characterizing the dynamic structural conformations of RNA, which play a critical role in the functional regulation of biomolecules9,10. Additionally, existing methods cannot effectively identify and analyze potential functional regulatory sites in biomolecules, such as those found in RNA.
For instance, while substantial progress has been made in HIV therapeutics, the current antiretroviral regimens face notable limitations, especially the lack of FDA-approved drugs that can achieve complete eradication of the infection11,12. This issue is specifically related to transcriptional regulation and the identification of functional regulatory sites in biomolecules. This gap in treatment highlights the urgent need for fundamental research on regulation mechanisms, which could aid in developing inhibitors that target crucial macromolecular complexes, specifically viral RNA.
Molecular dynamics (MD) simulations have emerged as powerful tools for investigating dynamic regulatory mechanisms, providing quantitative atomic-level resolution of chemical interactions, thereby significantly enhancing the interpretation of experimental data with the help of classical mechanics13,14,15. However, classical MD simulations require further analysis to capture rare mechanisms accurately. Recent theoretical advancements utilizing machine learning (ML) techniques16,17,18, especially highlighted by AlphaFold19,20, have greatly improved the ability to predict static structures of both monomers and complexes. The ML techniques have emerged as a promising alternative and have opened new opportunities in MD simulation21. However, current research mainly focuses on improving computational efficiency, simulation scale, and accuracy, while relatively overlooking the essential task of elucidating fundamental regulatory principles and molecular mechanisms22. Training ML methods typically demands extensive data, which is not always available for scientific issues. Instead, such methods can be trained using additional information from physical models and data (for example, displacement, velocity, and coarse-grained complex network models)21,23.
Here, we present a comprehensive computational framework ZHMolRM (RNA Mechanism) that combines physics-informed methods and network models. We utilize ML to investigate the functioning of regulatory mechanisms in MD simulations. Our approach has two main benefits: (1) it helps identify key regions that respond to regulation and identify binding sites, and (2) it explores potential small-molecule inhibitors that target these regulatory regions. By training the physics-informed model on simulation trajectories, we can assess molecular regulatory interactions and pinpoint potential regulatory RNA regions within the complexes using network-guided ML. To move therapeutic development forward, we applied ML to evaluate the binding probabilities of small-molecule inhibitors, supporting the rational design of RNA-targeted therapeutics. Overall, this framework provides a strategy for integrating MD simulations with ML to understand RNA regulatory mechanisms and facilitate RNA-targeted drug discovery.
Results
Overview of physics-informed ML paradigm
The physics-informed ML paradigm (see Fig. 1) mainly comprises the following steps. First, this approach employs an neural relational inference (NRI)-informed ML model to examine the potential interactions of structural domains during the dynamic processes of RNA complexes (see Fig. 1a, b) and identify key functional regions combined with RNet24 (see Fig. 1c). The NRI model learns the network dynamics by minimizing the error between the reconstructed and simulated trajectories, then infers edges between nucleotides and residues as latent variables. These learned embeddings capture the key roles of essential elements in conformational transitions, providing insights into RNA regulation mechanisms. Second, the approach uses a regulatory strategy that evaluates the plausibility of potential small-molecular inhibitors, generating mechanistic insights and guiding rational therapeutics design. Specifically, we use ZHMol-RLinter25 for searching and identification of possible candidate small-molecular inhibitors (see Fig. 1d). Together, this integration bridges fundamental understanding of RNA regulation with practical identification of candidate small-molecule inhibitors. To validate the effectiveness of this paradigm, we applied it to two representative RNA complex systems: (i) the P-TEFb/Tat/TAR system, which plays a critical role in the transcriptional activation of the HIV-1 proviral genome; (ii) the aminoacyl-tRNA synthetase (aaRS)/tRNA system, which is essential for genetic translation. Furthermore, we conducted a search to identify potential inhibitors, aiming for precise functional modulation of RNA target activity.
a The RNA complex system with the tertiary structure. b The MD simulation and neural relational inference (NRI)-guided ML mechanism prediction. Green represents protein, yellow represents RNA in (a) and (b). c Network-informed ML for binding site prediction and d physicochemical property-based inhibitors searching. Green represents potential sites in (c).
Dynamical regulatory identification in the P-TEFb/Tat/TAR complex
The P-TEFb/Tat/TAR complex is widely regarded as a promising therapeutic target with significant potential for the development of anti-HIV strategies26. We conducted five independent MD simulation trajectories, a total of \(1{{{\rm{\mu }}}}{{{\rm{s}}}}\) simulations, to systematically reveal the dynamic behavior of the P-TEFb/Tat/TAR system. We performed a backbone RMSD analysis to assess the structural stability of the P-TEFb/Tat/TAR complex. The RMSD values were averaged across the five independent trajectories. As shown in Fig. 2b, the system reached equilibrium after approximately 50 ns, as indicated by the stabilization of RMSD values. The secondary structure plot (see Supplementary Fig. 1a) shows that the overall fold is well-preserved over the 200 ns trajectory, indicating stable secondary structure elements throughout the simulation. The Rg-RMSD scatter plot (see Supplementary Fig. 1b) further supports that the system explores a compact and restricted conformational space after equilibration. Together, these analyses demonstrate that the simulations have achieved a stable convergence.
a Schematic representation of the P-TEFb/Tat/TAR system. Blue and green represent Cdk9 and Cyclin T1, respectively, which together form P-TEFb. Purple indicates the Tat protein, and yellow indicates TAR RNA. The zoomed-in view shows the motif domains in Tat and TAR based on the secondary structure presented in (c). b The backbone root-mean-square deviation (RMSD) for the P-TEFb/Tat/TAR system. The error bars denote the standard deviation of the backbone RMSDs across five trajectories. c Division of domains in the Tat protein and TAR RNA based on secondary structure. d Distribution of learned edges from neural relational inference learning between residues/nucleotides in the Tat/TAR complex MD simulations. e Distribution of learned edges between domains, obtained by aggregating the residues/nucleotides-level learned edges from the Tat/TAR complex MD simulations. The color bar represents the learned interaction strength. f, g The interacting domains between Tat and TAR are identified and mapped from the learned interaction edges. Edge thickness indicates the interaction strength, corresponding to (e), while arrow direction indicates the directionality of a learned edge, representing the influence from the source domain to the target domain.
As demonstrated in our previous study27, the key regulatory region of the system is localized at the interaction interface between TAR and the P-TEFb/Tat complex. Specifically, TAR hijacks Tat’s tail to overcome transcriptional pausing (see Fig. 2a). To investigate the regulatory mechanism further, we applied the NRI model using MD trajectories of the Tat/TAR complex. Across all Tat/TAR ensembles sampled of 50 steps, the model accurately reconstructs the trajectories with a mean squared error (MSE) of 0.006 between the truth and reconstruction RMSF (see Supplementary Fig. 2). We also derived the distribution of learned edges between residues (see Fig. 2d) and then constructed a domain interaction map (see Fig. 2e) by grouping adjacent residues/nucleotides into blocks according to their secondary structure (see Fig. 2c). The learned edges frequently occur between the Tat protein and TAR RNA domains, indicating that the Tat/TAR interactions play a crucial role in functional regulation, demonstrating a high connection between TAR RNA and the Tat protein (see Fig. 2f, g). Tat protein exhibits a strong directional preference towards the base-paired regions (R1), bulge loop (R2), and hairpin loop (R4) of TAR RNA, while among the three regions initially considered, only R1 and R2 exhibit this directional preference. This suggests that R1 and R2 act as critical nodes in the Tat/TAR interaction and may serve as potential regulatory targeting sites.
We further employ the network-guided ML method, RNet, to identify the potential functional binding sites of RNA molecules. We computed the binding probabilities of functional sites for TAR RNA to consider RNA’s flexibility based on 5000 frames derived from the P-TEFb/Tat/TAR MD simulation trajectories. As shown in Fig. 3, the curve illustrates the average predicted binding probabilities throughout the entire MD trajectory frames. The TAR RNA sequence displays two peaks (A7-G10 and C23-C25) that correspond to regions with high binding potential (shown in the blue shaded region of Fig. 3a). These regions precisely align with the bulge loop (R2), indicating that the bulge loop domain (nucleotides highlighted in maroon in Fig. 3b) shows the highest functional binding site probability, emphasizing its potential as a primary target for small-molecule inhibitor regulation. Additionally, we ranked the functional binding site probabilities for each nucleotide, with the top five being A7, G10, U24, U8, and A11. Among these, A7, G10, U24, and U8 are found within the loop region, while A11 is located near the loop region. These nucleotides are likely to play a crucial role in the structural dynamics and functional regulation of TAR RNA. Targeting these nucleotides could interfere with the interactions between Tat and TAR, preventing HIV-1 infection from transcriptional elongation. In fact, the multiple sequence alignment of representative HIV-1 subtypes highlights strong conservation in the U8 of the bulge loop (see Supplementary Fig. 3a). Structurally, a comparison between HIV-1 (PDB code: 6MCE) and HIV-2 (PDB code: 1AKX) TAR reveals bulge loop regions in both the secondary (see Supplementary Fig. 3b, c) and tertiary structures (see Supplementary Fig. 3d, e). Functionally, previous research indicates that nucleotides in the bulge are conserved and essential for Tat interaction. Specifically, U8 is fully conserved across all HIV isolates and is the only base that cannot be replaced by any of the other three natural bases (A, C, and G). Specifically, U8 forms hydrogen bonds with residues in TAR, creating a critical tertiary structure for Tat binding28. Importantly, studies have shown that methylation at U8’s N3 position disrupts high-affinity TAR binding29.
a The predicted probability of RNA functional binding sites along the TAR RNA sequence is determined by the network-informed ML method RNet. The blue shaded areas denote the regions with highest binding probabilities. The gray shading denotes the standard deviation across 5000 frames in the MD simulation trajectory. b The predicted binding probability is mapped onto the tertiary structure of TAR RNA, visualized with a cyan-white-maroon color bar.
Computational validation through removal of the bulge loop
Given the potential of the bulge loop region, we conducted a controlled experiment by removing this bulge loop structural element from the TAR RNA and performing MD simulations (see the “Methods” section for additional details). As illustrated in Fig. 4a, we present the P-TEFb/Tat/TAR system with the bulge loop region removed, referred to as P-TEFb/Tat/TAR-Delta. To quantitatively assess the structural stability of the P-TEFb/Tat/TAR-Delta complex, we also carried out an RMSD analysis. The RMSD values were calculated and then averaged across five independent trajectories to ensure statistical reliability. As shown in Fig. 4c, the system achieved convergence at approximately 100 ns. However, the RMSD values exhibited significantly greater fluctuations compared to those of the P-TEFb/Tat/TAR system with the bulge loop region present. To further investigate the molecular basis of these fluctuations, we conducted RMSF analysis on the backbone atoms of the P-TEFb/Tat/TAR complex. Removing the bulge loop in TAR RNA led to significant fluctuations in the nucleotides, particularly near the bulge loop region, while other areas remained relatively stable (see Supplementary Fig. 4a, b). These findings suggest that the absence of the TAR bulge loop increases structural flexibility near the Tat binding sites, potentially disrupting the Tat/TAR interface, which is essential for the complex’s functional regulation.
a Schematic representation of the P-TEFb/Tat/TAR-Delta system. Blue and green denote Cdk9 and Cyclin T1, respectively, which together constitute the P-TEFb complex. Purple represents the Tat protein, while yellow represents the TAR-Delta RNA. The zoomed-in view highlights the Tat and TAR-Delta motif domains according to the secondary structures shown in (b). b Division of domains in the Tat protein and TAR-Delta RNA based on secondary structure. c Comparative analysis of backbone root-mean-square deviation (RMSD) between the P-TEFb/Tat/TAR (PTT) system (blue) and the P-TEFb/Tat/TAR-Delta (PTTD) system (green). The error bars represent the standard deviations of the backbone RMSDs across five independent trajectories of PTT and PTTD system. d Distribution of learned edges from neural relational inference learning between residues/nucleotides in the Tat/TAR-Delta complex MD simulations. e Distribution of learned edges among domains, obtained by aggregating the residues/nucleotides-level learned edges from the Tat/TAR-Delta complex MD simulations. The color bar represents the learned interaction strength. f Change in interaction strength between domains before and after bulge removal. The color bar shows this difference. g, h The interacting domains between Tat and TAR-Delta are identified and mapped from the learned interaction edges. Edge thickness indicates the interaction strength, corresponding to (e), while arrow direction indicates the directionality of a learned edge, representing the influence from the source domain to the target domain.
To further investigate the mechanism after removing the bulge region, we applied the NRI model using MD trajectories of the Tat/TAR-Delta complex within the P-TEFb/Tat/TAR-Delta system. From all Tat/TAR ensembles sampled over 50 steps, the model achieves highly accurate trajectory reconstruction with a mean squared error (MSE) of 0.004 between the truth and reconstruction RMSF (see Supplementary Fig. 5). We derived the distribution of learned edges between residues and nucleotides (see Fig. 4d) and constructed a domain interaction map (see Fig. 4e) by grouping adjacent residues into blocks based on the secondary structure (see Fig. 4b). Compared to the P-TEFb/Tat/TAR system, the influence of learned edges between Tat and TAR becomes weaker (see Fig. 4g, h). To quantitatively characterize these changes, we measured the differences in interaction strengths between P-TEFb/Tat/TAR-Delta and P-TEFb/Tat/TAR (see Fig. 4f). The heatmap shows a reduction in nearly all interaction strengths. This finding further supports the crucial role of the bulge region as a mediator of the Tat/TAR interface in mechanistic regulation.
Regulatory application through ML-based inhibitor identification
Based on our findings, we can evaluate potential small-molecule inhibitors that target the RNA to disrupt the interaction between Tat and TAR (see Fig. 5a). We applied our ZHMol-RLinter methods, which can identify the binding probabilities between loop motifs and inhibitors. This approach demonstrated that loop motifs are highly likely to interact with small-molecule inhibitors, making them primary targets for inhibitor binding analysis. Additionally, the secondary structure of TAR RNA has two characteristic loop regions: the bulge loop and the hairpin loop (see Fig. 5b).
a Schematic representation of the P-TEFb/Tat/TAR system with bound inhibitor. Blue and green denote Cdk9 and Cyclin T1, respectively, which together constitute the P-TEFb complex. Purple represents the Tat protein, while yellow represents the TAR RNA. The zoomed-in view shows the inhibitor bound to the TAR RNA. b The secondary structure of TAR RNA (colored by motif) highlights the loop motifs (indicated by arrows) along with a plot showing the binding probability of their respective inhibitors. c The binding probability of inhibitors to the hairpin loop region of TAR RNA. d The binding probability of inhibitors to the bulge loop region of TAR RNA. e A comparative analysis of backbone root-mean-square deviation (RMSD) for the P-TEFb/Tat/TAR system in both inhibitor-free (P-TEFb/Tat/TAR, blue) and 110FA-bound (P-TEFb/Tat/TAR-L1, green) states.
Previously, our research showed five potential inhibitors (110FA, 115FA, F07#13, AM6538, and DB00594) that may target P-TEFb/Tat/TAR30. We applied ZHMol-RLinter to predict the binding preferences between the loop motifs and inhibitors. The analysis with ZHMol-RLinter revealed distinct binding preferences among the selected inhibitors. All five inhibitors demonstrated higher binding probabilities for the bulge loop region compared to the hairpin loop region (see Fig. 5c, d). 110FA exhibited the strongest binding probability toward the bulge loop (see Fig. 5d), suggesting it may be the most promising candidate for targeted TAR RNA inhibition.
We performed molecular docking between 110FA and TAR RNA (labeled as P-TEFb/Tat/TAR-L1), followed by MD simulations (see “Methods” for details). Due to the competitive binding of the small-molecule inhibitor at the RNA binding site, the ARM region of Tat remains free (see Fig. 5a). To evaluate the stability of the simulations upon inhibitor binding, we calculated the backbone RMSD across five 200 ns independent trajectories, amounting to a cumulative simulation time of \(1{{{\rm{\mu }}}}{{{\rm{s}}}}\). As shown in Fig. 5e, both systems stabilized after 50 ns of simulation, indicating the convergence of the trajectories. While the RMSD values of the P-TEFb/Tat/TAR-L1 complex were consistently much higher than those of the P-TEFb/Tat/TAR complex during the simulations. This increased RMSD suggests that the binding of inhibitor 110FA enhances the structural fluctuations within the P-TEFb/Tat/TAR complex. To further investigate structural fluctuations, we conducted the backbone RMSF analysis on the P-TEFb/Tat/TAR-L1 complex. As shown in Supplementary Fig. 6a, the system displayed overall structural stability, except for TAR RNA. Significant fluctuations were observed in TAR RNA, especially near the bulge loop region (Supplementary Fig. 6b). These findings further confirm that binding of the inhibitor 110FA induces considerable structural destabilization between TAR RNA and the Tat protein, potentially disrupting the Tat/TAR interface.
Dynamical regulatory identification in the aaRS/tRNA complex
We further extend our analysis to the second system, namely the aaRS/tRNA complex, which is essential for the aminoacylation of tRNA, a critical step in protein synthesis31. In this process, aaRS charge tRNAs with their cognate amino acids, ensuring accurate translation of the genetic code32. Specifically, the amino acid is transferred to the 3’ end of the cognate tRNA, and the synthetase distinguishes among a large pool of cellular tRNAs by recognizing particular nucleotides called identity elements. This precise recognition ensures each tRNA is charged with the correct amino acid, preserving the accuracy and fidelity of protein synthesis. As a result, inhibiting tRNA aminoacylation has been confirmed as an effective antimicrobial strategy33.
We obtained the aaRS/tRNA system (see Supplementary Fig. 7a, right) trajectory from previous research34. To explore the underlying regulatory mechanisms, we applied the NRI model to the MD trajectory of tRNA. Across all sampled tRNA ensembles of 50 steps, the model accurately reconstructed the trajectories, achieving an MSE of 0.0004 between the ground truth and reconstructed RMSF (see Supplementary Fig. 8). We also derived the distribution of learned edges between nucleotides (see Supplementary Fig. 7b) and built a domain-level interaction map (see Supplementary Fig. 7c) by grouping adjacent nucleotides into blocks based on the secondary structure (see Supplementary Fig. 7a, left). The learned edges among the D-loop, variable loop, and anticodon arm suggest that interactions between these regions are essential in the aminoacylation process of tRNA (see Supplementary Fig. 7d, e). Additionally, we calculated the shortest pathways from nucleotides in the anticodon loop to those in the acceptor arm using the learned edges, representing key allosteric communication routes within the tRNA molecule (see Supplementary Fig. 7f). The pathway mainly passes through the variable loop, indicating their relative importance in improving global connectivity and facilitating interactions that enhance allosteric signaling.
We further use the network-guided ML method, RNet, to predict potential functional binding sites for tRNA molecules. We calculated the binding probabilities of functional sites for tRNA, accounting for RNA’s flexibility based on 3200 frames from the aaRS/tRNA MD simulation trajectories. As shown in Supplementary Fig. 9a, the curve represents the average predicted binding probabilities across the entire MD trajectory frames. The tRNA sequence shows four distinct peaks (A7-G10, U19-G23, A46-G48, and G56-U59), which correspond to regions with high binding potential (indicated by the blue shaded areas in Supplementary Fig. 9a). These regions align precisely with the D-loop, T-loop, and variable loop domains (nucleotides highlighted in maroon in Supplementary Fig. 9b), emphasizing their potential as primary targets for small-molecule inhibitors. Indeed, the critical binding regions identified by RNet match the key interaction sites that sustain the L-shaped stability of tRNA, a structural requirement for its functional activity35.
Experimental evidence further supports the findings that small-molecule inhibitors specifically bind to the variable loop and D-loop regions of tRNA, disrupting the structural integrity needed for efficient aminoacylation. And the aminoglycoside antibiotic Neomycin B has been reported to inhibit the in vitro aminoacylation of E. coli tRNAPhe 36.
Performance evaluation of the physics-informed ML models
To further validate the physics-informed ML methods, we conducted a comparative analysis of these tools against traditional approaches. One key benefit of using the NRI ML method in our framework is that it can learn hidden interaction edges directly from the MD simulation data, rather than relying on past correlation-based network analysis measures. First, the NRI model doesn’t just construct static interaction networks. Instead, it learns how dynamic interaction edges work in a way that lets it generate accurate MD trajectories. This type of validation demonstrates that the learned edges contain sufficient dynamical information, a benefit that conventional network-based approaches can’t match. The Pearson correlation coefficient between the learned node weights and residue-level RMSF values is 0.59 (see Supplementary Fig. 10a), indicating a strong connection between the learned interaction strengths and changes in structural stability. On the other hand, traditional network-based methods, such as correlation or contact frequency analysis, including the dynamical cross-correlation matrix (DCCM, the details of DCCM can be found in Supplementary Note 1), can’t directly construct trajectories. The DCCM only has a Pearson correlation of −0.34 (see Supplementary Fig. 10b), which isn’t markedly compared to NRI. Further, we compared the NRI model with DCCM using network shortest-path analysis. Specifically, we constructed networks where nodes represent nucleotides or residues, and edges are defined by spatial distances less than 20 Å, with edge weights assigned from either NRI or DCCM. Then, we applied shortest-path analysis, a standard method in network studies, to calculate how often paths go through RNA nucleotides. The comparison showed that NRI (see Supplementary Fig. 10c) has sharper and more distinct peak patterns than DCCM (see Supplementary Fig. 10d), with most peaks located near regulatory bulge regions. This highlights the advantage of NRI. Overall, the comparison results show that (i) NRI gathers more detailed dynamic information and (ii) it provides more accurate and notable identification of key nucleotides.
We also compared the performance of the binding site prediction method RNet with other methods, including RNAsite, RBind, and Rsite37,38,39. RNAsite is also a ML-based method. RBind is a physics-based method grounded in complex network theory that analyzes binding sites by leveraging network properties like degree and closeness centrality. In contrast, Rsite is an approach based on Euclidean distance that identifies binding sites by calculating the spatial distances between nucleotides and other molecular components. As shown in Supplementary Fig. 11b, Rsite failed to accurately predict the bulge loop region, with its predicted binding sites scattered across various positions on the RNA. The precision of its predictions for the bulge loop region is 42.9%. This indicates that relying solely on simple Euclidean distance metrics is insufficient for accurately identifying specific functional areas. On the other hand, due to RBind’s strict cutoff criteria, it was unable to detect the binding sites effectively with 0% precision for the bulge loop region (see Supplementary Fig. 11a). The precision of RNAsite in predicting binding sites on the bulge loop is only 0.333 (see Supplementary Fig. 11c). In contrast, RNet, which utilizes ML to capture complex network features, successfully identified high-probability binding sites near the bulge loop region, achieving a precision of 80% in the Top 5 predictions, as shown in Supplementary Fig. 11d, representing a 37.1% improvement compared to Rsite and 46.7% to RNAsite.
Additionally, we evaluated the ability of ZHMol-RLinter in the pipeline to identify RNA-small molecule binding preferences using a set of experimentally determined RNA-small molecule PDB structures (see Supplementary Table 1). We started with a non-redundant set of 31 structures from the published benchmark RL9825,40. Subsequently, we added five recently released RNA-small molecule complexes from the PDB (post after April 2024). These were combined into a curated test set named RSM36, which was used to assess the pipeline’s performance in identifying corresponding small-molecule binding sites. On RSM36, the pipeline achieved an accuracy of 0.63, with a precision of 0.62 and a recall of 0.62 in predicting small molecules that bind to correct RNA motifs. When using a more relaxed criterion for determining whether the small molecule binds to the correct RNA chain, the accuracy increased to 0.74, the precision to 0.77, and the recall rose dramatically to 0.94 (see Supplementary Fig. 12). These results show that the pipeline can reliably identify small molecules binding at relevant sites, providing valuable insights for discovering RNA-targeting inhibitors.
Discussion
The physics-informed ML approach in MD simulations provides a promising tool for investigating biomolecular regulation and drug discovery and has been applied to explore two RNA complex regulatory mechanism and identify potential inhibitors. Compared to traditional MD methods, ML-enhanced workflows accelerate the analysis process by examining molecular behaviors and interactions, enabling researchers to focus experimental resources on the most promising candidates. This approach facilitates a more efficient exploration of the vast array of inhibitor candidates.
This study provides computational evidence demonstrating the crucial role of the bulge loop region in mediating the binding interaction between TAR RNA and the P-TEFb/Tat complex. These findings align with our previous experimental studies30, which established that the bulge loop motif in TAR RNA is a critical structural element for specific recognition by the Tat protein. Deleting the bulge loop or binding with inhibitor 110FA experimentally prevents Tat from associating with TAR RNA. These experimental results indicate that valid pipelines serve as valuable filters for prioritizing inhibitors for further investigation, thereby reducing both time and costs in drug development.
To interpret the latent edge types of NRI, we conducted an analysis beyond this baseline to better understand the learned interaction patterns. We examined the fluctuations in inter-residue distances (see Supplementary Fig. 13a) and analyzed how they relate to the inferred edge types. The first type (type 1) was explicitly designated as a “non-edge” (indicating no interaction). The sum of the other three edge types (types 2, 3, and 4) is collectively referred to as the “learned-edge”. Non-edge showed a significant negative correlation with distance fluctuations (see Supplementary Fig. 13c, Pearson: −0.41, p-value: 0.0), while learned-edge showed positive correlations (see Supplementary Fig. 13b, Pearson: 0.48, p-value: 0.0). This indicates that the assignment of edge types is closely linked to the system’s dynamic variability. The distribution of mean inter-residue distances shows that type 1 edges generally denote shorter distances compared to types 2, 3, and 4 (see Supplementary Fig. 13d). This implies that type 1 primarily captures close-range interactions such as covalent and hydrogen bonds, which are associated with more rigid molecular motion. Meanwhile, types 2, 3, and 4 better represent long-range relationships, reflecting distant effects within the system.
Looking to the future, research directions should focus on enhancing model interpretability and integrating with MD analysis software. As the field advances, combining ML and MD simulations becomes essential in pursuing biological mechanism research, potentially transforming our approach to addressing persistent viral infections and other complex diseases.
Conclusions
In conclusion, we have developed a physics-informed ML pipeline specifically designed to study RNA regulation mechanisms. This pipeline integrates NRI and network models with ML algorithms, offering a comprehensive and accurate approach to understanding complex RNA interactions and regulations. By analyzing the P-TEFb/Tat/TAR and aaRS/tRNA systems in detail, we’ve identified the underlying regulatory networks, pinpointed the binding sites of targeted RNA inhibitors, and used computational methods to evaluate potential small-molecule inhibitors. The results from our pipeline show its effectiveness and reliability, aligning closely with experimental studies. We believe this method will be useful for research on RNA–protein complexes and will offer insights into the physics-informed ML approach.
Methods
The P-TEFb/Tat/TAR structural model established for MD simulations
In the P-TEFb/Tat/TAR system, the HIV-1 transactivator protein Tat binds to TAR RNA, facilitating the recruitment of P-TEFb and enhancing the transcription of the integrated proviral genome41. The structural foundation of the initial structure of the P-TEFb/Tat/TAR complex for MD simulations is provided by the crystallized TAR complex with P-TEFb/Tat (PDB code: 6CYT), as resolved by Gahmen et al.42. P-TEFb consists of two components: Cdk9 and CyclinT1. Cdk9 comprises an ATP-binding pocket and a CyclinT1-interacting domain, and CyclinT1 binds the Cdk9 T-loop while engaging the TAR loop through the Tat/TAR recognition motif. In the 6CYT structure, the Tat protein is represented only by a short segment spanning residues M1 to G48, leaving the mechanism of TAR interaction with the Tat ARM region unclear due to the absence of residues R49-Q60 in the crystal structure. To obtain a complete structural model, we utilized the extended Tat and TAR, following the approach described in our previous study27. The extended Tat and TAR RNA were modeled using Chimera-1.13.1, based on the NMR structure (PDB code: 6MCE)42,43,44. Structural optimizations were performed using SWISS-MODEL to obtain a complete atomic protein model45,46. The P-TEFb/Tat/TAR-Delta structure was generated by cutting the critical position U8-U9 of the TAR RNA bulge loop in the P-TEFb/Tat/TAR complex. The structure with the inhibitor 110FA (P-TEFb/Tat/TAR-L1) was constructed using the P-TEFb/Tat complex (residues M1-I45) from 6CYT, along with Tat (S46-Q60) and TAR derived from 6MCE. The docking of 110FA and TAR was carried out using NLDock47 (a program designed for nucleic acid-ligand docking) with a global-rigid parameter. In the experimentally determined structure (PDB code: 1UTS), the inhibitor associates with the bulge-loop region of TAR48. In our docking model, 110FA also localizes to these same RNA elements, showing substantial overlap with the known binding sites, thus supporting the credibility of its predicted binding site (see Supplementary Fig. 14a). Third, we applied the docking protocol NLDock to reproduce the known 1UTS binding structure and successfully generated a near-native conformation comparable to the PDB structure (see Supplementary Fig. 14b). These results validate that our docking strategy can identify physically realistic RNA-ligand complexes. We utilized the ACPYPE tool to generate the topology for the ligand based on the General Amber Force Field (GAFF)49,50.
Simulation steps and parameters
The MD simulations were conducted using GROMACS-2023.3 with the AMBER03 protein and the nucleic acid AMBER94 mixed force field and SPC/E water model51,52,53. The SPC/E water model has been employed in several RNA–protein simulations with the Amber force field to improve solvent dielectric and dynamical properties54,55. A detailed comparison of different water model choices is provided in Supplementary Note 2 and Supplementary Fig. 15. Initially, AMBER20 was utilized to process the complex structure, removing the initial solvent and ions. The complex was then solvated in a cubic box with periodic boundary conditions. To ensure charge neutrality, Na⁺ and Cl⁻ ions were added. The final box size was 14.01 × 14.01 × 14.01 nm³, containing a total of 277,084 atoms for the PTFEb/Tat/TAR system. For the PTFEb/Tat/TAR-Delta system, the final box size was 14.01 × 14.01 × 14.01 nm³ with 277,087 atoms, whereas for the PTFEb/Tat/TAR-L1 system, it was 15.20 × 15.20 × 15.20 nm³ with 348,688 atoms in total. Each molecular complex was first energetically minimized using the steepest descent method until either 20,000 iterations were reached or convergence criteria were met. The system was subsequently heated to 300 K in two stages using the Nose-Hoover thermostat method56. Following this, the NPT ensemble underwent 1 ns of unrestrained equilibration with the Parrinello-Rahman barostat method57. All simulations employed a time step of 2 fs, and a final production run of 200 ns was conducted under the same NPT conditions. We conducted MD simulations for a total of 3 microseconds (3 µs), comprising five independent trajectories for each of the three systems, with each trajectory simulated for 200 nanoseconds (200 ns).
The backbone root-mean-square deviation (RMSD) and root-mean-square fluctuation (RMSF) of atomic positions, which is were calculated by the GROMACS-2023.3 program51. RMSF quantifies the flexibility of atoms or residues over time in MD simulations, thereby providing insight into structural dynamics. In the RMSD calculation, the protein backbone was represented by \(N,{C}_{\alpha }\) and \(C\) atoms, while the RNA backbone comprised \({C5}^{{\prime} },\,{O5}^{{\prime} },\,{C3}^{{\prime} }\) and \({O3}^{{\prime} }\) atoms. For RMSF calculations, a single backbone atom was used: \({C}_{\alpha }\) atoms for proteins and \({C5}^{{\prime} }\) atoms for RNA.
The NRI-informed ML algorithm
The NRI is an unsupervised model in the form of variational autoencoder (VAE), which is a generative deep learning framework that learns latent representations of input data by approximating probability distributions, designed to infer latent interactions from dynamic observational trajectory data. This comprises two interconnected components trained jointly: an encoder module responsible for inferring potential interactions based on observed trajectories and a decoder module that establishes the dynamic data according to the predicted latent interaction graph. Further details about the NRI model can be found in Supplementary Note 3 and Supplementary Fig. 16, the training details can be found in Supplementary Note 4, and the hyperparameter optimization can be found in Supplementary Note 5 and Supplementary Fig. 17.
Network-informed ML prediction of RNA binding sites
In this study, we employed RNet24, a ML-based tool, to identify potential small-molecule inhibitor binding sites on RNA. RNet transforms RNA three-dimensional structures into complex networks and uses an integrated approach combining complex network features with ensemble machine-learning strategies to predict potential RNA inhibitor binding sites. The algorithm utilizes a network decomposition approach based on ML to predict RNA binding sites by analyzing local and global network properties. The local properties are degree and neighborhood connectivity. Degree and neighborhood connectivity describe the connection ability to neighbors and secondary neighbors. The global properties are betweenness centrality, closeness centrality, and eccentricity. Global properties describe the transmission efficiency of nodes (see Supplementary Note 6).
Physicochemical property-informed inhibitor screening
Additionally, we utilized ZHMol-RLinter to assess the RNA-binding potential of the candidate inhibitor25. ZHMol-RLinter is a ML-based framework designed to predict RNA-small molecule binding interactions, specifically focusing on identifying potential small molecules that bind to RNA motifs. This method employs a double-layer stacking architecture that utilizes random forest models. ZHMol-RLinter integrates multiple sequence- and structure-based physicochemical features, including motif sequence patterns, structural geometric parameters, and physicochemical environmental characteristics, as comprehensive input parameters for small molecule recognition. The approach is optimized explicitly for screening potential inhibitors by evaluating their binding probabilities with RNA motif regions.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The data that support the findings of this article are publicly available at https://doi.org/10.5281/zenodo.17157778. The source data behind figures in the paper are provided in Supplementary Data 1.
Code availability
The code that support the findings of this article are publicly available at https://doi.org/10.5281/zenodo.17157778.
References
Cramer, P. Organization and regulation of gene transcription. Nature 573, 45–54 (2019).
Zhao, B. S., Roundtree, I. A. & He, C. Post-transcriptional gene regulation by mRNA modifications. Nat. Rev. Mol. Cell Biol. 18, 31–42 (2017).
Statello, L., Guo, C.-J., Chen, L.-L. & Huarte, M. Gene regulation by long non-coding RNAs and its biological functions. Nat. Rev. Mol. Cell Biol. 22, 96–118 (2021).
Liu, H., Jian, Y., Zeng, C. & Zhao, Y. RNA-protein interaction prediction using network-guided deep learning. Commun. Biol. 8, 247 (2025).
Glisovic, T., Bachorik, J. L., Yong, J. & Dreyfuss, G. RNA-binding proteins and post-transcriptional gene regulation. FEBS Lett. 582, 1977–1986 (2008).
Liu, H. & Zhao, Y. Integrated modeling of protein and RNA. Brief. Bioinforma. 25, bbae139 (2024).
Khatter, H., Myasnikov, A. G., Natchiar, S. K. & Klaholz, B. P. Structure of the human 80S ribosome. Nature 520, 640–645 (2015).
Ekeberg, T. et al. Observation of a single protein by ultrafast X-ray diffraction. Light Sci. Appl. 13, 15 (2024).
Ke, A. Doudna JA. Crystallization of RNA and RNA–protein complexes. Methods 34, 408–414 (2004).
Zeng, C., Jian, Y., Vosoughi, S., Zeng, C. & Zhao, Y. Evaluating native-like structures of RNA-protein complexes through the deep learning method. Nat. Commun. 14, 1060 (2023).
Ghosh, A. K. Four decades of continuing innovations in the development of antiretroviral therapy for HIV/AIDS: Progress to date and future challenges. Glob. Health Med. 5, 194–198 (2023).
Sever, B., Otsuka, M., Fujita, M. & Ciftci, H. A review of FDA-approved anti-HIV-1 drugs, anti-gag compounds, and potential strategies for HIV-1 eradication. Int. J. Mol. Sci. 25, 3659 (2024).
De Vivo, M., Masetti, M., Bottegoni, G. & Cavalli, A. Role of molecular dynamics and related methods in drug discovery. J. Med. Chem. 59, 4035–4061 (2016).
Hollingsworth, S. A. & Dror, R. O. Molecular dynamics simulation for all. Neuron 99, 1129–1143 (2018).
Kmiecik, S. et al. Coarse-grained protein models and their applications. Chem. Rev. 116, 7898–7936 (2016).
Baek, M. et al. Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA. Nat. Methods 21, 117–121 (2024).
Krishna, R. et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 384, eadl2528 (2024).
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Liu, H., Zhuo, C., Gao, J., Zeng, C. & Zhao, Y. AI-integrated network for RNA complex structure and dynamic prediction. Biophys. Rev. 5, 041304 (2024).
Wang, T. et al. Ab initio characterization of protein molecular dynamics with AI2BMD. Nature 635, 1019–1027 (2024).
Alber, M. et al. Integrating machine learning and multiscale modeling—perspectives, challenges, and opportunities in the biological, biomedical, and behavioral sciences. NPJ Digit. Med. 2, 115 (2019).
Liu, H., Jian, Y., Hou, J., Zeng, C. & Zhao, Y. RNet: a network strategy to predict RNA binding preferences. Brief. Bioinforma. 25, bbad482 (2023).
Zhuo, C., Gao, J., Li, A., Liu, X. & Zhao, Y. A machine learning method for RNA–small molecule binding preference prediction. J. Chem. Inf. Model. 64, 7386–7397 (2024).
Shortridge, M. D. et al. An ultra-high affinity ligand of HIV-1 TAR reveals the RNA structure recognized by P-TEFb. Nucleic Acids Res. 47, 1523–1531 (2018).
Ning, S., Zeng, C., Zeng, C. & Zhao, Y. The TAR binding dynamics and its implication in Tat degradation mechanism. Biophys. J. 120, 5158–5168 (2021).
Delling, U. et al. Conserved nucleotides in the TAR RNA stem of human immunodeficiency virus type 1 are critical for Tat binding and trans activation: model for TAR RNA tertiary structure. J. Virol. 66, 3018–3025 (1992).
Sumner-Smith, M. et al. Critical chemical features in trans-acting-responsive RNA are required for interaction with human immunodeficiency virus type 1 Tat protein. J. Virol. 65, 5196–5202 (1991).
Khatkar, P. et al. HIV-1 transcription inhibition using small RNA-binding molecules. Pharmaceuticals 17, 33 (2024).
Ibba, M. & Söll, A. D. Quality control mechanisms during translation. Science 286, 1893–1897 (1999).
Rubio Gomez, M. A. & Ibba, M. Aminoacyl-tRNA synthetases. RNA 26, 910–936 (2020).
Wilson, D. N. Ribosome-targeting antibiotics and mechanisms of bacterial resistance. Nat. Rev. Microbiol. 12, 35–48 (2014).
Sethi, A., Eargle, J., Black, A. A. & Luthey-Schulten, Z. Dynamical networks in tRNA:protein complexes. Proc. Natl. Acad. Sci. USA 106, 6620–6625 (2009).
Zagryadskaya, E. I., Kotlova, N. & Steinberg, S. V. Key elements in maintenance of the tRNA L-shape. J. Mol. Biol. 340, 435–444 (2004).
Mikkelsen, N. E., Johansson, K., Virtanen, A. & Kirsebom, L. A. Aminoglycoside binding displaces a divalent metal ion in a tRNA–neomycin B complex. Nat. Struct. Biol. 8, 510–514 (2001).
Wang, K., Jian, Y., Wang, H., Zeng, C. & Zhao, Y. RBind: computational network method to predict RNA binding sites. Bioinformatics 34, 3131–3136 (2018).
Zeng, P., Li, J., Ma, W. & Cui, Q. Rsite: a computational method to identify the functional sites of noncoding RNAs. Sci. Rep. 5, 9179 (2015).
Su, H., Peng, Z. & Yang, J. Recognition of small molecule–RNA binding sites using RNA sequence and structure. Bioinformatics 37, 36–42 (2021).
Sun, S., Yang, J. & Zhang, Z. RNALigands: a database and web server for RNA-ligand interactions. RNA 28, 115–122 (2022).
Wei, P., Garber, M. E., Fang, S.-M., Fischer, W. H. & Jones, K. A. A novel CDK9-associated C-type cyclin interacts directly with HIV-1 tat and mediates its high-affinity, loop-specific binding to TAR RNA. Cell 92, 451–462 (1998).
Schulze-Gahmen, U. & Hurley, J. H. Structural mechanism for HIV-1 TAR loop recognition by Tat and the super elongation complex. Proc. Natl. Acad. Sci. USA 115, 12973–12978 (2018).
Pham, V. V. et al. HIV-1 Tat interactions with cellular 7SK and viral TAR RNAs identifies dual structural mimicry. Nat. Commun. 9, 4266 (2018).
Pettersen, E. F. et al. UCSF Chimera—a visualization system for exploratory research and analysis. J. Comput. Chem. 25, 1605–1612 (2004).
Bienert, S. et al. The SWISS-MODEL Repository—new features and functionality. Nucleic Acids Res. 45, D313–D319 (2016).
Waterhouse, A. et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 46, W296–W303 (2018).
Feng, Y., Zhang, K., Wu, Q. & Huang, S.-Y. NLDock: a fast nucleic acid–ligand docking algorithm for modeling RNA/DNA–ligand complexes. J. Chem. Inf. Model. 61, 4771–4782 (2021).
Murchie, A. I. H. et al. Structure-based drug design targeting an inactive RNA conformation: exploiting the flexibility of HIV-1 TAR RNA. J. Mol. Biol. 336, 625–638 (2004).
Sousa da Silva, A. W. & Vranken, W. F. ACPYPE - AnteChamber PYthon Parser interfacE. BMC Res. Notes 5, 367 (2012).
Wang, J., Wolf, R. M., Caldwell, J. W., Kollman, P. A. & Case, D. A. Development and testing of a general amber force field. J. Comput. Chem. 25, 1157–1174 (2004).
Abraham, M. J. et al. GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1–2, 19–25 (2015).
Duan, Y. et al. A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations. J. Comput. Chem. 24, 1999–2012 (2003).
Chatterjee, S., Debenedetti, P. G., Stillinger, F. H. & Lynden-Bell, R. M. A computational investigation of thermodynamics, structure, dynamics and solvation behavior in modified water models. J. Chem. Phys. 128, 124511 (2008).
Baltrukevich, H. & Bartos, P. RNA-protein complexes and force field polarizability. Front. Chem. 11, 1217506 (2023).
Knappeová, B. et al. Comprehensive assessment of force-field performance in molecular dynamics simulations of DNA/RNA hybrid duplexes. J. Chem. Theory Comput. 20, 6917–6929 (2024).
Nosé, S. A unified formulation of the constant temperature molecular dynamics methods. J. Chem. Phys. 81, 511–519 (1984).
Parrinello, M. & Rahman, A. Polymorphic transitions in single crystals: a new molecular dynamics method. J. Appl. Phys. 52, 7182–7190 (1981).
Acknowledgements
This work was supported by the National Natural Science Foundation of China (grant no. 12175081); Hubei Science Fund for Distinguished Young Scholars (grant no. 2024AFA077); Fundamental Research Funds for the Central Universities (grant no. CCNU25ai010); The Central China Normal University’s excellent postgraduate education innovation funding project (grant no. 2024CXZZ146); Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China (JYB2025XDXM502).
Author information
Authors and Affiliations
Contributions
Haoquan Liu conducted the MD simulations, developed the ML-based model, analyzed the data, and wrote the paper; Yanan Zhu contributed to the discussion on model development; Jiaming Gao helped with the data collection; Chen Zhuo contributed to the development of the ZHMol-RLinter; Chengwei Zeng assisted in the collection and analysis of molecular MD simulation data; Yunjie Zhao designed the project and supervised the overall study of the project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Physics thanks the anonymous reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Liu, H., Zhu, Y., Gao, J. et al. RNA regulation mechanisms study using physics-informed machine learning. Commun Phys 9, 19 (2026). https://doi.org/10.1038/s42005-025-02452-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s42005-025-02452-z







