Introduction

The identification of therapeutically interesting compounds from the estimated 1060 potential drug-like chemicals remains a critical challenge in drug discovery1. High-throughput screening is widely used to evaluate compound–protein binding affinity, but it is time-intensive and costly 2. Computational methods have emerged as fast and efficient alternatives for predicting compound–protein interaction (CPI)3,4,5.

CPI prediction aims to assess how small-molecule compounds interact with proteins by determining their interaction pattern (contact map) and strength (binding affinity)6. The accuracy of physics-based CPI prediction methods, such as docking and molecular dynamics simulations, depends heavily on the quality of three-dimensional (3D) protein structures7,8. Their performances often decline significantly when predicted rather than experimentally determined structures are used9,10.

Recent deep learning (DL) models have enabled high-throughput CPI prediction without the need for experimentally determined protein structures, thereby improving the efficiency of binding-affinity estimation3,11,12,13,14. Early models used 1D convolutional neural networks (CNNs) to represent both compounds and proteins14,15; whereas later approaches, such as DeepAffinity16, adopt recurrent neural networks to process compound and protein sequences. Building on prior developments, GraphDTA13 incorporates molecular graphs to represent compounds, but protein representations are still limited to sequences. To further refine feature extraction, TransformerCPI17 and HyperAttentionDTI18 incorporate self-attention mechanisms. More recently, PerceiverCPI19 has utilized cross-attention to enhance CPI representations; whereas Cross–Interaction20 integrates 1D protein sequences with two-dimensional (2D) predicted contact maps for improved binding affinity and contact predictions.

In addition to CPI prediction models, structure-based Drug Target Affinity (DTA) prediction models directly use protein-ligand complex structures or binding pocket information to predict binding affinities21,22. These structure-based DTA models require accurate complex structures as input, limiting their applicability when experimentally determined structures are unavailable. Recently, Fold-Docking-Affinity23 extended a structure-based DTA model by combining AlphaFold224 and DiffDock25 to predict affinities without relying on experimental structures. However, its application results have been limited to kinase targets23, and the combination of AlphaFold2 structures with DiffDock has shown limited success across diverse targets26, indicating that extending DTA models to CPI tasks remains difficult.

Despite extensive research on DL-based CPI prediction models, the performance of DL models has been hindered from insufficient generalization to external datasets. When applied beyond the training domain, DL models often fail to maintain predictive accuracy, highlighting a critical limitation of their utility in real-world drug discovery. In the present study, we aimed to address this limitation by addressing two main challenges. First, most models are trained on benchmark CPI datasets, such as Davis27, KIBA28, and Metz29. These datasets are biased toward a few, well-studied drug target proteins, particularly kinases. Consequently, generalizing DL models trained on these datasets to a broader range of targets is difficult. To address this limitation, You et al.20 pretrained models on unlabeled databases such as Pfam-A30 using masked language modeling and graph completion. In addition, they constructed an expanded CPI dataset encompassing a more diverse protein set. Fine-tuning pretrained models on this extended dataset offers a promising approach to overcome protein diversity limitations and extrapolate CPI models to a wider array of drug targets.

The second limitation is the inadequate integration of multimodal information, particularly protein sequences, 3D structural information about proteins, and compound properties, which are all critical for understanding protein functions and their interactions with compounds31. With advances in structure prediction models such as AlphaFold224, structural information can be utilized for CPI prediction without requiring experimentally determined structures. Previous CPI models have typically treated protein structures, protein sequences, and compound properties as independent sources of information, thereby overlooking the interplay among these modalities32. By contrast, the joint representations of PSC–CPI6, which adopts contrastive machine learning to indirectly capture protein sequences and structures, offer higher generalizability than models relying on sequences or structures alone. Despite its strengths, the representational capacity of PSC–CPI remains insufficient. To enhance both the generalizability and performance of the model, an approach that explicitly incorporates multimodal information, including 3D structural information and compound properties, while leveraging large-scale unlabeled datasets of proteins and compounds, is required. Such an approach enables the learning of more robust embeddings, ultimately improving the accuracy of DL models for CPI prediction and drug discovery33.

Here, we present a model with Generalized Structure- and Property-Aware Representations of protein and chemical language models for CPI prediction (GenSPARC). It is designed to integrate multimodal pretrained models capable of incorporating protein sequences, 3D protein structures, and the biochemical features and structures of compounds. GenSPARC adopts the SaProt34 structure-aware protein language model as a protein encoder, and the Structure-Property Multi-Modal (SPMM)35 and graph convolutional network (GCN)-integrated model, which captures both structural and biochemical properties, as a compound encoder. To effectively learn joint representations, a multimodal attention network (MAN) was applied. GenSPARC achieved state-of-the-art performance across multiple CPI benchmarks, excelling in binding-affinity and contact-map prediction. Moreover, we performed a benchmark of virtual screening tasks, where prior knowledge of protein-compound complex structures or compound binding pockets is assumed to be known. In the benchmark, our model achieved comparable performance against structure-aware approaches, including DrugCLIP, a state-of-the-art model in virtual screening that explicitly utilizes structural information of protein-compound complex structures. Notably, under the condition that experimental structures were not available and AlphaFold2-predicted structures were used instead, our model demonstrated the best performance.

Compared to other models, GenSPARC demonstrated the best performance in scenarios where actual protein structures were unavailable, highlighting its robustness in experimentally determined, structure-independent CPI predictions.

Results

Overview of GenSPARC

As illustrated in Fig. 1, GenSPARC is designed to incorporate both global and local structural information from protein sequences. Using AlphaFold224 and Foldseek36, we predicted the protein structures and assigned a structural alphabet to each residue based on tertiary interactions (Fig. 1A). The obtained structural letters and amino acid sequences were applied to SaProt34, thereby generating universal, structure-aware protein representations. For the compounds, we converted SMILES representations into molecular graphs, which were processed using GCNs37 to capture global atomic relationships. These representations were further enriched by 53 structural and biochemical properties extracted using a pretrained property encoder from the SPMM35 chemical language model (Fig. 1B). To effectively integrate these multimodal representations, we introduced a MAN that captured both intra-modal relationships and inter-modal interactions between proteins and compounds, leveraging large-scale pretrained models based on the Transformer38 architecture (Fig. 1C).

Fig. 1: Overall workflow of GenSPARC.
figure 1

A Generation of protein structure-aware (SA) sequences. Raw protein sequences were first mapped to UniProt IDs and then used to obtain the AlphaFold Protein Structure Database (AFDB) structures. These structures were transformed into SA sequences using FoldSeek. B Model architecture. The protein SA sequence was encoded using a pretrained protein encoder, SaProt. Meanwhile, the compound graph, based on SMILES data, was processed through both a compound encoder and a property encoder, capturing structural and chemical features. Cross-attention mechanisms combined the protein and compound data, and self-attention layers further refined these features. The MAN predicted binding affinity, while the pairwise interaction network predicted the contact map. C MAN for integrating protein and compound representations. This module combined multi-head self-attention and cross-attention mechanisms, followed by a feed-forward network to capture joint representation between protein residues and compound atoms for predicting binding affinity.

Binding affinity and contact prediction on the Karimi dataset

The general CPI prediction framework focuses on two key tasks: contact-map prediction and binding-affinity prediction. In the Karimi dataset setting, prediction is performed using SMILES strings for compounds together with amino acid sequences and AlphaFold2-predicted structures for proteins, without relying on protein-compound complex structural information. In GenSPARC, these tasks were addressed separately by training the dedicated models. In addition, we evaluated the multitask learning version GenSPARC–MT, which employed a combined loss function to learn both tasks simultaneously. Because GenSPARC–MT exhibited similar performance to GenSPARC, the following sections focus primarily on GenSPARC.

We first evaluated our model using four baseline methods, MONN12, Cross–Interaction20, PSC–CPI6, and GraphBAN39, on the Karimi11 dataset. MONN12 is a sequence-based method that jointly predicts binding affinity and atom–residue contacts, representing one of the most effective sequence-based approaches. Cross–Interaction20 combines protein sequences and structures using 2D contact maps, whereas PSC–CPI6 integrates sequences and structures through contrastive learning. GraphBAN39 is a recently reported graph-based framework that integrates inductive learning and knowledge distillation to robustly predict interactions between unseen compounds and proteins. Notably, PSC–CPI6 is a state-of-the-art model for CPI contact and affinity predictions on the Karimi dataset. The results for the test set were reported across four different data splits following the Original Karimi11 dataset: (1) Seen–Both, with both proteins and compounds present in the training set; (2) Unseen–Comp, with proteins present and compounds absent from the training set; (3) Unseen–Prot, with proteins not present and compounds absent from the training set; and (4) Unseen–Both, with both proteins and compounds absent from in the training set. To evaluate contact-map prediction, we adopted the area under the precision-recall curve (AUPRC) and the area under the receiver operating characteristic curve (AUROC) as metrics. In GraphBAN39, we followed the analytical procedure described in the original study39 for contact prediction, where the attention weights of a bilinear attention network were used to predict the contacts between proteins and compounds. For binding-affinity prediction, the root mean square error (RMSE) and Pearson correlation coefficient (PCC) were applied to evaluate the performance of each model. Following the experimental setup of PerceiverCPI19, we modified the final layers of GraphBAN39, originally designed for binary classification, to perform regression tasks. These values should be interpreted with caution, as the approach differs from direct prediction methods.

For contact-map prediction, our model outperformed the baseline methods across all splits (Table 1). Specifically, in the Seen–Both test set, our model achieved an AUPRC of 25.44 and an AUROC of 90.54, significantly surpassing the baseline. In the more challenging Unseen–Prot and Unseen–Both test sets, our model remained robust, with AUPRC of 20.00 and 18.19, respectively. The average performance of our model surpassed that of PSC–CPI6 by 7.01 in AUPRC and 8.66 in AUROC.

Table 1 Performance comparison of CPI on pattern prediction under four data splits on the original Karimi dataset

For binding-affinity prediction, our model achieved a lower RMSE and higher PCC than with baseline methods across all splits. Seen–Both achieved an RMSE of 1.357 and a PCC of 0.748. In the Unseen–Comp and Unseen–Prot splits, the model generalized well, with RMSE values of 1.221 and 1.551, and PCCs of 0.787 and 0.526, respectively (Table 2). On average, our model outperformed PSC–CPI6, showing a 0.096 lower RMSE and 0.050 higher PCC, thereby demonstrating strong performance and generalizability.

Table 2 Performance comparison of CPI on binding-affinity prediction under four data splits on the original Karimi dataset

Binding affinity and contact prediction on a split test set to evaluate sequence and structural similarities

While we expected worse performance on the Unseen–Comp split set compared to the Seen–Both split set, because the former included compounds absent from the training data, the results showed comparable or even slightly better performance (Table 2). This unexpected outcome raised concerns about potential data leakage across different splits. To address this issue, we implemented a more stringent dataset splitting approach. Strict Tanimoto similarity thresholds were enforced for the compounds, while two distinct split settings were introduced for proteins: a sequence-hard setting based on sequence alignment scores and a structure-hard setting based on structural alignment scores.

Using both settings, it was possible to more comprehensively evaluate the model’s generalizability. The results were consistent with our expectations (Tables 3 and 4). In the sequence-hard setting, all direct prediction models (MONN12, Cross–Interaction20, PSC–CPI6, and GenSPARC) showed improved contact prediction performance (Table 3), with the Seen–Both test set achieving the best results. The MONN12, Cross–Interaction20, PSC–CPI6, and GraphBAN39 models exhibited an expected decrease in performance for the Unseen–Prot test set and Unseen–Both test set. In contrast, GenSPARC demonstrated superior performance than the baseline models, with notable improvements in the Unseen–Prot and Unseen–Both test sets that pointed to robust generalizability, even under stringent split conditions.

Table 3 Performance comparison of CPI on pattern prediction under four data splits on the sequence-hard and structure-hard dataset settings
Table 4 Performance comparison of CPI on binding-affinity prediction under four data splits on the sequence-hard and structure-hard dataset settings

Similar results were obtained with the structure-hard setting (Table 3). While the baseline models showed reduced performance in the structure-hard Unseen–Prot and Unseen–Both configurations, GenSPARC improved its performance in the Unseen–Prot split and maintained a consistent effectiveness in the Unseen–Both split. Our model substantially outperformed PSC–CPI6, achieving remarkable gains of +7.82 in AUPRC and +9.13 in AUROC for sequence-hard evaluations, alongside improvements of +7.08 in AUPRC and +8.86 in AUROC for structure-hard assessments. Although GenSPARC showed a slightly worse performance in Unseen–Comp across both sequence-hard and structure-hard configurations, these changes were minimal. Moreover, for binding-affinity prediction (Table 4), GenSPARC generally outperformed all baseline models across all evaluation settings, further validating its robustness under diverse and challenging conditions.

In conclusion, these comprehensive results demonstrate the superiority of GenSPARC for contact-map prediction and binding-affinity estimation. Our model outperformed standard evaluation metrics and, more importantly, maintained exceptional performance under rigorous evaluation frameworks specifically designed to test true generalizability.

Case studies for CPI patterns

To evaluate the performance and generalization of GenSPARC, we examined its ability to reconstruct contact maps for two protein examples: prothrombin (UniProt ID: P00734) and macrophage metalloelastase (UniProt ID: P39900). Prothrombin represented the Unseen–Prot split, whereas macrophage metalloelastase represented the more challenging Unseen–Both split. We compared the predicted contact maps of GenSPARC with those of Cross–Interaction20 and PSC–CPI6 to assess the ability of the models to recover ground-truth interaction patterns. To ensure a fair evaluation, we ranked the interaction strengths, set a threshold to match the number of ground-truth interactions, and normalized the results (Fig. 2).

Fig. 2: Comparison of predicted CPI contact maps across different models.
figure 2

A, B For two representative protein-compound complexes, the interaction sites of their experimentally determined structures and structural formula of the compound, the ground-truth contact maps, and the predicted contact maps from Cross–Interaction, PSC–CPI, and GenSPARC are illustrated from left to right. In the structure shown in the left panel, the compound is depicted using a ball-and-stick representation, while the side chains and Cα atoms of the interacting amino acids are represented as sticks. Protein structures and compound structural formulae are annotated with residue numbers and atomic numbers corresponding to the contact map. The contact maps depict interactions between residue sites (proteins) and atomic sites (compounds), with darker hues indicating higher prediction accuracy. The AUPRC scores (higher is better) are displayed for each method. Prothrombin (PDB: 1BCU; A) and macrophage metalloelastase (PDB: 3RTT; B) were used to generate the displayed experimental structures and ground-truth contact maps.

GenSPARC demonstrated superior predictive accuracy for prothrombin than the baseline methods. Cross–Interaction20 failed to identify key interaction sites in the bottom-left region (corresponding to Val213–Cys220 in the protein and atoms 1–6 in the compound) or in the bottom-right corner (corresponding to Gly226 in the protein and atom 16 in the compound). While PSC–CPI6 showed partial recovery of the bottom-left interaction, it failed to identify contacts in the bottom-right region. In contrast, GenSPARC successfully identified both interaction regions and recovered the interaction at G226. For macrophage metalloelastase, all three models identified the upper-right interaction site (corresponding to Gly179–Ala182 in the protein and atoms 17–20 in the compound). However, noticeable differences were observed for the other contact regions. Both Cross–Interaction20 and PSC–CPI6 failed to recover the upper-left interaction (Gly179–Ala182 in the protein and atoms 1–8 in the compound) and the bottom interaction site (Tyr240 in the protein and atoms 13–16 in the compound). Although GenSPARC exhibited a slightly weaker signal in these regions, it successfully detected both contact sites, demonstrating an improved ability to identify CPI in structurally difficult cases.

Additional binding-affinity prediction using Davis, KIBA, and Metz datasets

Next, we compared GenSPARC and previous models in terms of CPI binding-affinity prediction using three public datasets—Davis27, KIBA28, and Metz29—under the sequence-hard Seen–Both, Unseen–Comp, Unseen–Prot, and Unseen–Both split settings. The mean squared error (MSE) and concordance index were used as evaluation metrics. We compared GenSPARC to DeepConvDTI15, GraphDTA13, HyperattentionDTI18, TransformerCPI17, PerceiverCPI19, Cross–Interaction20, PSC-CPI6, and GraphBAN39. We followed the experimental setup of PerceiverCPI19, modifying the final layers of binary classification models (e.g., TransformerCPI, DeepConvDTI, HyperAttentionDTI, and GraphBAN) for the regression tasks.

For clarity, we focused on the Unseen–Both split setting shown in Table 5, as it is the most important setting for assessing generalizability, whereas the results for all splits (Seen–Both, Unseen–Comp, Unseen–Prot, and Unseen–Both) are provided in the Supplementary Information (Tables S1S3). GenSPARC exhibited consistently superior performance across the most unseen test settings, demonstrating strong generalization capability beyond the training distribution (Table 5, Tables S2 and S3). Under the Unseen–Both split, both Cross–Interaction and PSC–CPI6 demonstrated competitive MSE performance but were slightly outperformed by PerceiverCPI in terms of concordance index (Table 5). Notably, GenSPARC achieved superior MSE results across all datasets, delivering the best performance in four of six metrics and underscoring its effectiveness in CPI prediction under stringent evaluation conditions.

Table 5 Performance comparison of GenSPARC with state-of-the-art baselines for CPI strength prediction

Virtual screening using the DUD-E benchmark

We assessed the performance of GenSPARC in virtual screening tasks using the Database of Useful Decoys-Enhanced (DUD-E) benchmark. Thus, we benchmarked GenSPARC against a diverse set of approaches, including conventional docking methods, machine learning models, and DL approaches (Table 6). In contrast to the CPI prediction task discussed above, this benchmark includes models that utilize experimentally determined complex structures and binding pocket information provided in the dataset. In the zero-shot setting, GenSPARC was benchmarked against the docking methods Glide-SP40 and AutoDock Vina7, the machine learning models random score (RF-score)41, and the DL models NNScore42, Pafnucy43, OnionNet44, Planet45, and the current state-of-the-art model DrugCLIP46. In the cross-validation setting, GenSPARC was compared to RF-score41, NNScore42, AutoDock Vina7, and the DL models 3D-CNN47, Graph CNN2, COSP48, DrugVQA49, and the state-of-the-art methods AttentionSiteDTI50 and DrugCLIP46.

Table 6 Virtual screening on DUD-E in the zero-shot setting

Conventional docking and structure-aware models, which rely heavily on experimentally determined protein-ligand pocket coordinates, often struggle to generalize to predicted structures. To systematically evaluate this limitation, we tested two versions of AutoDock Vina7, a representative docking model, and DrugCLIP, a representative structure-aware model, under a zero-shot setting, while maintaining the same model architecture but altering the input data source. Specifically, we evaluated Vina–PDB, using coordinates from experimentally determined Protein Database (PDB) structures as input, and Vina–AF, using coordinates from AlphaFold2 structures as input. Similarly, we assessed DrugCLIP–PDB, using pocket coordinates from experimentally determined PDB structures as input, and DrugCLIP–AF, using pocket coordinates from AlphaFold2 structures as input. This comparison offered insights into the ability of AutoDock Vina and DrugCLIP to generalize experimental structures and predicted protein structures, which is a key challenge in structure-based virtual screening. Similarly, we evaluated GenSPARC in two configurations: GenSPARC–PDB, using FoldSeek’s structural alphabet generated from experimentally determined PDB structures as inputs, and GenSPARC–AF, using FoldSeek’s structural alphabet generated from AlphaFold2 structures as inputs.

For cross-validation, GenSPARC used structure-aware sequences derived from experimentally determined PDB structures, following the data splitting and evaluation method of AttentionSiteDTI50. Although AUROC is a widely used metric for classification, it has been criticized for its limitations in virtual screening, as it does not prioritize the early retrieval of active compounds51,52. Therefore, we also employed the enrichment factor (EF)53, Boltzmann-enhanced discrimination of ROC (BEDROC)51, and ROC enrichment (RE)54,55 metrics to better capture early enrichment performance and the actual effectiveness of virtual screening models.

DrugCLIP–PDB achieved strong performance, with a BEDROC (α = 85) of 50.52 and EF values of 38.07 (EF 0.5%), 31.89 (EF 1%), and 10.66 (EF 5%) (Table 6). Instead, DrugCLIP–AF dropped to an EF value of 2.12 (EF 0.5%), indicating its dependence on precise experimental structures. In contrast, GenSPARC–PDB achieved competitive results compared with DrugCLIP–PDB, with a BEDROC (α = 85) of 22.32 and EF values of 15.10 (EF 0.5%) and 12.82 (EF 1%). Focusing on the results using AlphaFold2 structures as input, all evaluated methods, including GenSPARC, DrugCLIP, and Vina, exhibited a performance drop compared with their performance using experimentally determined PDB structures, highlighting the limitations of predicted structures. Notably, GenSPARC–AF still maintained a robust performance, reaching an EF value of 7.14 (EF 0.5%), achieving the highest EF value among the evaluated methods, including DrugCLIP–AF and Vina–AF. These findings indicate that GenSPARC maintained superior performance in virtual screening scenarios where experimentally determined structures are unavailable.

In the cross-validation setting, GenSPARC demonstrated competitive performance with state-of-the-art methods across all evaluation metrics, with a strong ability to identify hit molecules within a small fraction of the dataset (Table 7).

Table 7 Virtual screening on DUD-E in the cross-validation setting

Overall, GenSPARC exhibited a robust performance in virtual screening tasks, whether using experimentally determined structures or using predicted structures. Notably, GenSPARC showed strong performance in hit identification metrics, highlighting its potential for practical applications in structure-based virtual screening.

Structure and property representations enhance generalizability

To evaluate the impact of structure- and property-aware representations on model performance, we conducted an ablation study comparing the full GenSPARC model with three simplified variants: (A) without protein structure awareness, using raw protein sequences processed by ESM-2 650M56 embeddings; (B) without the GCN, utilizing only SMILES-based transformer embeddings35 for compound representations; and (C) without property encoder, omitting 53 additional molecular properties (Table S4). These configurations were evaluated under the Unseen–Both split, which is the most challenging scenario for generalization. The results for the sequence-hard setting are shown in Fig. 3, while detailed results across all settings, including the original Karimi dataset, sequence-hard, and structure-hard configurations, are provided in Fig. S3. Results for all the splits, including Seen–Both split, Unseen–Comp, Unseen–Prot, and Unseen–Both split, are presented in Tables S5 and S6.

Fig. 3: Impact of structure awareness, GCN, and property encoder on model performance under sequence-hard settings.
figure 3

A RMSE, (B) PCC, (C) AUPRC, and (D) AUROC metrics were evaluated for the full GenSPARC model and three alternative configurations: without structure awareness, without GCN, and without the property encoder. Data are presented as the mean ± SD over three independent experiments with different random seeds.

In binding-affinity prediction, the full GenSPARC model achieved the best results, yielding the lowest RMSE and highest PCC (Fig. 3A, B). Ablation studies revealed that removing any single component, be it structure awareness, GCN, or property encoder, led to noticeable performance declines, highlighting the necessity of each for predictive accuracy and generalization.

For contact prediction, the full GenSPARC model consistently outperformed the alternatives across all splits. Although removing structure awareness or the property encoder had a minor impact on the original Karimi dataset, it led to significantly worse performance in the sequence-hard and structure-hard settings, emphasizing their role in enhancing model discrimination and generalization (Fig. 3 and S3). Notably, removing the GCN caused a sharp decline to an AUPRC of 10–11 (Fig. 3C), whereas AUROC (Fig. 3D) remained comparable to that in the full model. This indicates that the GCN plays a critical role in reducing false positives, which is crucial for accurate contact prediction.

Overall, the results from the Unseen–Both split highlight the critical role of integrating multiple components—structure awareness, graph-based molecular representations, and molecular properties—to achieve strong generalization in both contact and affinity prediction tasks. While structure awareness and the GCN had the largest impact on performance, the molecular property encoder provided valuable complementary information to further enhance the effectiveness of the model.

Discussion

In this study, we describe GenSPARC, a model designed to enhance CPI prediction. GenSPARC achieves this goal by integrating global and local 3D structural information from protein sequences. By leveraging AlphaFold2 to predict protein structures, FoldSeek to assign structural letters to residues, and SaProt to generate embeddings, GenSPARC overcomes the limitations of previous methods, which were restricted to local pockets. Additionally, the model incorporates the structural and biochemical properties of compounds through GCNs and a multimodal molecular model, significantly boosting its predictive performance. Extensive experiments demonstrated that GenSPARC achieved state-of-the-art performance across multiple benchmarks, including affinity and contact map prediction tasks, on established datasets (i.e., Karimi, Davis, KIBA, and Metz), even under different splitting criteria. Notably, our stricter dataset splitting demonstrated the strong generalizability of GenSPARC, while revealing the shortcomings of existing models, which struggled to perform consistently under more realistic conditions. These results underscore the need for benchmarks and datasets that better reflect the challenges of drug discovery and provide more accurate assessments of model capabilities. GenSPARC performed comparably well on the DUD-E virtual screening benchmark, demonstrating its robustness in actual applications. These findings highlight the effectiveness of GenSPARC in overcoming the structural limitations of previous methods and tackling the dynamic complexities of CPI. Nonetheless, certain limitations and challenges remain, and they provide directions for further improvement.

First, the structure and property encoder from SPMM uses primarily high-level molecular descriptors from RDKit, such as aromaticity, Gasteiger charges, and ring counts. Although effective for basic molecular properties, these features may not fully capture the complexity of 3D compound structures. Notably, in the sequence and structure-hard settings, AUPRC under the Unseen–Comp split showed significantly worse contact prediction (Table 3), whereas the Unseen–Prot split demonstrated stability and even improvements. This indicates that the compound encoder can be improved further. One potential approach is to incorporate more 3D compound information, such as spatial embeddings, to enhance performance. However, the direct use of absolute distances or coordinates may not be ideal because it can lead to poor generalization. They tend to overfit the experimental data and struggle to perform well on predicted structures such as those from AlphaFold2. A more robust approach would be to develop representations that remain consistent across both experimental and predicted structures. Developing robust representations, such as FoldSeek’s structural letters, which are consistent across predicted and experimental structures, could address this challenge. Second, although GenSPARC improved accuracy, some contact prediction results were still missing (Fig. 2). A potential strategy is to use predicted interactions as constraints in complex structure prediction models such as AlphaFold357 or Boltz-158, enabling the generation of plausible complex structures consistent with GenSPARC’s predictions, which can then be further refined to the level of molecular dynamics. Third, a related challenge concerns the balance between accuracy and generalization in structural representations. Although structure-aware models such as DrugCLIP achieved higher peak accuracy by directly using 3D coordinates, our use of FoldSeek’s 3Di representation inevitably sacrificed some fine-grained structural information. Nevertheless, FoldSeek’s structural alphabet has been shown through mutual information analyses to retain richer features than prior structural encodings, and, importantly, the discretization of local structures provides robustness against coordinate noise and enhances generalizability, as demonstrated in our results. Bridging this trade-off by designing representations that simultaneously preserve high accuracy while ensuring robust generalization remains an important avenue for future research. One potential direction is the development of integrated pipelines, such as the recently introduced Boltz-259, which jointly perform structure prediction and affinity prediction, aiming for a more seamless integration of structural modeling with compound–protein interaction prediction. Comprehensive benchmarking, including CPI prediction models integrated with structure prediction pipelines, as well as further model development, will be essential to advance the field. Finally, another limitation of this work is the absence of intrinsically disordered proteins (IDPs) in the training datasets. Future work should address this issue by incorporating ensemble-based structural predictions of IDPs60, allowing GenSPARC to better handle proteins with intrinsic disorder and extend its applicability to a wider range of biological contexts.

Methods

Datasets

Original Karimi dataset

The binding affinity and contact-map prediction experiments were conducted primarily on a publicly available dataset known as Cross-Interaction11,20, which includes 4446 compound–protein pairs between 1287 proteins and 3672 compounds collected from PDBbind61 and BindingDB62. To assess the model’s generalizability, test data were split into four subsets as described for PSC–CPI and Cross–Interaction, based on whether the proteins or compounds were part of the training set: (1) Seen–Both (591 pairs), in which both the compound and protein had been seen; (2) Unseen–Comp (521 pairs), in which only the protein had been seen; (3) Unseen–Prot (795 pairs), in which only the compound had been seen; and (4) Unseen–Both (202 pairs), in which neither had been seen.

To better evaluate the generalizability of the model, we introduced splits based on the similarity between proteins and compounds. First, we combined the four test subsets above into a single test set. For compounds, we calculated Tanimoto similarity between test and training compounds. For proteins, we considered both sequence and structural similarities. The data splitting process is illustrated in Fig. S1, while a detailed description is provided in Supplementary Methods. Fig. S2 shows the distribution of protein and compound similarity scores.

Davis, KIBA, and Metz datasets

Three widely recognized CPI benchmark datasets were employed to assess the efficacy of binding-affinity prediction: Davis27, KIBA28, and Metz29. The Davis dataset, which focused on interactions between 68 kinase inhibitors and 442 kinases, contained dissociation constant (Kd) values. These values were log-transformed to obtain pKd and enhance compatibility with computational models. The KIBA dataset integrates various bioactivity measurements, including Kd, inhibition constant (Ki), and half-maximal inhibitory concentration (IC50). The Metz dataset offers a detailed analysis of kinase inhibitor screening across 170 protein kinases. It employs rigorous statistical methods to ensure the quality and reliability of interaction data, making it a valuable resource for studying kinase-related interactions.

DUD-E dataset

We applied our method to a classic virtual screening task using the DUD-E63 dataset and selected existing DL-based methods and docking programs for comparison. The DUD-E dataset comprised 102 targets from various protein families. Each target provided a set containing an average of 224 active (positive examples) and 10,000 decoy ligands (negative examples). The decoys were computationally selected to be physically similar, but topologically different from the active ligands. This task enabled the assessment of model performance in virtual screening tasks, where distinguishing active compounds from decoys is crucial.

Protein input and encoder (SaProt)

Following the structure-aware vocabulary of SaProt34, we enhanced the protein inputs by incorporating both the amino acid sequence and 3D geometric features. Traditional protein language models such as ESM-256 use only the amino acid sequence as input and lack any structural context. Instead, SaProt enriches each amino acid with a structural token derived from FoldSeek, which encodes 3D interactions at the residue level. This addition introduces crucial structural information absent from conventional approaches.

Specifically, a protein’s primary sequence is represented as \(P=\left({r}_{1},{r}_{2},\ldots ,{r}_{n}\right)\), where \({r}_{i}\in V\) corresponds to the residue at position \(i\) in the amino acid alphabet \(V\). To incorporate a 3D structure, we utilized a structure alphabet \(S,\) assigning each residue a structure token \({s}_{j}\in S.\) This transformed \(P=\left({r}_{1}{s}_{1},{r}_{2}{s}_{2},\ldots ,{r}_{n}{s}_{n}\right)\) into a structure-aware sequence. The alphabet \(S\) contained 20 structure tokens plus an “unknown” token (“#”), which denoted missing structural information and enabled a more comprehensive representation of the protein’s tertiary structure.

After constructing the structure-aware sequence, SaProt 650 M, which was trained on AlphaFold2-predicted structures, was employed to extract the embeddings. SaProt retains the architecture and parameter size of 650 M ESM-2 but expands the embedding layer with 441 structure-aware tokens, which replace the original 20 residue tokens. For each structure-aware sequence, SaProt generated embeddings by taking the last hidden state of its embedding layer (1280 dimensions). These embeddings were then projected into a lower-dimensional hidden space through a linear layer, \({F}_{{prot}}\), to integrate seamlessly with the compound embeddings in subsequent model stages.

This structure-aware enhancement provided a richer and more informative representation of proteins, addressing the limitations of sequence-only models such as ESM.

Compound input and encoder

In the context of molecular compounds, we employed RDKit to convert the SMILES strings into molecular graphs \({G}_{C}=\left({V}_{C},{E}_{C}\right)\), where each atom \({V}_{C}\) was represented as a node with a feature vector. The vector included key atomic properties, such as the atomic symbol, aromaticity, Gasteiger charge, and additional features, as listed in Table S7. Further details on transforming the SMILES strings into molecular graphs can be found in the Supplementary Information.

The compound encoder is composed of three main elements: a molecular graph encoder, a molecular property encoder, and a fusion module that incorporates cross-attention and self-attention mechanisms.

The molecular graph encoder processes the molecular graph \({G}_{C}=\left({V}_{C},{E}_{C}\right)\) and learns D-dimensional node representation for each atom. As a compound encoder, we employed GCNs, which are powerful variants of graph neural networks and are widely used for feature extraction from graph data. Given a molecular graph \({G}_{C}=\left({V}_{C},{E}_{C}\right)\), GCNs take their adjacency matrix \({{{{\mathcal{A}}}}}_{C}\) and node features \({{{{\mathcal{X}}}}}_{{{{\rm{C}}}}}\) as inputs, and output a representation for each node. In our model, we used the following 3-layer GCN to process the molecular graph:

$${Z}_{{comp}}=\hat{{{{\mathcal{A}}}}}{{{\rm{\sigma }}}}\left(\hat{{{{\mathcal{A}}}}}{{{\rm{\sigma }}}}\left(\hat{{{{\mathcal{A}}}}}{{{{\mathcal{X}}}}}_{{{{\rm{C}}}}}{W}^{0}\right){W}^{1}\right){W}^{2}$$

where \(\sigma ={{{\rm{ReLU}}}}(\cdot )\) is the ReLU activation function. \(\hat{{{{\mathcal{A}}}}}=\hat{{D}^{-\frac{1}{2}}}\left({{{{\mathcal{A}}}}}_{C}+I\right)\hat{{D}^{-\frac{1}{2}}}\) represents the normalized adjacency matrix, where \({{{{\mathcal{A}}}}}_{C}\) is the adjacency matrix of the molecular graph, \(I\) the identity matrix, and \(\hat{D}\) the diagonal degree matrix for (\({{{{\mathcal{A}}}}}_{C}+I\)). \({W}^{0}\in {R}^{d\times D}\), \({W}^{1}\in {R}^{D\times D}\), and \({W}^{2}\in {R}^{D\times D}\) are the learnable parameter matrices, with \(D\) representing the hidden dimension size.

The molecular property encoder was adopted directly from the pretrained property encoder in SPMM35. It processed 53 real-valued molecular properties (Table S4) and generated a 768-dimensional feature vector, denoted as \({Z}_{{prop}}\) (Supplementary Methods).

Finally, in the fusion module, we combined the compound embedding \({Z}_{{comp}}\) and property embedding \({Z}_{{prop}}\) using a molecular fusion module (Supplementary Methods). In this setup, the property embedding \({Z}_{{prop}}\) was treated as a query, whereas compound embedding served as the key and value in the multi-head attention (MHA) mechanism. This strategy refines compound embeddings based on their relevant properties, making the compounds more aware of both structural and property information:

$${F}_{{comp}}=M{HA}\left({Z}_{{prop}},{Z}_{{comp}},{Z}_{{comp}}\right)$$
$${F}_{{comp}}={MHA}\left({F}_{{comp}},{F}_{{comp}},{F}_{{comp}}\right)$$

MAN

To capture fine-grained binding information between a drug and target, our GenSPARC model utilized a fusion module to learn token-level interactions between the compound and protein representations provided by their respective encoders. As illustrated in Fig. 1C, we introduced a fusion module, termed MAN, which was designed to integrate protein and compound information. This module employs two distinct sets of projection matrices used to generate the query, key, and value representations as follows:

$${Q}_{{comp}}={F}_{{comp}}{W}_{q}^{{comp}},{K}_{{comp}}={F}_{{comp}}{W}_{k}^{{comp}},{V}_{{comp}}={F}_{{comp}}{W}_{v}^{{comp}}$$
$${Q}_{{prot}}={F}_{{prot}}{W}_{q}^{{prot}},{K}_{{prot}}={F}_{{prot}}{W}_{k}^{{prot}},{V}_{{prot}}={F}_{{prot}}{W}_{v}^{{prot}}$$

where \({F}_{{comp}}\) and \({F}_{{prot}}\) represent compound and protein embeddings, respectively. The MAN combines MHA and cross-attention mechanisms to refine the representation of each residue (in the protein) and each atom (in the compound). The refined representations were computed as follows:

$${F}_{{prot}}^{* }=\frac{1}{2}\left[{\mbox{MHA}}\left({Q}_{{prot}},{K}_{{prot}},{V}_{{prot}}\right)+{\mbox{MHA}}\left({Q}_{{prot}},{K}_{{comp}},{V}_{{comp}}\right)\right]$$
$${F}_{{comp}}^{* }=\frac{1}{2}\left[{\mbox{MHA}}\left({Q}_{{comp}},{K}_{{comp}},{V}_{{comp}}\right)+{\mbox{MHA}}\left({Q}_{{comp}},{K}_{{prot}},{V}_{{prot}}\right)\right]$$

Next, we applied residual connections and layer normalization, followed by a feed-forward network and another Add & Norm layer to further refine the representations. Finally, we used a pooling strategy for both \({F}_{{comp}}\) and \({F}_{{prot}}\) separately, followed by concatenation. The final representation was as follows:

$${F}_{{prot}}^{* }={FFN}\left({LayerNorm}\left({F}_{{prot}}+{F}_{{prot}}^{* }\right)\right);$$
$${F}_{{comp}}^{* }={FFN}\left({LayerNorm}\left({F}_{{comp}}+{F}_{{comp}}^{* }\right)\right);$$
$${F}_{{prot}}^{* }={LayerNorm}\left({F}_{{prot}}+{F}_{{prot}}^{* }\right)$$
$${F}_{{comp}}^{* }={LayerNorm}\left({F}_{{comp}}+{F}_{{comp}}^{* }\right)$$
$${F}_{{joint}}={\mbox{Concat}}\left({\mbox{MeanPool}}\left({F}_{{comp}}^{* }\right),{\mbox{MeanPool}}\left({F}_{{prot}}^{* }\right)\right)$$

where the MeanPool operation aggregated token-level interactions into a fixed-size representation. This fused representation was then used for the final prediction task, whereby we calculated the binding affinity \({y}^{{pred}}\) for the CPI prediction using a multilayer perceptron (MLP) regressor. The MLP was trained using an MSE loss function. The predicted affinities were calculated as follows:

$${y}^{{pred}}={MLP}\left({F}_{{joint}}\right)$$

Pairwise interaction network

To train the model to predict the interaction pattern (i.e., contact map) between a protein and a compound, we envisioned a 2D matrix representing these interactions. Protein embedding \({F}_{{\mbox{prot}}}\) and compound embedding \({F}_{{\mbox{comp}}}\) were used as inputs and were transformed to predict interaction intensities between residues and atoms by calculating the inner product between the transformed protein and the compound representations. These interaction intensities were normalized to produce the final contact map, denoted as \({P}_{{\mbox{inter}}}\left[m,n\right]\), which represented the interaction between the \(m-{th}\) residue and the \(n-{th}\) atom.

This approach enabled the model to effectively capture interaction patterns and generate a detailed contact map. The predicted contact map was then used to calculate the loss by comparing it with the ground-truth contact map for visualization purposes. Interaction intensity was normalized as follows:

$${P}_{{\mbox{inter}}}\left[m,n\right]=\frac{{P}^{* }\left[m,n\right]}{{\sum }_{i,j}{P}^{* }\left[i,j\right]}$$

where \({P}^{* }={\mbox{Sigmoid}}\left(({F}_{{prot}})({{F}_{{comp}}})^{T}\right)\).

Training and evaluation

CPI task

The CPI prediction task comprises two key objectives: affinity prediction and interaction pattern prediction. Given the true binding affinity \({y}_{i}^{{true}}\) between the \(i-{th}\) protein and compound pair, the loss function for CPI strength prediction is defined as

$${{{{\mathcal{L}}}}}^{{affn}}=\frac{1}{N}{\sum}_{i=1}^{N}{\left|{y}_{i}^{{true}}-{y}_{i}^{{pred}}\right|}^{2},$$

Similarly, if the true interaction pattern \({P}_{{\mbox{true}}}\left[i,j\right]\) between the \(i-{th}\) residue and the \(j-{th}\) atom is known, the loss function for the CPI pattern prediction is defined as

$${{{{\mathcal{L}}}}}^{{inter}}=\, \frac{1}{M\cdot N} {\sum}_{i=1}^{M} {\sum}_{j=1}^{N}||{P}_{{{\rm{true}}}}\left[i,j\right]-{P}_{{{\rm{inter}}}}\left[i,j\right]|{|}_{F}^{2}\\ +\beta \left(|{{|P}}_{{{\rm{inter}}}}\left[i,j\right]{||}_{{{\rm{group}}}}+||{P}_{{{\rm{inter}}}}\left[i,j\right]{||}_{{{\rm{fused}}}}+||{P}_{{{\rm{inter}}}}\left[i,j\right]{||}_{1}\right),$$

where \(|{{|P}}_{{\mbox{inter}}}\left[i,j\right]{||}_{{{{\rm{group}}}}}\)64, \(||{P}_{{\mbox{inter}}}\left[i,j\right]{||}_{{{{\rm{fused}}}}}\)65, and \(||{P}_{{\mbox{inter}}}\left[i,j\right]{||}_{1}\) are the three sparsity regularization techniques used to control the sparsity of the interaction contact map \({P}_{{\mbox{inter}}}\left[i,j\right]\), as proposed by Karimi11. We reported the AUPRC and AUROC for this task.

Rather than treating CPI affinity and interaction pattern prediction as two discrete and independently trained tasks, we investigated the possibility of concurrently training them as a multitask learning model. We achieved this by integrating the two loss functions, \({{{{\mathcal{L}}}}}^{{affn}}\) and \({{{{\mathcal{L}}}}}^{{{{\rm{inter}}}}}\), into a unified loss function: \({{{{\mathcal{L}}}}}^{{integrate}}={{{{\mathcal{L}}}}}^{{affn}}+{{{{\mathcal{L}}}}}^{{inter}}\). The resulting model, denoted as GenSPARC–MT, enabled the training and prediction of both intermolecular affinity and atom–residue contacts within a unified framework.

Virtual screening task

For virtual screening of the DUD-E dataset, we evaluated GenSPARC using two approaches: (1) zero-shot and (2) cross-validation. In the zero-shot setting, the PDBBind66 2019 dataset, which was similarly employed in DrugCLIP46, was used to exclude the targets present in DUD-E from the training set. Given that the training set contained only positive protein-compound pairs, we employed the in-batch negative construction strategy from CLIP67 to generate negative examples. This strategy randomly pairs each protein with a different compound to create negative examples based on a well-founded assumption68. For the cross-validation setting, we followed the 3DCNN47, DrugVQA49, and AttentionSiteDTI50 protocols to compare our model with benchmark methods. Specifically, we fine-tuned our model on the DUD-E dataset, performed 3-fold cross-validation on the dataset and reported the average performance across the evaluation metrics. Each fold was split based on the target, with similar targets within the same fold. Random undersampling was applied to the decoys to balance the training set, while keeping the test sets unbalanced for evaluation. We evaluated performance using several metrics from DrugCLIP, including AUROC, BEDROC, EF (at 0.5%, 1%, and 5%), and RE (at thresholds of 0.5%, 1%, 2%, and 5%). BEDROC assigns more importance to top-ranked results, whereas EF and RE are commonly used metrics in virtual screening. Details about these metrics are provided in Supplementary Methods.