Abstract
Graph-based molecular representation learning is essential for predicting molecular properties in drug discovery and materials science. Despite its importance, current approaches struggle with capturing the intricate molecular relationships and often rely on limited chemical knowledge during training. Multimodal fusion, which integrates information from graph and other data sources together, has emerged as a promising approach for enhancing molecular property prediction. However, existing studies explore only a narrow range of modalities, and the optimal integration stages for multimodal fusion remain largely unexplored. Furthermore, the reliance on auxiliary modalities poses challenges, as such data is often unavailable in downstream tasks. Here, we present MMFRL (Multimodal Fusion with Relational Learning), a framework designed to address these limitations by leveraging relational learning to enrich embedding initialization during multimodal pre-training. MMFRL enables downstream models to benefit from auxiliary modalities, even when these are absent during inference. We also systematically investigate modality fusion at early, intermediate, and late stages, elucidating their unique advantages and trade-offs. Using the MoleculeNet benchmarks, we demonstrate that MMFRL significantly outperforms existing methods with superior accuracy and robustness. Beyond predictive performance, MMFRL enhances explainability, offering valuable insights into chemical properties and highlighting its potential to transform real-world applications in drug discovery and materials science.
Similar content being viewed by others
Introduction
Graph representation learning for molecules has gained significant attention in drug discovery and materials science, as it effectively encapsulates molecular structures and enables the effective investigation of structure-activity relationships1,2,3,4,5,6. In this paradigm, atoms are treated as nodes and chemical bonds as edges, effectively encapsulating the connectivities that define molecular behavior. However, it poses significant challenges due to intricate relationships among molecules and the limited chemical knowledge utilized during training.
Contrastive Learning (CL) is often employed to study relationships among molecules. The primary focus within the domain of CL applied to molecular graphs centers on 2D–2D graphs comparisons. Noteworthy representative examples: InfoGraph7 maximizes the mutual information between the representations of the graph and its substructures to guide the molecular representation learning; GraphCL8, MoCL9, and MolCLR10 employ graph augmentation techniques to construct positive pairs; MoLR11 establishes positive pairs with reactant-product relationships. In addition to 2D-2D graph CL, there are also noteworthy efforts exploring 2D–3D and 3D–3D CL in the field. 3DGCL12 is 3D–3D CL model, establishing positive pairs with conformers from the same molecules. GraphMVP13, GeomGCL14, and 3D Informax15 propose 2D–3D view CL approaches. To conclude, 2D–2D and 3D–3D comparisons are intra-modality CL, as only one graph encoder is employed in these studies. However, these approaches often focus on the motif and graph levels, leaving atom-level CL less explored. For example, consider Thalidomide: while the (R)- and (S)-enantiomers share the same topological graph and differ only at a single chiral center, their biological activities are drastically different-the (R)-enantiomer is effective in treating morning sickness, whereas the (S)-enantiomer causes severe birth defects. In other words, the (R)- and (S)-enantiomers are similar in terms of topological structure but dissimilar in terms of biological activities. Thus, a more sophisticated approach is required to tackle these scenarios. A potential solution would be to use continuous metrics within a multi-view space, enabling a more comprehensive understanding of these complex molecular relationships.
There are multiple approaches for such similarity learning. One approach among them is instance-wise discrimination, which involves directly assessing the similarity between instances based on their latent representations or features.16. Naive instance-wise discrimination relies on pairwise similarity, leading to the development of contrastive loss17. Although there are improved loss functions such as triplet loss18, quadruplet loss19, lifted structure loss20, N-pairs loss21, and angular loss22, these methods still fall short in thoroughly capturing relationships among multiple instances simultaneously23. To address this limitation, a joint multi-similarity loss has been proposed, incorporating pair weighting for each pair to enhance instance-wise discrimination23,24. However, these pair weightings require the manual categorization of negative and positive pairs, as distinct weights are assigned to losses based on their categories. In this case, we can borrow the idea of relational learning (RL)25 from computer vision by using different augmented views of the same instance tasks to similar features, while allowing for some variability. This approach captures the essential characteristics of the instance in a continuous scale, promoting relative consistency across the views without requiring them to be identical. By doing so, it enhances the model’s ability to generalize and recognize underlying patterns in the data.
Besides, in order to enable a multi-view analysis from diverse sources is essential for improving molecule analysis, we can apply the multi-modality fusion26,27,28,29,30,31,32. It combines diverse heterogeneous data (e.g., text, images, graph) to create a more comprehensive understanding of complex scenarios. This approach leverages the strengths of each modality, potentially improving performance in tasks like sentiment analysis or medical diagnosis. While challenging to implement due to the need to align different data streams, successful fusion can provide insights beyond what’s possible with single modalities, advancing AI and data-driven decision-making. In particular, the way to fuse different modalities should also depends on the dominace of each unimodality30. However, when it comes to multimodal learning for molecules, we often encounter data availability and incompleteness issues. This raises a critical question: how can multimodal information be effectively leveraged for molecular property reasoning when such data are absent in downstream tasks? Recent studies have demonstrated the effectiveness of pretraining molecular graph neural networks (GNNs) by integrating additional knowledge sources10,33,34,35. Building on this foundation, a promising solution is to pretrain multiple replicas of molecular GNNs, with each replica dedicated to learning from a specific modality. This approach allows downstream tasks to benefit from multimodal data that is not accessible during fine-tuning, ultimately improving representation learning.
Facing these challenges and opportunities, we propose multimodal fusion (MMF) with relational learning (MMFRL) for molecular property prediction as shown in Fig. 1, a framework features RL and MMF. RL utilizes a continuous relation metric to evaluate relationships among instances in the feature space36,37. Our major contribution comprises three aspects: Conceptually: We introduce a modified relational learning (MRL) metric for molecular graph representation that offers a more comprehensive and continuous perspective on inter-instance relations, effectively capturing both localized and global relationships among instances. Methodologically: Our proposed modified relational metric captures complex relationships by converting pairwise self-similarity into relative similarity, which evaluates how the similarity between two elements compares to the similarity of other pairs in the dataset. In addition, we integrate these metrics into a fused multimodal representation, which has the potential to enhance performance, allowing downstream tasks to leverage modalities that are not directly accessible during fine-tuning. Empirically: MMFRL excels in various downstream tasks for Molecular Property Predictions. Last but not least, we demonstrate the explainability of the learned representations through two post-hoc analysis. Notably, we explore minimum positive subgraphs (MPS) and maximum common subgraphs to gain insights for further drug molecule design.
Results
The effectiveness of pre-training
We first illustrate the impact of pre-training initialization on performance on DMPNN38. As shown in Table 1, the average performance of pre-trained models outperform the non-pre-trained model in all tasks except for Clintox. The results of various downstream tasks indicate that different tasks may prefer different modalities. Notably, the model pre-trained with the NMR modality achieves the highest performance across three classification tasks. Similarly, the model pre-trained with the Image modality excels in three tasks, two of which are regression tasks related to solubility, aligning with findings from prior literature35. Additionally, the model pre-trained with the Fingerprint Modality method achieves the best performance in two tasks, including MUV, which has the largest dataset.
Overall performance of MMFRL
As shown in Tables 2 and 3, MMFRL demonstrates superior performance compared to all baseline models and the average performance of DMPNN pretrained with extra modalities across all 11 tasks evaluated in MoleculeNet. Results in Supplementary Information Table D.2 and D.3 demonstrates our great performance compared to the baseline models on the directory of useful decoys: enhanced (Dud-E)39 and LIT-PCBA40 datasets. This robust performance highlights the effectiveness of our approach in leveraging multimodal data. In particular, while individual models pre-trained on other modalities for Clintox fail to outperform the NoPre-training model, the fusion of these pre-trained models leads to improved performance. Besides, apart from Tox21 and Sider, the fusion models significantly enhance overall performance. In particular, the intermediate fusion model stands out by achieving the highest scores in seven distinct tasks, showcasing its ability to effectively combine features at a mid-level abstraction. The late fusion model achieves the top performance in two tasks. These results underscore the advantages of utilizing various fusion strategies in multimodal learning, further validating the efficacy of the MMFRL framework.
Analysis of the fusion effect
General comparison among various ways of fusions
Early fusion is employed during the pretraining phase and is easy to implement, as it aggregates information from different modalities directly. However, its primary limitation lies in the necessity for predefined weights assigned to each modality. These weights may not accurately reflect the relevance of each modality for the specific downstream tasks, potentially leading to suboptimal performance.
Intermediate fusion is able to capture the interaction between modalities early in the fine-tuning process, allowing for a more dynamic integration of information. This method can be particularly beneficial when different modalities provide complementary information that enhances overall performance. If the modalities effectively compensate for one another’s strengths and weaknesses, intermediate fusion may emerge as the most effective approach.
In contrast, Late fusion enables each modality to be explored independently, maximizing the potential of individual modalities without interference from others. This separation allows for a thorough examination of each modality’s contribution. When certain modalities dominate the performance metrics, Late fusion can maximize on these strengths, ensuring that the most impactful information is utilized effectively. This approach is especially useful in scenarios where the dominance of specific modalities can be leveraged to enhance overall model performance.
In addition, we conduct an ablation study to evaluate the performance of our proposed loss functions against two traditional CL losses-contrastive loss and triplet loss-in the context of intermediate fusion. The experimental results as shown in Supplementary Information Table D.1 demonstrate that our proposed methods outperform the baseline approaches across the majority of tasks in the MoleculeNet dataset, thereby highlighting the superiority of our approach.
Explainability of learnt representations
To demonstrate the interpretability of learnt representations of fusion, we present post-hoc analysis for two tasks, ESOL and Lipo, as demonstration. The results showcase that learnt representations can capture task-specific patterns and offer valuable insights for molecular design.
ESOL with intermediate fusion. As presented in Table 3, the intermediate fusion method 5.3.2 exhibits superior performance on the ESOL regression task for predicting solubility. To further analyze this performance, we employed t-SNE to reduce the dimensionality of the molecule embeddings from 300 to 2, resulting in a heatmap visualized in Fig. 2. The embeddings derived from individual modalities prior to fusion do not display a clear pattern, showing no smooth transition from low to high solubility. In contrast, the embeddings by intermediate fusion reveal a distinct and smooth transition in solubility values: molecules with similar solubility cluster together, forming a gradient that extends from the bottom left (indicating lower solubility) to the upper center (representing higher solubility). This trend underscores the effectiveness of the intermediate fusion approach in accurately capturing the quantitative structure-activity relationships for aqueous solubility.
This figure shows our proposed idea about how to transfer the knowledge from other modalities and use fusion to improve the performance further. Unlike the general contrastive learning framework shown in Supplementary Information Figure 1, MMFRL doesn’t need to define positive or negative pairs and is capable of learning continuous ordering from target similarity. In Early fusion, a single Initialized GNN is created by combining all modality information during pretraining. In Intermediate and Late Fusion, each modality has its own initialized GNN.
Each point in the heatmap corresponds to the embeddings of respective molecules in ESOL, with color indicating solubility levels. Red denotes higher solubility, while blue indicates lower solubility. The embeddings derived from individual modalities prior to fusion do not display a clear pattern; the embeddings by intermediate fusion form a gradient that extends from the bottom left (indicating lower solubility) to the upper center (representing higher solubility).
Additionally, we examined the similarity between the respective embeddings prior to intermediate fusion and the resulting fused embedding, as depicted in Fig. 3. Our analysis indicates that the embeddings from each modality exhibit low similarity with the intermediate-fused representation. This observation suggests that the modalities complement each other, collectively enhancing the resulting representation of the intermediate-fused embedding.
Lipo with late fusion. As detailed in Table 3, the Late Fusion method (described in Section 5.3.3) demonstrates superior performance on the Lipo regression task for predicting solubility in fats, oils, lipids, and non-polar solvents. According to Equation (11), the final prediction is determined by the respective coefficients (wi) and predictions (pi) from each modality.
In Fig. 4, we present the distribution of values for the coefficients, predictions, and their products for each modality. Notably, the simplified molecular input line entry system (SMILES) and Image modalities display a wide range of values, highlighting their potential to significantly influence the final predictions. This observation aligns with the strong performance achieved when pretraining using either of these two modalities, as shown in Table 1. In contrast, the NMRPeak values display a narrower range, indicating its role as a modifier for finer adjustments in the predictions. Furthermore, we observe that the contributions from NMRSpectrum and Fingerprint modalities are minimal, with their corresponding values approaching zero. This outcome highlights the advantages of the Late Fusion approach in effectively identifying and leveraging dominant modalities, thereby optimizing the overall predictive performance.
Substructure analysis with BACE. We explore the binding potential of positive inhibitor molecules targeting BACE and their associated key functional substructures, referred to as MPS. To identify MPS, we employ a Monte Carlo Tree Search (MCTS) approach integrated into our BACE classification model, as implemented in RationalRL41. MCTS, being an iterative process, allows us to evaluate each candidate substructure for its binding potential with our model. Following the determination of MPSs, we categorize the original positive BACE molecules based on their respective MPSs. By computing the binding potential difference between the original molecule and its MPS, we can identify structural features that contribute to changes in binding affinity as shown in Fig. 5.
In the case of the MPS 5 group, the binding score is heavily influenced by steric effects. The top three high-performing designs (5a-5c) all feature a flexible and compact alkylated pyrazole structure (colored green), which likely facilitates better accommodation within the binding pocket. In contrast, the three lowest-performing designs (5n-5p) incorporate a more rigid and bulky (trifluoromethoxy)benzene moiety (colored red), which may introduce steric hindrance and reduce binding efficiency. Additionally, the pyrazole ring contains two nitrogen atoms, offering more potential for hydrogen bonding interactions with the target protein, whereas the (trifluoromethoxy)benzene group has only one oxygen atom, limiting its capacity for such interactions. This comparison highlights the importance of both molecular flexibility and functional group composition in optimizing binding affinity.
Sensitivity Analysis
Choosing the most effective fusion strategy can be empirical. However, our results presented in Table 2, 3, and Supplementary Information Table D.1, D.2, and D.3 provide strong evidence that our lightweight fusion strategy (early, intermediate, and late fusion) outperforms existing approaches in the literature. To guide the selection among these strategies, our intuition is as follows: if a modality is highly relevant to the downstream task, earlier fusion is likely to be more effective; otherwise, later fusion may be preferable.
To test this hypothesis, we performed a retrospective analysis to assess the sensitivity of downstream tasks to different fusion strategies. Since early fusion embeddings often lack the flexibility to adapt to individual samples, we excluded them from this analysis. Instead, we used pretrained encoders to extract embeddings for each modality and performed a simple linear regression between the embeddings and task labels. We then computed the Pearson correlation between the predicted values and the ground truth as a measure of each modality’s relevance.
For each dataset, we recorded the highest correlation across all modalities as the “Top 1" score. We then concatenated the embeddings from all modalities and repeated the regression analysis. The improvement in correlation is reported as the “Pearson Gain." A higher Pearson Gain suggests that earlier fusion of multiple modalities is more beneficial. As shown in Table 4, datasets where intermediate fusion performs best generally exhibit higher Pearson Gain compared to late fusion, supporting our intuition. However, for ESOL and FreeSolv, the correlation from a single modality is already high, making them less suitable for this analysis.
Conclusion
In summary, we introduce a RL metric for molecular graph representation that enhances the understanding of inter-instance relationships by capturing both local and global contexts. Our method transforms pairwise self-similarity into relative similarity through a weighting function, allowing for complex relational insights. This metric is integrated into a multimodal representation, improving performance by utilizing modalities not directly accessible during fine-tuning. Empirical results show that our approach, MMFRL, excels in various molecular property prediction tasks. We also demonstrate a detailed study about the explainability of the learned representations, offering valuable insights for drug molecule design. Despite these accomplishments, further exploration is needed to achieve more effective integration of graph- and node-level similarities. Looking ahead, we are enthusiastic about the prospect of applying our model to additional fields, such as social science, thereby broadening its applicability and impact.
Dataset
Selected modalities for target similarity calculation
The following modalities are used for target similarity calculation. For details on training the corresponding encoders to obtain fixed embeddings for these modalities, please refer to Supplementary Information Section C.
Fingerprint
Fingerprints are binary vectors that represent molecular structures, capturing the presence or absence of particular substructures, fragments, or chemical features within a molecule. In particular, we utilize Morgan fingerprints, which are based on the extended-connectivity fingerprints (ECFP) method introduced by Rogers and Hahn42. Specifically, we generate fingerprints using AllChem.GetMorganFingerprintAsBitVect(mol, 2), which corresponds to ECFP4 (radius = 2). Because ECFP4 is one of the most effective and interpretable molecular representations43.
Simplified molecular input line entry system (SMILES)
SMILES offers a compact textual representation of chemical structures.
Nuclear magnetic resonance (NMR)
NMR spectroscopy provides detailed insights into the chemical environment of atoms within a molecule44. By analyzing the interactions of atomic nuclei with an applied magnetic field, NMR can reveal information about the structure, dynamics, and interactions of molecules, including the connectivity of atoms, functional groups, and conformational changes. In our experiments, NMRspectrum provides the information about the overal information of molecule while NMRpeak provides the information about the individual atoms in the molecule.
Image
Images (e.g., 2D chemical structures) provide a visual representation of molecular structures.
All of the similarity calculation from the modalities above are listed in Supplementary Information C.
Pre-training
NMRShiftDB-245 is a comprehensive database dedicated to nuclear magnetic resonance (NMR) chemical shift data, providing researchers with an extensive collection of expert-annotated NMR data for various organic compounds with molecular structures (SMILES) (Accessed June 2023). There are around 25,000 molecules used for pre-training and no overlap with downstream task datasets.
Downstream tasks
For Downstream tasks, our model was trained on 11 drug discovery-related benchmarks sourced from MoleculeNet46. Eight of these benchmarks were designated for classification downstream tasks, including BBBP, BACE, SIDER, CLINTOX, HIV, MUV, TOX21, and ToxCast, while three were allocated for regression tasks, namely ESOL, Freesolv, and Lipo. The datasets were divided into train/validation/test sets using a ratio of 80%:10%:10%, accomplished through the scaffold splitter47 from Chemprop38,48, like previous works. The scaffold splitter categorizes molecular data based on substructures, ensuring diverse structures in each set. Molecules are partitioned into bins, with those exceeding half of the test set size assigned to training, promoting scaffold diversity in validation and test sets. Remaining bins are randomly allocated until reaching the desired set sizes, creating multiple scaffold splits for comprehensive evaluation.
The DUD-E dataset39 is a widely used benchmark for virtual screening, containing 102 protein targets, thousands of active compounds, and carefully selected decoys that resemble actives in physico-chemical properties but differ topologically. In contrast, Low-Throughput Informatics-Targeted PubChem BioAssay (LIT-PCBA)40 offers a more realistic and challenging benchmark, derived from real experimental assays across 15 targets, with no artificial decoys and inherent data noise and imbalance. Together, they represent two ends of the spectrum in virtual screening evaluation-DUD-E with idealized conditions, and LIT-PCBA with real-world complexity. For the fine-tuning setting, We follow the same split and test approach as49 for DUD-E and50 for LIT-PCBA.
Methods
We first explain the preliminaries, and then our proposed modified metric in RL to facilitate smooth alignment between the graph and referred unimodality. Then, we introduce approaches for integrating multimodalities at different stages of the learning process.
Molecular representation with DMPNN
The message passing neural network (MPNN)51 is a GNN model that processes an undirected graph G with node (atom) features xv and edge (chemical bond) features evw. It operates through two distinct phases: a message passing phase, facilitating information transmission across the molecule to construct a neural representation, and a readout phase, utilizing the final representation to make predictions regarding properties of interest. The primary distinction between DMPNN and a generic MPNN lies in the message passing phase. While MPNN uses messages associated with nodes, DMPNN crucially differs by employing messages associated with directed edges38. This design choice is motivated by the necessity to prevent totters52, eliminating messages passed along paths of the form v1v2…vn, where vi = vi+2 for some i, thereby eliminating unnecessary loops in the message passing trajectory.
MRL in pretraining
Original Relation Learning25 ensures that different augmented views of the same instance from computer vision tasks share similar features, while allowing for some variability. Suppose zi is the original embedding for the i-th instance. Then \({{{{\bf{z}}}}}_{{{{\bf{i}}}}}^{1}\) is the embedding of first augmented view for zi, and \({{{{\bf{z}}}}}_{{{{\bf{i}}}}}^{2}\) is the embedding of second augmented view for zi. In this case, the loss of RL is formulated as following:
We propose a modified relational metric by adapting the softmax function as a pairwise weighting mechanism. Let \(| {{{\mathcal{S}}}}|\) denote the size of the instance set. The variable si,j represents the learned similarity where zi is the embedding to be trained. On the other hand, \({t}_{i,j}^{R}\) defines the target similarity that captures the relationship between the pair of instances in the given space or modality R, where \({{{{\bf{z}}}}}_{{{{\bf{i}}}}}^{R}\) is a fixed embedding. The detailed formulation for the loss of MRL is provided below:
Notably, unlike other similarity learning approaches23,24, our method does not rely on the categorization of negative and positive pairs for the pair weighting function. Additionally, our use of the softmax function ensures that the generalized target similarity ti,j adheres to the principles of convergence, which results in better ranking consistency between the graph modality and the auxiliary modality, compared with the original Relational Study, as follows:
Theorem 5.3
(Convergence of MRL Metric)
Let \({{{\mathcal{S}}}}\) be a set of instances with size of \(| {{{\mathcal{S}}}}|\), and let \({{{\mathcal{P}}}}\) represent the learnable latent representations of instances in \({{{\mathcal{S}}}}\) such that \(| {{{\mathcal{P}}}}| =| {{{\mathcal{S}}}}|\). For any two instances \(i,j\in {{{\mathcal{S}}}}\), their respective latent representations are denoted by \({{{{\mathcal{P}}}}}_{i}\) and \({{{{\mathcal{P}}}}}_{j}\). Let ti,j represent the target similarity between instances i and j in a given domain, and let di,j be the similarity between \({{{{\mathcal{P}}}}}_{i}\) and \({{{{\mathcal{P}}}}}_{j}\) in the latent space. If ti,j is non-negative and {ti,j} satisfies the constraint \({\sum }_{j = 1}^{| {{{\mathcal{S}}}}| }{t}_{i,j}=1\), consider the loss function for an instance i defined as follows:
then when it reaches ideal optimum, the relationship between ti,j and di,j satisfies:
For detailed proof, please refer to Supplementary Information Section A.
Fusion of multi-modality information in downstream tasks
During pre-training, the encoders are initialized with parameters derived from distinct reference modalities. A critical question that arises is how to effectively utilize these pre-trained models during the fine-tuning stage to improve performance on downstream tasks.
Early stage: multimodal multi-similarity
With a set of known target similarity {tR} from various modalities, we can transform themto multimodal space through a fusion function. There are numerous potential designs of the fusion function. For simplicity, we take linear combination as a demonstration. The multimodal generalized multi-similarity \({t}_{i,j}^{M}\) between ith and jth objects can be defined as follows:
where \({t}_{i,j}^{R}\) represents the target similarity between ith and jth instance in unimodal space R, wR is the pre-defined weights for the corresponding modal, and ∑wR = 1. Then we can make \({t}_{i,j}={t}_{i,j}^{R}\) in equation (3). Such that, it still satisfy the requirement of convergence (See proof in Supplementary Information Section A). In this case, the learnt similarity during pretraining will be aligned with this new combined target similarity.
Intermediate stage: embedding concatenation and fusion
Intermediate fusion integrates features from various modalities after their individual encoding processes and prior to the decoding/readout stage. Let f1, f2, …, fn represent the feature vectors obtained from these different modalities. The resulting fused feature vector can be defined as follows:
Where concat represents concatenation of the feature vectors. The fused features are then fed into a later readout function or decoder for downstrean tasks prediction or classification. The multi-layer perceptron is used to reduce the dimension to be the same as fi.’
Late stage: decision-level
Late fusion (or decision-level fusion) combines the outputs of models trained on different modalities after they have been processed independently. Each modality is first processed separately, and its predictions are combined at a later stage.
Let p1, p2, …, pn be the predictions (e.g., probabilities) from different modalities. The final prediction pfinal can be computed using a weighted sum mechanism:
Where wi are the weights assigned to each modality’s prediction, and they can be adjusted based on the importance of each modality. In particular, wi is tunable during the learning process for respective downsteak tasks.
Data availability
The pretraining data can be downloaded from NMRShiftDB2. The MoleculeNet dataset is available at MoleculeNet. The DuD-E dataset can be accessed at DuD-E, and the Lit-PCBA dataset can be downloaded from Lit-PCBA.
Code availability
The code is available in Github: https://github.com/zhengyjo/MMFRL.
References
Schneider, P. et al. Rethinking drug design in the artificial intelligence era. Nat. Rev. Drug Discov. 19, 353–364 (2020).
Wieder, O. et al. A compact review of molecular property prediction with graph neural networks. Drug Discov. Today. Technol. 37, 1–12 (2020).
Zhang, Z. et al. Graph neural network approaches for drug-target interactions. Curr. Opin. Struct. Biol. 73, 102327 (2022).
Fang, X. et al. Geometry-enhanced molecular representation learning for property prediction. Nat. Mach. Intell. 4, 127–134 (2022).
Wang, Y. et al. Motif-based graph representation learning with application to chemical molecules. In: Proc. Informatics, vol. 10, 8 (MDPI, 2023).
Chen, Y. et al. Drugdagt: a dual-attention graph transformer with contrastive learning improves drug-drug interaction prediction. BMC Biol. 22, 233 (2024).
Sun, F.-Y. et al. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. In Proc. International Conference on Learning Representations (2019).
You, Y. et al. Graph contrastive learning with augmentations. Adv. neural Inf. Process. Syst. 33, 5812–5823 (2020).
Sun, M., Xing, J., Wang, H., Chen, B. & Zhou, J. MOCL: data-driven molecular fingerprint via knowledge-aware contrastive learning from molecular graph. In: Proc. 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 3585–3594 (ACM, 2021).
Wang, Y., Wang, J., Cao, Z. & Barati Farimani, A. Molecular contrastive learning of representations via graph neural networks. Nat. Mach. Intell. 4, 279–287 (2022).
Wang, H. et al. Chemical-Reaction-Aware Molecule Representation Learning. In Proc. 10th International Conference on Learning Representations (ICLR) (2022).
Moon, K., Im, H.-J. & Kwon, S. 3d graph contrastive learning for molecular property prediction. Bioinformatics 39, btad371 (2023).
Liu, S. et al. Pre-training Molecular Graph Representation with 3D Geometry. In Proc. International Conference on Learning Representations (2022).
Li, S., Zhou, J., Xu, T., Dou, D. & Xiong, H. Geomgcl: Geometric graph contrastive learning for molecular property prediction. In: Proc. Thirty-Six AAAI Conference on Artificial Intelligence, 4541–4549 (2022).
Stärk, H. et al. 3d infomax improves gnns for molecular property prediction. In: Proc. International Conference on Machine Learning, 20479–20502 (PMLR, 2022).
Wu, Z., Xiong, Y., Yu, S. X. & Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 3733–3742 (IEEE, 2018).
Hadsell, R., Chopra, S. & LeCun, Y. Dimensionality reduction by learning an invariant mapping. k In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, 1735–1742 (IEEE, 2006).
Hoffer, E. & Ailon, N. Deep metric learning using triplet network. In: Proc. Similarity-Based Pattern Recognition Third International Workshop, SIMBAD, 84–92 (Springer, 2015).
Law, M. T., Thome, N. & Cord, M. Quadruplet-wise image similarity learning. In: Proc. IEEE International Conference on Computer Vision, 249–256 (IEEE, 2013).
Oh Song, H., Xiang, Y., Jegelka, S. & Savarese, S. Deep metric learning via lifted structured feature embedding. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 4004–4012 (IEEE, 2016).
Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. Adv. Neural Inf. Process. Syst. 29, 1857–1865 (2016).
Wang, J., Zhou, F., Wen, S., Liu, X. & Lin, Y. Deep metric learning with angular loss. In: Proc. IEEE International Conference on Computer Vision, 2593–2601 (IEEE, 2017).
Wang, X., Han, X., Huang, W., Dong, D. & Scott, M. R. Multi-similarity loss with general pair weighting for deep metric learning. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5022–5030 (IEEE, 2019).
Zhang, L. et al. Jointly multi-similarity loss for deep metric learning. In: Proc. IEEE International Conference on Data Mining (ICDM), 1469–1474 (IEEE, 2021).
Zheng, M. et al. RESSL: relational self-supervised learning with weak augmentation. Adv. Neural Inf. Process. Syst. 34, 2543–2555 (2021).
Lahat, D., Adali, T. & Jutten, C. Multimodal data fusion: an overview of methods, challenges, and prospects. Proc. IEEE 103, 1449–1477 (2015).
Khaleghi, B., Khamis, A., Karray, F. O. & Razavi, S. N. Multisensor data fusion: a review of the state-of-the-art. Inf. fusion 14, 28–44 (2013).
Poria, S., Cambria, E. & Gelbukh, A. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proc. Conference on Empirical Methods in Natural Language Processing, 2539–2544 (2015).
Ramachandram, D. & Taylor, G. W. Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process. Mag. 34, 96–108 (2017).
Pawłowski, M., Wróblewska, A. & Sysko-Romańczuk, S. Effective techniques for multimodal data fusion: a comparative analysis. Sens. 23, 2381 (2023).
Manzoor, M. A. et al. Multimodality representation learning: a survey on evolution, pretraining and its applications. ACM Trans. Multimed. Comput. Commun. Appl. 20, 1–34 (2023).
Priessner, M. et al. Enhancing molecular structure elucidation: multimodaltransformer for both simulated and experimental spectra (2024).
Wang, Y., Min, Y., Shao, E. & Wu, J. Molecular graph contrastive learning with parameterized explainable augmentations. In: Proc. IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 1558–1563 (IEEE, 2021).
Liu, H., Huang, Y., Liu, X. & Deng, L. Attention-wise masked graph contrastive learning for predicting molecular property. Brief. Bioinforma. 23, bbac303 (2022).
Yifei, W., Li, Y., Liu, L., Hong, P. & Xu, H. Advancing Drug Discovery with Enhanced Chemical Understanding via Asymmetric Contrastive Multimodal Learning. J. Chem. Inf. Model. ASAP https://doi.org/10.1021/acs.jcim.5c00430 (2025).
Balcan, M.-F. & Blum, A. On a theory of learning with similarity functions. In: Proc. 23rd international conference on Machine learning, 73–80 (ICML, 2006).
Wen, Y. et al. Pairwise similarity learning is simple. In: Proc. IEEE/CVF International Conference on Computer Vision, 5308–5318 (IEEE, 2023).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370–3388 (2019).
Mysinger, M. M., Carchia, M., Irwin, J. J. & Shoichet, B. K. Directory of useful decoys, enhanced (DUD-E): Better ligands and decoys for better benchmarking. J. Med. Chem. 55, 6582–6594 (2012).
Tran-Nguyen, V.-K., Jacquemard, C. & Rognan, D. LIT-PCBA: an unbiased data set for machine learning and virtual screening. J. Chem. Inf. Model. 60, 4263–4273 (2020).
Jin, W., Barzilay, R. & Jaakkola, T. Multi-objective molecule generation using interpretable substructures. In: Proc. 37th International Conference on Machine Learning (PMLR, 2020).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Zhong, S. & Guan, X. Count-based Morgan fingerprint: a more efficient and interpretable molecular representation in developing machine learning-based predictive regression models for water contaminants’ activities and properties. Environ. Sci. Technol. 57, 18193–18202 (2023).
Bunzel, M. & Ralph, J. NMR characterization of lignins isolated from fruit and vegetable insoluble dietary fiber. J. Agric. food Chem. 54, 8352–8361 (2006).
Steinbeck, C., Krause, S. & Kuhn, S. NMRShiftDB constructing a free chemical information system with open-source components. J. Chem. Inf. Comput. Sci. 43, 1733–1739 (2003).
Wu, Z. et al. Moleculenet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Halgren, T. A. Merck molecular force field. i. basis, form, scope, parameterization, and performance of mmff94. J. Comput. Chem. 17, 490–519 (1996).
Heid, E. et al. Chemprop: a machine learning package for chemical property prediction. J. Chem. Inf. Model. 64, 9–17 (2023).
Gao, B. et al. Drugclip: Contrastive protein-molecule representation learning for virtual screening. Adv. Neural Inf. Process. Syst. 36, 44595–44614 (2023).
Cai, H., Zhang, H., Zhao, D., Wu, J. & Wang, L. FP-GNN: a versatile deep learning architecture for enhanced molecular property prediction. Brief. Bioinforma. 23, bbac408 (2022).
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. In: Proc. International Conference on Machine Learning, 1263–1272 (PMLR, 2017).
Mahé, P., Ueda, N., Akutsu, T., Perret, J.-L. & Vert, J.-P. Extensions of marginalized graph kernels. In: Proc. twenty-first International Conference on Machine Learning, 70 (ICML, 2004).
Author information
Authors and Affiliations
Contributions
Z.Z. contributed to idea generation, algorithm design, code implementation, data analysis, manuscript writing, editing, and revision. H.Xu contributed to idea generation, algorithm design, data analysis, manuscript writing, editing, supervision and revision. Y.L. assisted in designing appropriate experiments, compiling experimental results, and revising both the manuscript and the response letter. P.H. supported the project by providing access to the computational resources essential for conducting large-scale experiments and model development. All authors reviewed and approved the final version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Chemistry thanks Dingyan Wang and the other, anonymous, reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhou, Z., Li, Y., Hong, P. et al. Multimodal fusion with relational learning for molecular property prediction. Commun Chem 8, 200 (2025). https://doi.org/10.1038/s42004-025-01586-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42004-025-01586-z