Multimodal fusion with relational learning for molecular property prediction

Zhou, Zhengyang; Li, Yunrui; Hong, Pengyu; Xu, Hao

doi:10.1038/s42004-025-01586-z

Download PDF

Article
Open access
Published: 05 July 2025

Multimodal fusion with relational learning for molecular property prediction

Communications Chemistry volume 8, Article number: 200 (2025) Cite this article

2870 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Graph-based molecular representation learning is essential for predicting molecular properties in drug discovery and materials science. Despite its importance, current approaches struggle with capturing the intricate molecular relationships and often rely on limited chemical knowledge during training. Multimodal fusion, which integrates information from graph and other data sources together, has emerged as a promising approach for enhancing molecular property prediction. However, existing studies explore only a narrow range of modalities, and the optimal integration stages for multimodal fusion remain largely unexplored. Furthermore, the reliance on auxiliary modalities poses challenges, as such data is often unavailable in downstream tasks. Here, we present MMFRL (Multimodal Fusion with Relational Learning), a framework designed to address these limitations by leveraging relational learning to enrich embedding initialization during multimodal pre-training. MMFRL enables downstream models to benefit from auxiliary modalities, even when these are absent during inference. We also systematically investigate modality fusion at early, intermediate, and late stages, elucidating their unique advantages and trade-offs. Using the MoleculeNet benchmarks, we demonstrate that MMFRL significantly outperforms existing methods with superior accuracy and robustness. Beyond predictive performance, MMFRL enhances explainability, offering valuable insights into chemical properties and highlighting its potential to transform real-world applications in drug discovery and materials science.

Enhancing property and activity prediction and interpretation using multiple molecular graph representations with MMGX

Article Open access 05 April 2024

Unified and explainable molecular representation learning for imperfectly annotated data from the hypergraph view

Article Open access 30 September 2025

Multi-channel learning for integrating structural hierarchies into context-dependent molecular representation

Article Open access 06 January 2025

Introduction

Graph representation learning for molecules has gained significant attention in drug discovery and materials science, as it effectively encapsulates molecular structures and enables the effective investigation of structure-activity relationships^1,2,3,4,5,6. In this paradigm, atoms are treated as nodes and chemical bonds as edges, effectively encapsulating the connectivities that define molecular behavior. However, it poses significant challenges due to intricate relationships among molecules and the limited chemical knowledge utilized during training.

Contrastive Learning (CL) is often employed to study relationships among molecules. The primary focus within the domain of CL applied to molecular graphs centers on 2D–2D graphs comparisons. Noteworthy representative examples: InfoGraph⁷ maximizes the mutual information between the representations of the graph and its substructures to guide the molecular representation learning; GraphCL⁸, MoCL⁹, and MolCLR¹⁰ employ graph augmentation techniques to construct positive pairs; MoLR¹¹ establishes positive pairs with reactant-product relationships. In addition to 2D-2D graph CL, there are also noteworthy efforts exploring 2D–3D and 3D–3D CL in the field. 3DGCL¹² is 3D–3D CL model, establishing positive pairs with conformers from the same molecules. GraphMVP¹³, GeomGCL¹⁴, and 3D Informax¹⁵ propose 2D–3D view CL approaches. To conclude, 2D–2D and 3D–3D comparisons are intra-modality CL, as only one graph encoder is employed in these studies. However, these approaches often focus on the motif and graph levels, leaving atom-level CL less explored. For example, consider Thalidomide: while the (R)- and (S)-enantiomers share the same topological graph and differ only at a single chiral center, their biological activities are drastically different-the (R)-enantiomer is effective in treating morning sickness, whereas the (S)-enantiomer causes severe birth defects. In other words, the (R)- and (S)-enantiomers are similar in terms of topological structure but dissimilar in terms of biological activities. Thus, a more sophisticated approach is required to tackle these scenarios. A potential solution would be to use continuous metrics within a multi-view space, enabling a more comprehensive understanding of these complex molecular relationships.

There are multiple approaches for such similarity learning. One approach among them is instance-wise discrimination, which involves directly assessing the similarity between instances based on their latent representations or features.¹⁶. Naive instance-wise discrimination relies on pairwise similarity, leading to the development of contrastive loss¹⁷. Although there are improved loss functions such as triplet loss¹⁸, quadruplet loss¹⁹, lifted structure loss²⁰, N-pairs loss²¹, and angular loss²², these methods still fall short in thoroughly capturing relationships among multiple instances simultaneously²³. To address this limitation, a joint multi-similarity loss has been proposed, incorporating pair weighting for each pair to enhance instance-wise discrimination^23,24. However, these pair weightings require the manual categorization of negative and positive pairs, as distinct weights are assigned to losses based on their categories. In this case, we can borrow the idea of relational learning (RL)²⁵ from computer vision by using different augmented views of the same instance tasks to similar features, while allowing for some variability. This approach captures the essential characteristics of the instance in a continuous scale, promoting relative consistency across the views without requiring them to be identical. By doing so, it enhances the model’s ability to generalize and recognize underlying patterns in the data.

Besides, in order to enable a multi-view analysis from diverse sources is essential for improving molecule analysis, we can apply the multi-modality fusion^{26,27,28,29,30,31,32}. It combines diverse heterogeneous data (e.g., text, images, graph) to create a more comprehensive understanding of complex scenarios. This approach leverages the strengths of each modality, potentially improving performance in tasks like sentiment analysis or medical diagnosis. While challenging to implement due to the need to align different data streams, successful fusion can provide insights beyond what’s possible with single modalities, advancing AI and data-driven decision-making. In particular, the way to fuse different modalities should also depends on the dominace of each unimodality³⁰. However, when it comes to multimodal learning for molecules, we often encounter data availability and incompleteness issues. This raises a critical question: how can multimodal information be effectively leveraged for molecular property reasoning when such data are absent in downstream tasks? Recent studies have demonstrated the effectiveness of pretraining molecular graph neural networks (GNNs) by integrating additional knowledge sources^10,33,34,35. Building on this foundation, a promising solution is to pretrain multiple replicas of molecular GNNs, with each replica dedicated to learning from a specific modality. This approach allows downstream tasks to benefit from multimodal data that is not accessible during fine-tuning, ultimately improving representation learning.

Facing these challenges and opportunities, we propose multimodal fusion (MMF) with relational learning (MMFRL) for molecular property prediction as shown in Fig. 1, a framework features RL and MMF. RL utilizes a continuous relation metric to evaluate relationships among instances in the feature space^36,37. Our major contribution comprises three aspects: Conceptually: We introduce a modified relational learning (MRL) metric for molecular graph representation that offers a more comprehensive and continuous perspective on inter-instance relations, effectively capturing both localized and global relationships among instances. Methodologically: Our proposed modified relational metric captures complex relationships by converting pairwise self-similarity into relative similarity, which evaluates how the similarity between two elements compares to the similarity of other pairs in the dataset. In addition, we integrate these metrics into a fused multimodal representation, which has the potential to enhance performance, allowing downstream tasks to leverage modalities that are not directly accessible during fine-tuning. Empirically: MMFRL excels in various downstream tasks for Molecular Property Predictions. Last but not least, we demonstrate the explainability of the learned representations through two post-hoc analysis. Notably, we explore minimum positive subgraphs (MPS) and maximum common subgraphs to gain insights for further drug molecule design.

Results

The effectiveness of pre-training

We first illustrate the impact of pre-training initialization on performance on DMPNN³⁸. As shown in Table 1, the average performance of pre-trained models outperform the non-pre-trained model in all tasks except for Clintox. The results of various downstream tasks indicate that different tasks may prefer different modalities. Notably, the model pre-trained with the NMR modality achieves the highest performance across three classification tasks. Similarly, the model pre-trained with the Image modality excels in three tasks, two of which are regression tasks related to solubility, aligning with findings from prior literature³⁵. Additionally, the model pre-trained with the Fingerprint Modality method achieves the best performance in two tasks, including MUV, which has the largest dataset.

Table 1 Study on the performances of MMFRL_Unimodality

Full size table

Overall performance of MMFRL

As shown in Tables 2 and 3, MMFRL demonstrates superior performance compared to all baseline models and the average performance of DMPNN pretrained with extra modalities across all 11 tasks evaluated in MoleculeNet. Results in Supplementary Information Table D.2 and D.3 demonstrates our great performance compared to the baseline models on the directory of useful decoys: enhanced (Dud-E)³⁹ and LIT-PCBA⁴⁰ datasets. This robust performance highlights the effectiveness of our approach in leveraging multimodal data. In particular, while individual models pre-trained on other modalities for Clintox fail to outperform the NoPre-training model, the fusion of these pre-trained models leads to improved performance. Besides, apart from Tox21 and Sider, the fusion models significantly enhance overall performance. In particular, the intermediate fusion model stands out by achieving the highest scores in seven distinct tasks, showcasing its ability to effectively combine features at a mid-level abstraction. The late fusion model achieves the top performance in two tasks. These results underscore the advantages of utilizing various fusion strategies in multimodal learning, further validating the efficacy of the MMFRL framework.

Table 2 Overall performances (ROC-AUC) on classification downstream tasks

Full size table

Table 3 Overall performances (RMSE) on regression downstream tasks. The best results are denoted in bold, and the second-best are indicated with underlining

Full size table

Analysis of the fusion effect

General comparison among various ways of fusions

Early fusion is employed during the pretraining phase and is easy to implement, as it aggregates information from different modalities directly. However, its primary limitation lies in the necessity for predefined weights assigned to each modality. These weights may not accurately reflect the relevance of each modality for the specific downstream tasks, potentially leading to suboptimal performance.

Intermediate fusion is able to capture the interaction between modalities early in the fine-tuning process, allowing for a more dynamic integration of information. This method can be particularly beneficial when different modalities provide complementary information that enhances overall performance. If the modalities effectively compensate for one another’s strengths and weaknesses, intermediate fusion may emerge as the most effective approach.

In contrast, Late fusion enables each modality to be explored independently, maximizing the potential of individual modalities without interference from others. This separation allows for a thorough examination of each modality’s contribution. When certain modalities dominate the performance metrics, Late fusion can maximize on these strengths, ensuring that the most impactful information is utilized effectively. This approach is especially useful in scenarios where the dominance of specific modalities can be leveraged to enhance overall model performance.

In addition, we conduct an ablation study to evaluate the performance of our proposed loss functions against two traditional CL losses-contrastive loss and triplet loss-in the context of intermediate fusion. The experimental results as shown in Supplementary Information Table D.1 demonstrate that our proposed methods outperform the baseline approaches across the majority of tasks in the MoleculeNet dataset, thereby highlighting the superiority of our approach.

Explainability of learnt representations

To demonstrate the interpretability of learnt representations of fusion, we present post-hoc analysis for two tasks, ESOL and Lipo, as demonstration. The results showcase that learnt representations can capture task-specific patterns and offer valuable insights for molecular design.

ESOL with intermediate fusion. As presented in Table 3, the intermediate fusion method 5.3.2 exhibits superior performance on the ESOL regression task for predicting solubility. To further analyze this performance, we employed t-SNE to reduce the dimensionality of the molecule embeddings from 300 to 2, resulting in a heatmap visualized in Fig. 2. The embeddings derived from individual modalities prior to fusion do not display a clear pattern, showing no smooth transition from low to high solubility. In contrast, the embeddings by intermediate fusion reveal a distinct and smooth transition in solubility values: molecules with similar solubility cluster together, forming a gradient that extends from the bottom left (indicating lower solubility) to the upper center (representing higher solubility). This trend underscores the effectiveness of the intermediate fusion approach in accurately capturing the quantitative structure-activity relationships for aqueous solubility.

**Fig. 1: Multimodal fusion with relational learning for molecular property prediction (MMFRL).**

**Fig. 2: T-SNE visualization depicting the ESOL molecule embeddings for intermediate fusion in Section 5.3.2 alongside molecules within the highlighted region.**

Additionally, we examined the similarity between the respective embeddings prior to intermediate fusion and the resulting fused embedding, as depicted in Fig. 3. Our analysis indicates that the embeddings from each modality exhibit low similarity with the intermediate-fused representation. This observation suggests that the modalities complement each other, collectively enhancing the resulting representation of the intermediate-fused embedding.

**Fig. 3: This figure shows the distribution of similarities between each modality and the intermediate fusion embedding for ESOL.**

Lipo with late fusion. As detailed in Table 3, the Late Fusion method (described in Section 5.3.3) demonstrates superior performance on the Lipo regression task for predicting solubility in fats, oils, lipids, and non-polar solvents. According to Equation (11), the final prediction is determined by the respective coefficients (w_i) and predictions (p_i) from each modality.

In Fig. 4, we present the distribution of values for the coefficients, predictions, and their products for each modality. Notably, the simplified molecular input line entry system (SMILES) and Image modalities display a wide range of values, highlighting their potential to significantly influence the final predictions. This observation aligns with the strong performance achieved when pretraining using either of these two modalities, as shown in Table 1. In contrast, the NMR_Peak values display a narrower range, indicating its role as a modifier for finer adjustments in the predictions. Furthermore, we observe that the contributions from NMR_Spectrum and Fingerprint modalities are minimal, with their corresponding values approaching zero. This outcome highlights the advantages of the Late Fusion approach in effectively identifying and leveraging dominant modalities, thereby optimizing the overall predictive performance.

**Fig. 4: Lipo late fusion contribution analysis reveals that the three primary contributors are SMILES, image, and NMR_peak.**

Substructure analysis with BACE. We explore the binding potential of positive inhibitor molecules targeting BACE and their associated key functional substructures, referred to as MPS. To identify MPS, we employ a Monte Carlo Tree Search (MCTS) approach integrated into our BACE classification model, as implemented in RationalRL⁴¹. MCTS, being an iterative process, allows us to evaluate each candidate substructure for its binding potential with our model. Following the determination of MPSs, we categorize the original positive BACE molecules based on their respective MPSs. By computing the binding potential difference between the original molecule and its MPS, we can identify structural features that contribute to changes in binding affinity as shown in Fig. 5.

**Fig. 5: The left sub-figure is the boxplot of the binding difference for the respective groups of molecules by the top eight most frequent minimum positive subgraph.**

In the case of the MPS 5 group, the binding score is heavily influenced by steric effects. The top three high-performing designs (5a-5c) all feature a flexible and compact alkylated pyrazole structure (colored green), which likely facilitates better accommodation within the binding pocket. In contrast, the three lowest-performing designs (5n-5p) incorporate a more rigid and bulky (trifluoromethoxy)benzene moiety (colored red), which may introduce steric hindrance and reduce binding efficiency. Additionally, the pyrazole ring contains two nitrogen atoms, offering more potential for hydrogen bonding interactions with the target protein, whereas the (trifluoromethoxy)benzene group has only one oxygen atom, limiting its capacity for such interactions. This comparison highlights the importance of both molecular flexibility and functional group composition in optimizing binding affinity.

Sensitivity Analysis

Choosing the most effective fusion strategy can be empirical. However, our results presented in Table 2, 3, and Supplementary Information Table D.1, D.2, and D.3 provide strong evidence that our lightweight fusion strategy (early, intermediate, and late fusion) outperforms existing approaches in the literature. To guide the selection among these strategies, our intuition is as follows: if a modality is highly relevant to the downstream task, earlier fusion is likely to be more effective; otherwise, later fusion may be preferable.

To test this hypothesis, we performed a retrospective analysis to assess the sensitivity of downstream tasks to different fusion strategies. Since early fusion embeddings often lack the flexibility to adapt to individual samples, we excluded them from this analysis. Instead, we used pretrained encoders to extract embeddings for each modality and performed a simple linear regression between the embeddings and task labels. We then computed the Pearson correlation between the predicted values and the ground truth as a measure of each modality’s relevance.

For each dataset, we recorded the highest correlation across all modalities as the “Top 1" score. We then concatenated the embeddings from all modalities and repeated the regression analysis. The improvement in correlation is reported as the “Pearson Gain." A higher Pearson Gain suggests that earlier fusion of multiple modalities is more beneficial. As shown in Table 4, datasets where intermediate fusion performs best generally exhibit higher Pearson Gain compared to late fusion, supporting our intuition. However, for ESOL and FreeSolv, the correlation from a single modality is already high, making them less suitable for this analysis.

Table 4 Person correlation of different modalities and chosen fusion strategies across datasets

Full size table

Conclusion

In summary, we introduce a RL metric for molecular graph representation that enhances the understanding of inter-instance relationships by capturing both local and global contexts. Our method transforms pairwise self-similarity into relative similarity through a weighting function, allowing for complex relational insights. This metric is integrated into a multimodal representation, improving performance by utilizing modalities not directly accessible during fine-tuning. Empirical results show that our approach, MMFRL, excels in various molecular property prediction tasks. We also demonstrate a detailed study about the explainability of the learned representations, offering valuable insights for drug molecule design. Despite these accomplishments, further exploration is needed to achieve more effective integration of graph- and node-level similarities. Looking ahead, we are enthusiastic about the prospect of applying our model to additional fields, such as social science, thereby broadening its applicability and impact.

Dataset

Selected modalities for target similarity calculation

The following modalities are used for target similarity calculation. For details on training the corresponding encoders to obtain fixed embeddings for these modalities, please refer to Supplementary Information Section C.

Fingerprint

Fingerprints are binary vectors that represent molecular structures, capturing the presence or absence of particular substructures, fragments, or chemical features within a molecule. In particular, we utilize Morgan fingerprints, which are based on the extended-connectivity fingerprints (ECFP) method introduced by Rogers and Hahn⁴². Specifically, we generate fingerprints using AllChem.GetMorganFingerprintAsBitVect(mol, 2), which corresponds to ECFP4 (radius = 2). Because ECFP4 is one of the most effective and interpretable molecular representations⁴³.

Simplified molecular input line entry system (SMILES)

SMILES offers a compact textual representation of chemical structures.

Nuclear magnetic resonance (NMR)

NMR spectroscopy provides detailed insights into the chemical environment of atoms within a molecule⁴⁴. By analyzing the interactions of atomic nuclei with an applied magnetic field, NMR can reveal information about the structure, dynamics, and interactions of molecules, including the connectivity of atoms, functional groups, and conformational changes. In our experiments, NMR_spectrum provides the information about the overal information of molecule while NMR_peak provides the information about the individual atoms in the molecule.

Image

Images (e.g., 2D chemical structures) provide a visual representation of molecular structures.

All of the similarity calculation from the modalities above are listed in Supplementary Information C.

Pre-training

NMRShiftDB-2⁴⁵ is a comprehensive database dedicated to nuclear magnetic resonance (NMR) chemical shift data, providing researchers with an extensive collection of expert-annotated NMR data for various organic compounds with molecular structures (SMILES) (Accessed June 2023). There are around 25,000 molecules used for pre-training and no overlap with downstream task datasets.

Downstream tasks

For Downstream tasks, our model was trained on 11 drug discovery-related benchmarks sourced from MoleculeNet⁴⁶. Eight of these benchmarks were designated for classification downstream tasks, including BBBP, BACE, SIDER, CLINTOX, HIV, MUV, TOX21, and ToxCast, while three were allocated for regression tasks, namely ESOL, Freesolv, and Lipo. The datasets were divided into train/validation/test sets using a ratio of 80%:10%:10%, accomplished through the scaffold splitter⁴⁷ from Chemprop^38,48, like previous works. The scaffold splitter categorizes molecular data based on substructures, ensuring diverse structures in each set. Molecules are partitioned into bins, with those exceeding half of the test set size assigned to training, promoting scaffold diversity in validation and test sets. Remaining bins are randomly allocated until reaching the desired set sizes, creating multiple scaffold splits for comprehensive evaluation.

The DUD-E dataset³⁹ is a widely used benchmark for virtual screening, containing 102 protein targets, thousands of active compounds, and carefully selected decoys that resemble actives in physico-chemical properties but differ topologically. In contrast, Low-Throughput Informatics-Targeted PubChem BioAssay (LIT-PCBA)⁴⁰ offers a more realistic and challenging benchmark, derived from real experimental assays across 15 targets, with no artificial decoys and inherent data noise and imbalance. Together, they represent two ends of the spectrum in virtual screening evaluation-DUD-E with idealized conditions, and LIT-PCBA with real-world complexity. For the fine-tuning setting, We follow the same split and test approach as⁴⁹ for DUD-E and⁵⁰ for LIT-PCBA.

Methods

We first explain the preliminaries, and then our proposed modified metric in RL to facilitate smooth alignment between the graph and referred unimodality. Then, we introduce approaches for integrating multimodalities at different stages of the learning process.

Molecular representation with DMPNN

The message passing neural network (MPNN)⁵¹ is a GNN model that processes an undirected graph G with node (atom) features x_v and edge (chemical bond) features e_vw. It operates through two distinct phases: a message passing phase, facilitating information transmission across the molecule to construct a neural representation, and a readout phase, utilizing the final representation to make predictions regarding properties of interest. The primary distinction between DMPNN and a generic MPNN lies in the message passing phase. While MPNN uses messages associated with nodes, DMPNN crucially differs by employing messages associated with directed edges³⁸. This design choice is motivated by the necessity to prevent totters⁵², eliminating messages passed along paths of the form v₁v₂…v_n, where v_i = v_i+2 for some i, thereby eliminating unnecessary loops in the message passing trajectory.

MRL in pretraining

Original Relation Learning²⁵ ensures that different augmented views of the same instance from computer vision tasks share similar features, while allowing for some variability. Suppose z_i is the original embedding for the i-th instance. Then ${{{{\bf{z}}}}}_{{{{\bf{i}}}}}^{1}$ is the embedding of first augmented view for z_i, and ${{{{\bf{z}}}}}_{{{{\bf{i}}}}}^{2}$ is the embedding of second augmented view for z_i. In this case, the loss of RL is formulated as following:

$${s}_{ik}^{1}=\frac{{{\mathbb{1}}}_{i\ne k}\cdot \exp ({{{{\bf{z}}}}}_{{{{\bf{i}}}}}^{1}\cdot {{{{\bf{z}}}}}_{{{{\bf{k}}}}}^{2}/\tau )}{{\sum }_{j = 1}^{N}{{\mathbb{1}}}_{i\ne j}\cdot \exp ({{{{\bf{z}}}}}_{{{{\bf{i}}}}}^{1}\cdot {{{{\bf{z}}}}}_{{{{{\bf{j}}}}}^{2}}/\tau )}$$

$${s}_{ik}^{2}=\frac{{{\mathbb{1}}}_{i\ne k}\cdot \exp ({{{{\bf{z}}}}}_{{{{\bf{i}}}}}^{2}\cdot {{{{\bf{z}}}}}_{{{{\bf{k}}}}}^{2}/{\tau }_{m})}{{\sum }_{j = 1}^{N}{{\mathbb{1}}}_{i\ne j}\cdot \exp ({{{{\bf{z}}}}}_{{{{\bf{i}}}}}^{2}\cdot {{{{\bf{z}}}}}_{{{{\bf{j}}}}}^{2}/{\tau }_{m})}$$

$${L}_{RL}=-\frac{1}{N}\mathop{\sum }_{i=1}^{N}\mathop{\sum }_{k=1\atop k\ne i}^{N}{s}_{ik}^{2}\log ({s}_{ik}^{1}).$$

We propose a modified relational metric by adapting the softmax function as a pairwise weighting mechanism. Let $| {{{\mathcal{S}}}}|$ denote the size of the instance set. The variable s_i,j represents the learned similarity where z_i is the embedding to be trained. On the other hand, ${t}_{i,j}^{R}$ defines the target similarity that captures the relationship between the pair of instances in the given space or modality R, where ${{{{\bf{z}}}}}_{{{{\bf{i}}}}}^{R}$ is a fixed embedding. The detailed formulation for the loss of MRL is provided below:

$${s}_{i,j}=\frac{\exp (sim({{{{\bf{z}}}}}_{{{{\bf{i}}}}},{{{{\bf{z}}}}}_{{{{\bf{j}}}}}))}{{\sum }_{k = 1}^{| {{{\mathcal{S}}}}| }\exp (sim({{{{\bf{z}}}}}_{{{{\bf{i}}}}},{{{{\bf{z}}}}}_{{{{\bf{k}}}}}))}$$

(1)

$${t}_{i,j}^{R}=\frac{\exp (sim({{{{\bf{z}}}}}_{{{{\bf{i}}}}}^{{{{\bf{R}}}}},{{{{\bf{z}}}}}_{{{{\bf{j}}}}}^{{{{\bf{R}}}}}))}{{\sum }_{j = 1}^{| {{{\mathcal{S}}}}| }\exp (sim({{{{\bf{z}}}}}_{{{{\bf{i}}}}}^{{{{\bf{R}}}}},{{{{\bf{z}}}}}_{{{{{\bf{k}}}}}^{{{{\bf{R}}}}}}))}$$

(2)

$${L}_{MRL}=-\frac{1}{| {{{\mathcal{S}}}}| }\mathop{\sum }_{i=1}^{| {{{\mathcal{S}}}}| }\mathop{\sum }_{j=1}^{| {{{\mathcal{S}}}}| }{t}_{i,j}^{R}\log ({s}_{i,j}).$$

(3)

Notably, unlike other similarity learning approaches^23,24, our method does not rely on the categorization of negative and positive pairs for the pair weighting function. Additionally, our use of the softmax function ensures that the generalized target similarity t_i,j adheres to the principles of convergence, which results in better ranking consistency between the graph modality and the auxiliary modality, compared with the original Relational Study, as follows:

Theorem 5.3

(Convergence of MRL Metric)

Let ${{{\mathcal{S}}}}$ be a set of instances with size of $| {{{\mathcal{S}}}}|$, and let ${{{\mathcal{P}}}}$ represent the learnable latent representations of instances in ${{{\mathcal{S}}}}$ such that $| {{{\mathcal{P}}}}| =| {{{\mathcal{S}}}}|$. For any two instances $i,j\in {{{\mathcal{S}}}}$, their respective latent representations are denoted by ${{{{\mathcal{P}}}}}_{i}$ and ${{{{\mathcal{P}}}}}_{j}$. Let t_i,j represent the target similarity between instances i and j in a given domain, and let d_i,j be the similarity between ${{{{\mathcal{P}}}}}_{i}$ and ${{{{\mathcal{P}}}}}_{j}$ in the latent space. If t_i,j is non-negative and {t_i,j} satisfies the constraint ${\sum }_{j = 1}^{| {{{\mathcal{S}}}}| }{t}_{i,j}=1$, consider the loss function for an instance i defined as follows:

$$L(i)=-\mathop{\sum }_{j=1}^{| {{{\mathcal{S}}}}| }{t}_{i,j}\log \left(\frac{{e}^{{d}_{i,j}}}{{\sum }_{k = 1}^{| {{{\mathcal{S}}}}| }{e}^{{d}_{i,k}}}\right)$$

(4)

then when it reaches ideal optimum, the relationship between t_i,j and d_i,j satisfies:

$${softmax} \, ({d}_{i,j})={t}_{i,j}$$

(5)

For detailed proof, please refer to Supplementary Information Section A.

Fusion of multi-modality information in downstream tasks

During pre-training, the encoders are initialized with parameters derived from distinct reference modalities. A critical question that arises is how to effectively utilize these pre-trained models during the fine-tuning stage to improve performance on downstream tasks.

Early stage: multimodal multi-similarity

With a set of known target similarity {t^R} from various modalities, we can transform themto multimodal space through a fusion function. There are numerous potential designs of the fusion function. For simplicity, we take linear combination as a demonstration. The multimodal generalized multi-similarity ${t}_{i,j}^{M}$ between ith and jth objects can be defined as follows:

$${t}_{i,j}^{M}=fusion(\{{t}^{R}\})$$

(6)

$$=\sum {w}_{R}\cdot {t}_{i,j}^{R}$$

(7)

where ${t}_{i,j}^{R}$ represents the target similarity between ith and jth instance in unimodal space R, w_R is the pre-defined weights for the corresponding modal, and ∑w_R = 1. Then we can make ${t}_{i,j}={t}_{i,j}^{R}$ in equation (3). Such that, it still satisfy the requirement of convergence (See proof in Supplementary Information Section A). In this case, the learnt similarity during pretraining will be aligned with this new combined target similarity.

Intermediate stage: embedding concatenation and fusion

Intermediate fusion integrates features from various modalities after their individual encoding processes and prior to the decoding/readout stage. Let f₁, f₂, …, f_n represent the feature vectors obtained from these different modalities. The resulting fused feature vector can be defined as follows:

$${{{{\bf{f}}}}}_{{{{\rm{fused}}}}}={\mbox{MLP}}({\mbox{concat}}\,({{{{\bf{f}}}}}_{1},{{{{\bf{f}}}}}_{2},\ldots ,{{{{\bf{f}}}}}_{n}))$$

(8)

Where concat represents concatenation of the feature vectors. The fused features are then fed into a later readout function or decoder for downstrean tasks prediction or classification. The multi-layer perceptron is used to reduce the dimension to be the same as f_i.’

Late stage: decision-level

Late fusion (or decision-level fusion) combines the outputs of models trained on different modalities after they have been processed independently. Each modality is first processed separately, and its predictions are combined at a later stage.

Let p₁, p₂, …, p_n be the predictions (e.g., probabilities) from different modalities. The final prediction p_final can be computed using a weighted sum mechanism:

$${w}_{i}={T}_{i}({{{{\bf{f}}}}}_{i})$$

(9)

$${p}_{i}={{\mbox{readout}}}_{i}({{{{\bf{f}}}}}_{i})$$

(10)

$${p}_{{{{\rm{final}}}}}=\mathop{\sum }_{i=1}^{n}{w}_{i}{p}_{i}$$

(11)

Where w_i are the weights assigned to each modality’s prediction, and they can be adjusted based on the importance of each modality. In particular, w_i is tunable during the learning process for respective downsteak tasks.

Data availability

The pretraining data can be downloaded from NMRShiftDB2. The MoleculeNet dataset is available at MoleculeNet. The DuD-E dataset can be accessed at DuD-E, and the Lit-PCBA dataset can be downloaded from Lit-PCBA.

Code availability

The code is available in Github: https://github.com/zhengyjo/MMFRL.

References

Schneider, P. et al. Rethinking drug design in the artificial intelligence era. Nat. Rev. Drug Discov. 19, 353–364 (2020).
Article CAS PubMed Google Scholar
Wieder, O. et al. A compact review of molecular property prediction with graph neural networks. Drug Discov. Today. Technol. 37, 1–12 (2020).
Article PubMed Google Scholar
Zhang, Z. et al. Graph neural network approaches for drug-target interactions. Curr. Opin. Struct. Biol. 73, 102327 (2022).
Article CAS PubMed Google Scholar
Fang, X. et al. Geometry-enhanced molecular representation learning for property prediction. Nat. Mach. Intell. 4, 127–134 (2022).
Article Google Scholar
Wang, Y. et al. Motif-based graph representation learning with application to chemical molecules. In: Proc. Informatics, vol. 10, 8 (MDPI, 2023).
Chen, Y. et al. Drugdagt: a dual-attention graph transformer with contrastive learning improves drug-drug interaction prediction. BMC Biol. 22, 233 (2024).
Article PubMed PubMed Central Google Scholar
Sun, F.-Y. et al. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. In Proc. International Conference on Learning Representations (2019).
You, Y. et al. Graph contrastive learning with augmentations. Adv. neural Inf. Process. Syst. 33, 5812–5823 (2020).
Google Scholar
Sun, M., Xing, J., Wang, H., Chen, B. & Zhou, J. MOCL: data-driven molecular fingerprint via knowledge-aware contrastive learning from molecular graph. In: Proc. 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 3585–3594 (ACM, 2021).
Wang, Y., Wang, J., Cao, Z. & Barati Farimani, A. Molecular contrastive learning of representations via graph neural networks. Nat. Mach. Intell. 4, 279–287 (2022).
Article Google Scholar
Wang, H. et al. Chemical-Reaction-Aware Molecule Representation Learning. In Proc. 10th International Conference on Learning Representations (ICLR) (2022).
Moon, K., Im, H.-J. & Kwon, S. 3d graph contrastive learning for molecular property prediction. Bioinformatics 39, btad371 (2023).
Article CAS PubMed Central Google Scholar
Liu, S. et al. Pre-training Molecular Graph Representation with 3D Geometry. In Proc. International Conference on Learning Representations (2022).
Li, S., Zhou, J., Xu, T., Dou, D. & Xiong, H. Geomgcl: Geometric graph contrastive learning for molecular property prediction. In: Proc. Thirty-Six AAAI Conference on Artificial Intelligence, 4541–4549 (2022).
Stärk, H. et al. 3d infomax improves gnns for molecular property prediction. In: Proc. International Conference on Machine Learning, 20479–20502 (PMLR, 2022).
Wu, Z., Xiong, Y., Yu, S. X. & Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 3733–3742 (IEEE, 2018).
Hadsell, R., Chopra, S. & LeCun, Y. Dimensionality reduction by learning an invariant mapping. k In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, 1735–1742 (IEEE, 2006).
Hoffer, E. & Ailon, N. Deep metric learning using triplet network. In: Proc. Similarity-Based Pattern Recognition Third International Workshop, SIMBAD, 84–92 (Springer, 2015).
Law, M. T., Thome, N. & Cord, M. Quadruplet-wise image similarity learning. In: Proc. IEEE International Conference on Computer Vision, 249–256 (IEEE, 2013).
Oh Song, H., Xiang, Y., Jegelka, S. & Savarese, S. Deep metric learning via lifted structured feature embedding. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 4004–4012 (IEEE, 2016).
Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. Adv. Neural Inf. Process. Syst. 29, 1857–1865 (2016).
Wang, J., Zhou, F., Wen, S., Liu, X. & Lin, Y. Deep metric learning with angular loss. In: Proc. IEEE International Conference on Computer Vision, 2593–2601 (IEEE, 2017).
Wang, X., Han, X., Huang, W., Dong, D. & Scott, M. R. Multi-similarity loss with general pair weighting for deep metric learning. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5022–5030 (IEEE, 2019).
Zhang, L. et al. Jointly multi-similarity loss for deep metric learning. In: Proc. IEEE International Conference on Data Mining (ICDM), 1469–1474 (IEEE, 2021).
Zheng, M. et al. RESSL: relational self-supervised learning with weak augmentation. Adv. Neural Inf. Process. Syst. 34, 2543–2555 (2021).
Google Scholar
Lahat, D., Adali, T. & Jutten, C. Multimodal data fusion: an overview of methods, challenges, and prospects. Proc. IEEE 103, 1449–1477 (2015).
Article Google Scholar
Khaleghi, B., Khamis, A., Karray, F. O. & Razavi, S. N. Multisensor data fusion: a review of the state-of-the-art. Inf. fusion 14, 28–44 (2013).
Article Google Scholar
Poria, S., Cambria, E. & Gelbukh, A. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proc. Conference on Empirical Methods in Natural Language Processing, 2539–2544 (2015).
Ramachandram, D. & Taylor, G. W. Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process. Mag. 34, 96–108 (2017).
Article Google Scholar
Pawłowski, M., Wróblewska, A. & Sysko-Romańczuk, S. Effective techniques for multimodal data fusion: a comparative analysis. Sens. 23, 2381 (2023).
Article Google Scholar
Manzoor, M. A. et al. Multimodality representation learning: a survey on evolution, pretraining and its applications. ACM Trans. Multimed. Comput. Commun. Appl. 20, 1–34 (2023).
Article Google Scholar
Priessner, M. et al. Enhancing molecular structure elucidation: multimodaltransformer for both simulated and experimental spectra (2024).
Wang, Y., Min, Y., Shao, E. & Wu, J. Molecular graph contrastive learning with parameterized explainable augmentations. In: Proc. IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 1558–1563 (IEEE, 2021).
Liu, H., Huang, Y., Liu, X. & Deng, L. Attention-wise masked graph contrastive learning for predicting molecular property. Brief. Bioinforma. 23, bbac303 (2022).
Article Google Scholar
Yifei, W., Li, Y., Liu, L., Hong, P. & Xu, H. Advancing Drug Discovery with Enhanced Chemical Understanding via Asymmetric Contrastive Multimodal Learning. J. Chem. Inf. Model. ASAP https://doi.org/10.1021/acs.jcim.5c00430 (2025).
Balcan, M.-F. & Blum, A. On a theory of learning with similarity functions. In: Proc. 23rd international conference on Machine learning, 73–80 (ICML, 2006).
Wen, Y. et al. Pairwise similarity learning is simple. In: Proc. IEEE/CVF International Conference on Computer Vision, 5308–5318 (IEEE, 2023).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370–3388 (2019).
Article CAS PubMed PubMed Central Google Scholar
Mysinger, M. M., Carchia, M., Irwin, J. J. & Shoichet, B. K. Directory of useful decoys, enhanced (DUD-E): Better ligands and decoys for better benchmarking. J. Med. Chem. 55, 6582–6594 (2012).
Tran-Nguyen, V.-K., Jacquemard, C. & Rognan, D. LIT-PCBA: an unbiased data set for machine learning and virtual screening. J. Chem. Inf. Model. 60, 4263–4273 (2020).
Article CAS PubMed Google Scholar
Jin, W., Barzilay, R. & Jaakkola, T. Multi-objective molecule generation using interpretable substructures. In: Proc. 37th International Conference on Machine Learning (PMLR, 2020).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Article CAS PubMed Google Scholar
Zhong, S. & Guan, X. Count-based Morgan fingerprint: a more efficient and interpretable molecular representation in developing machine learning-based predictive regression models for water contaminants’ activities and properties. Environ. Sci. Technol. 57, 18193–18202 (2023).
Article CAS PubMed Google Scholar
Bunzel, M. & Ralph, J. NMR characterization of lignins isolated from fruit and vegetable insoluble dietary fiber. J. Agric. food Chem. 54, 8352–8361 (2006).
Article CAS PubMed Google Scholar
Steinbeck, C., Krause, S. & Kuhn, S. NMRShiftDB constructing a free chemical information system with open-source components. J. Chem. Inf. Comput. Sci. 43, 1733–1739 (2003).
Wu, Z. et al. Moleculenet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Article CAS PubMed Google Scholar
Halgren, T. A. Merck molecular force field. i. basis, form, scope, parameterization, and performance of mmff94. J. Comput. Chem. 17, 490–519 (1996).
Article CAS Google Scholar
Heid, E. et al. Chemprop: a machine learning package for chemical property prediction. J. Chem. Inf. Model. 64, 9–17 (2023).
Gao, B. et al. Drugclip: Contrastive protein-molecule representation learning for virtual screening. Adv. Neural Inf. Process. Syst. 36, 44595–44614 (2023).
Google Scholar
Cai, H., Zhang, H., Zhao, D., Wu, J. & Wang, L. FP-GNN: a versatile deep learning architecture for enhanced molecular property prediction. Brief. Bioinforma. 23, bbac408 (2022).
Article Google Scholar
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. In: Proc. International Conference on Machine Learning, 1263–1272 (PMLR, 2017).
Mahé, P., Ueda, N., Akutsu, T., Perret, J.-L. & Vert, J.-P. Extensions of marginalized graph kernels. In: Proc. twenty-first International Conference on Machine Learning, 70 (ICML, 2004).

Download references

Author information

Authors and Affiliations

Department of Computer Science, Brandeis University, Waltham, MA, USA
Zhengyang Zhou, Yunrui Li & Pengyu Hong
Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
Hao Xu

Authors

Zhengyang Zhou
View author publications
Search author on:PubMed Google Scholar
Yunrui Li
View author publications
Search author on:PubMed Google Scholar
Pengyu Hong
View author publications
Search author on:PubMed Google Scholar
Hao Xu
View author publications
Search author on:PubMed Google Scholar

Contributions

Z.Z. contributed to idea generation, algorithm design, code implementation, data analysis, manuscript writing, editing, and revision. H.Xu contributed to idea generation, algorithm design, data analysis, manuscript writing, editing, supervision and revision. Y.L. assisted in designing appropriate experiments, compiling experimental results, and revising both the manuscript and the response letter. P.H. supported the project by providing access to the computational resources essential for conducting large-scale experiments and model development. All authors reviewed and approved the final version of the manuscript.

Corresponding authors

Correspondence to Zhengyang Zhou or Hao Xu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Chemistry thanks Dingyan Wang and the other, anonymous, reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhou, Z., Li, Y., Hong, P. et al. Multimodal fusion with relational learning for molecular property prediction. Commun Chem 8, 200 (2025). https://doi.org/10.1038/s42004-025-01586-z

Download citation

Received: 07 January 2025
Accepted: 11 June 2025
Published: 05 July 2025
DOI: https://doi.org/10.1038/s42004-025-01586-z

Subjects

Abstract

Similar content being viewed by others

Enhancing property and activity prediction and interpretation using multiple molecular graph representations with MMGX

Unified and explainable molecular representation learning for imperfectly annotated data from the hypergraph view

Multi-channel learning for integrating structural hierarchies into context-dependent molecular representation

Introduction

Results

The effectiveness of pre-training

Overall performance of MMFRL

Analysis of the fusion effect

General comparison among various ways of fusions

Explainability of learnt representations

Sensitivity Analysis

Conclusion

Dataset

Selected modalities for target similarity calculation

Fingerprint

Simplified molecular input line entry system (SMILES)

Nuclear magnetic resonance (NMR)

Image

Pre-training

Downstream tasks

Methods

Molecular representation with DMPNN

MRL in pretraining

Theorem 5.3

Fusion of multi-modality information in downstream tasks

Early stage: multimodal multi-similarity

Intermediate stage: embedding concatenation and fusion

Late stage: decision-level

Data availability

Code availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links