Fig. 1: Overview of the VideoMol foundational model. | Nature Communications

Fig. 1: Overview of the VideoMol foundational model.

From: A molecular video-derived foundation model for scientific drug discovery

Fig. 1

a Feature extraction of molecular videos. First, we render 2 million molecules with conformers in 3D spatial structure. We then rotate the rendered molecule around the \(x\), \(y\), \(z\) axes and generate snapshots for each frame of the molecule video. Finally, we feed the molecular frames into a video encoder to extract latent features. b–d Three self-supervised tasks for pre-training video encoder. The direction-aware pretraining (DAP) task is used to distinguish the relationship between pairs of molecular frames (such as the axis of rotation, the direction of rotation, and the angle of rotation) by using axis classifier (orange), rotation classifier (green) and angle classifier (blue). The video-aware pretraining (VAP) task is used to maximize intra-video similarity and minimize inter-video similarity. The chemical-aware pretraining (CAP) task is used to recognize information related to physicochemical structures in molecular videos by using chemical classifier (gray). e The finetuning of VideoMol on downstream benchmarks (such as binding activity prediction and molecular property prediction). A multi-layer perceptron (MLP) is added after the pre-trained video encoder for fine-tuning on four types of downstream drug discovery tasks (20 target prediction, 12 property prediction, 11 SARS-CoV-2 inhibitor prediction, and 4 virtual screening and docking). We assemble the results (logits) of each frame as the prediction result of molecular video (video logit).

Back to article page