Fig. 1: Illustration of the cross-modal contrastive learning and generative diffusion approach implemented in Chemeleon. | Nature Communications

Fig. 1: Illustration of the cross-modal contrastive learning and generative diffusion approach implemented in Chemeleon.

From: Exploration of crystal chemical space using text-guided generative artificial intelligence

Fig. 1

a The text-guided denoising diffusion model comprises two key components: (1) Crystal CLIP (Contrastive Language-Image Pretraining), a text encoder pre-trained through contrastive learning to align text embeddings with graph neural network (GNN) embeddings derived from crystal structures, and (2) a classifier-free diffusion model, which iteratively predicts noise at each time step while integrating text embeddings from the pre-trained Crystal CLIP. In this framework, q denotes the forward diffusion process (posterior) progressively adding noise to the crystal structures, while pθ represents the reverse diffusion process (learned approximation of the posterior) aimed at generating crystal structures. Ct refers to crystal structures at time step, t. b Illustration of the contrastive learning objective in Crystal CLIP, where positive pairs, which consist of text and graph embeddings from the same crystal structures, are brought closer together in the latent space, while negative pairs are pushed further apart.

Back to article page