Fig. 2: Overview of the study. | Nature Biomedical Engineering

Fig. 2: Overview of the study.

From: A visually grounded language model for fetal ultrasound understanding

Fig. 2: Overview of the study.The alternative text for this image may have been generated using AI.

a, Illustration of CLIP22. b, The coarse-grained video–text alignment method ‘pulls together’ the paired video and text (that is, transcribed audio) features while ‘pushing away’ the unpaired ones. c, The fine-grained frame-sentence alignment method optimizes the textual–visual similarity matrix p(\({p}^{{\prime} }\)) to maximize the similarity score between the sentence and its corresponding visual frames.

Back to article page