Fig. 2: Overview of the study.
From: A visually grounded language model for fetal ultrasound understanding

a, Illustration of CLIP22. b, The coarse-grained video–text alignment method ‘pulls together’ the paired video and text (that is, transcribed audio) features while ‘pushing away’ the unpaired ones. c, The fine-grained frame-sentence alignment method optimizes the textual–visual similarity matrix p(\({p}^{{\prime} }\)) to maximize the similarity score between the sentence and its corresponding visual frames.