Figure 1 | Scientific Reports

Figure 1

From: Audio-visual modelling in a clinical setting

Figure 1

Multi-modal representation learning in clinical setting. (a) Illustration of audio-visual modelling in a natural scene setting (top) and in a clinical setting (bottom). (b) Pipeline for audio-video modelling and analysis in clinical settings from raw video footage. Video frames with the corresponding speech audio are extracted from the raw footage (I). After audio data pre-processing (the illustrated red waveform to the green one) and text data generation from the audio signal, the enhanced multi-modal data (II) are fed into a joint fusion framework (each data modality is encoded via a network to the corresponding features, more detail please refer to Fig. 2) to learn multi-modal representations without human annotation (III). The whole system can be transferred to several downstream tasks and used for large-scale analysis and support human experts (IV).

Back to article page