Fig. 2: Schematic representation of the predictive model.

We illustrate our multi-modal late fusion method. 1) Collect triage data and a short video of the patient. 2) Pre-process the tabular data and encode the chief complaint and video with ImageBind pre-trained text and vision encoders, to obtain embeddings for each data modality (tabular, text, video). 3) Independently train random forest classifiers for each data modality. 4) Fuse the predictions of each of the trained Random Forests to get the final model prediction for patient disposition.