Fig. 1: Overview of the workflow and functionality of IOMIDS.

a Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS) is an embodied conversational agent integrated with ChatGPT designed for multimodal diagnosis using eye images and medical history. It comprises a text model and an image model. The text model employs classifiers for chief complaints, along with question and analysis prompts developed from real doctor-patient dialogs. The image model utilizes eye photos taken with a slit-lamp and/or smartphone for image-based diagnosis. These modules combine through diagnostic prompts to create a multimodal model. Patients with eye discomfort can interact with IOMIDS using natural language. This interaction enables IOMIDS to gather patient medical history, guide them in capturing eye lesion photos with a smartphone or uploading slit-lamp images, and ultimately provide disease diagnosis and ophthalmic subspecialty triage information. b Both the text model and the multimodal models follow a similar workflow for text-based modules. After a patient inputs their chief complaint, it is classified by the chief complaint classifier using keywords, triggering relevant question and analysis prompts. The question prompt guides ChatGPT to ask specific questions to gather the patient’s medical history. The analysis prompt considers the patient’s gender, age, chief complaint, and medical history to generate a preliminary diagnosis. If no image information is provided, IOMIDS provides the preliminary diagnosis along with subspecialty triage and prevention, treatment, and care guidance as the final response. If image information is available, the diagnosis prompt integrates image analysis with the preliminary diagnosis to provide a final diagnosis and corresponding guidance. c The text + image multimodal model is divided into text + slit-lamp, text + smartphone, and text + slit-lamp + smartphone models based on image acquisition methods. For smartphone-captured images, YOLOv7 segments the image to isolate the affected eye, removing other facial information, followed by analysis using a ResNet50-trained diagnostic model. Slit-lamp captured images skip segmentation and are directly analyzed by another ResNet50-trained model. Both diagnostic outputs undergo threshold processing to exclude non-relevant diagnoses. The image information is then integrated with the preliminary diagnosis derived from textual information via the diagnosis prompt to form the multimodal model.