Table 7 Results on treatment planning, diagnosis clinical outcome prediction, and text classification results

Method	Size	SEER	DDXPlus	MIMIC4ED			HoC Classification
				Hospitalization	72h ED Revisit	Critical Triage	Precision	Recall	F1
Close-source Models
GPT-4	-	84.73	58.13	61.20	58.07	60.13	61.07	80.23	68.06
Claude-3.5	-	92.93	60.24	65.80	57.91	68.53	58.43	79.84	66.74
Open-source Models
MEDITRON	7B	68.27	29.53	56.27	48.47	45.67	19.61	34.61	23.70
InternLM 2	7B	62.33	35.20	58.80	55.13	52.80	20.65	82.24	31.09
Mistral	7B	38.93	34.80	56.27	48.47	45.67	40.39	64.11	48.73
Llama 3	8B	56.07	33.73	39.07	9.27	8.80	32.40	52.03	38.37
Qwen 2	7B	22.27	34.07	57.60	56.67	53.53	37.78	53.81	40.29
Med42-v2	8B	43.87	34.13	57.87	55.20	46.60	49.95	53.12	47.87
Baichuan 2	7B	16.80	34.13	22.73	8.07	2.13	38.54	20.28	23.76
MMedIns-Llama 3	8B	98.47	97.53	74.20	52.73	63.13	89.59	85.58	86.66

The first 3 tasks are reported with Accuracy scores, and text classification is reported with Precision, Recall, and F1 scores. Bolding represents the best results.

Quick links

Search