Fig. 2: Fine-tuned LLMs versus ChatGPT-family models.
From: Large language models to identify social determinants of health in electronic health records

Comparison of model performance between our fine-tuned Flan-T5 models against zero- and 10-shot GPT. Macro-F1 was measured using our manually validated synthetic dataset. The GPT-turbo-0613 version of GPT3.5 and the GPT4–0613 version of GPT4 were used. Error bars indicate the 95% confidence intervals. LLM large language model.