Fig. 3: Metadata of information from LLM-based diagnostic studies in the scoping review.
From: Large language models for disease diagnosis: a scoping review

a Quarterly breakdown of LLM-based diagnostic studies. Since the information for 2024-Q3 is incomplete, our statistics only cover up to 2024-Q2. b The top 5 widely-used LLMs for inference and training. c Breakdown of the data source by regions. d Breakdown of evaluation methods (note that some papers utilized multiple evaluation methods). e Breakdown of the employed datasets by privacy status. f Distribution of data size used for LLM techniques. The red line indicates the median value, while the box limits represent the interquartile range (IQR) from the first to third quartiles. Notably, pre-trained diagnostic models were often followed by other LLM techniques (e.g., fine-tuning), yet this figure only includes studies that primarily used fine-tuning or RAG. Statistics for prompting methods are not included because: (i) hard prompts generally utilize zero or very few demonstration samples, and (ii) although soft prompts require more training data, the number of relevant studies is insufficient for meaningful distribution analysis.