Fig. 2: CASSIA improves annotation accuracy in five benchmark datasets, on complex populations of cells from immune and cancer cell populations, and on rare cell types.
From: CASSIA: a multi-agent large language model for automated and interpretable cell annotation

a Fully correct annotation rates across 8 datasets where CASSIA (blue) increases annotation accuracy by 12–41% over the next best performing method. b Combined correct annotation rates (fully or partially correct annotations) across the same datasets where CASSIA maintains a 9–20% advantage over the next-best approach (typically GPTCelltype4/4o). c Heatmap of average accuracy across cell types for each dataset (rows) and method (columns), with values ranging from 0 (white) to 1 (dark red). Average accuracy is computed by assigning 1 to fully correct, 0.5 to partially correct, and 0 to incorrect annotations, then averaging these scores across all clusters within each dataset. CASSIA consistently achieves the highest accuracy scores. The “Cancer” and “Immune” rows correspond to the specialized datasets discussed in the Results subsection “CASSIA performance is maintained in more specialized applications.” d Effect of marker set size on annotation accuracy across benchmark datasets. Line plots show the average accuracy (1 = fully correct, 0.5 = partially correct, 0 = incorrect) achieved by CASSIA when the top 10, 30, 50, 70, or 100 ranked marker genes are used as input. For each benchmark dataset—GTEx, Tabula Sapiens (TS), Human Cell Landscape (HCL), Mouse Cell Atlas (MCA), and Azimuth—accuracy is averaged across all clusters. Performance generally plateaus at around 50 markers, suggesting this number offers an effective trade-off between information content and efficiency. e Performance benchmarking of CASSIA with different LLMs versus GPTCelltype implementations on 100 cell types sampled from five benchmark datasets. CASSIA-Claude-3.5 achieves superior performance (0.92), followed by CASSIA-GPT4.0 (0.88) and CASSIA-LLAMA-3.2 (0.82), demonstrating the framework’s robustness to underlying model selection. Source data are provided as a Source Data file.