Fig. 4: Qualitative assessment of annotation confidence.
From: Benchmarking cell type and gene set annotation by large language models with AnnDictionary

A Inter-rater agreement within the top 4 performing LLMs vs. agreement with manual annotation for each manual cell type annotation, with marginal kernel density estimates stratified by tertile of cell type population size. Red, yellow, and green represent the bottom, middle, and top tertiles of cell type by population size, respectively. B Same set of axes as (A), with dot sizes scaled by their respective cell type populations size, and with kernel density estimates scaled by population size as well. The manually drawn ellipses outline two regions of interest: (A) the cell types with the highest inter-rater agreement and lowest agreement with manual annotation—which are the subject of Fig. 5, and (B) the cell types with the highest inter-rater agreement and highest agreement with manual annotation—which includes the most abundant cell types discussed earlier.