Table 1 LLM performance
From: Benchmarking cell type and gene set annotation by large language models with AnnDictionary
Cell Type | Biological Process | ||||||||
|---|---|---|---|---|---|---|---|---|---|
Binary (%) | Perfect Match (%) | Exact String Match (%) | Kappa | Close Match | |||||
Model | Cells | By Cell Type | Cells | By Cell Type | Cells | By Cell Type | With Manual | Average With LLMs | % of Terms |
Claude 3 Haiku | 78.3 ± 0.8 | 66.2 ± 1.5 | 61.8 ± 2.3 | 47 ± 4 | 61.8 ± 2.3 | 47 ± 4 | 0.589 ± 0.023 | 0.652 ± 0.023 | 62.8 ± 0.4 |
Claude 3 Opus | 82.3 ± 0.9 | 69.8 ± 1.0 | 72.8 ± 2.6 | 54 ± 4 | 72.7 ± 2.6 | 54 ± 4 | 0.704 ± 0.026 | 0.711 ± 0.020 | 71.0 ± 0.4 |
Claude 3.5 Sonnet | 84.0 ± 0.7 | 70.5 ± 1.2 | 74.4 ± 2.7 | 54 ± 4 | 74.3 ± 2.6 | 53 ± 4 | 0.721 ± 0.027 | 0.697 ± 0.026 | 81.20 ± 0.32 |
Command R Plus | 77.2 ± 1.0 | 59.4 ± 2.6 | 64.5 ± 2.6 | 40 ± 5 | 64.5 ± 2.6 | 40 ± 5 | 0.616 ± 0.027 | 0.646 ± 0.026 | 58.5 ± 0.7 |
GPT-4 | 79.2 ± 0.9 | 64.3 ± 1.9 | 64 ± 4 | 44 ± 5 | 64 ± 4 | 44 ± 5 | 0.61 ± 0.05 | 0.65 ± 0.04 | 65.24 ± 0.33 |
GPT-4o | 80.9 ± 0.7 | 70.1 ± 2.8 | 70.4 ± 2.5 | 54 ± 6 | 70.4 ± 2.5 | 54 ± 6 | 0.680 ± 0.026 | 0.721 ± 0.021 | 67.04 ± 0.33 |
GPT-4o mini | 76.8 ± 1.0 | 66.2 ± 1.6 | 63.4 ± 3.0 | 47 ± 6 | 63.4 ± 3.0 | 47 ± 6 | 0.605 ± 0.031 | 0.681 ± 0.022 | 64.8 ± 0.5 |
Gemini 1.5 Flash | 68.8 ± 1.3 | 60.8 ± 2.7 | 51.0 ± 2.5 | 41 ± 5 | 51.0 ± 2.5 | 41 ± 5 | 0.478 ± 0.024 | 0.561 ± 0.020 | 60.52 ± 0.18 |
Gemini 1.5 Pro | 77.5 ± 1.8 | 67.9 ± 0.8 | 65.1 ± 2.4 | 50 ± 5 | 65.1 ± 2.4 | 50 ± 5 | 0.625 ± 0.024 | 0.658 ± 0.019 | 66.32 ± 0.11 |
Llama 3.1 405B Instruct | 82.0 ± 1.0 | 64.9 ± 2.7 | 69.5 ± 2.6 | 47 ± 5 | 69.3 ± 2.6 | 47 ± 5 | 0.667 ± 0.027 | 0.690 ± 0.021 | 71.9 ± 0.5 |
Llama 3.1 70B Instruct | 74 ± 4 | 61.6 ± 1.5 | 64 ± 4 | 46 ± 5 | 64 ± 4 | 45 ± 5 | 0.62 ± 0.04 | 0.665 ± 0.022 | 70.8 ± 0.5 |
Llama 3.1 8B Instruct | 59 ± 4 | 53 ± 4 | 47.7 ± 3.2 | 37 ± 6 | 47.6 ± 3.2 | 36 ± 6 | 0.440 ± 0.031 | 0.526 ± 0.030 | 61.7 ± 0.7 |
Mistral Large | 78.1 ± 1.5 | 66.2 ± 2.0 | 64.9 ± 2.7 | 50 ± 5 | 64.8 ± 2.8 | 49 ± 5 | 0.623 ± 0.027 | 0.696 ± 0.024 | 62.76 ± 0.17 |
Plurality Vote | 80.5 ± 0.9 | 69.4 ± 1.7 | 72.4 ± 2.1 | 55 ± 5 | 72.3 ± 2.1 | 55 ± 5 | 0.700 ± 0.022 | 0.770 ± 0.018 | — |