Table 1 LLM performance

From: Benchmarking cell type and gene set annotation by large language models with AnnDictionary

 

Cell Type

Biological Process

 

Binary (%)

Perfect Match (%)

Exact String Match (%)

Kappa

Close Match

Model

Cells

By Cell Type

Cells

By Cell Type

Cells

By Cell Type

With Manual

Average With LLMs

% of Terms

Claude 3 Haiku

78.3 ± 0.8

66.2 ± 1.5

61.8 ± 2.3

47 ± 4

61.8 ± 2.3

47 ± 4

0.589 ± 0.023

0.652 ± 0.023

62.8 ± 0.4

Claude 3 Opus

82.3 ± 0.9

69.8 ± 1.0

72.8 ± 2.6

54 ± 4

72.7 ± 2.6

54 ± 4

0.704 ± 0.026

0.711 ± 0.020

71.0 ± 0.4

Claude 3.5 Sonnet

84.0 ± 0.7

70.5 ± 1.2

74.4 ± 2.7

54 ± 4

74.3 ± 2.6

53 ± 4

0.721 ± 0.027

0.697 ± 0.026

81.20 ± 0.32

Command R Plus

77.2 ± 1.0

59.4 ± 2.6

64.5 ± 2.6

40 ± 5

64.5 ± 2.6

40 ± 5

0.616 ± 0.027

0.646 ± 0.026

58.5 ± 0.7

GPT-4

79.2 ± 0.9

64.3 ± 1.9

64 ± 4

44 ± 5

64 ± 4

44 ± 5

0.61 ± 0.05

0.65 ± 0.04

65.24 ± 0.33

GPT-4o

80.9 ± 0.7

70.1 ± 2.8

70.4 ± 2.5

54 ± 6

70.4 ± 2.5

54 ± 6

0.680 ± 0.026

0.721 ± 0.021

67.04 ± 0.33

GPT-4o mini

76.8 ± 1.0

66.2 ± 1.6

63.4 ± 3.0

47 ± 6

63.4 ± 3.0

47 ± 6

0.605 ± 0.031

0.681 ± 0.022

64.8 ± 0.5

Gemini 1.5 Flash

68.8 ± 1.3

60.8 ± 2.7

51.0 ± 2.5

41 ± 5

51.0 ± 2.5

41 ± 5

0.478 ± 0.024

0.561 ± 0.020

60.52 ± 0.18

Gemini 1.5 Pro

77.5 ± 1.8

67.9 ± 0.8

65.1 ± 2.4

50 ± 5

65.1 ± 2.4

50 ± 5

0.625 ± 0.024

0.658 ± 0.019

66.32 ± 0.11

Llama 3.1 405B Instruct

82.0 ± 1.0

64.9 ± 2.7

69.5 ± 2.6

47 ± 5

69.3 ± 2.6

47 ± 5

0.667 ± 0.027

0.690 ± 0.021

71.9 ± 0.5

Llama 3.1 70B Instruct

74 ± 4

61.6 ± 1.5

64 ± 4

46 ± 5

64 ± 4

45 ± 5

0.62 ± 0.04

0.665 ± 0.022

70.8 ± 0.5

Llama 3.1 8B Instruct

59 ± 4

53 ± 4

47.7 ± 3.2

37 ± 6

47.6 ± 3.2

36 ± 6

0.440 ± 0.031

0.526 ± 0.030

61.7 ± 0.7

Mistral Large

78.1 ± 1.5

66.2 ± 2.0

64.9 ± 2.7

50 ± 5

64.8 ± 2.8

49 ± 5

0.623 ± 0.027

0.696 ± 0.024

62.76 ± 0.17

Plurality Vote

80.5 ± 0.9

69.4 ± 1.7

72.4 ± 2.1

55 ± 5

72.3 ± 2.1

55 ± 5

0.700 ± 0.022

0.770 ± 0.018

  1. Agreement with manual annotations measured by yes/no, quality of match, and exact string agreement. Kappa with manual annotation and average kappa of the given model with every other model. Biological process annotation of known gene lists. All values are mean ± standard deviation across five replicates. Source data are provided as a Source Data file.