Table 2 The error counts (mean and standard deviation) of the LLMs’ responses on the 1050 RD exam questions.
Benchmark | Prompt | GPT-4o | Claude 3.5Â S. | Gemini 1.5Â P. |
|---|---|---|---|---|
RD Exam | Zero Shot | 84.8 (2.93) | 104.6 (1.02) | 96.8 (1.17) |
Chain of Thought | 59.6 (1.85) | 80.6 (2.87) | 117.4 (6.62) | |
Chain of Thought w. Self Consistency | 58.0 (2.28) | 77.0 (1.67) | 104.8 (4.12) | |
Retrieval Augmented Prompting | 75.8 (2.86) | 113.2 (1.94) | 108.6 (1.20) |