Table 2 The error counts (mean and standard deviation) of the LLMs’ responses on the 1050 RD exam questions.

Benchmark	Prompt	GPT-4o	Claude 3.5 S.	Gemini 1.5 P.
RD Exam	Zero Shot	84.8 (2.93)	104.6 (1.02)	96.8 (1.17)
	Chain of Thought	59.6 (1.85)	80.6 (2.87)	117.4 (6.62)
	Chain of Thought w. Self Consistency	58.0 (2.28)	77.0 (1.67)	104.8 (4.12)
	Retrieval Augmented Prompting	75.8 (2.86)	113.2 (1.94)	108.6 (1.20)

Quick links

Search