Table 3 Performance comparison across different prompt types.
Model | Prompt type | Accuracy (%) | Accuracy vs. Baseline |
|---|---|---|---|
Deepseek-R1 | Baseline (original prompt) | 93.50 | – |
Step-by-step reasoning | 96.00 | + 2.50% | |
Concise answer-only | 95.50 | + 2.00% | |
ChatGPT-4o | Baseline (original prompt) | 63.50 | – |
Step-by-step reasoning | 88.00 | + 24.50% | |
Concise answer-only | 77.50 | + 14.00% |