Fig. 2: Response accuracy of LLMs on text-based test questions.

Models are grouped by family and arranged chronologically within each cluster from oldest to newest. Stacked bars show response categories: correct, 2-option selection (2OP), external option selection (EOP), no option selection (NOP), errors, and incorrect responses. For models tested through both API and web interface, the API version is used in this figure due to similar performance between interfaces.