Table 4 Overall evaluation on quality and time-effectiveness.
| Â | Task 1 | Task 2 | Task 3 |
|---|---|---|---|
ChatGPT o4-mini-high | \(\color{yellow} {\bullet}\)/20Â min | \(\color{red} {\bullet}\)/15Â min | \(\color{red} {\bullet}\)/5Â min |
Claude Sonnet 3.7 with Extended Thinking | \({\bullet}\)/n.a. | \(\color{green} {\bullet}\)/90Â min | \(\color{yellow} {\bullet}\)/5Â min |
Google Gemini 2.5 Pro Experimental | \(\color{green} {\bullet}\)/5Â min | \(\color{green} {\bullet}\)/15Â min | \(\color{yellow} {\bullet}\)/5Â min |
DeepSeek R1 | \(\color{red} {\bullet}\)/30Â min | \(\color{green} {\bullet}\)/120Â min | \(\color{red} {\bullet}\)/5Â min |
Mistral Le Chat | \(\color{red} {\bullet}\)/5Â min | \(\color{yellow} {\bullet}\)/90Â min | \(\color{red} {\bullet}\)/5Â min |
Grok 3 | \(\color{yellow} {\bullet}\)/5Â min | \({\bullet}\)/n.a. | \(\color{red} {\bullet}\)/5Â min |