Table 4 Overall evaluation on quality and time-effectiveness.

	Task 1	Task 2	Task 3
ChatGPT o4-mini-high	\(\color{yellow} {\bullet}\)/20 min	\(\color{red} {\bullet}\)/15 min	\(\color{red} {\bullet}\)/5 min
Claude Sonnet 3.7 with Extended Thinking	\({\bullet}\)/n.a.	\(\color{green} {\bullet}\)/90 min	\(\color{yellow} {\bullet}\)/5 min
Google Gemini 2.5 Pro Experimental	\(\color{green} {\bullet}\)/5 min	\(\color{green} {\bullet}\)/15 min	\(\color{yellow} {\bullet}\)/5 min
DeepSeek R1	\(\color{red} {\bullet}\)/30 min	\(\color{green} {\bullet}\)/120 min	\(\color{red} {\bullet}\)/5 min
Mistral Le Chat	\(\color{red} {\bullet}\)/5 min	\(\color{yellow} {\bullet}\)/90 min	\(\color{red} {\bullet}\)/5 min
Grok 3	\(\color{yellow} {\bullet}\)/5 min	\({\bullet}\)/n.a.	\(\color{red} {\bullet}\)/5 min

Overall evaluation of the quality of results on each Task (green: good; yellow: average; red: bad), with approximate time needed to execute each Task.

Quick links

Search