Table 4 Overall evaluation on quality and time-effectiveness.

From: Human researchers are superior to large language models in writing a medical systematic review in a comparative multitask assessment

 

Task 1

Task 2

Task 3

ChatGPT o4-mini-high

\(\color{yellow} {\bullet}\)/20 min

\(\color{red} {\bullet}\)/15 min

\(\color{red} {\bullet}\)/5 min

Claude Sonnet 3.7 with Extended Thinking

\({\bullet}\)/n.a.

\(\color{green} {\bullet}\)/90 min

\(\color{yellow} {\bullet}\)/5 min

Google Gemini 2.5 Pro Experimental

\(\color{green} {\bullet}\)/5 min

\(\color{green} {\bullet}\)/15 min

\(\color{yellow} {\bullet}\)/5 min

DeepSeek R1

\(\color{red} {\bullet}\)/30 min

\(\color{green} {\bullet}\)/120 min

\(\color{red} {\bullet}\)/5 min

Mistral Le Chat

\(\color{red} {\bullet}\)/5 min

\(\color{yellow} {\bullet}\)/90 min

\(\color{red} {\bullet}\)/5 min

Grok 3

\(\color{yellow} {\bullet}\)/5 min

\({\bullet}\)/n.a.

\(\color{red} {\bullet}\)/5 min

  1. Overall evaluation of the quality of results on each Task (green: good; yellow: average; red: bad), with approximate time needed to execute each Task.