Table 3 Task 3 results.

From: Human researchers are superior to large language models in writing a medical systematic review in a comparative multitask assessment

 

Title

Abstract

Introduction

Methods

Results

Discussion

References

Round 1 – Feb 2025

ChatGPT o3-mini-high

A

P.A.

P.A.

I

I

I

I

Claude Sonnet 3.7 with Extended Thinking

P.A.

P.A.

P.A.

P.A.

P.A.

P.A.

P.A.

Google Gemini 2.0 Flash Thinking Experimental

P.A.

P.A.

P.A.

I

P.A.

P.A.

I

DeepSeek R1

P.A.

P.A.

I

I

I

I

I

Mistral Le Chat

A

P.A.

I

I

I

I

I

Round 2 – Apr 2025

ChatGPT o4-mini-high

I

P.A.

I

I

I

I

I

Claude Sonnet 3.7 with Extended Thinking

P.A.

P.A.

P.A.

P.A.

P.A.

P.A.

P.A.

Google Gemini 2.5 Pro Experimental

P.A.

P.A.

P.A.

I

P.A.

P.A.

I

DeepSeek R1

P.A.

P.A.

I

I

I

I

I

Mistral Le Chat

A

P.A.

I

I

I

I

I

Grok 3

A

P.A.

P.A.

I

I

I

I

  1. Summary of the results of Task 3 – final manuscript production – in the two rounds of evaluation. A: appropriate; P.A.: partially appropriate; I: inappropriate.