Fig. 4: Reviewers use criteria different than the AI output detector for flagging abstracts as either generated or original.

The AI detection scores for generated abstracts were not significantly different (p = 0.45) between abstracts that human reviewers identified as generated, and those that they failed to identify as generated.