Fig. 2: Overview of the iterative pipeline improvement and gold-standard set creation process.

After completing each iteration, the schema, prompts, LLM outputs, and gold-standard are incrementally versioned, e.g., V1, V2, V3, etc. The gold-standard set was structured in the exact format as the table that held the LLM outputs, with columns for report ID, specimen (and block when applicable) name, item name, e.g., histology, and item label, e.g., clear cell RCC, so that items could be programmatically matched for review. The same LLM backbone (GPT-4o 2024-05-13) is utilized through all iterations.