Fig. 1: Study flow diagram.

The study cohort is split into a prompt refinement and a validation dataset (for evaluating generalizability). The process begins with prompt 0 (P0: Is this note indicative of any cognitive concern, yes or no? \n {note}). AI-generated labels are evaluated against chart-reviewed labels (i.e., ground truth). An expert-driven approach is developed as a benchmark and tested on three different LLMs to find the best-performing LLM that is used for the agentic workflow. The agentic workflow suggests new prompts by evaluating cases of misclassification and summarizing suggestions from specialist agents. P0 initial prompt, XP1, XP2 expert prompt 1, expert prompt 2, AP1, AP2 agent prompt 1, agent prompt 2.