Fig. 7: Comparison of structured output with and without Pydantic in pathological T (pT) classification of gynecologic cancers using Qwen2.5 72B.

A Example of conventional prompt-based structured output. The output may include unnecessary explanations or inconsistent formatting, requiring manual post-processing. B Example output using Pydantic-enforced constraints, which ensures format consistency and suppresses irrelevant text. C, D Confusion matrices showing the accuracy of Qwen2.5 72B in extracting pT classification from pathology reports of gynecologic cancers (n = 951), with (C) and without (D) Pydantic-based structured decoding. The vertical axis represents the ground truth pT values obtained via manual annotation, and the horizontal axis indicates the pT values extracted using the model. Each cell shows the number of cases (n). “N.S.” denotes “Not specified.” E Summary of performance differences between the two approaches. The use of Pydantic significantly improved all metrics, including accuracy, precision, recall, and F1 score, as evaluated using bootstrapped mean differences and 95% confidence intervals. All improvements were statistically significant.