Fig. 1: Methodological overview and database summary statistics.

A Depiction of overall method. Top: Dataset separation into training/validation and held-out test sets (TCGA), as well as external validation (CUIMC). Bottom: Example TCGA pathology reports, inputted into separate transformer models to for TNM stage prediction. B Token distribution for TCGA training set reports. The ClinicalBERT (CB) tokenizer was used to tokenize reports into pre-defined CB vocabulary. C Per-class distribution of TCGA pathology reports with TNM staging annotation. The distribution of TNM values varied substantially between cancer types. x-axis labeled as TCGA cancer-type abbreviations.