Fig. 1: Methodological overview and database summary statistics. | Nature Communications

Fig. 1: Methodological overview and database summary statistics.

From: Generalizable and automated classification of TNM stage from pathology reports with external validation

Fig. 1

A Depiction of overall method. Top: Dataset separation into training/validation and held-out test sets (TCGA), as well as external validation (CUIMC). Bottom: Example TCGA pathology reports, inputted into separate transformer models to for TNM stage prediction. B Token distribution for TCGA training set reports. The ClinicalBERT (CB) tokenizer was used to tokenize reports into pre-defined CB vocabulary. C Per-class distribution of TCGA pathology reports with TNM staging annotation. The distribution of TNM values varied substantially between cancer types. x-axis labeled as TCGA cancer-type abbreviations.

Back to article page