Fig. 1: Unsupervised clustering identifies biological subpopulations within newly diagnosed DLBCL.

Unsupervised clustering was applied to a large cohort of patient-derived RNAseq data to identify biologically homogeneous subtypes of DLBCL. A Schematic of data transformation, unsupervised clustering, and classifier training methodology. Steps in black represent data objects, while steps in blue represent algorithmic processes. B Co-clustering frequency heatmap identifies sample clusters that consistently group together over repeated subsampling runs. C Cluster prevalence and breakdown by COO and TME26 classification. Bar heights represent the observed proportion in each cohort, and error bars represent the 95% confidence interval. D Top 50 up- and down-regulated genes per cluster from the Discovery dataset, replicated in each cohort. Source data are provided as a Source Data file.