Extended Data Fig. 4: Exploring cell misclassification within COMBAT2022 dataset.
From: Interpretable inflammation landscape of circulating immune cells

(a-b) Normalized confusion matrices, aggregated (left) and one for each cell type (Level 1; excluding Cycling cells, Progenitors, Platelets and RBC) (right), displaying proportion of predictions belonging to each True Condition. Diagonal values correspond to the Recall metric. XGBoost was trained on the original normalized and log-scaled cell expression profiles from (a) whole COMBAT dataset and (b) Healthy, Flu and COVID (stratified by disease severity) samples from COMBAT dataset. (c-d) Agglomerative hierarchical clustering with complete linkage (using the average method and cosine distance) was performed on pseudobulk gene expression at the patient level (c), or at cell type (Level1) and patient level (d), using the log-normalized uncorrected count matrix on the 8,253 gene expression universe. Sample covariates, including sequencing pool, sex, and age, were also incorporated.