Fig. 3: Functional gene discovery using interpretable machine learning.
From: Interpretable inflammation landscape of circulating immune cells

a, Normalized confusion matrices displaying proportion of predictions belonging to each true condition. Diagonal values correspond to the Recall metric. XGBoost was trained on the scANVI batch-corrected (left) or batch-uncorrected (right) log-scaled cell expression profiles. b, Validation of d-SHAP-based gene selection using XGBoost trained with a nested cross-validation on unseen studies’ cells. Each point corresponds to the average left-out fold performance, for each best configuration of each fold combination. The box plots report the WF1 (top) and the BAS (bottom) computed considering top 5, 10 and 20 genes (among the ones expressed in at least 5% of the total cells), for each inflammatory condition present within the unseen studies dataset (that is, healthy, sepsis, CD, SLE, HIV, cirrhosis, RA and COVID) according to the d-SHAP values, across cell types (Level 1). For the same number of genes, we report the performance scores of n = 20 random selected gene sets. The performance of the classifier when trained on the whole gene set, consisting of the genes expressed in at least 5% of the total cells, is also reported. Boxes indicate the interquartile range (IQR) with the median as a center line; whiskers extend to 1.5× IQR; and outliers are shown as individual points. c, Scatter plot of max-normalized gene expression against d-SHAP values computed for CYBA gene on monocyte population (Level 1) and considering the output of disease-XGBoost for a given disease (UC, CD, PS and PSA, from left to right). d, Scatter plot of max-normalized gene expression against d-SHAP values computed for IFITM1 gene on T non-naive CD4 and ILC populations (annotation Level 1) considering the output of the disease-XGBoost for a given disease (asthma and COPD, left and right). In c and d, we limited the visualization to up to 60,000 cells, sampling an equal percentage from each patient corresponding to 5% and 7.5% of monocytes and T non-naive CD4 cells, respectively. Cells belonging to samples with or without the given condition (disease) are marked in orange or blue, respectively. CD, Crohnʼs disease; MS, multiple sclerosis; PS, psoriasis; PSA, psoriatic arthritis; RA, rheumatoid arthritis; UC, ulcerative colitis.