Fig. 4: Whole slide image (WSI)-level quantitative analysis of the deep learning (DL) models.

The quantitative analysis is based on comparing the tumor proportion score (TPS) score in different training settings. The reported Cohen’s Kappa scores are computed using the pathologists’ labeled category as ground truth. a Macro-averaged Cohen’s Kappa scores of the eight DL models over all the stains. b Cohen’s Kappa scores of the DL models in PD-L1 22C3 Lung dataset. c Cohen’s Kappa scores of the DL models in 22C3 Pan-cancer dataset. d Cohen’s Kappa scores of the DL models in PD-L1 SP142 Lung dataset. e Cohen’s Kappa scores of the DL models in multi-stain Pan-cancer dataset. The x-axis presents the summation of the number of utilized stain types and the organ types of each cohort when training (e.g. PH-B [22C3 and HER2 of breast] is 3 as it has 2 stains and 1 cancer type). H-B, HER2 of breast; P-L, 22C3 of lung; P-B, 22C3 of breast; P-LUB, 22C3 of lung, urothelium, and breast; PH-B, 22C3 and HER2 of breast; PH-LB, 22C3 of lung and HER2 of breast; PH-LUB, 22C3 and HER2 of lung, urothelium, and breast. PD-L1, Programmed Death-Ligand 1; HER2, Human Epidermal growth factor Receptor 2.