Fig. 2: Development of an index discriminating breast cancer cases from controls based on cervical samples.

a Distribution of different cell types in samples of the discovery dataset inferred using the HEpiDISH algorithm (*p < 0.05, **p < 0.01, ***p < 0.001 in two-sided Wilcoxon signed-ranked test; n = 869 controls, 329 breast cancer cases). Exact p values: epithelial cells, p = 0.030; neutrophils, p = 0.052; monocytes, p = 0.0009; NK cells, 0.006; CD4+ T cells, p = 0.0013; B cells, p = 0.002; fibroblasts, p = 0.001; eosinophils, 0.029; CD8+ T cells, 0.475. No adjustment for multiple comparisons was made. Box plots correspond to standard Tukey representation, with boxes indicating mean and interquartile ranges, and lines indicating smallest and largest values within 1.5 times of the 25th and 75th percentile, respectively. Dots indicate outlier values. b Example of a CpG with cell-type-specific methylation. c Area under the receiver operating characteristic curve (AUROC) in the internal validation set as a function of the number of CpGs used to train the classifier. d ROC curves of the WID-BC-index in the internal validation set for samples with an immune cell proportion ≤0.5 and >0.5. e Distribution of the WID-BC-index with respect to immune cell proportion in the internal validation set. f Distribution of the estimated variance in epithelial and immune cells across all CpGs used in the WID-BC-index. g AUC values in the internal validation set after training classifiers on different subsets of the CpGs used in the WID-BC-index. The top n CpGs were either retained or removed. CpGs were also split into separate bins of size 500. h An index developed using data extracted from TCGA breast cancer and normal samples (TCGA-BC-index) is able to discriminate cancer cases from controls in breast tissue (n =44 controls, n = 222 breast cancer samples) but not cervical samples (n = 1094 controls, n = 442 breast cancer cases), prompting the development of a cervix-specific index reflecting systemic changes in a hormone-sensitive surrogate tissues. Source data are provided as a Source Data file.