Extended Data Fig. 10: Development of MMRDetect.

(a)-(e) Distribution of the five parameters across IHC-determined MMR gene abnormal (orange) and MMR gene normal (green) samples. black dots and error bars represent mean ± SD of the paramenters. NAbnormal=79 samples (yellow); NNormal= 257 samples (green). a, Exposure of MMRd signatures. b, Cosine similarity between the substitution profile of cancer samples and that of MMR gene knockouts. c, Number of indels in repetitive regions. d, Cosine similarity between the profile of repeat-mediated deletions of cancer sample and that of knockout generated indel signatures, (e) the cosine similarity between the profile of repeat-mediated insertion of cancer sample and that of knockout generated indel signatures. P-values were calculated through two-sided Mann-Whitney test. f, Distribution of coefficients from 10-fold cross validation using training data set. Box plots denote median (horizontal line) and 25th to 75th percentiles (boxes). The lower and upper whiskers extend to 1.5× the inter-quartile range. N = 10 iterations. g, MMRDetect-calculated probabilities for 336 colorectal cancers. With cut-off of 0.7, 77 out of 336 were predicted to be MMR-deficient samples (probability < 0.7). Colour bars represent the MSI status determined by IHC staining: red – abnormal; blue – normal. Four samples with abnormal IHC staining have probabilities > 0.7, whilst 2 samples with normal IHC staining have probabilities < 0.7. The four samples were revealed to be false positive cases and the two samples were false negative ones for IHC staining through validation using MSIseq and seeking coding mutations in MMR genes. h, Distribution of the mutation number of repeat-mediated indels, MMRd signatures and non-MMRd signatures across four groups of samples: MMR-deficient samples determined by only MMRDetect (yellow), MMR-deficient samples determined by only MSIseq (purple), MMR-deficient samples determined by both MMRDetect and MSIseq (blue) and non-MMR-deficient samples determined by both MMRDetect and MSIseq (pink). P-values were calculated through two-sided Mann-Whitney test. Numbers of MMR-deficient samples determined by MMRDetect only (blue), MSIseq only (pink), both (yellow) and none (purple) are 34, 20, 587 and 6,718, respectively.