Extended Data Fig. 5: Comparison of mutational signatures that were extracted using two algorithms.
From: Tobacco smoking and somatic mutations in human bronchial epithelium

a, Trinucleotide contexts for the signatures extracted by the hierarchical Dirichlet process (HDP) (left) and MutationalPatterns non-negative matrix factorization (right). The six substitution types are shown across the top of each signature. Within each signature, the trinucleotide context is shown as four sets of four bars, grouped by whether an A, C, G or T respectively is 5′ to the mutated base, and within each group of four by whether A, C, G or T is 3′ to the mutated base (the order of bars is the same as that shown in Fig. 2b). Where signatures show high cosine similarity scores between algorithms, they are lined up horizontally. We note that Signature C in MutationalPatterns does not have a match in the signatures extracted by the HDP algorithm, but appears very similar to Signature A in MutationalPatterns (or SBS-5 from the HDP). This means that it probably represents over-splitting of the signatures. b, Heat map showing the cosine similarities of signatures extracted by MutationalPatterns with those extracted by the HDP. Only cosine-similarity scores that are greater than 0.75 are coloured. c, Scatter plots showing the fraction of mutations in each colony (n = 632) assigned to each signature by the HDP algorithm (x axis) versus the MutationalPatterns algorithm (y axis). The correlation values quoted are Pearson’s correlation coefficients (R2). d, Transcriptional strand bias of A>G mutations in an N[A]T context before and after TSSs. Note the absence of transcriptional strand bias in intergenic regions but evidence for both transcription-coupled damage and repair after the TSS, applying similarly in both never-smokers and ex- or current smokers.