Extended Data Fig. 3: Nanopore sequencing signal processing variable.
From: Discovering multiple types of DNA methylation from bacteria and microbiome using nanopore sequencing

(a) Comparison of current differences across methylation occurrences between datasets base called with Albacore 1.1.0, Albacore 2.3.4, and Guppy 3.2.4 illustrated by projection with t-SNE from for 46 well-characterized motifs (Supplementary Table 2). Each dot represents one isolated motif occurrence colored by base caller versions. 100,000 motif occurrences were randomly selected from each dataset to reduce the scatter plot density and ease the visualization. For each motif occurrence, current differences from 22 positions near methylated bases ([− 10 bp, + 11 bp]) were used. (b) Performance for de novo methylated site detection between datasets base called with Albacore 1.1.0, Albacore 2.3.4, and Guppy 3.2.4. We evaluated individual motif occurrences detection using Precision-Recall curves for H. pylori at 75x coverage. Precision-Recall curves and area under the curves (AUC) were computed as described in the Method section. Only confident H. pylori motifs were considered for the evaluation. (c) Comparison of current differences across methylation occurrences (same as a) between datasets produced with or without outlier removal step (Methods). (d) Performance for de novo methylated site detection (similar than b) with datasets produced with or without outlier removal step. (e) Variation of current differences across methylation occurrences without outlier removal step as illustrated by motif signatures from three motifs, AG4mCT (n = 6550 occurrences), GGW5mCC (n = 1875 occurrences), and GCYYG6mAT (n = 954 occurrences). For each motif, current differences near methylated bases ([− 6 bp, + 7 bp]) from all isolated occurrences are plotted with conservation of relative distances to methylated bases. Distributions of current differences for each relative distance are displayed as a violin plot. Current differences axis is limited to −8 to 8 pA range. (f) Performance for de novo methylated site detection across current difference datasets generated with different read alignment type filtering: remove alternative alignments (filtered out XA bam flags; named No Alt.), remove supplementary alignments (filtered out 2048 bam flags; named No Supp.), remove chimeric alignments (filtered out SA bam flags; named No Chim.), only conserve unique mapping (filtered out XA and SA bam flags; named Unique), and keep all alignments (named None). (g) Performance for de novo methylated site detection across datasets normalized with linear regression (lm function), robust regression (rlm function) or no additional normalization (annotated as none). (h) Performance for de novo methylated site detection across datasets generated using two-sided Mann-Whitney U-test or Student’s t-test. (i) Performance for de novo methylated site detection across datasets generated using different p-value smoothing window size: no smoothing (named None), 3 nt, 5 nt, and 7 nt. (j) Performance for de novo methylated site detection across datasets generated using different function for combining consecutives p-values: Fisher’s method (named sumlog), logit method (named logitp), sum p method (named sump), and sum z method (named sumz). (k) Performance for de novo methylated site detection across peaks datasets generated using different peak detection window size: 5 nt, 7 nt, and 9 nt. Plots f, g, h, i, j, and k show Precision-Recall curves and area under the curves (AUC) for various signal processing steps and were computed as described in the Method section. (l) Comparison of current differences across methylation occurrences (same as a) with E. coli datasets (200x) produced using either the reference genome or the de novo assembly (Methods). (m) Performance for de novo methylated site detection in E. coli datasets (200x) using either the reference genome or the de novo assembly. (n) Performance of methylation motif typing and fine mapping on E. coli datasets (200x) produced using either the reference genome or the de novo assembly (motif occurrences: n = 458 for AACNNNNNNGTGC, n = 18451 for CCWGG, n = 28110 for GATC, n = 463 for GCACNNNNNNGTT). Only results for k-nearest neighbors, neural network, and random forest are displayed.