Fig. 2: An example of microhomology-induced chimeric read (MICR)-originated sequencing error.
From: MicroSEC filters sequence errors for formalin-fixed and paraffin-embedded samples

a The genomic sequence visualized by Integrative Genomics Viewer exhibits a T-to-C artifact in the FGFR4 gene found in target sequencing data of a FFPE normal breast tissue sample. In all mutation-supporting reads, only six bases downstream of the mutation were mapped, and the rest is soft-clipped (red line). The blue colored read has an inferred insert size smaller than expected. The mate-reads of green or gold colored reads were mapped to different chromosomes. b A representative read supporting the T-to-C artifact in Fig. 2a. The upstream sequence of the read (blue arrow) was mapped to the forward strand of the genome, and the downstream sequence of the same read (green arrow) was mapped to the reverse strand. Strangely, the upstream and downstream sequences overlapped, as did the genomic sequences to which each was mapped. Since the upstream sequence was longer than the downstream sequence, only the upstream sequence was eventually mapped and the downstream sequence was soft-clipped. Two palindromic sequences exist in close proximity to each other, and the mismatched base between the two sequences (red box) represent the source of the T-to-C artifact. Most of the downstream bases were soft-clipped. c Presumed mechanism of the phenomenon observed in Fig. 2b. Two palindromic sequences in a single-stranded DNA (ssDNA) formed a hairpin structure at the end-repair step of library preparation. After nicking and partial denaturation, the double-stranded DNA was regenerated during the end-repair step of library preparation. The mismatched base between two palindromic sequences was defined as a mutation. d The MicroSEC algorithm is based on three criteria. Filter 1, 3: the distance from the mutation position to the most distant mapped base is distributed over a probabilistically improbable limited range for any reads. Filter 2: MICR-originated sequencing errors are generated when two palindromic sequences are in the same DNA fragment. Filter 4: The mis-annealing of ssDNA derived from other distant homologous regions of the genome also creates chimeric reads and artifacts. Dark-red, green, or light-blue horizontal bars represent sequences of other distant regions of the genome. Chimeric reads with mutated bases were formed.