Extended Data Fig. 7: The workflow of dynamic clustering and the optimizations of epi-bit sites for large-scale epi-bit storage (the panda image).
From: Parallel molecular data storage by printing epigenetic bits on DNA

a, The dynamic clustering workflow to reduce the misclassified reads. First, the probability of methylation on the barcode sites was used as the basis for K-means clustering (Step 1). Second, reads in each barcode group were clustered based on the methylation probability of the information sites (Step 2). Third, adaptively selected the number of clusters depending on the silhouette score and then updated the barcode groups with the cluster closest to the current barcode in Euclidean space (Step 3). Meanwhile, other reads were distributed to the nearest barcode groups (Step 4). Finally, Megalodon were used to call the methylation probabilities from grouped reads. b, Per site accuracies between pairwise experiments for the optimizations of epi-bit sites. The accuracy of each epi-bit site in each sequencing experiment was calculated from the methylation calling results of single reads. Sites with accuracies less than 0.5 were dropped in the following experiments. The subplots on the diagonal are the kernel density estimates for each experiment. c, Heatmap of average accuracies at each site (columns) for each template L1-L5 (rows).