Figure 2
From: Generator based approach to analyze mutations in genomic datasets

Clustering of Sequences. (a–e) Red and blue sequences are generated by adding substitution, insertion and deletion noise with probability \(0.1\%\) and \(1\%\) respectively. (a–c) show the first 2 principal components as a result of applying PCA on the state machine representations of the sequences with \(k = 4, b = 1, \beta = 0.5.\) for \(L = 2000, 5000, 10000\) respectively. (d) shows distances between the centers of the red and blue clusters generated by different combinations of b and k, while (e) shows the time complexity of the method with different combinations of b and k, we repeat 20 times in each experiment and show the mean and standard deviation. (f) Red and blue sequences are generated by adding substitution, insertion and deletion noise with probability \(0.1\%\) and \(1\%\) respectively on the root SARS-CoV-2 sequence. (f) illustrates similar clustering performance for \(k = 4, b = 1, \beta = 0.5\) as observed in (a–c) when noise is applied on a real sequence, i.e. SARS-CoV-2. (g–i) Red and blue sequences are generated by adding substitution, insertion and deletion noise with probability \(10\%\) and \(15\%\) respectively. (g–i) show the first 2 principal components as a result of applying PCA on the state machine representations of the sequences with \(k = 4, b = 1, \beta = 0.5.\) for \(L = 5000, 10000, 30000\) respectively. (j–l) The red sequences have a noise level of \(x\%\) and are of length \(L = 10000.\) The blue sequences of length \(L = 10000\) have a noise level of \(x\%\) in locations \(1-1000\) and \(4001-10000\), and a noise level of \(5\%\) in locations \(1001-4000\). (j–l) show the first two principal components of applying PCA on the state machine representation (with \(k = 4, b = 1, \beta = 0.5\)) of red and blue sequences for \(x = 1, 2, 3\) respectively.