Table 1 Typical distance-based clustering challenges with one example dataset each. The table summarizes the results of SI C, Supplementary Fig. 10 and SI D Supplementary Figs. 11–14. No algorithm is able to reproduce all types of problems with highly stable results. The challenge that no distance-based cluster structures exist is not included in this table because benchmarking is not possible in this case. Note that the benchmarking performed here does not allow the deduction if an algorithm fails due to the cluster structures or due to the distribution of the data.
From: Distance-based clustering challenges for unbiased benchmarking studies
Distance-based cluster structures | Exemplary dataset dimensionality d range of cluster size | Stable clustering solution | Small bias with minor variance | Small bias and unstable clustering solution (multimodality) | Large bias |
---|---|---|---|---|---|
Non-overlapping convex hulls with varying intra-cluster distance | Hepta, D = 3 14%-15% | 24/41 | QT, SOM, | CrossEntropyC, Hartigan, HCL, HDD, LBG, mvnpEM, npEM, Orclus, SOM Sparse k-means Spectral, | Diana, ProClus, RTC, PPC |
Overlapping convex hulls | Atom D = 3 50% | 10/41 | DBS | CrossEntropyC | 29/41 |
Non-overlapping convex hulls with varying geometric shapes and noise | Lsun3D D = 3 24–49% (Additionally, 4 outliers as noise) | Clustvarsel, , Gini, HDBSCAN, Minimax ModelBased, mvnpEM, npEM, VarSelLCM, Ward, , , | Fanny, DBS, Orclus, CrossEntropyC, HDD | Spectral, ProClus | 25/41 |
Linear non-separable entanglements | Chainlink D = 3 50% | DBS, Gini, HDBSCAN,mvnpEM, SingleL, Spectral, Spectrum, , | Clustvarsel, CrossEntrpoy, Modelbased, npEM, VarSelLCM | / | 29/41 |
High dimensionality with highly imbalanced cluster sizes | Leukaemia D = 7447 Range of cluster sizes: 2.7–50% (Additionally, 1 outlier as noise) | AverageL, CompleteL Diana, SingleL, WPGMA | DBS | Clara, HCL, QT | 32/41 with Clustvarsel, CrossEntropy, ModelBased, mvnpEM, npEM, Orclus, RTC, and Spectrum not computable |
High dimensionality with an unstable clustering solution | Cancer D = 18,167 Range of cluster sizes: 10%-17% | Gini | Ward | DBS, Hartigan, HDD, LBG, Neural Gas | 34/41 with Clustvarsel, CrossentropyC, ModelBased, mvnpEM, npEM, Orclus, RobustTrimmedC, SparseH and Spectrum not computable |