Table 1 Typical distance-based clustering challenges with one example dataset each. The table summarizes the results of SI C, Supplementary Fig. 10 and SI D Supplementary Figs. 1114. No algorithm is able to reproduce all types of problems with highly stable results. The challenge that no distance-based cluster structures exist is not included in this table because benchmarking is not possible in this case. Note that the benchmarking performed here does not allow the deduction if an algorithm fails due to the cluster structures or due to the distribution of the data.

From: Distance-based clustering challenges for unbiased benchmarking studies

Distance-based cluster structures

Exemplary dataset dimensionality d range of cluster size

Stable clustering solution

Small bias with minor variance

Small bias and unstable clustering solution (multimodality)

Large bias

Non-overlapping convex hulls with varying intra-cluster distance

Hepta, D = 3 14%-15%

24/41

QT, SOM,

CrossEntropyC, Hartigan, HCL, HDD, LBG, mvnpEM, npEM, Orclus, SOM Sparse k-means Spectral,

Diana, ProClus, RTC, PPC

Overlapping convex hulls

Atom D = 3 50%

10/41

DBS

CrossEntropyC

29/41

Non-overlapping convex hulls with varying geometric shapes and noise

Lsun3D D = 3 24–49% (Additionally, 4 outliers as noise)

Clustvarsel, , Gini, HDBSCAN, Minimax ModelBased, mvnpEM, npEM, VarSelLCM, Ward, , ,

Fanny, DBS, Orclus, CrossEntropyC, HDD

Spectral, ProClus

25/41

Linear non-separable entanglements

Chainlink D = 3 50%

DBS, Gini, HDBSCAN,mvnpEM, SingleL, Spectral, Spectrum, ,

Clustvarsel, CrossEntrpoy, Modelbased, npEM, VarSelLCM

/

29/41

High dimensionality with highly imbalanced cluster sizes

Leukaemia D = 7447 Range of cluster sizes: 2.7–50% (Additionally, 1 outlier as noise)

AverageL, CompleteL Diana, SingleL, WPGMA

DBS

Clara, HCL, QT

32/41 with Clustvarsel, CrossEntropy, ModelBased, mvnpEM, npEM, Orclus, RTC, and Spectrum not computable

High dimensionality with an unstable clustering solution

Cancer D = 18,167 Range of cluster sizes: 10%-17%

Gini

Ward

DBS, Hartigan, HDD, LBG, Neural Gas

34/41 with Clustvarsel, CrossentropyC, ModelBased, mvnpEM, npEM, Orclus, RobustTrimmedC, SparseH and Spectrum not computable