Table 1 Overview of datasets used in SpeakEasy community detection.

From: Identifying robust communities and multi-community nodes by combining top-down and bottom-up approaches to clustering

Dataset title

Network size (#nodes)

Biological scale

Data type

Cluster validation

Output

Conclusion

LFR benchmarks

1000–5000

NA

unweighted symmetric networks

known/synthetic clusters

benchmark clusters - comparable to other methods

Top recorded performance on LFR benchmarks to date

Various real networks

34–320000

NA

unweighted symmetric networks

modularity measures

cluster separation statistics - comparable to other methods

Predicted communities are well-separated

Human Brain Atlas (HBA); Cancer Cell Line Encyclopedia (CCLE)

8000–18000

gene

gene expression

Gene Ontology (GO)

co-regulated gene sets

Possible to robustly detect overlapping gene clusters

Gavin et al.; Collins et al.

700–1100

protein

AP-MS protein interactions

small-scale experiments

protein complexes and multi-community proteins

Most accurate recovery of true protein complexes to date

Immunological Genome Project (Immgen)

212

cell-type

cell type-specific gene expression

cell-surface markers

families of cell-types, at multiple resolutions

Cannonical cell type classification is mirrored in cluster results

Spike-sorting

9900

cell activity

extracellular neuron recordings

known/synthetic clusters

spikes associated with specific neurons

SpeakEasy accuratly associates spike waveforms with specific neurons

Parkinson disease rs-fMRI

264

tissue

brain resting state fMRI

permutation testing

groups of synchronized brain regions

SpeakEasy identifies disease-related changes to co-active brain regions

  1. We test community detection across a range of biological datasets to robustly characterize the ability to define practically useful biological communities.