Sampling-enabled scalable manifold learning unveils the discriminative cluster structure of high-dimensional data

Peng, Dehua; Gui, Zhipeng; Wei, Wenzhang; Li, Fa; Gui, Jie; Wu, Huayi; Gong, Jianya

doi:10.1038/s42256-025-01112-9

Article
Published: 10 September 2025

Sampling-enabled scalable manifold learning unveils the discriminative cluster structure of high-dimensional data

Dehua Peng^1,2,3,
Zhipeng Gui ORCID: orcid.org/0000-0001-9467-9680^1,2,3,
Wenzhang Wei¹,
Fa Li ORCID: orcid.org/0000-0002-0625-5587⁴,
Jie Gui^5,6,
Huayi Wu^2,3 &
…
Jianya Gong^2,3

Nature Machine Intelligence volume 7, pages 1669–1684 (2025)Cite this article

4280 Accesses
3 Citations
23 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

As a pivotal branch of machine learning, manifold learning uncovers the intrinsic low-dimensional structure within complex non-linear manifolds in high-dimensional space for visualization, classification, clustering and gaining key insights. Although existing techniques have achieved remarkable successes, they suffer from extensive distortions of cluster structure, which hinders the understanding of underlying patterns. Scalability issues also limit their applicability for handling large-scale data. Here we propose a sampling-based scalable manifold learning technique that enables uniform and discriminative embedding (SUDE) for large-scale and high-dimensional data. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data and then incorporates the non-landmarks into the learned space by constrained locally linear embedding. We empirically validated the effectiveness of SUDE on synthetic datasets and real-world benchmarks and applied it to analyse single-cell data and detect anomalies in electrocardiogram signals. SUDE exhibits a distinct advantage in scalability with respect to data size and embedding dimension and shows promising performance in cluster separation, integrity and global structure preservation. The experiments also demonstrate notable robustness in embedding quality as the sampling rate decreases.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Performance of the PPS algorithm on synthetic datasets.**

**Fig. 3: Two-dimensional embeddings from five real-world datasets using six manifold learning methods.**

**Fig. 4: Quantitative comparison of the embedding quality and time efficiency on 13 real-world datasets.**

**Fig. 5: Performance comparison on scRNA-seq and CyTOF data.**

**Fig. 6: Recognition of abnormal heartbeats in ECG signals.**

Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalization

Article 08 January 2024

Cost function for low-dimensional manifold topology assessment

Article Open access 25 August 2022

Tackling the subsampling problem to infer collective properties from limited data

Article 15 November 2022

Data availability

The real-world datasets used in this study are available as follows: Wine via University of California Irvine at http://archive.ics.uci.edu/dataset/109/wine, Dermatology via University of California Irvine at http://archive.ics.uci.edu/dataset/33/dermatology, Breast-Cancer via National Taiwan Univeristy at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/breast-cancer, Mfeat via University of California Irvine at https://archive.ics.uci.edu/dataset/72/multiple+features, Rice via University of California Irvine at https://archive.ics.uci.edu/dataset/545/rice+cammeo+and+osmancik, Spambase via University of California Irvine at https://archive.ics.uci.edu/dataset/94/spambase, Dry-Bean via University of California Irvine at https://archive.ics.uci.edu/dataset/602/dry+bean+dataset, Shuttle via University of California Irvine at https://archive.ics.uci.edu/dataset/148/statlog+shuttle, CIFAR10 via the University of Toronto at https://www.cs.toronto.edu/~kriz/cifar.html, MNIST via Yann LeCun at https://yann.lecun.com/exdb/mnist/, FMNIST via Github at https://github.com/zalandoresearch/fashion-mnist, AG’s News via the University of Pisa at http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html, Yahoo via GitHub at https://github.com/zjulearning/AtSNE, WT via the GEO database with accession no. GSE125708, HRPEAC via the Broad Institute at https://singlecell.broadinstitute.org/single_cell/study/SCP3025/scrna-seq-and-scatac-seq-atlas-of-human-retinal-pigment-epithelium-and-choroid?#study-download), SARS-CoV-2 via the Broad Institute at https://singlecell.broadinstitute.org/single_cell/study/SCP2593/impact-of-variants-and-vaccination-on-nasal-immunity-across-three-waves-of-sars-cov-2#study-download, Levine via FlowRepository with accession no. FR-FCM-ZZPH, QT database via Mathworks at https://www.mathworks.com/supportfiles/SPT/data/QTDatabaseECGData.zip and MIT-BIH long-term database via PhysioNet at https://archive.physionet.org/physiobank/database/ltdb/.

Code availability

The source code for SUDE in MATLAB and Python versions is available via GitHub at https://github.com/ZPGuiGroupWhu/sude (ref. ⁷⁴) and via Code Ocean at https://doi.org/10.24433/CO.9866909.v3 (ref. ⁷⁵). The TriMap algorithm is available via OMIQ at http://www.omiq.ai. The Seurat software is available via the Satija Lab at https://satijalab.org/seurat. The TopoAE model is available via GitHub at https://github.com/BorgwardtLab/topological-autoencoders. The P-UMAP model is available via Read the Docs at https://umap-learn.readthedocs.io/en/latest/parametric_umap.html. The Slingshot toolkit is available via GitHub at https://github.com/kstreet13/slingshot.

References

Busch, E. L. et al. Multi-view manifold learning of human brain-state trajectories. Nat. Comput. Sci. 3, 240–253 (2023).
Article Google Scholar
Floryan, D. & Graham, M. D. Data-driven discovery of intrinsic dynamics. Nat. Mach. Intell. 4, 1113–1120 (2022).
Article Google Scholar
Islam, M. T. et al. Revealing hidden patterns in deep neural network feature space continuum via manifold learning. Nat. Commun. 14, 8506 (2023).
Article Google Scholar
Vogelstein, J. T. et al. Supervised dimensionality reduction for big data. Nat. Commun. 12, 2872 (2021).
Article Google Scholar
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 25, 417–441 (1933).
Article Google Scholar
Peng, D., Gui, Z., Gui, J., & Wu, H. A robust and efficient boundary point detection method by measuring local direction dispersion. IEEE Trans. Circuits Syst. Video Technol. https://doi.org/10.1109/TCSVT.2025.3546401 (2025).
Kruskal, J. B. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1–27 (1964).
Article MathSciNet Google Scholar
Fisher, R. A. The use of multiple measurements in taxonomic problems. Ann. Eugenics 7, 179–188 (1936).
Article Google Scholar
Yan, S. et al. Nonlinear discriminant analysis on embedded manifold. IEEE Trans. Circuits Syst. Video Technol. 17, 468–477 (2007).
Article Google Scholar
Zhang, L., Zhou, W. & Chang, P.-C. Generalized nonlinear discriminant analysis and its small sample size problems. Neurocomputing 74, 568–574 (2011).
Article Google Scholar
Tenenbaum, J. B., Silva, V. D. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).
Article Google Scholar
Roweis, S. T. & Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000).
Article Google Scholar
Cover, T. & Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21–27 (1967).
Article Google Scholar
Belkin, M., & Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proc. Advanced Neural Information Processing Systems (eds Dietterich, T. G. et al.) 585–591 (2001).
Donoho, D. L. & Grimes, C. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proc. Natl Acad. Sci. USA 100, 5591–5596 (2003).
Article MathSciNet Google Scholar
Coifman, R. R. et al. Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. Proc. Natl Acad. Sci. USA 102, 7426–7431 (2005).
Article Google Scholar
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
McInnes, L., Healy, J., Saul, N. & Groçberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3, 861 (2018).
Article Google Scholar
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).
Article Google Scholar
Amid, E. & Warmuth, M. K. TriMap: large-scale dimensionality reduction using triplets. Preprint at https://arxiv.org/abs/1910.00204 (2022).
Fu, C., Zhang, Y., Cai, D. & Ren, X. AtSNE: efficient and robust visualization on GPU through hierarchical optimization. In Proc. 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 176–186 (Association for Computing Machinery, 2019).
Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 37, 1482–1492 (2019).
Article Google Scholar
Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 10, 5416 (2019).
Article Google Scholar
Shen, C. & Wu, H.-T. Scalability and robustness of spectral embedding: landmark diffusion is all you need. Inf. Inference 11, 1527–1595 (2022).
MathSciNet Google Scholar
Long, A. W. & Ferguson, A. L. Landmark diffusion maps (L-dMaps): accelerated manifold learning out-of-sample extension. Appl. Comput. Harmon. Anal. 47, 190–211 (2019).
Article MathSciNet Google Scholar
Persad, S. et al. SEACells infers transcriptional and epigenomic cellular states from single-cell genomics data. Nat. Biotechnol. 41, 1746–1757 (2023).
Article Google Scholar
Stanoi, I., Riedewald, M., Agrawal, D. & El Abbadi, A. Discovery of influence sets in frequently updated databases. In Proc. 27th International Conference on Very Large Databases (VLDB) (eds. Apers P. M. G. et al.) 99–108 (Morgan Kaufmann Publishers Inc., 2001).
Tao, Y., Papadias, D. & Lian, X. Reverse kNN search in arbitrary dimensionality. In Proc. 30th International Conference on Very Large Databases (VLDB) (eds Nascimento, M. A. et al.) 744–755 (Morgan Kaufmann Publishers Inc., 2004).
Radovanovic, M., Nanopoulos, A. & Ivanovic, M. Hubs in space: popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11, 2487–2531 (2010).
MathSciNet Google Scholar
Loshchilov, I. & Hutter, F. SGDR: stochastic gradient descent with warm restarts. In Proc. 5th International Conference on Learning Representations (ICLR) 1769–1784 (Curran Associates, 2017).
Peng, D., Gui, Z. & Wu, H. MeanCut: a greedy-optimized graph clustering via path-based similarity and degree descent criterion. Preprint at https://arxiv.org/abs/2312.04067 (2023).
Karypis, G., Han, E.-H. & Kumar, V. Chameleon: hierarchical clustering using dynamic modeling. Computer 32, 68–75 (1999).
Article Google Scholar
France, S. & Carroll, D. in Machine Learning and Data Mining in Pattern Recognition (ed. Perner, P.) vol. 4,571, 499–517 (2007).
Moor, M., Horn, M., Rieck, B. & Borgwardt, K. Topological autoencoders. In Proc. 37th International Conference on Machine Learning (ICML) (eds Daumé, H. & Singh, A.) 7045–7054 (PMLR, 2020).
Sainburg, T., McInnes, L. & Gentner, T. Q. Parametric UMAP embeddings for representation and semisupervised learning. Neural Comput. 33, 2881–2907 (2021).
MathSciNet Google Scholar
Dua, D. & Graff, C. UCI machine learning repository. University of California Irvine https://archive.ics.uci.edu (2019).
Krizhevsky, A. & Hinton, G. Learning Multiple Layers of Features from Tiny Images Technical Report (Univ. Toronto, 2009).
Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Article Google Scholar
Xiao, H., Rasul, K. & Vollgraf, R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. Preprint at https://arxiv.org/abs/1708.07747 (2017).
Corso, G. M. D., Gulli, A. & Romani, F. Ranking a stream of news. In Proc. 14th International World Wide Web Conference (WWW) 97–106 (Association for Computing Machinery, 2005).
Poličar, P. G., Stražar, M. & Zupan, B. openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding. J. Stat. Softw. 109, 1–30 (2024).
Article Google Scholar
Huang, H., Wang, Y., Rudin, C. & Browne, E. P. Towards a comprehensive evaluation of dimension reduction methods for transcriptomic data visualization. Commun. Biol. 5, 719 (2022).
Article Google Scholar
Amir, E. A. et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 31, 545–552 (2013).
Article Google Scholar
Khan, S. A. et al. Reusability report: learning the transcriptional grammar in single-cell RNA-sequencing data using transformers. Nat. Mach. Intell. 5, 1437–1446 (2023).
Article Google Scholar
Heng, J. S. et al. Hypoxia tolerance in the Norrin-deficient retina and the chronically hypoxic brain studied at single-cell resolution. Proc. Natl Acad. Sci. USA 116, 9103–9114 (2019).
Article Google Scholar
Satija, R. et al. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
Article Google Scholar
Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).
Article Google Scholar
Peng, D. et al. Clustering by measuring local direction centrality for data with heterogeneous density and weak connectivity. Nat. Commun. 13, 5455 (2022).
Article Google Scholar
Wang, S. K. et al. Single-cell multiome and enhancer connectome of human retinal pigment epithelium and choroid nominate pathogenic variants in age-related macular degeneration. Preprint at bioRxiv https://doi.org/10.1101/2025.03.21.644670 (2025).
Ziegler, C. G. et al. Impaired local intrinsic immunity to SARS-CoV-2 infection in severe COVID-19. Cell 184, 4713–4733 (2021).
Article Google Scholar
Levine, J. H. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015).
Article Google Scholar
Weber, L. M. & Robinson, M. D. Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytom. A 89, 1084–1096 (2016).
Article Google Scholar
Ferrari, G., Thrasher, A. J. & Aiuti, A. Gene therapy using haematopoietic stem and progenitor cells. Nat. Rev. Genet. 22, 216–234 (2021).
Article Google Scholar
Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19, 477 (2018).
Article Google Scholar
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Article Google Scholar
Young, W. J. et al. Genetic analyses of the electrocardiographic QT interval and its components identify additional loci and pathways. Nat. Commun. 13, 5144 (2022).
Article Google Scholar
Niroshana, S. M. I., Kuroda, S., Tanaka, K. & Chen, W. Beat-wise segmentation of electrocardiogram using adaptive windowing and deep neural network. Sci. Rep. 13, 11039 (2023).
Article Google Scholar
Huo, R. et al. ECG segmentation algorithm based on bidirectional hidden semi-Markov model. Comput. Biol. Med. 150, 106081 (2022).
Article Google Scholar
Oberlin, T., Meignen, S. & Perrier, V. The Fourier-based synchrosqueezing transform. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 315–319 (IEEE, 2014).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article Google Scholar
Laguna, P., Mark, R. G., Goldberg, A. & Moody, G. B. A database for evaluation of algorithms for measurement of QT and other waveform intervals in the ECG. Comput. Cardiol. 24, 673–676 (1997).
Google Scholar
Goldberger, A. L. et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101, e215–e220 (2000).
Article Google Scholar
Bengio, Y., Courville, A. & Vincent, P. Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013).
Article Google Scholar
Peng, D., Gui, Z. & Wu, H. Interpreting the curse of dimensionality from distance concentration and manifold effect. Preprint at https://arxiv.org/abs/2401.00422 (2023).
Healy, J. & McInnes, L. Uniform manifold approximation and projection. Nat. Rev. Methods Primers 4, 82 (2024).
Article Google Scholar
Malkov, Y. A. & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2018).
Article Google Scholar
Dong, W., Moses, C. & Li, K. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proc. 20th International Conference on World Wide Web (WWW) 577–586 (Association for Computing Machinery, 2011).
Saul, L. K. & Roweis, S. T. Think globally, fit locally: unsupervised learning of low dimensional manifolds. J. Mach. Learn. Res. 4, 119–155 (2003).
MathSciNet Google Scholar
Wang, J. Real local-linearity preserving embedding. Neurocomputing 136, 7–13 (2014).
Article Google Scholar
Wang, Z. et al. Clustering by local gravitation. IEEE Trans. Cybern. 48, 1383–1396 (2018).
Article Google Scholar
von Luxburg, U. A tutorial on spectral clustering. Stat. Comp. 17, 395–416 (2007).
Article MathSciNet Google Scholar
Ding, J., Shah, S. & Condon, A. densityCut: an efficient and versatile topological approach for automatic clustering of biological data. Bioinformatics 32, 2567–2576 (2016).
Article Google Scholar
Kuhn, H. W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2, 83–97 (1955).
Article MathSciNet Google Scholar
Peng, D. Source code of SUDE. Zenodo https://doi.org/10.5281/zenodo.16792257 (2025).
Peng, D. Open code for in-review “Sampling-enabled scalable manifold learning unveils discriminative cluster structure of high-dimensional data”. Code Ocean https://doi.org/10.24433/CO.9866909.v3 (2025).

Download references

Acknowledgements

We thank Y. Tang and Z. Ruan for technical assistance in the analysis of single-cell data. We are also grateful to Y. Liu and R. Wang for language and technical discussion. This paper is supported by the National Natural Science Foundation of China (grant nos. 42090011, 42501573, 42571496, 41971349 and 41930107), Postdoctoral Fellowship Program and China Postdoctoral Science Foundation (grant no. BX20250084), China Postdoctoral Science Foundation (grant no. 2025M770345) and Fundamental Research Funds for the Central Universities (grant nos. 2042024kf0005 and 2042025kf0084).

Author information

Authors and Affiliations

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China
Dehua Peng, Zhipeng Gui & Wenzhang Wei
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China
Dehua Peng, Zhipeng Gui, Huayi Wu & Jianya Gong
Collaborative Innovation Center of Geospatial Technology, Wuhan University, Wuhan, China
Dehua Peng, Zhipeng Gui, Huayi Wu & Jianya Gong
Department of Earth System Science, Stanford University, Stanford, CA, USA
Fa Li
School of Cyber Science and Engineering, Southeast University, Nanjing, China
Jie Gui
Purple Mountain Laboratories, Nanjing, China
Jie Gui

Authors

Dehua Peng
View author publications
Search author on:PubMed Google Scholar
Zhipeng Gui
View author publications
Search author on:PubMed Google Scholar
Wenzhang Wei
View author publications
Search author on:PubMed Google Scholar
Fa Li
View author publications
Search author on:PubMed Google Scholar
Jie Gui
View author publications
Search author on:PubMed Google Scholar
Huayi Wu
View author publications
Search author on:PubMed Google Scholar
Jianya Gong
View author publications
Search author on:PubMed Google Scholar

Contributions

Z.G., H.W. and J. Gong envisioned the study. D.P. and Z.G. designed the algorithm and collected datasets. D.P., W.W. and Z.G. conducted experiments and performed the analysis. D.P., Z.G., W.W. and F.L. wrote the manuscript. H.W. and J. Gui provided advice in analysis.

Corresponding author

Correspondence to Zhipeng Gui.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Dmitry Kobak, Samantha Riesenfeld, Ryan Robinett and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1–12, Figs. 1–31 and Tables 1–4.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Peng, D., Gui, Z., Wei, W. et al. Sampling-enabled scalable manifold learning unveils the discriminative cluster structure of high-dimensional data. Nat Mach Intell 7, 1669–1684 (2025). https://doi.org/10.1038/s42256-025-01112-9

Download citation

Received: 01 November 2024
Accepted: 11 August 2025
Published: 10 September 2025
Version of record: 10 September 2025
Issue date: October 2025
DOI: https://doi.org/10.1038/s42256-025-01112-9