Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Stochastic LASSO for extremely high-dimensional genomic data
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 14 January 2026

Stochastic LASSO for extremely high-dimensional genomic data

  • Beomsu Baek1,
  • Jongkwon Jo2,3,
  • Mingon Kang1 &
  • …
  • Youngsoon Kim2 

Scientific Reports , Article number:  (2026) Cite this article

  • 611 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Computational models
  • Statistical methods

Abstract

Accurate identification of significant features in high-dimensional data is indispensable in high-throughput genomic analysis and association studies. Least Absolute Shrinkage and Selection Operator (LASSO) and its derivatives have been widely adapted to discover potential biomarkers as a feature selection scheme in various biological systems. Recently, bootstrap-based LASSO models, such as Random LASSO and Hi-LASSO, have been effective solutions for extremely high-dimensional but low sample size (EHDLSS) genomic data. However, the bootstrap-based LASSO models still have several drawbacks, such as multicollinearity within bootstrap samples, missing predictors in draw, and randomness in predictor sampling. To tackle the limitations, we propose a new bootstrap-based LASSO, named Stochastic LASSO, that effectively reduces multicollinearity in bootstrap samples and mitigates randomness in predictor sampling, resulting in remarkably outperforming benchmarks in feature selection and coefficient estimation. Furthermore, Stochastic LASSO provides a two-stage t-test strategy for selecting statistically significant features. The performance of Stochastic LASSO was assessed by comparing the existing benchmark models in extensive simulation experiments. In the simulation experiments, Stochastic LASSO consistently showed significant improvements in performance compared to the state-of-the-art LASSO models for feature selection, coefficient estimation, and robustness. We also applied Stochastic LASSO for the gene expression data of publicly available TCGA cancer datasets and identified statistically significant genes associated with survival month prediction. The source code is publicly available at: https://github.com/datax-lab/StochasticLASSO.

Similar content being viewed by others

Bio-primed machine learning to enhance discovery of relevant biomarkers

Article Open access 06 February 2025

Biobank-scale methods and projections for sparse polygenic prediction from machine learning

Article Open access 19 July 2023

Curated incidence of lysosomal storage diseases from the Taiwan Biobank

Article Open access 23 September 2023

Data availability

The datasets analyzed during the current study are all publicly available online from The Cancer Genome Atlas (TCGA). The open-source is available at: https://github.com/datax-lab/StochasticLASSO.

References

  1. Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. & Lange, K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25(6), 714–721 (2009).

    Google Scholar 

  2. Xu, C., Fang, J., Shen, H., Wang, Y. P. & Deng, H. W. EPS-LASSO: Test for high-dimensional regression under extreme phenotype sampling of continuous traits. Bioinformatics 34(12), 1996–2003 (2018).

    Google Scholar 

  3. Geeven, G., van Kesteren, R. E., Smit, A. B. & de Gunst, M. C. M. Identification of context-specific gene regulatory networks with GEMULA-gene expression modeling using LAsso. Bioinformatics 28(2), 214–221 (2012).

    Google Scholar 

  4. Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc., B: Stat. Methodol. 58(1), 267–288 (1996).

    Google Scholar 

  5. Wang, W. & Liu, W. Integration of gene interaction information into a reweighted Lasso-Cox model for accurate survival prediction. Bioinformatics 36(22–23), 5405–5414 (2020).

    Google Scholar 

  6. Fu, G. H., Yi, L. Z. & Pan, J. LASSO-based false-positive selection for class-imbalanced data in metabolomics. J. Chemom. 33(10), e3177 (2019).

  7. Yu, B. et al. Prediction of protein-protein interactions based on elastic net and deep forest. Expert Syst. Appl. 176, 114876 (2021).

  8. Sohn, I., Kim, J., Jung, S. H. & Park, C. Gradient lasso for Cox proportional hazards model. Bioinformatics 25(14), 1775–1781 (2009).

    Google Scholar 

  9. Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006).

    Google Scholar 

  10. Meinshausen, N. Relaxed Lasso. Comput. Stat. Data Anal. 52(1), 374–393 (2007).

    Google Scholar 

  11. Wang, H., Lengerich, B. J., Aragam, B. & Xing, E. P. Precision Lasso: Accounting for correlations and linear dependencies in high-dimensional genomic data. Bioinformatics 35(7), 1181–1187 (2019).

    Google Scholar 

  12. Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc., B: Stat. Methodol. 67(2), 301–320 (2005).

    Google Scholar 

  13. Wang, S., Nan, B., Rosset, S. & Zhu, J Random lasso. Ann. Appl. Stat. 5(1), 468–485, (2011).

  14. Park, H., Imoto, S. & Miyano, S. Recursive random lasso (RRLasso) for identifying anti-cancer drug targets. PLoS ONE 10(11), e0141869 (2015).

  15. Kim, Y., Hao, J., Mallavarapu, T., Park, J. & Kang, M. Hi-LASSO: High-Dimensional LASSO. IEEE Access 7, 44562–44573 (2019).

    Google Scholar 

  16. Fan, J. & Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001).

    Google Scholar 

  17. Lustgarten, J. L., Gopalakrishnan, V. & Visweswaran, S. Measuring stability of feature selection in biomedical datasets. AMIA Annu. Symp. proc. 406–410, 2009 (2009).

    Google Scholar 

  18. Das, N. D. et al. Defining super-enhancers by highly ranked histone H4 multi-acetylation levels identifies transcription factors associated with glioblastoma stem-like properties. BMC Genomics 24(1), 574 (2023).

  19. Zupančič, M. et al. Concerted transcriptional regulation of the morphogenesis of hypothalamic neurons by ONECUT3. Nat. Commun. 15(1), 8631 (2024).

    Google Scholar 

  20. Lee, M. An ensemble deep learning model with a gene attention mechanism for estimating the prognosis of low-grade glioma. Biology (Basel) 11(4), 586 (2022).

    Google Scholar 

  21. Wang, Y. et al. Identification of a five-pseudogene signature for predicting survival and its ceRNA network in glioma. Front. Oncol. 9, 1059 (2019).

    Google Scholar 

  22. Liu, B. et al. A prognostic signature of five pseudogenes for predicting lower-grade gliomas. Biomed. Pharmacother. 117, 109116 (2019).

    Google Scholar 

  23. Akiyoshi, K. et al. Expression of mRNAs of Urocortin in the STKM-1 gastric cancer cell line. Anticancer Res. 33(12), 5289–5294 (2013).

    Google Scholar 

  24. Kamada, M. et al. Expression of mRNAs of urocortin and corticotropin-releasing factor receptors in malignant glioma cell lines. Anticancer Res. 32(12), 5299–5307 (2012).

    Google Scholar 

  25. Yan, B. et al. Artificial intelligence-based radiogenomics reveals the potential immunoregulatory role of COL22A1 in glioma and its induced autoimmune encephalitis. Front. Immunol. 16, 1562070 (2025).

    Google Scholar 

  26. Liu, H., Zeng, Z. & Sun, P. Prognosis and immunoinfiltration analysis of angiogene-related genes in grade 4 diffuse gliomas. Aging (Albany NY) 15(18), 9842–9857 (2023).

    Google Scholar 

  27. Barbosa, L. C., Machado, G. C., Heringer, M. & Ferrer, V. P. Identification of established and novel extracellular matrix components in glioblastoma as targets for angiogenesis and prognosis. Neurogenetics 25(3), 249–262 (2024).

    Google Scholar 

  28. Gu, S., Wang, Y., Lei, D. & Zhao, H. Analysis and construction of ceRNA networks reveal 4 mRNAs as potential biomarkers of temozolomide-resistant glioblastomas. Research Square (preprint) (2021).

  29. Li, Y. et al. Distinct genomic aberrations between low-grade and high-grade gliomas of Chinese patients. PLoS One 8(2), e57168 (2013).

    Google Scholar 

  30. Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28(1), 27–30 (2000).

    Google Scholar 

  31. Xu, H. et al. Comprehensive molecular characterization of long-term glioblastoma survivors. Cancer Lett. 593, 216938 (2024).

    Google Scholar 

  32. Hauffe, L. et al. Eukaryotic translation initiation factor 4E binding protein 1 (EIF4EBP1) expression in glioblastoma is driven by ETS1- and MYBL2-dependent transcriptional activation. Cell Death Discov. 8(1), 91 (2022).

    Google Scholar 

  33. Yi, G. Z. et al. Identification of key candidate proteins and pathways associated with temozolomide resistance in glioblastoma based on subcellular proteomics and bioinformatical analysiss. BioMed Res. Int. 2018, 5238760 (2018).

    Google Scholar 

  34. Qi, T. et al. Glioma-associated oncogene homolog 1 in breast invasive carcinoma: a comprehensive bioinformatic analysis and experimental validation. Front. Cell Dev. Biol. 12, 1478478 (2024).

    Google Scholar 

  35. Sun, Z., Qi, X. & Zhang, Y. Bioinformatics Analysis of the Expression of ATP Binding Cassette Subfamily C Member 3 (ABCC3) in Human Glioma. Open Med. (Warsaw) 15, 107–113 (2020).

    Google Scholar 

  36. Hermawan, A. & Putri, H. Systematic analysis of potential targets of the curcumin analog pentagamavunon-1 (PGV-1) in overcoming resistance of glioblastoma cells to bevacizumab. Saudi Pharm. J. 29(11), 1289–1302 (2021).

    Google Scholar 

  37. Swiatek-Machado, K. & Kaminska, B. STAT Signaling in Glioma Cells. In Glioma Signaling (ed. Barańska, J.) 203–222 (Springer International Publishing, Cham, 2020).

    Google Scholar 

Download references

Funding

This work was supported by the National Science Foundation Major Research Instrumentation (NSF MRI) (Grant#:2117941), the National Research Foundation of Korea (NRF) (NRF-2021R1I1A3048029), and the MSIT (Ministry of Science and ICT) under the ICAN (ICT Challenge and Advanced Network of HRD) support program (IITP-2024-RS-2022-00156409) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) in South Korea.

Author information

Authors and Affiliations

  1. Department of Computer Science, University of Nevada, Las Vegas, 89154, NV, USA

    Beomsu Baek & Mingon Kang

  2. Department of Information and Statistics, Gyeongsang National University, Jinju, Republic of Korea

    Jongkwon Jo & Youngsoon Kim

  3. AI Research Team, BigAI Inc., Changwon, Republic of Korea

    Jongkwon Jo

Authors
  1. Beomsu Baek
    View author publications

    Search author on:PubMed Google Scholar

  2. Jongkwon Jo
    View author publications

    Search author on:PubMed Google Scholar

  3. Mingon Kang
    View author publications

    Search author on:PubMed Google Scholar

  4. Youngsoon Kim
    View author publications

    Search author on:PubMed Google Scholar

Contributions

B.B.: Methodology, Software, Validation, Investigation, Writing - Original Draft, Writing - Review & Editing, Visualization; J.J: Methodology, Investigation; M.K.: Conceptualization, Methodology, Validation, Resources, Writing - Original Draft, Writing - Review & Editing, Supervision, Project administration, Funding acquisition; Y.K.: Conceptualization, Methodology, Validation, Resources, Writing - Original Draft, Writing - Review & Editing, Supervision, Funding acquisition

Corresponding authors

Correspondence to Mingon Kang or Youngsoon Kim.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Baek, B., Jo, J., Kang, M. et al. Stochastic LASSO for extremely high-dimensional genomic data. Sci Rep (2026). https://doi.org/10.1038/s41598-026-35273-3

Download citation

  • Received: 23 June 2025

  • Accepted: 05 January 2026

  • Published: 14 January 2026

  • DOI: https://doi.org/10.1038/s41598-026-35273-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Stochastic LASSO
  • LASSO
  • High-dimensional data
  • Variable selection
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on Twitter
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com sitemap

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing