Stochastic LASSO for extremely high-dimensional genomic data

Baek, Beomsu; Jo, Jongkwon; Kang, Mingon; Kim, Youngsoon

doi:10.1038/s41598-026-35273-3

Download PDF

Article
Open access
Published: 14 January 2026

Stochastic LASSO for extremely high-dimensional genomic data

Beomsu Baek¹,
Jongkwon Jo^2,3,
Mingon Kang¹ &
…
Youngsoon Kim²

Scientific Reports , Article number: (2026) Cite this article

611 Accesses
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Accurate identification of significant features in high-dimensional data is indispensable in high-throughput genomic analysis and association studies. Least Absolute Shrinkage and Selection Operator (LASSO) and its derivatives have been widely adapted to discover potential biomarkers as a feature selection scheme in various biological systems. Recently, bootstrap-based LASSO models, such as Random LASSO and Hi-LASSO, have been effective solutions for extremely high-dimensional but low sample size (EHDLSS) genomic data. However, the bootstrap-based LASSO models still have several drawbacks, such as multicollinearity within bootstrap samples, missing predictors in draw, and randomness in predictor sampling. To tackle the limitations, we propose a new bootstrap-based LASSO, named Stochastic LASSO, that effectively reduces multicollinearity in bootstrap samples and mitigates randomness in predictor sampling, resulting in remarkably outperforming benchmarks in feature selection and coefficient estimation. Furthermore, Stochastic LASSO provides a two-stage t-test strategy for selecting statistically significant features. The performance of Stochastic LASSO was assessed by comparing the existing benchmark models in extensive simulation experiments. In the simulation experiments, Stochastic LASSO consistently showed significant improvements in performance compared to the state-of-the-art LASSO models for feature selection, coefficient estimation, and robustness. We also applied Stochastic LASSO for the gene expression data of publicly available TCGA cancer datasets and identified statistically significant genes associated with survival month prediction. The source code is publicly available at: https://github.com/datax-lab/StochasticLASSO.

Bio-primed machine learning to enhance discovery of relevant biomarkers

Article Open access 06 February 2025

Biobank-scale methods and projections for sparse polygenic prediction from machine learning

Article Open access 19 July 2023

Curated incidence of lysosomal storage diseases from the Taiwan Biobank

Article Open access 23 September 2023

Data availability

The datasets analyzed during the current study are all publicly available online from The Cancer Genome Atlas (TCGA). The open-source is available at: https://github.com/datax-lab/StochasticLASSO.

References

Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. & Lange, K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25(6), 714–721 (2009).
Google Scholar
Xu, C., Fang, J., Shen, H., Wang, Y. P. & Deng, H. W. EPS-LASSO: Test for high-dimensional regression under extreme phenotype sampling of continuous traits. Bioinformatics 34(12), 1996–2003 (2018).
Google Scholar
Geeven, G., van Kesteren, R. E., Smit, A. B. & de Gunst, M. C. M. Identification of context-specific gene regulatory networks with GEMULA-gene expression modeling using LAsso. Bioinformatics 28(2), 214–221 (2012).
Google Scholar
Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc., B: Stat. Methodol. 58(1), 267–288 (1996).
Google Scholar
Wang, W. & Liu, W. Integration of gene interaction information into a reweighted Lasso-Cox model for accurate survival prediction. Bioinformatics 36(22–23), 5405–5414 (2020).
Google Scholar
Fu, G. H., Yi, L. Z. & Pan, J. LASSO-based false-positive selection for class-imbalanced data in metabolomics. J. Chemom. 33(10), e3177 (2019).
Yu, B. et al. Prediction of protein-protein interactions based on elastic net and deep forest. Expert Syst. Appl. 176, 114876 (2021).
Sohn, I., Kim, J., Jung, S. H. & Park, C. Gradient lasso for Cox proportional hazards model. Bioinformatics 25(14), 1775–1781 (2009).
Google Scholar
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006).
Google Scholar
Meinshausen, N. Relaxed Lasso. Comput. Stat. Data Anal. 52(1), 374–393 (2007).
Google Scholar
Wang, H., Lengerich, B. J., Aragam, B. & Xing, E. P. Precision Lasso: Accounting for correlations and linear dependencies in high-dimensional genomic data. Bioinformatics 35(7), 1181–1187 (2019).
Google Scholar
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc., B: Stat. Methodol. 67(2), 301–320 (2005).
Google Scholar
Wang, S., Nan, B., Rosset, S. & Zhu, J Random lasso. Ann. Appl. Stat. 5(1), 468–485, (2011).
Park, H., Imoto, S. & Miyano, S. Recursive random lasso (RRLasso) for identifying anti-cancer drug targets. PLoS ONE 10(11), e0141869 (2015).
Kim, Y., Hao, J., Mallavarapu, T., Park, J. & Kang, M. Hi-LASSO: High-Dimensional LASSO. IEEE Access 7, 44562–44573 (2019).
Google Scholar
Fan, J. & Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001).
Google Scholar
Lustgarten, J. L., Gopalakrishnan, V. & Visweswaran, S. Measuring stability of feature selection in biomedical datasets. AMIA Annu. Symp. proc. 406–410, 2009 (2009).
Google Scholar
Das, N. D. et al. Defining super-enhancers by highly ranked histone H4 multi-acetylation levels identifies transcription factors associated with glioblastoma stem-like properties. BMC Genomics 24(1), 574 (2023).
Zupančič, M. et al. Concerted transcriptional regulation of the morphogenesis of hypothalamic neurons by ONECUT3. Nat. Commun. 15(1), 8631 (2024).
Google Scholar
Lee, M. An ensemble deep learning model with a gene attention mechanism for estimating the prognosis of low-grade glioma. Biology (Basel) 11(4), 586 (2022).
Google Scholar
Wang, Y. et al. Identification of a five-pseudogene signature for predicting survival and its ceRNA network in glioma. Front. Oncol. 9, 1059 (2019).
Google Scholar
Liu, B. et al. A prognostic signature of five pseudogenes for predicting lower-grade gliomas. Biomed. Pharmacother. 117, 109116 (2019).
Google Scholar
Akiyoshi, K. et al. Expression of mRNAs of Urocortin in the STKM-1 gastric cancer cell line. Anticancer Res. 33(12), 5289–5294 (2013).
Google Scholar
Kamada, M. et al. Expression of mRNAs of urocortin and corticotropin-releasing factor receptors in malignant glioma cell lines. Anticancer Res. 32(12), 5299–5307 (2012).
Google Scholar
Yan, B. et al. Artificial intelligence-based radiogenomics reveals the potential immunoregulatory role of COL22A1 in glioma and its induced autoimmune encephalitis. Front. Immunol. 16, 1562070 (2025).
Google Scholar
Liu, H., Zeng, Z. & Sun, P. Prognosis and immunoinfiltration analysis of angiogene-related genes in grade 4 diffuse gliomas. Aging (Albany NY) 15(18), 9842–9857 (2023).
Google Scholar
Barbosa, L. C., Machado, G. C., Heringer, M. & Ferrer, V. P. Identification of established and novel extracellular matrix components in glioblastoma as targets for angiogenesis and prognosis. Neurogenetics 25(3), 249–262 (2024).
Google Scholar
Gu, S., Wang, Y., Lei, D. & Zhao, H. Analysis and construction of ceRNA networks reveal 4 mRNAs as potential biomarkers of temozolomide-resistant glioblastomas. Research Square (preprint) (2021).
Li, Y. et al. Distinct genomic aberrations between low-grade and high-grade gliomas of Chinese patients. PLoS One 8(2), e57168 (2013).
Google Scholar
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28(1), 27–30 (2000).
Google Scholar
Xu, H. et al. Comprehensive molecular characterization of long-term glioblastoma survivors. Cancer Lett. 593, 216938 (2024).
Google Scholar
Hauffe, L. et al. Eukaryotic translation initiation factor 4E binding protein 1 (EIF4EBP1) expression in glioblastoma is driven by ETS1- and MYBL2-dependent transcriptional activation. Cell Death Discov. 8(1), 91 (2022).
Google Scholar
Yi, G. Z. et al. Identification of key candidate proteins and pathways associated with temozolomide resistance in glioblastoma based on subcellular proteomics and bioinformatical analysiss. BioMed Res. Int. 2018, 5238760 (2018).
Google Scholar
Qi, T. et al. Glioma-associated oncogene homolog 1 in breast invasive carcinoma: a comprehensive bioinformatic analysis and experimental validation. Front. Cell Dev. Biol. 12, 1478478 (2024).
Google Scholar
Sun, Z., Qi, X. & Zhang, Y. Bioinformatics Analysis of the Expression of ATP Binding Cassette Subfamily C Member 3 (ABCC3) in Human Glioma. Open Med. (Warsaw) 15, 107–113 (2020).
Google Scholar
Hermawan, A. & Putri, H. Systematic analysis of potential targets of the curcumin analog pentagamavunon-1 (PGV-1) in overcoming resistance of glioblastoma cells to bevacizumab. Saudi Pharm. J. 29(11), 1289–1302 (2021).
Google Scholar
Swiatek-Machado, K. & Kaminska, B. STAT Signaling in Glioma Cells. In Glioma Signaling (ed. Barańska, J.) 203–222 (Springer International Publishing, Cham, 2020).
Google Scholar

Download references

Funding

This work was supported by the National Science Foundation Major Research Instrumentation (NSF MRI) (Grant#:2117941), the National Research Foundation of Korea (NRF) (NRF-2021R1I1A3048029), and the MSIT (Ministry of Science and ICT) under the ICAN (ICT Challenge and Advanced Network of HRD) support program (IITP-2024-RS-2022-00156409) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) in South Korea.

Author information

Authors and Affiliations

Department of Computer Science, University of Nevada, Las Vegas, 89154, NV, USA
Beomsu Baek & Mingon Kang
Department of Information and Statistics, Gyeongsang National University, Jinju, Republic of Korea
Jongkwon Jo & Youngsoon Kim
AI Research Team, BigAI Inc., Changwon, Republic of Korea
Jongkwon Jo

Authors

Beomsu Baek
View author publications
Search author on:PubMed Google Scholar
Jongkwon Jo
View author publications
Search author on:PubMed Google Scholar
Mingon Kang
View author publications
Search author on:PubMed Google Scholar
Youngsoon Kim
View author publications
Search author on:PubMed Google Scholar

Contributions

B.B.: Methodology, Software, Validation, Investigation, Writing - Original Draft, Writing - Review & Editing, Visualization; J.J: Methodology, Investigation; M.K.: Conceptualization, Methodology, Validation, Resources, Writing - Original Draft, Writing - Review & Editing, Supervision, Project administration, Funding acquisition; Y.K.: Conceptualization, Methodology, Validation, Resources, Writing - Original Draft, Writing - Review & Editing, Supervision, Funding acquisition

Corresponding authors

Correspondence to Mingon Kang or Youngsoon Kim.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Baek, B., Jo, J., Kang, M. et al. Stochastic LASSO for extremely high-dimensional genomic data. Sci Rep (2026). https://doi.org/10.1038/s41598-026-35273-3

Download citation

Received: 23 June 2025
Accepted: 05 January 2026
Published: 14 January 2026
DOI: https://doi.org/10.1038/s41598-026-35273-3