Abstract
Current metrics for binary classification, like the Area Under the Receiver Operating Characteristic curve (AUC-ROC) or Log Loss, provide a global performance score. However, they do not quantify predictive quality separately for event and non-event classes. This limitation is particularly critical in imbalanced settings like medical diagnostics. To address it, we introduce the U-smile Likelihood Evaluation (LE) method, a substantial extension of the original U-smile framework. The U-smile LE method is based on a new metric called the relative Likelihood Ratio (rLR). This single score measures overall model strength without needing a classification threshold. We decompose this score into two class-specific components: \(\:{rLR}_{1}\) for event class and \(\:{rLR}_{0}\) for non-event class, visualizing them simultaneously in a compact U-shaped plot. We validated the U-smile LE method on synthetic datasets with varying class imbalance and a real-world clinical Heart Disease dataset. In severely imbalanced scenarios (90/10 class distribution), stepwise variable selection guided by U-smile LE outperformed traditional AUC-based selection, improving minority-class detection by 16% in the Area Under the Precision-Recall curve (AUC-PR) and 21% in F1-score. The evolution of U-smile patterns during variable selection provided clear, interpretable insight into class-specific contributions of individual predictors. Demonstrated with both logistic regression and random forest models, U-smile LE offers an explainable, model-agnostic framework for evaluating binary classifiers, especially valuable where class imbalance and interpretability are key concerns.
Data availability
We used the Heart Disease dataset [Detrano R, Janosi A, Steinbrunn W, Pfisterer M, Schmid J-J, Sandhu S, et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. The American Journal of Cardiology 1989;64:304–10. https://doi.org/10.1016/0002-9149(89)90524-9.] from the public Machine Learning Repository [Aha D. UCI Machine Learning Repository: Heart Disease Data Set n.d. https://archive.ics.uci.edu/ml/datasets/heart+disease]. Random data similarly to the results presented in the paper can be generated again using the code on github.com/bbwieckowska/UsmileLE.
References
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874 (2006).
Lobo, J. M., Jiménez-Valverde, A. & Real, R. AUC: a misleading measure of the performance of predictive distribution models. Glob. Ecol. Biogeogr. 17, 145–151 (2008).
Huang, J. & Ling, C. X. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17, 299–310 (2005).
Assel, M., Sjoberg, D. D. & Vickers, A. J. The Brier score does not evaluate the clinical utility of diagnostic tests or prediction models. Diagn. Progn Res. 1, 19 (2017).
Saito, T. & Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE. 10, e0118432 (2015).
van Lierop, S., Ramos, D., Sjerps, M. & Ypma, R. An overview of log likelihood ratio cost in forensic science – Where is it used and what values can we expect? Forensic Sci. International: Synergy. 8, 100466 (2024).
He, H. & Garcia, E. A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009).
Hoo, Z. H., Candlish, J. & Teare, D. What is an ROC curve? Emerg. Med. J. 34, 357–359 (2017).
Wynants, L. et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ 369, m1328 (2020).
Davis, J. & Goadrich, M. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning - ICML ’06 233–240. https://doi.org/10.1145/1143844.1143874 (ACM Press, 2006).
Kubiak, K. B., Więckowska, B., Jodłowska-Siewert, E. & Guzik, P. Visualising and quantifying the usefulness of new predictors stratified by outcome class: the U-smile method. PLOS ONE. 19, e0303276 (2024).
Więckowska, B., Kubiak, K. B. & Guzik, P. Evaluating the three-level approach of the U-smile method for imbalanced binary classification. PLOS ONE. 20, e0321661 (2025).
Kubiak, K. B., Konieczna, A., Tyranska-Fobke, A. & Więckowska, B. Beyond global metrics: the U-Smile method for explainable, interpretable, and transparent variable selection in risk prediction models. Appl. Sci. 15, 8303 (2025).
McFadden, D. Conditional logit analysis of qualitative choice behavior. In Frontiers in Econometrics (ed Zarembka, P.) 105–142 (Academic Press, 1973).
Detrano, R. et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am. J. Cardiol. 64, 304–310 (1989).
Janosi, A., Steinbrunn, W., Pfisterer, M. & Detrano, R. Heart Disease (1988).
Aha, D. UCI Machine Learning Repository: Heart Disease Data Set. https://archive.ics.uci.edu/ml/datasets/heart+disease.
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
Brümmer, N. & du Preez, J. Application-independent evaluation of speaker detection. Comput. Speech Lang. 20, 230–275 (2006).
Branco, P., Torgo, L. & Ribeiro, R. P. A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. 49, 1–50 (2017).
Zhang, L., Geisler, T., Ray, H. & Xie, Y. Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function. J. Appl. Stat. 49, 3257–3277 (2022).
Molnar, C. Interpretable Machine Learning (Lean Publishing Process, 2019).
Gilpin, L. H. et al. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) 80–89 (IEEE, 2018).
Lundberg, S. M. & Lee, S. I. A unified approach to interpreting model predictions. Adv. Neural. Inf. Process. Syst. 30, 4765–4774 (2017).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Hosmer, D. W. Jr, Lemeshow, S. & Sturdivant, R. X. Applied Logistic Regression (Wiley, 2013).
Niculescu-Mizil, A. & Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning - ICML ’05 625–632. https://doi.org/10.1145/1102351.1102430 (ACM Press, 2005).
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning 1321–1330 (PMLR, 2017).
Advanced Issues and Deeper Insights. In Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (eds Burnham, K. P. & Anderson, D. R.) 267–351 https://doi.org/10.1007/978-0-387-22456-5_6 (Springer, 2002).
Flach, P. & Kull, M. Precision-recall-gain curves: PR analysis done right. Adv. Neural. Inf. Process. Syst. 28, 838–846 (2015).
Funding
The publication costs were covered by an unrestricted scientific and educational grant from the Ministry of Education and Science, Warsaw, Poland, under the Programme ‘Science for Society.’ This support was provided for the project titled ‘Scientific and Consulting Activities of the University Center for Sports and Medical Research in Poznań’ (Grant No. NdS-II/SP/0207/2024/01, funding amount: 1,399,966.00 PLN, total project value: PLN 1 723 966,00 PLN). PG is the Principal Investigator of the project, while BW is one of the main investigators.
Author information
Authors and Affiliations
Contributions
Barbara Więckowska: Conceptualization, study design, data collection, methodology and statistical analysis, literature review, manuscript drafting; Przemysław Guzik: Study design, manuscript editing, interpretation of results, critical revision, supervision, funding acquisition, final approval of the manuscript.All authors read and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Consent to participate
Not applicable, as this study is based on publicly available datasets and included randomly generated synthetic data.
Ethics approval
This study did not require ethics approval as it was based on publicly available datasets and included randomly generated synthetic data. According to institutional and national guidelines, research using publicly available anonymized data and synthetic data does not fall under the scope of ethics committee review.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Więckowska, B., Guzik, P. Usmile likelihood evaluation provides robust threshold free assessment of binary classification models for balanced and imbalanced datasets. Sci Rep (2026). https://doi.org/10.1038/s41598-026-40545-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-40545-z