A practical evaluation of AutoML tools for binary, multiclass, and multilabel classification

Aragão, Marcelo V. C.; Afonso, Augusto G.; Ferraz, Rafaela C.; Ferreira, Rairon G.; Leite, Sávio G.; de Figueiredo, Felipe A. P.; Mafra, Samuel B.

doi:10.1038/s41598-025-02149-x

Download PDF

Article
Open access
Published: 21 May 2025

A practical evaluation of AutoML tools for binary, multiclass, and multilabel classification

Marcelo V. C. Aragão¹^na1,
Augusto G. Afonso¹^na1,
Rafaela C. Ferraz¹^na1,
Rairon G. Ferreira¹^na1,
Sávio G. Leite¹^na1,
Felipe A. P. de Figueiredo¹ &
…
Samuel B. Mafra¹

Scientific Reports volume 15, Article number: 17682 (2025) Cite this article

6101 Accesses
6 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Selecting the most suitable Automated Machine Learning (AutoML) tool is pivotal for achieving optimal performance in diverse classification tasks, including binary, multiclass, and multilabel scenarios. The wide range of frameworks with distinct features and capabilities complicates this decision, necessitating a systematic evaluation. This study benchmarks sixteen AutoML tools, including AutoGluon, AutoSklearn, TPOT, PyCaret, and Lightwood, across all three classification types using 21 real-world datasets. Unlike prior studies focusing on a subset of classification tasks or a limited number of tools, we provide a unified evaluation of sixteen frameworks, incorporating feature-based comparisons, time-constrained experiments, and multi-tier statistical validation. We also compared our findings with four representative prior benchmarks to contextualize our results within the existing literature. A key contribution of our study is the in-depth assessment of multilabel classification, exploring both native and label-powerset representations and revealing that several tools lack robust multilabel capabilities. Our findings demonstrate that AutoSklearn excels in predictive performance for binary and multiclass settings, albeit at longer training times, while Lightwood and AutoKeras offer faster training at the cost of predictive performance on complex datasets. AutoGluon emerges as the best overall solution, balancing predictive accuracy with computational efficiency. Our statistical analysis—at per-dataset, across-datasets, and all-datasets levels—confirms significant performance differences among tools, highlighting accuracy-speed trade-offs in AutoML. These insights underscore the importance of aligning tool selection with specific problem characteristics and resource constraints. The open-source code and reproducible experimental protocols further ensure the study’s value as a robust resource for researchers and practitioners.

AutoXAI: a meta-learning approach for recommendation of explanation techniques

Article Open access 25 November 2025

A study on classification based concurrent API calls and optimal model combination for tool augmented LLMs for AI agent

Article Open access 01 July 2025

AutoML-ID: automated machine learning model for intrusion detection using wireless sensor network

Article Open access 31 May 2022

Introduction

Automated Machine Learning (AutoML), as the name suggests, automates the Machine Learning (ML) process, reducing the need for manual human labor and in-depth knowledge in the area¹. Several AutoML tools—presented in this study—are available, but choosing one is not trivial since each has its particularities and is intended to solve different binary, multiclass, and multilabel classification problems. One of ML’s goals is to predict values or categories for new (i.e., unseen) data as accurately as possible. Thus, selecting an AutoML tool is essential, particularly when balancing model performance and training efficiency. Given the rapid industrial uptake of AutoML, the community still lacks a single “all-tasks” benchmark that is both statistically rigorous and reproducible.

Despite intense activity, we found no prior study that (i) validates results at three nested levels–per-dataset, across-datasets, and all-datasets—and (ii) contrasts native versus label-powerset multilabel handling. As summarised in Table 2, even the largest benchmarks (Truong et al.², Wever et al.³, and Gijbers et al.⁴) omit one or both facets, making their findings hard to generalise. Specifically, Truong et al. cover only binary and multiclass tasks; Wever et al. treat multilabel classification exclusively; and Gijbers et al. excludes multilabel datasets and reports only aggregate scores without per-dataset or corpus-wide significance testing. In contrast, our work evaluates sixteen widely used Python frameworks on twenty-one real-world datasets that span all three classification types, applies the above three-tier statistical analysis, and is therefore, to the best of our knowledge, the most comprehensive AutoML benchmark to date.

Our contributions are fourfold: (1) the first systematic AutoML comparison that jointly covers binary, multiclass, native multilabel, and powerset multilabel tasks; (2) a tight 5-min, hardware-controlled experimental design with standardised weighted-$F_1$ and timing metrics; (3) a comprehensive multi-layer validation pipeline (per-dataset, across-datasets, all-datasets) that quantifies accuracy–speed trade-offs with robust significance tests; and (4) the public release of all code, including the end-to-end statistical scripts, so researchers can replicate or extend every figure and table (see Code availability). Crucially, these scripts allow the entire statistical analysis to be rerun with ease.

This paper is organized into seven parts: “Theory review” section explains the main concepts necessary for understanding the article; “Related work” section reviews prior studies on AutoML, positioning our research within existing literature; “Assessment methodology” section presents the work proposal, containing details of the steps performed and why only classification (and not regression) problems were addressed; “Experiments” section describes the experiments conducted, as well as the results obtained and in-depth analysis (quantitative and qualitative) of these results, including benchmarking our findings against four representative prior studies; “Statistical analysis” section provides an extensive statistical validation of the results, employing multi-tier significance tests to compare the performance of AutoML frameworks at different levels; “Threats to validity” section discusses potential limitations of the study, addressing internal, external, and construct validity concerns; and finally, “Conclusion” section presents the main findings derived from this work and proposes future research directions.

Theory review

The comparative study proposed in this paper addresses several areas of study, such as machine learning, hyperparameter optimization, neural architecture search, and AutoML. A brief theoretical review of each area follows.

Machine and supervised learning

Machine learning (ML) is a significant domain within Artificial Intelligence (AI) that enables computer systems to learn from data rather than being explicitly programmed. In supervised learning, a subset of ML, systems use labeled datasets during training to produce models that can classify new, unseen data based on patterns learned from the training data⁵.

Supervised learning methods are designed to solve classification and regression tasks. This work focuses on classification, where the objective is to assign input data to predefined categories based on the learned patterns. Classification tasks include:

Binary classification Involves classifying data into two mutually exclusive categories, such as determining whether an email is spam or non-spam based on word frequency and structure or predicting if a patient has a disease or is healthy based on medical test results like blood pressure and cholesterol levels.
Multiclass classification Assigns each sample a single label from a set of more than two mutually exclusive categories, such as classifying animal images as cats, dogs, or birds based on physical features or categorizing movies into genres like drama, comedy, or action based on themes, plot structures, and viewer preferences.
Multilabel classification Associates each sample with one or more labels simultaneously, such as tagging an image with “beach,” “sunset,” and “vacation” to describe its multiple attributes, or assigning research keywords like “machine learning,” “AI,” and “data science” to a paper to reflect its diverse topics and focus areas.

The key difference among these tasks lies in the complexity of the output: binary handles two classes (Fig. 1a), multiclass assigns one label from many (Fig. 1b), and multilabel allows multiple labels per sample (Fig. 1c), often requiring models to capture label interdependencies.

In addition, several algorithms widely discussed in the literature are often associated with solving specific tasks. For instance, binary classification tasks frequently utilize algorithms such as Support Vector Machines (SVMs)⁶ and logistic regression⁷. Multiclass classification often employs decision trees⁸, k-nearest neighbors (KNN)⁹, or neural networks¹⁰. For multilabel classification, specialized algorithms such as Binary Relevance (BR)¹¹, Classifier Chains (CC)¹², Label Powerset (LP)¹³, or adaptations of neural networks¹⁴ are commonly used. The AutoML tools discussed in this paper leverage this knowledge to automate the selection and optimization of ML models or pipelines, enhancing efficiency and reducing the need for extensive manual intervention.

Hyperparameter optimization (HPO)

ML models require not only the learning of parameters during training but also the careful selection of predefined settings called hyperparameters. Training algorithms work by iteratively adjusting model parameters, such as weights in a neural network, to minimize a loss function and improve performance on the training data¹⁵. Hyperparameters govern the behavior of this process and can include values such as the learning rate, the number of layers in a neural network, or the depth of a decision tree. Well-chosen hyperparameters can significantly improve model performance and computational efficiency.

Hyperparameter optimization (HPO) aims to identify the best configuration of these settings to maximize the model’s performance, often measured using a validation metric such as accuracy or $F_1$ score. Unlike parameters, which are adjusted dynamically during training, hyperparameters must be determined before training begins. The HPO process involves defining a search space of potential configurations, evaluating their effectiveness, and iteratively refining the search to approach the optimal solution^16,17.

Figure 2 illustrates the general workflow of HPO, emphasizing its iterative nature. The problem setup includes selecting an evaluation metric, defining the range of potential values for each setting, and determining how these configurations will be assessed during the search process.

Methods for exploring the search space can be broadly categorized into manual and automated approaches. Manual methods rely on user expertise and intuition, often requiring extensive trial and error. Automated methods, in contrast, utilize algorithms to evaluate configurations systematically and will be discussed in detail in the remainder of this section¹⁸.

Random Search (Fig. 3a) and Grid Search (Fig. 3b) can be used for both classification and regression. Both models use cross-validation and two parameters: the model with the hyperparameter values to be optimized and the search space¹⁶. The difference between them is in the search space definition, where Random Search tests several hyperparameter settings, and Grid Search exhaustively tests all settings (i.e., all possible combinations)²¹.

Bayesian optimization aims to deduce black-box functions that are initially unknown and usually have a high computational cost to be solved. The Bayesian strategy (Fig. 3c) combines information from the unknown hyperparameter function with sample data to obtain information that will be used to deduce the optimal values of the function. In addition, Gaussian processes are also used to fit the high and low point distributions. Finally, the entire process is repeated until the maximum number of iterations is reached or the difference between the optimal and current values is less than a specified threshold¹⁷.

The TPE algorithm is commonly used in various domains, including image processing, solar radiation prediction, and work accidents. It employs a Gaussian distribution to model hyperparameter values based on experimental data. Random search is performed initially to initialize the response surface sampling distributions. The hyperparameter space is divided based on fitness values and a predefined threshold, as depicted in Fig. 4a. The optimal hyperparameters are determined from the best observations. Increasing the number of initial iterations improves the distribution²⁰.

Unlike other algorithms, Hyperband adopts an early-stopping strategy and requires two inputs to determine the resource allocation. The first input specifies the maximum allocated resources for a configuration, while the second controls the proportion of configurations discarded during successive halving (see Fig. 4b). This approach allows Hyperband to evaluate more configurations in high-cost problems without making strong assumptions. The algorithm begins aggressively to maximize exploitation and ensures that at least one configuration receives resources. Subsequent iterations decrease the minimum resource allocation based on the defined proportion until the final iteration employs Random Search. Hyperband can adaptively allocate resources and may favor conservative allocations in certain scenarios²³.

The process of optimizing hyperparameters encompasses a range of techniques applicable to various machine learning models. These models, along with some of their respective hyperparameters, include (but are not restricted to):

Decision trees (DTs) maximum depth limits tree complexity to prevent overfitting, while minimum samples per leaf ensure splits are meaningful and not based on small, noisy subsets.
K-Nearest neighbors (KNN) number of neighbors (k) determines decision boundary smoothness; small k captures fine details but risks overfitting, while large k smooths predictions at the cost of local precision.
Neural networks (NNs) learning rate governs training speed and stability, with smaller values ensuring steady convergence. Hidden layers and neurons affect model complexity, enabling simple to intricate pattern recognition.
Random forests (RFs) number of trees enhances stability by averaging predictions, while maximum features per split control tree diversity, balancing accuracy and robustness.
Support vector machines (SVMs) regularization parameter (C) controls the trade-off between overfitting and generalization, with smaller C enforcing simpler models. Kernel type (e.g., linear or polynomial) maps data into higher-dimensional spaces for separating non-linear patterns.

While manually tuning these hyperparameters can be a tedious and time-consuming process, automated hyperparameter optimization contributes significantly to AutoML by reducing the need for human effort, improving the performance of ML algorithms, and being more accessible to reproduce than manual configuration²⁴. However, some difficulties are noted, such as the high cost of evaluations when the models, datasets, or pipelines are complex and the lack of clarity about which hyperparameter needs optimization and which ranges should be optimized.

Neural architecture search (NAS)

Deep learning’s ability to automatically extract features has revolutionized machine learning²⁵, shifting the focus to designing effective neural architectures—the blueprints for network structure. Hyperparameters define these architectures, such as the number of hidden layers and neurons within each layer. During training, the network’s internal parameters (weights and biases) are iteratively adjusted through gradient-based optimization techniques like backpropagation²⁶ to minimize a loss function. Manually designing neural architectures is labor-intensive and requires significant expertise.

While gradient-based optimization excels at refining a neural network’s parameters through backpropagation²⁶, Neural Architecture Search (NAS) tackles a more fundamental challenge: automatically identifying the hyperparameters that optimize the network architecture itself²⁷.

The success of deep learning hinges on a well-designed neural network architecture. However, manually crafting these architectures is a highly specialized and time-consuming task. Neural Architecture Search (NAS) addresses this challenge by automating the discovery of optimal architectures²⁷.

As shown in Fig. 5, the NAS methodology typically consists of three main components:

Search space The set of possible building blocks, such as layer types (e.g., fully connected, convolutional, pooling), activation functions, and other structural elements. A well-defined search space balances flexibility with computational feasibility, enabling the discovery of meaningful architectures.
Search strategy The method used to navigate the search space and generate promising architectures. Strategies range from simpler approaches like random search to advanced techniques such as evolutionary algorithms. The choice of strategy affects the speed and quality of the search, with trade-offs between exploration and computational cost.
Evaluation strategy The process of assessing a candidate architecture’s performance, typically using metrics like accuracy or loss on a validation set. Results guide the search toward better architectures, balancing efficiency and accuracy to impact both search effectiveness and the quality of the final results.

As mentioned earlier, the search strategy is critical in NAS as it determines how the search space is explored. Two common strategies are:

Evolutionary optimization Inspired by biological evolution, this approach generates a population of candidate architectures and evaluates their performance. High-performing architectures are combined and mutated to produce new candidates in iterative cycles, gradually leading to improved designs²⁸.
Random search This simpler approach randomly samples architectures from the search space, evaluating each using the chosen evaluation strategy (e.g., accuracy on a validation set). After a set number of iterations, the architecture with the best performance is selected. Despite its simplicity, random search can be surprisingly effective, particularly in smaller search spaces²⁹.

Traditional NAS approaches separate the search and evaluation phases, which can be computationally expensive. One-shot methods streamline the process by integrating these phases into a single step to overcome these challenges, significantly reducing computational costs. Two notable one-shot NAS algorithms are:

Efficient neural architecture search (ENAS) Treats all possible architectures as sub-graphs of a larger supergraph. Each node in the supergraph contains several candidate operations. Once an architecture is trained, its weights are shared with other architectures that share edges in the supergraph³⁰.
Differentiable architecture search (DARTS) Relaxes each node’s choice of activation functions into a softmax distribution over all possible operations. This allows the search space to be optimized continuously by minimizing both validation and training loss³¹.

NAS can be an integral step within an AutoML framework, enhancing the automation of the machine learning pipeline by selecting neural network architectures. The concept of AutoML, including its integration with NAS, will be explained in the next subsection.

Automated machine learning (AutoML)

As previously mentioned, ML is an assertive discipline of AI that can solve many issues by recognizing patterns or making data-based predictions. However, it requires human knowledge, interference, and effort from data scientists.

Automated ML was born to reduce these and other inconveniences. According to Ref.³², AutoML may reduce numerous manual tasks and permit domain experts to create and work well with ML pipelines without advanced expertise in ML or statistics. Reference³³ sees AutoML as a conjunction of automation and ML. They say that AutoML highlights the facility of configuration and control of ML learning tools, making it self-adaptive to problems.

To explore AutoML’s potential, we selected frameworks based on popularity, community adoption, features, and Python support. Most are open source and actively maintained on GitHub, ensuring accessibility and transparency. To provide a broader view, we included 4intelligence, a commercial framework. These frameworks were chosen for their ability to automate key stages of the ML pipeline for structured data. Below, we outline their main features and capabilities.

4intelligence³⁴: A cloud-based framework by 4intelligence for binary and multiclass classification, requiring minimal user input. It automates data cleaning, feature engineering, preprocessing, and model selection using various algorithms.
AutoGluon³⁵: An open-source AutoML toolkit with deep learning capabilities for tabular, image, and text data tasks. Automates major steps in the ML process, making it accessible for users with diverse skill levels.
AutoKeras³⁶: An open-source system that leverages CPUs and GPUs to train neural networks, focusing on image and text classification and regression tasks. Employs Keras and Bayesian optimization for efficient neural architecture search (NAS) with a user-friendly interface that requires no deep learning expertise.
Auto-PyTorch³⁷: Built on PyTorch, this framework automates NAS and HPO for optimal neural networks, primarily for tabular data tasks. It streamlines the development of models for classification and regression problems.
AutoSklearn³⁸: Based on scikit-learn, it offers diverse preprocessing methods and supports classification, regression, and multilabel tasks. Incorporates meta-learning to enhance Bayesian optimization efficiency and reduce training time.
EvalML³⁹: Automates the entire ML pipeline for structured data, including data loading, cleaning, feature engineering, model selection, HPO, and model evaluation, enabling model building with reduced coding effort.
FEDOT⁴⁰: An AutoML framework designed for complex ML tasks, including classification. It employs evolutionary algorithms to optimize pipeline creation, focusing on interpretable results to support data-driven decision-making.
FLAML⁴¹: A lightweight AutoML library that prioritizes efficient, cost-effective search for accurate models by optimizing the search process to minimize computational costs beyond evaluating each configuration.
GAMA⁴²: An open-source AutoML framework that optimizes pipelines for supervised tasks using genetic programming, offering an intuitive interface for model selection and HPO in classification and regression.
H2O⁴³: Optimized for scalable ML, supporting efficient parallel training of multiple models on large datasets. Aims to achieve competitive performance by utilizing distributed algorithms designed for high efficiency.
LightAutoML⁴⁴: A comprehensive, open-source framework emphasizing transparency in ML. Automates data preparation, feature engineering, model selection, HPO, and model deployment, ensuring interpretability in predictions.
Lightwood⁴⁵: An open-source AutoML library developed by MindsDB, focusing on integrating machine learning models directly within databases. It automates core processes like data preprocessing and feature selection for structured data, emphasizing accessibility and ease of use.
mljar-supervised⁴⁶: Dedicated to supervised learning tasks, focusing on both accuracy and interpretability. Automates model selection, HPO, and training with insights into model performance and feature importance.
NaiveAutoML⁴⁷: A lightweight AutoML tool designed for simplicity, automating tasks like data cleaning, feature selection, and basic model evaluation. Ideal for interpretable applications with minimal setup.
PyCaret⁴⁸: A low-code Python library simplifying the ML workflow. Automates preprocessing, feature engineering, model selection, hyperparameter tuning, and evaluation, supporting classification, regression, clustering, and anomaly detection, and seamlessly integrates with popular deployment tools.
TPOT⁴⁹: Uses genetic programming to generate and refine ML pipelines automatically. Manages data preprocessing, feature selection, and model selection with minimal input, providing user-friendly and modifiable pipelines.

It is shown in works like⁵⁰, which uses real-life datasets, how helpful AutoML is, making it possible for people without broad knowledge to operate datasets and make a difference with its creation. “Related work” section will further confirm the potential and user-friendliness of AutoML tools through a comprehensive analysis of existing works.

In addition, Table 1 summarizes the key features and capabilities of each AutoML framework, such as data handling (missing data, duplicates, noisy data, outliers), categorical data management, scaling, normalization, feature engineering, NAS, and supported classification types. This comparison will provide insights into the strengths and limitations of each framework, aiding in informed decision-making for researchers and practitioners.

Thus, this paper uses the domains presented above and their characteristics as categories of comparison between the AutoML tools studied to provide detailed insights into how each tool addresses these areas.

Table 1 Frameworks summary.

Full size table

Related work

This section situates our research within AutoML. “Overview of existing research” section reviews existing studies on frameworks, surveys, and applications, while “Research gap and our contribution” section identifies research gaps and outlines our contributions through benchmarking and in-depth statistical analysis.

Overview of existing research

We review literature on AutoML, examining its functionalities, applications, and evaluations. Our synthesis highlights assessment methodologies, metrics, scalability challenges, and the transition from domain-specific to general-purpose benchmarking.

Truong et al.² compared AutoML tools across datasets, assessing their performance, advantages, and limitations. While AutoML tools excel in feature engineering, data preprocessing still requires significant human intervention. The evaluation involved around 300 OpenML datasets, with accuracy as the primary metric for binary and multiclass classification tasks. It was observed that no single tool consistently outperformed others across diverse tasks, indicating performance variations depending on the specific test. Notably, H2O excelled in binary classification, while AutoKeras led in multiclass classification.

Wever et al.³ advanced the study of AutoML by focusing on multilabel classification. They introduced a benchmarking framework ensuring uniform runtime constraints, search space models, and evaluation routines across multiple optimization approaches, including Bayesian optimization and bandit algorithms. Their results underscored the difficulties posed by large search spaces and found that a grammar-based best-first search approach outperformed others for multilabel tasks.

Mustafa and Azghadi⁵¹ surveyed AutoML in clinical notes analysis, highlighting its challenges and advantages in healthcare. The most commonly used algorithms for clinical notes analysis were Support Vector Machine, Convolutional Neural Networks, Random Forest, and Linear Regression. It was noted that no single feature selection algorithm consistently provided optimal results across all datasets. Future research suggests exploring new feature-extraction techniques, comparing feature selection methods, and developing tools combining Bayesian optimization with random search.

Similarly, Bahri et al.⁵² reviewed AutoML techniques for unsupervised anomaly detection. The review addresses challenges in building high-quality machine learning models through AutoML, emphasizing scalability, evaluation metrics, and handling high-dimensional data. The methodology involves summarizing representative methods, evaluating their performance, and discussing their advantages and disadvantages. Additionally, various automated methods and strategies for anomaly detection are reviewed, including meta-learning approaches and meta-features. Results include insights into the current state of AutoML, highlighting limitations and open research questions, along with updated overviews of AutoML and HPO methods. Furthermore, the article examines anomaly detection algorithms, highlighting common uses, challenges like hyperparameter sensitivity, and frameworks such as PyOD and PyODDS for scalability and pipeline optimization.

In addition to domain-specific applications, AutoML has shown promise in augmenting model training processes. Lin et al.⁵³ developed a solution to optimize augmentation policies during network training. The solution, named OHL-Auto-Aug, treated the augmentation policy as a parameterized probability distribution, and its strategy for neural architecture search was based on AutoML techniques. It was formulated as a bilevel framework where, in the inner loops, the network parameters were trained using a standard Stochastic Gradient Descent along with augmentation sampling. In the outer loops, augmentation distribution parameters were trained using REINFORCE gradients with trajectory samples. The solution achieved a search efficiency 60x faster on CIFAR-10 and 24x faster on ImageNet than the previous state-of-the-art approach.

Extending AutoML’s application across diverse domains, Angarita-Zapata et al.⁵⁴ conducted a study on hybrid AutoML for traffic forecasting, focusing on using the AutoSklearn framework. The study explored three scenarios: optimization, meta-learning, and ensemble learning; meta-learning combined with ensemble learning; and selecting the best-performing pipeline suggested by meta-learning. The study used speed measurements at 5-min intervals for highways and 15-min intervals for urban areas. AutoSklearn assumes pipelines ranked near position 1, with meta-features closest to the input dataset, perform better. However, the distance measure showed a weak correlation with performance, and pipelines below position 25 were excluded based on this similarity metric.

AutoML’s impact on medical imaging was explored by Beduin⁵⁰, who proposed a new residual model called AutoResCovidNet using AutoKeras. This model aimed to compare X-ray images of healthy, pneumonia, and COVID-19-diagnosed individuals. The chosen metrics for analysis were accuracy, precision, sensitivity, and $F_1$ score, with parameters designed to enhance image processing accuracy. Before AutoML, preprocessing was performed on the COVIDx-m dataset to optimize the training data. Despite testing 20 models, AutoResCovidNet did not surpass existing results in the literature. However, the study highlighted the advantages of AutoML in accelerating model development and enabling focused research for more specific and satisfactory outcomes. The user-friendly interface, simplicity, and strong processing power of AutoKeras were recognized.

Expanding on domain-specific applications, van Eeden et al.⁵⁵ compared multinomial logistic regression, Naïve Bayesian classifier, and AutoSklearn for predicting psychiatric diagnoses. They hypothesized that AutoSklearn would outperform the Bayesian classifier, surpassing traditional regression techniques in detecting complex patterns. They also hypothesized that AutoSklearn would be efficient when including single items and follow-up measures. The study used data from the Netherlands Study of Depression and Anxiety (NESDA), excluding non-Dutch fluent patients and other psychiatric disorders. Variables such as gender, age, ethnicity, education level, relationship status, and employment status were considered. AutoSklearn outperformed logistic regression and the Bayesian classifier on more complex predictor sets, but its accuracy varied depending on the predictor variables used. However, AutoSklearn was the most consistent across predictor sets.

Beyond specific applications, Ferreira et al.⁵⁶ proposed a benchmark study to examine the characteristics of eight open-source AutoML frameworks (AutoKeras, Auto-PyTorch, AutoSklearn, AutoGluon, H2O, TPOT, rminer, and TransmogrifAI) along with twelve popular OpenML datasets (such as 37—Diabetes and 23—Contraceptive Method Choice). These datasets were divided into regression, binary, and multiclass classification tasks. The study compared the performance of these tools across different scenarios: General Machine Learning (GML), Deep Learning (DL), and XGBoost (XGB). In the GML scenario, results indicated that TransmogrifAI excelled in binary classification, AutoGluon in multiclass classification, and rminer in regression tasks. In the DL scenario, H2O performed well in binary classification and regression tasks, while AutoGluon stood out in multiclass classification. For XGB, rminer was the top choice for binary classification and regression tasks, and H2O for multiclass classification. Overall, the GML approach demonstrated superior predictive performance, with tools like TransmogrifAI, AutoGluon, and rminer consistently delivering strong results across tasks.

Romero et al.⁵⁷ evaluated the performance of three widely used AutoML frameworks—AutoSklearn, H2O, and TPOT—on highly imbalanced binary classification tasks in the healthcare domain. Using de-identified medical claims data, they predicted the occurrence of six different diseases, each framed as a separate binary task. Their evaluation focused on imbalanced-aware metrics, such as the Area Under the Precision-Recall Curve (AUPRC), due to the low prevalence of positive cases. While all AutoML tools outperformed a tuned Random Forest baseline, their relative performances were not statistically distinguishable, suggesting no single framework had a consistent advantage across tasks. This study highlights the need to evaluate AutoML tools under real-world data imbalance, especially in high-stakes domains like healthcare.

Synthesizing these insights, Del Valle et al.⁵⁸ systematically reviewed AutoML for multilabel classification and multi-target regression. The review analyzed existing AutoML approaches, encompassing search space definition, optimization algorithms, and evaluation metrics while identifying limitations and proposing future research directions. The protocol involved selecting studies from various online data sources, employing a four-step process: submitting a search string, initial and final selection, and snowballing. Initially identifying 94 studies, the review ultimately selected 12 for analysis, which included models such as convolutional networks and grammar-based genetic programming. Key findings include determining suitable evaluation measures for each context, exploring alternative loss functions, enhancing search space definition, developing methods for large datasets (e.g., transfer learning), and investigating meta-learning approaches to expedite the AutoML search process.

Neverov et al.⁵⁹ applied AutoML to wave data classification, addressing parameter optimization challenges. The authors analyzed frameworks including MLJAR AutoML, AutoGluon, AutoKeras, and TPOT, evaluating performance on datasets like Sonar, Doppler, and Winnipeg, featuring energy bands, radar matrices, and labels like mine/rock, car/human/drone, and crops. Using genetic algorithms and Bayesian optimization, AutoGluon achieved the highest accuracy, outperforming other frameworks. The study also demonstrated AutoGluon ’s ability to combine models and optimize performance, making it faster, more reliable, and accurate, underscoring AutoML’s effectiveness in wave data classification and real-world applicability.

Salehin et al.⁶⁰ expanded on AutoML’s broader implications by conducting a comprehensive review of AutoML and Neural Architecture Search (NAS). They analyzed existing literature and provided insights into advancements, challenges, and future directions. Employing a systematic review approach, the authors identified studies from reputable databases. The review covers various aspects, including AutoML methods, industrial applications, and open issues related to performance and accuracy. It highlights the role of these techniques in optimizing deep neural applications and addressing challenges in accuracy, latency, and energy consumption. Additionally, the article discusses the motivation behind AutoML research and its potential impact on innovation across industries. It also examines open issues and limitations, offering insights into potential developments.

To systematically evaluate and compare the performance of various AutoML frameworks, Gijbers et al.⁴ introduced Automated Machine Learning Benchmark (AMLB), a comprehensive benchmarking framework. AMLB provides a standardized environment with a diverse collection of real-world datasets covering tasks such as classification and regression of varying complexities. By establishing consistent evaluation metrics and protocols, curating datasets from multiple domains to test generalization capabilities, presenting empirical results of state-of-the-art AutoML frameworks, and offering an open-source platform to encourage reproducibility and future research, the authors emphasize the importance of such benchmarks in advancing AutoML development. AMLB is a valuable resource for researchers and industry professionals by identifying performance gaps and guiding practitioners in selecting appropriate tools to understand and improve AutoML technologies.

Eldeeb et al.⁶¹ proposed a large-scale AutoML benchmark study across 100 classification datasets using six frameworks, including AutoWEKA, AutoSklearn, TPOT, RECIPE, ATM, and SmartML. The study explored how pipeline design decisions—such as search space size, use of meta-learning, time budgets, and ensemble strategies—impact performance. Unlike many benchmarks that report only aggregate scores, Eldeeb et al. emphasized how tool behavior changes under different constraints. Although they did not focus on multilabel classification, their results provide a detailed comparative analysis of tool robustness and configurability across a wide range of tasks, offering practical guidance for AutoML deployment.

Research gap and our contribution

This section has reviewed various AutoML frameworks, highlighting their transformative potential across diverse applications and methodologies. While these studies demonstrate significant advancements, challenges such as scalability, optimization for multilabel tasks, and the need for standardized benchmarks remain. Prior research has largely focused on theoretical comparisons, often failing to address real-world applications. Notably, although some works incorporate limited performance metrics (e.g., accuracy) or consider only binary and multiclass tasks, few studies adopt multi-level statistical tests or examine native and powerset multilabel approaches in a unified setting.

To bridge these gaps, our study evaluates sixteen AutoML frameworks under stringent time constraints across binary, multiclass, and multilabel problems—including both native and label-powerset classifications. Unlike many prior efforts, we employ extensive, multi-tiered statistical analyses (per-dataset, across-datasets, and all-datasets) to robustly determine significance in both predictive performance and computational efficiency. This hands-on methodology connects theoretical insights with real-world outcomes, offering practitioners actionable guidance for selecting or refining AutoML strategies. Our open-source code and reproducible protocols further ensure that researchers can replicate, extend, and apply our findings to a broad range of classification scenarios.

By evaluating AutoML frameworks using real datasets, our study provides a broader, more practical perspective on their capabilities, thus helping researchers and practitioners select the most appropriate tool based on classification type, computational constraints, and desired performance metrics. Table 2 highlights the key focuses, methods, metrics, and outcomes of prior research. It also outlines how our work addresses existing gaps – such as limited task diversity, lack of multilabel support, and insufficient statistical rigor—through multi-task benchmarking, multi-tier statistical analysis, and the explicit evaluation of both native and label-powerset multilabel strategies.

Table 2 Study-by-study summary of AutoML research, highlighting methods, findings, and distinctions from our work.

Full size table

To complement the study-specific overview in Tables 2 and 3 presents a comparative summary of recent AutoML benchmarking efforts across key methodological dimensions. This feature-level synthesis highlights how our study extends the state of the art in terms of classification coverage, dataset diversity, statistical validation, and practical applicability.

Table 3 Feature-level comparison of AutoML studies, emphasizing task coverage, tool diversity, and evaluation rigor.

Full size table

Assessment methodology

This section describes the methodology for evaluating AutoML tools in different classification tasks. It details dataset selection, data preprocessing, tool workflow, performance metrics, and result analysis to ensure a fair and reproducible comparison.

Dataset selection

To evaluate AutoML frameworks across different classification tasks, we selected datasets spanning binary, multiclass, and multilabel classification problems. These datasets are all real, publicly available OpenML benchmarks, not synthetic toy problems. They vary in size and include a mix of quantitative, qualitative, and mixed feature types. The number of predictive features ranges from just a few to over 200, and the label structures differ in complexity: from simple binary outcomes to multi-class categories and multilabel combinations. For multilabel datasets, we report both the number of original labels and the number of unique label combinations resulting from the Label Powerset transformation. We specifically chose these 21 OpenML datasets because they are among the most commonly referenced benchmarks in the literature, ensuring comparability with prior studies, and they collectively cover a broad spectrum of real-world domains and data characteristics.

To quantify dataset characteristics, we used two complementary metrics. The first, complexity, is calculated differently depending on the classification type. For binary and multiclass tasks, complexity⁶² is defined as the ratio of the product of features and classes to samples. For multilabel classification, complexity¹¹ incorporates label cardinality (the average number of labels per instance) to compute effective classes. This metric helps contextualize the computational difficulty of each dataset, where higher values suggest a more challenging classification task. We also report the imbalance, defined for binary and multiclass tasks as the ratio of the smallest to the largest class size. For multilabel datasets, we report both a global imbalance ratio based on total positive and negative label counts, and a powerset-based ratio computed over unique label combinations. Values near 1 indicate balanced distributions, while values near 0 reflect severe imbalance.

Our selection aimed for broad diversity across several dimensions. We sampled (i) size tiers—small ($<1000$ rows), medium (1000–2000), and large ($>2000$); (ii) application domains, including finance (credit-g), health (diabetes, cardiotocography), imaging (wdbc, segment, scene), text (reuters), and time-series/audio (birds); (iii) a wide complexity spectrum—from 0.006 for bank-note-authentication to 7.766 for birds; and (iv) a broad imbalance range—from 0.001 for reuters to 1.000 for hill-valley and segment. A complete summary of all datasets is presented in Table 4.

Table 4 Summary of the datasets.

Full size table

Data preprocessing

Minimal preprocessing was applied prior to feeding data into the AutoML tools. Categorical features were integer-encoded using factorization for compatibility across frameworks. Targets were numerically encoded for binary and multiclass tasks; for multilabel classification, labels were either binarized (0 = absence, 1 = presence) or transformed via Label Powerset into unique class identifiers. All other preprocessing, including missing value handling, scaling, and feature engineering, was delegated to the AutoML frameworks to allow them to optimize data preparation internally.

Tool workflow

Frameworks were selected based on adoption in the research community, open-source availability, and integration with widely used ML libraries to align with standard workflows. Each AutoML framework was executed with minimal intervention, handling data preprocessing, feature selection, model training, and HPO. To ensure fair comparison, key parameters—time budget, evaluation metrics, and parallelization—were unified. Because each framework conducts its own internal hyperparameter search, we gauged consistency by running 20 independent trials per dataset, each started with a different prime-number seed—thereby sampling distinct HPO trajectories without externally altering the tools’ default settings. Table 1 summarizes key features: data handling, classification task coverage, and optimization techniques. Details on dataset partitioning, runtime constraints, and hardware specifications are provided in “Setup” section, while threats to validity are discussed in “Threats to validity” section.

Performance metrics

To evaluate AutoML frameworks, we define key performance metrics. Precision is the proportion of correctly predicted positive instances among all predicted positives for a given class $a$, while recall measures the proportion of actual positives correctly identified. Here, $a$ represents a specific class, $TP(a)$ are correctly predicted positives, $FP(a)$ are incorrectly predicted as positive, and $FN(a)$ are actual positives missed by the model. Their formal definitions are given in Eqs. (1) and (2).

$$\begin{aligned} Precision(class = a) = \frac{TP(a)}{TP(a) + FP(a)}, \end{aligned}$$

(1)

$$\begin{aligned} Recall(class = a) = \frac{TP(a)}{TP(a) + FN(a)}. \end{aligned}$$

(2)

A more robust and informative metric is the F₁ score, which represents the harmonic mean of precision and recall for a given class $a$. This metric balances the trade-off between precision and recall, making it particularly useful for evaluating imbalanced datasets. However, while the $F_1$ score effectively evaluates individual classes, it does not account for class imbalances across the dataset. To address this, the weighted $F_1$ score^84,85,86 extends the $F_1$ score by incorporating the support, which is the number of true instances (samples) of each class in the dataset. This ensures that the metric reflects the relative importance of each class proportionally. Here, $C$ represents the set of all classes, $n_a$ is the support of class $a$ (i.e., the number of true samples belonging to class $a$), and $F_1~score(a)$ is the $F_1$ score for class $a$. The formulas for both metrics are given in Eqs. (3) and (4).

$$\begin{aligned} F_1~score(class = a) = 2 \cdot \frac{\text {Precision}(a) \cdot \text {Recall}(a)}{\text {Precision}(a) + \text {Recall}(a)}, \end{aligned}$$

(3)

$$\begin{aligned} Weighted~F_1~score(class = a) = \frac{\sum _{a \in C} n_a \cdot F_1~score(a)}{\sum _{a \in C} n_a}. \end{aligned}$$

(4)

The calculation of the weighted $F_1$ score varies slightly depending on the type of classification problem:

Binary: the weighted $F_1$ score is equivalent to the standard $F_1$ score when there are only two classes, as the weights are derived directly from the support of the positive and negative classes.
Multiclass: the weighted $F_1$ score considers the $F_1~score(a)$ for each class $a$ and weights them by their respective support ($n_a$). This ensures that the contribution of each class to the final score reflects its proportion in the dataset.
Multilabel: each label is treated as a binary classification problem, and the weighted $F_1$ score is calculated across all labels by summing the weighted $F_1$ scores for each label and dividing by the total support.

The weighted $F_1$ score provides a robust and comprehensive performance measure, making it suitable for various classification tasks with imbalanced datasets. Incorporating the relative importance of each class allows results to be presented and evaluated responsibly and fairly. Each metric ranges from 0 to 1, with values closer to 1 indicating better performance.

Alternative evaluation metrics such as the Area Under the Receiver Operating Characteristic Curve (AUROC), the Area Under the Precision-Recall Curve (AUC-PR), and log-loss were not adopted due to limitations in multiclass and multilabel contexts. AUROC measures the trade-off between true and false positive rates across thresholds, while AUC-PR reflects precision-recall performance for the positive class^87,88. Both are difficult to interpret or inconsistently defined when label dependencies are present, as in multilabel classification¹¹. Logarithmic loss (log-loss) assesses the confidence of probabilistic predictions but is sensitive to class imbalance and lacks intuitive interpretability across tasks⁸⁹. In contrast, the weighted $F_1$ score is robust, interpretable, and applicable across binary, multiclass, and multilabel settings, making it the most suitable choice for unified AutoML benchmarking. Although some frameworks—such as AutoSklearn, FLAML, and mljar-supervised—support interpretability via SHAP (SHapley Additive exPlanations)⁹⁰ and feature importance scores⁹¹, this study focuses on predictive performance and computational efficiency. A dedicated evaluation of interpretability features is left for future work.

Result analysis

The evaluation of AutoML frameworks considers both performance and efficiency. To assess model reliability, stability, and computational demands, we analyze the following:

Weighted $F_1$ score: Evaluates classification performance by balancing precision and recall. We report: (i) maximum weighted $F_1$ score, indicating the best-case scenario, (ii) mean weighted $F_1$ score, representing expected performance, and (iii) standard deviation (SD), measuring variability, where lower values indicate greater stability. Results for binary, multiclass, and multilabel tasks are analyzed separately, with framework rankings aggregated for overall assessment.
Training time: Assesses efficiency based on execution speed and consistency. We report: (i) minimum training time, reflecting the fastest execution, (ii) mean training time, estimating typical computational needs, and (iii) standard deviation, indicating runtime stability, where lower values suggest more predictable performance.

To ensure robust comparisons, we apply statistical tests in “Statistical analysis” section to evaluate performance ($F_1$ score) and efficiency (training time) across frameworks. The analysis considers three scenarios: per-dataset, across-datasets, and all-datasets. Per-dataset analysis examines each dataset individually, across-datasets analysis ranks frameworks based on aggregated results, and all-datasets analysis identifies the best-performing framework overall. These tests confirm that observed differences are statistically meaningful rather than random variations. Additional methodological details are also provided in the section.

Experiments

Setup

The experimental setup involved implementing the experiments in Python and consolidating the code and instructions for reproducibility in the following public GitHub repository: https://github.com/marcelovca90/auto-ml-evaluation. Key functionalities, including data retrieval, sampling, and target encoding, were executed using the Pandas⁹², scikit-learn²¹, and scikit-multilearn⁹³ (for label powerset) libraries. Datasets were sourced from the public API of the OpenML tool⁹⁴.

To ensure robust and random evaluations, each dataset underwent 20 rounds of shuffling using prime number seeds (2–71) for the random number generator (RNG). This shuffling mitigates bias and strengthens the generalizability of the findings. The data was split into an 80/20 ratio for training and testing in each round, ensuring consistency across evaluations. Additionally, the prime number seeds were used to initialize the random state of the AutoML frameworks, enhancing reproducibility in their search processes. Prime numbers were chosen as seeds to minimize unintended correlations in dataset shuffling. Unlike composite numbers, they reduce the risk of systematic bias and improve statistical randomness. This approach aligns with best practices in cryptography and randomized algorithms, reinforcing the robustness and fairness of experimental outcomes⁹⁵.

For each seed, a shell script orchestrated the experiments in a dedicated Linux environment, executing all AutoML frameworks per dataset with a fixed 5-min timeout. This limit ensured fair comparison across frameworks with varying optimization speeds—some rely on fast heuristics, others on longer runtimes. A fixed cap prevents bias toward either. RAM was monitored throughout, but no framework exceeded system memory or failed due to resource limits, so no cap was enforced. After each iteration, results were aggregated using NumPy⁹⁶ to compute basic statistics (max, min, mean, and standard deviation), providing insight into framework performance.

The experiments were conducted on a computer with an AMD® Ryzen™ 9 5900X processor, 128 GB of DDR4-3200 RAM, Nvidia GeForce® RTX™ 3090 24 GB GDDR6X dedicated graphics card (driver version 546.17), and Ubuntu operating system (version 22.04.2 LTS) under Windows 11 Professional (version 22H2) using WSL (Windows Subsystem for Linux).

For a more comprehensive understanding of the experiment’s configuration, Table 5 details the custom parameters used for each framework. This table provides insight into the specific settings employed during the construction, fitting, and prediction processes through custom parameters.

Results and discussion

This section presents experimental results for binary, multiclass, and multilabel classification tasks. The primary performance metric is the weighted $F_1$ score (defined in “Performance metrics” section), which balances precision and recall while accounting for class imbalance. Training time is also analyzed to assess computational efficiency.

Each framework-dataset pair was evaluated over 20 runs with shuffled splits using prime number seeds. We report the maximum, mean, and standard deviation of the weighted $F_1$ score. A high maximum indicates strong predictive capacity in at least one trial, while a high mean and low standard deviation reflect robustness to hyperparameter and data variation.

While prediction time matters in real-time settings such as Network Traffic Analysis (NTA)^97,98, we focus on training time, as excessive durations can hinder model usability.

For each classification type, results are summarized in tables and figures reporting weighted $F_1$ scores and training times. The discussion is structured around the following key aspects:

Performance analysis Comparison of frameworks based on weighted $F_1$ scores across datasets of varying complexity. The analysis identifies where advanced techniques like ensemble learning or NAS contribute to higher accuracy.
Training time analysis Examination of computational efficiency through training time comparisons. This highlights the trade-off between frameworks that maximize accuracy through exhaustive optimization versus those prioritizing speed through adaptive resource allocation.
Usability and scalability comparison Evaluation of framework behavior under growing data complexity and resource constraints, including how they manage missing values, categorical features, and class imbalance. We highlight differences in scalability and design focus (e.g., exhaustive search vs. adaptive strategies).
Biases, limitations, and failure modes We log crashes, timeouts, and degenerate predictions (e.g., NaNs, single-class outputs), tracing them to issues such as inadequate handling of missing values, categorical encoding, or label imbalance. These cases expose framework brittleness and complement quantitative results with robustness insights.
Insights on complexity and imbalance Expanded analysis of the relationship between dataset complexity, class imbalance, and framework performance, with emphasis on cases where traditional complexity metrics fail to explain results. In both standard and multilabel settings, severe imbalance—whether across labels or label combinations—often proved more predictive of failure than complexity alone, highlighting the need for imbalance-aware modeling.
Conclusion Summary of trade-offs between frameworks favoring exhaustive search and those optimized for efficiency.

Next, we present and discuss the results for each classification task, followed by overall findings and insights. Finally, we examine threats to validity, considering potential limitations and the reliability of our conclusions.

Table 5 Custom settings used in each framework.

Full size table

Binary scenario

From here on, each dataset is identified as “Name (ID, C/I)”, where C denotes complexity and I indicates class imbalance, as per Table 4; scores are reported as “max. (mean ± SD)” for $F_{1}$, and training times as “min. (mean ± SD).” For the binary datasets, Tables 6a and b, along with Figs. 6 and 7, summarize each framework’s weighted $F_{1}$ score and training time.

Table 6 Performance summary for binary classification tasks.

Full size table

Performance analysis

From a performance perspective, most frameworks achieved strong results on simpler datasets, as shown in Table 6a and Fig. 6—particularly bank-note-authentication (ID 1462, 0.006/0.800), where perfect $F_{1}$ scores were common. However, as dataset complexity increased, performance variability became more pronounced. On hill-valley (ID 1479, 0.165/1.000), frameworks exhibited a wide range of results, with AutoSklearn and GAMA achieving a perfect score, followed by TPOT with 0.992. In contrast, AutoGluon, despite a maximum of 0.926 (mean: 0.685 ± 0.162), showed better adaptability to training time, whereas LightAutoML struggled, reaching only 0.368. These differences highlight the impact of framework design on handling complex datasets. Frameworks like AutoSklearn and Auto-PyTorch use advanced Bayesian HPO³⁸, while GAMA and TPOT employ genetic programming⁴⁹, both consistently yielding high scores and robust performance. Meanwhile, AutoGluon leverages stacked ensembles and adaptive model selection³⁵, maintaining competitive results with lower training time. In contrast, LightAutoML and Lightwood, with limited feature engineering and less aggressive HPO^44,45, showed performance degradation on high-dimensional datasets.

Training time analysis

Training time analysis revealed a clear trade-off between exhaustive optimization and rapid execution. As shown in Table 6b and Fig. 7, heavy-search frameworks such as AutoSklearn and TPOT consistently used nearly the full 5-min budget (05:00 ± 00:02), achieving the highest $F_{1}$ scores. In contrast, adaptive pipelines dynamically adjusted their runtimes: AutoGluon completed bank-note-authentication (ID 1462, 0.006/0.800) in just 00:11 ± 00:03 while still reaching $F_{1}=1.000$, and on hill-valley (ID 1479, 0.165/1.000) it finished in 00:17 ± 00:04 (with $F_{1}=0.926$). “Speed-first” tools ran dramatically faster but at lower accuracy – for example, Lightwood ran in about 2 s ($\approx 150\times$ faster) with $F_{1}=0.563$, and LightAutoML ran in 02:29 ± 00:18 ($\approx 2\times$ faster) with $F_{1}=0.368$. These results underscore how runtime strategies directly affect both speed and accuracy.

Usability and scalability comparison

On spambase (ID 44, 0.025/0.650), frameworks that relied on extensive HPO required the full training budget to achieve high scores, while others produced comparable results in significantly less time. This supports findings that HPO through Bayesian optimization³⁸ or genetic programming⁴⁹ can increase training time without necessarily yielding substantial performance improvements. On hill-valley (ID 1479, 0.165/1.000), which contains 100 features, scalability differences were more apparent. Some frameworks required the full training budget due to their reliance on exhaustive model selection³⁸, whereas others completed training in a fraction of the time at the cost of lower $F_{1}$, illustrating the speed–accuracy trade-off in high-dimensional tasks^35,44.

Biases, limitations, and failure modes

Four frameworks—FEDOT, GAMA, LightAutoML, and TPOT—crashed on the moderately imbalanced titanic set (ID 40945, 0.020/0.618) for distinct but related reasons: FEDOT stalled on missing-value imputation, TPOT failed to encode non-numeric categoricals, and GAMA and LightAutoML aborted when extreme class imbalance left the target effectively single-class (the latter performing a sanity check). AutoSklearn and AutoGluon completed the task, evidencing resilient pipelines with built-in imputation, encoding, and class-balancing heuristics. Two general patterns emerged across the seven binary datasets: (i) exhaustive Bayesian or genetic search engines (AutoSklearn, Auto-PyTorch, TPOT) achieved the highest $F_{1}$ scores but often overfit the trivially separable bank-note-authentication (ID 1462, 0.006/0.800); (ii) speed-first pipelines (LightAutoML, Lightwood) ran dramatically faster but suffered major accuracy losses ($\ge 0.6$ in $F_{1}$) on noisy, high-dimensional hill-valley (ID 1479, 0.165/1.000). Tree-ensemble tools (AutoGluon, FLAML) showed only mild dips on imbalanced sets, likely because their 5-minute budgets precluded advanced re-sampling or cost-sensitive learning.

Insights on complexity and imbalance

Dataset difficulty hinges on how complexity interacts with class skew. The noisy, high-dimensional hill-valley (ID 1479, 0.165/1.000) defeated every speed-first tool despite moderate complexity, whereas the separable yet skewed bank-note-authentication (ID 1462, 0.006/0.800) still yielded perfect $F_{1}$ for most frameworks. Low complexity but moderate skew in credit-g (ID 31, 0.040/0.429) caused dips in tree ensembles, while slight skew in diabetes (ID 37, 0.021/0.536) challenged pipelines without feature-interaction search. High complexity and moderate skew in wdbc (ID 1510, 0.105/0.594) saw uniformly strong scores, confirming that informative feature geometry can override imbalance. Together, these findings show that neither complexity nor skew alone predicts risk; it is their joint profile that decides success.

Conclusion

The binary scenario analysis reveals trade-offs among the evaluated frameworks. AutoSklearn and TPOT, which employ comprehensive model search and HPO^38,49, achieve high accuracy but require longer training times, prioritizing performance over efficiency. AutoGluon and FLAML balance accuracy and scalability through efficient model selection and automated ensembling^35,41. In contrast, LightAutoML and GAMA, which limit computational resources^42,44, exhibit higher performance variability, particularly on complex datasets, due to constraints in feature engineering and optimization. These results highlight the need for adaptable frameworks with low variability to ensure reliable performance across diverse datasets.

Multiclass scenario

Departing from the binary datasets, analyzing the multiclass scenarios through Tables 7a and b, as well as Figs. 8 and 9, we observe significant variation in performance and training times across frameworks.

Table 7 Performance summary for multiclass classification tasks.

Full size table

Performance analysis

Performance across multiclass datasets revealed notable differences in framework robustness, as shown in Table 7a and Fig. 8. On cardiotocography (ID 1466, 0.165/0.091), frameworks like AutoSklearn (1.000 ($1.000\pm 0.000$)) and PyCaret (1.000 ($0.999\pm 0.002$)) achieved high weighted $F_{1}$ scores with low variability, indicating reliable performance on high-complexity tasks. In contrast, H2O (0.915 ($0.706\pm 0.214$)) struggled to handle complex distributions consistently, while LightAutoML performed poorest (0.150 ($0.104\pm 0.022$)) on that dataset. Trends also varied on simpler datasets such as contraceptive-method-choice (ID 23, 0.018/0.529). Here, AutoSklearn (max 0.631, mean $0.569\pm 0.031$) proved more consistent, whereas LightAutoML (max 0.461, mean $0.415\pm 0.029$) demonstrated weaker performance and higher variability. Overall, many frameworks (e.g., AutoGluon, Auto-PyTorch, FEDOT, mljar-supervised, NaiveAutoML, TPOT) reached near-perfect $F_{1}$ on datasets such as segment (ID 36, 0.058/1.000), wine-quality-red (ID 40691, 0.041/0.015), and car (ID 40975, 0.014/0.054), underscoring the capacity of AutoML to excel on diverse multiclass tasks.

Training time analysis

Training times also varied significantly across frameworks and datasets, as shown in Table 7b and Fig. 9. Some frameworks (e.g., Auto-PyTorch, AutoSklearn, EvalML, TPOT) generally utilized the full 5-min budget or close to it, irrespective of dataset complexity, yielding top-tier $F_{1}$ scores but at the cost of longer, inflexible runtimes. In contrast, LightAutoML took only a few minutes on most datasets (e.g., 02:15 min for contraceptive-method-choice (ID 23, 0.018/0.529), 02:44 min for cardiotocography (ID 1466, 0.165/0.091)), though its performance was inconsistent. Meanwhile, AutoGluon, PyCaret, and mljar-supervised consistently completed tasks in well under a minute on simpler datasets—and even on more complex ones often finished in 1–3 min (e.g., AutoGluon ’s min 02:12 min on wine-quality-red (ID 40691, 0.041/0.015)), leveraging adaptive resource usage and ensembling. Finally, Lightwood reported extremely short runtimes (a few seconds) though with more variable performance.

Usability and scalability comparison

On spambase (ID 44, 0.025/0.650), frameworks that relied on extensive HPO required the full training budget to achieve high scores, while others produced comparable results in significantly less time. This supports findings that HPO through Bayesian optimization³⁸ or genetic programming⁴⁹ can increase training time without necessarily yielding substantial performance improvements. On hill-valley (ID 1479, 0.165/1.000), which contains 100 features, scalability differences were more apparent: some frameworks required the full training budget due to their reliance on exhaustive model selection³⁸, whereas others completed training in a fraction of the time at the cost of lower $F_{1}$, illustrating the speed-accuracy trade-off in high-dimensional tasks^35,44.

Biases, limitations, and failure modes

No multiclass dataset triggered a hard failure, yet two systematic weaknesses surfaced: (i) NAS-driven frameworks (AutoKeras, Auto-PyTorch) excelled on image-like segment (ID 36, 0.058/1.000) but showed limited generalization across tasks, revealing bias toward convolutional backbones; (ii) pipelines that blindly one-hot encode categoricals (H2O, LightAutoML) suffered severe accuracy drops (up to 0.80 in $F_{1}$) on cardiotocography (ID 1466, 0.165/0.091) and yeast (ID 181, 0.054/0.011), where rare classes combine with many sparse features, whereas CatBoost-based engines (FLAML, mljar-supervised) remained unaffected. Exhaustive-search systems (AutoSklearn, Auto-PyTorch, TPOT, GAMA) consistently consumed full 5-minute budgets, confirming classic speed-versus-performance trade-off.

Combined variability analysis

In multiclass classification, frameworks displayed differing levels of reliability in performance and training times. AutoSklearn and PyCaret consistently maintained low variability—exemplified on segment (ID 36, 0.058/1.000), where AutoSklearn reached 0.993 (0.985 ± 0.006) and PyCaret 0.994 (0.983 ± 0.005) with limited fluctuations. In contrast, LightAutoML and H2O reported greater variability, especially on cardiotocography (ID 1466, 0.165/0.091), where standard deviations in $F_{1}$ exceeded 0.020 and 0.200, respectively—suggesting sensitivity to parameter initialization or partial pipeline exploration.

Insights on complexity and imbalance

Imbalance dominated outcomes more than nominal complexity. The most complex set, cardiotocography (ID 1466, 0.165/0.091), yielded near-ceiling scores with robust encoders, while benign yeast (ID 181, 0.054/0.011) remained chief failure case for one-hot or linear pipelines. Even low-complexity, mid-skew contraceptive-method-choice (ID 23, 0.018/0.529) exposed $F_{1}$ gaps between exhaustive search engines and fast heuristics. Conversely, balanced but richer segment (ID 36, 0.058/1.000) posed little difficulty for NAS-based tools. Hence, the practical “hardness” surface is shaped by twin axes of categorical sparsity and minority-class share rather than feature-class ratio.

Conclusion

AutoSklearn and PyCaret balanced performance, efficiency, and robustness well on multiclass tasks. AutoGluon, mljar-supervised, and TPOT also demonstrated strong results, especially when time was limited and adaptive resource allocation proved beneficial. H2O and LightAutoML exhibited larger swings in performance, highlighting the need for tuning when faced with high-complexity data or subtle class imbalances. Encouragingly, no failures occurred on any dataset, underscoring these frameworks’ capabilities under controlled conditions, though real-world deployments still demand cautious preprocessing.

Multilabel (Native) scenario

Analyzing the native multilabel cases (i.e., without “label powerset” transformation) reveals framework compatibility and performance insights. Tables 8a and b, as well as Figures 10 and 11, present the experimental results.

Table 8 Performance summary for multilabel (Native) classification tasks.

Full size table

Performance analysis

Among the functioning frameworks, AutoSklearn consistently outperformed AutoKeras, AutoGluon, and FEDOT, as shown in Table 8a and Fig. 10. It achieved significantly higher $F_{1}$ scores across all datasets, with lower or comparable standard deviations, indicating greater robustness^3,32. For instance, on yeast (ID 41473, 2.528/0.434), AutoSklearn reported a weighted $F_{1}$ score of $0.630\,(0.597\pm 0.019)$, outperforming both FEDOT with $0.576\,(0.554\pm 0.014)$ and AutoKeras with $0.248\,(0.234\pm 0.009)$. Even the best performer (AutoSklearn) peaked at just 0.63 on the hardest dataset, versus $>0.9$ on binary/multiclass tasks—underscoring that native multilabel remains largely unsolved. AutoGluon, while efficient on many binary or multiclass tasks³⁵, struggled to surpass $F_{1}=0.421$ on image (ID 41468, 0.417/0.328), suggesting less effective internal strategies for multi-output data³. FEDOT performed moderately but remained sensitive to dataset complexity, and AutoKeras maintained relatively short training times but lagged in predictive accuracy on large or complex label spaces.

Training time analysis

As shown in Table 8b and Fig. 11, AutoSklearn consistently consumed the full 5-min budget, yielding thorough HPO and, consequently, higher $F_{1}$ scores^24,38. By comparison, AutoKeras completed many tasks in under two minutes (e.g. 00:45–03:15), leveraging faster neural architecture exploration at the cost of fully optimized configurations²². AutoGluon displayed mixed times (1–5 min), reflecting adaptive resource usage³⁵. Meanwhile, FEDOT sometimes hovered near 5 min (and even exceeded it on reuters (ID 41470, 0.981/0.197)), indicating less stable search strategies for higher-complexity problems⁴⁰. These findings align with the principle that deeper model searches or ensemble methods typically yield better predictive performance but require proportionally more compute time^2,32.

Usability and scalability comparison

From a usability perspective, frameworks with native multilabel support are rare; most either do not implement multi-output methods or rely on wrappers³. In this experiment, only AutoSklearn, AutoGluon, AutoKeras, and FEDOT produced valid native multilabel models. Among these, AutoSklearn ’s advanced Bayesian HPO often secured the highest $F_{1}$ but demanded full-time allocation^24,38, while AutoKeras achieved faster builds via NAS but lower accuracy. AutoGluon and FEDOT landed in between, reflecting moderate performance and variable runtimes. This suggests that frameworks balancing thorough pipeline searches with efficient resource usage—via optimized cross-validation or automated ensembling^32,35—offer more scalability on larger datasets under time constraints. However, the dearth of native multilabel solutions underscores that specialized tasks can limit an AutoML framework’s applicability in real-world scenarios.

Biases, limitations, and failure modes

Only four frameworks—AutoSklearn, AutoGluon, AutoKeras, and FEDOT—could train native multilabel models; the remaining eleven either lack multi-output estimators or crashed during pipeline construction. AutoSklearn ’s Binary-Relevance ensembles achieved the highest overall weighted $F_{1}$ but still reached only 0.799 (mean $0.716\pm 0.034$) on the highly imbalanced flags dataset (ID 285, 4.336/0.524), highlighting struggles with rare labels. Shared-backbone architectures of AutoGluon and AutoKeras also showed lower $F_{1}$ on flags, and FEDOT ’s performance varied widely, occasionally timing out under high label cardinality. These patterns illustrate how different native multilabel strategies yield varied robustness when faced with label sparsity.

Combined variability analysis

AutoSklearn and PyCaret consistently maintained low variability, exemplified on segment (ID 36, 0.058/1.000), where AutoSklearn reached $0.993\,(0.985\pm 0.006)$ and PyCaret $0.994\,(0.983\pm 0.005)$ with limited fluctuations. In contrast, LightAutoML and H2O reported greater variability, especially on cardiotocography (ID 1466, 0.165/0.091), where standard deviations in $F_{1}$ exceeded 0.020 and 0.200, respectively. Such variation can imply sensitivity to parameter initialization or partial pipeline exploration, implying further tuning might be necessary in practice.

Insights on complexity and imbalance

For native multilabel, rare-label frequency—not dimensionality—dictated variance. Every framework stumbled on flags (ID 285, 4.336/0.524), where half the labels appear in < 5% of samples, yet several handled the far larger and denser birds (ID 41464, 7.766/0.056) without incident. AutoSklearn ’s binary-relevance ensembles softened the blow but still posted zero recall on the sparsest tags, while shared-backbone models in AutoGluon/AutoKeras excelled on image (ID 41468, 0.417/0.328). These contrasts show that robustness scales with a tool’s ability to detect and weight ultra-minor labels rather than with search depth or feature count.

Conclusion

In native multilabel classification, AutoSklearn demonstrated the strongest combination of accuracy and stability, leveraging exhaustive model selection and HPO^24,38. AutoKeras offered fast training but lower predictive power, while AutoGluon and FEDOT delivered mixed results and times. Only four of the 16 evaluated frameworks provided native multilabel support, leaving most unable to train directly on multi-output data. As a result, real-world users must often resort to label-powerset transformations or custom pipelines. This highlights a structural limitation within current AutoML, where tasks such as multilabel classification still require bespoke solutions to bridge persistent gaps in support³.

Multilabel (powerset) scenario

Moving on to the powerset-transformed multilabel scenarios, we observe varying performance and training efficiency among different frameworks, as shown in Tables 9a and b, as well as Figs. 12 and 13. All imbalance values reported in this section refer to the powerset-transformed label space, as defined in Table 4.

Table 9 Performance summary for multilabel (powerset) classification tasks.

Full size table

Performance analysis

Regarding the $F_{1}$ score, AutoSklearn consistently achieved the highest values across most datasets, demonstrating robustness and reliability^32,38, as shown in Tables 9a and 12. For example, on scene (ID 41471, 0.787/0.003), AutoSklearn reached $0.796\,(0.749\pm 0.021)$, exceeding both FLAML with $0.772\,(0.723\pm 0.024)$ and Lightwood with $0.299\,(0.235\pm 0.040)$. Such superior performance often stems from comprehensive pipeline searches and ensemble-based optimization, which consistently yield higher accuracy but require more computational effort^2,24. Meanwhile, FLAML and AutoGluon^35,41 also delivered competitive $F_{1}$ results on datasets like reuters (ID 41470, 0.981/0.001), highlighting their potential for powerset-transformed tasks. However, frameworks like Lightwood⁴⁵ and PyCaret⁴⁸ persistently underperformed in predictive accuracy, likely due to shallower HPO strategies or limited feature-engineering capabilities¹.

Training time analysis

Training efficiency varied considerably among frameworks, as shown in Fig. 13 and Table 9b. AutoSklearn exhibited stable training times near the 5-min limit, indicating a thorough optimization process. By contrast, Lightwood often completed tasks in under a minute, emphasizing minimal pipeline overhead at the cost of performance drops³. For instance, on reuters (ID 41470, 0.981/0.001), Lightwood finished in $00{:}14\pm 00{:}02$ but lagged behind AutoSklearn ’s $F_{1}$ by over 30%. PyCaret, conversely, suffered extended runtimes on high-complexity datasets, sometimes exceeding 5 min and negatively impacting usability for time-constrained scenarios. Such disparities confirm the trade-off: deeper HPO or ensembling can boost $F_{1}$ but require more computation, while faster frameworks may be appealing yet yield lower accuracy^24,32.

Usability and scalability comparison

Label-powerset transformations inflate the effective number of classes, leading to pronounced imbalance and sparse label distributions^11,13. Frameworks with robust ensembling or Bayesian HPO strategies (e.g., AutoSklearn³⁸) generally cope better with this complexity, albeit at the cost of longer training times². Meanwhile, tools like AutoGluon and AutoKeras balance speed and accuracy via adaptive model selection^35,41, often excelling on moderately complex tasks without exhausting the time budget. In contrast, frameworks such as LightAutoML⁴⁴, Lightwood⁴⁵, or PyCaret⁴⁸ scale well in runtime but exhibit steeper drops when confronted with large powerset label spaces or heavy imbalances^1,3. Hence, user priorities—rapid prototyping vs. highest possible $F_{1}$—dictate which frameworks are most suitable in powerset scenarios (Table 10).

Table 10 Comparison of AutoML frameworks across classification scenarios in this and prior studies.

Full size table

Biases, limitations, and failure modes

After powerset transformation, effective class count explodes, exposing three recurrent issues: (i) memory-intensive GP/evolutionary frameworks (Auto-PyTorch, TPOT) struggled on high-cardinality sets birds (ID 41464, 7.766/0.003) and yeast (ID 41473, 2.528/0.004), achieving low precision on rare combinations; (ii) lighter Bayesian engines (FLAML, AutoGluon) finished within budget but lost precision on rare labels, reflecting limited cost-sensitive tuning; (iii) majority-class fallbacks in Lightwood and PyCaret produced sub-minute runtimes yet ignored over 90% of minority classes, yielding degenerate, mode-seeking predictions. FEDOT and GAMA failed on several powerset tasks due to memory or sparsity issues, whereas AutoSklearn ’s bagged one-vs-rest ensembles handled memory pressure and imbalance at the cost of a 5-min budget. These patterns confirm that robustness under label explosion demands either heavy compute (deep ensembles) or specialized imbalance-aware strategies—capabilities missing in many AutoML frameworks.

Combined variability analysis

AutoSklearn maintained low variability in both $F_{1}$ and training times, making it a robust choice for powerset-transformed tasks. On reuters (ID 41470, 0.981/0.001), it achieved $0.764\,(0.718\pm 0.015)$ at $05{:}06\pm 00{:}04$. In contrast, Lightwood displayed high instability, finishing swiftly ($<1$ min) but yielding widely fluctuating results. FLAML offered a middle ground, balancing competitive accuracy (i.e., $0.745\,(0.717\pm 0.016)$ on reuters (ID 41470, 0.981/0.001)) with moderate time usage, illustrating how partial yet efficient HPO can yield strong performance without monopolizing the runtime⁴¹. Hence, frameworks with heavier (or more methodical) optimization loops generally show lower performance variance, while faster solutions introduce the greater risk of suboptimal hyperparameter sets^2,24.

Insights on complexity and imbalance

After powerset transformation, imbalance ratios plunge (< 0.05) yet effective classes skyrocket, producing thousands of near-singletons that cripple memory-hungry GP engines and majority baselines. Moderately complex emotions (ID 41465, 1.361/0.012) proved tougher than denser reuters (ID 41470, 0.981/0.001) because rare label combinations outnumber informative ones. AutoSklearn and FLAML contained long tail via bagged one-vs-rest or cost-aware boosting, whereas EvalML and FEDOT timed out on high-cardinality yeast (ID 41473, 2.528/0.004). Once label-space explosion sets in, success depends less on complexity than specialised imbalance-aware search, memory discipline.

Conclusion

In the multilabel powerset scenario, AutoSklearn generally dominated in $F_{1}$ performance, while FLAML and AutoGluon provided strong contenders with shorter average training times. Frameworks such as Lightwood and PyCaret, though quick on some datasets, often fell behind in predictive quality, illustrating the classic speed-versus-accuracy trade-off^3,32. Additionally, a few tools failed to complete powerset tasks altogether due to label-sparsity-induced pipeline breakdowns^3,12. The results confirm that deeper searches and robust ensembling yield higher accuracy but come at the cost of runtime, while less exhaustive approaches can be advantageous in time-sensitive contexts yet require careful consideration of label imbalance.

Benchmark comparison with prior studies

To situate our results within the broader AutoML evaluation landscape, we compare our results against four representative benchmark studies (see Table 3):

Truong et al. (2019)²: early analysis of binary and multiclass $F_1$-score ranges under varying class imbalance.
Wever et al. (2021)³: subset-$F_1$ performance of advanced hyperparameter optimizers on native multilabel tasks.
Gijsbers et al. (2024)⁴: AUC for binary and log-loss for multiclass across diverse datasets.
Eldeeb et al. (2024)⁶¹: $F_1$ ± SD results under extended time budgets (10–240 min).

Note: we did not include Romero et al. (2022)⁵⁷ in our comparison because their evaluation was performed on a completely disjoint set of datasets with no overlap to the ones used in this study (Table 4).

We summarize each study’s top-reported metrics (and runtimes, where available) alongside our 5 min budget $F_1$ scores in Table 3—highlighting advances in predictive quality and persistent gaps, particularly in multilabel classification; see the table notes for full definitions and aggregation details. Having established how our results compare to prior benchmarks, we now turn to a detailed examination—analyzing weighted $F_1$ scores across binary, multiclass, native multilabel, and powerset multilabel scenarios.