Table 11 Summary of the results and discussion.

From: A practical evaluation of AutoML tools for binary, multiclass, and multilabel classification

Scenario

Key findings

Challenges

Recommendations

Binary

Advanced HPO approaches yielded top performance but required more time. Some faster frameworks or minimal pipelines struggled on complex or missing data and occasionally failed on certain tasks

Rigid preprocessing and insufficient handling of missing data or class imbalance limited performance and led to failures.

Adopt robust data encoding, improve imbalance mitigation, and enable adaptive model selection to address diverse complexities and avoid execution errors

Multiclass

A few frameworks consistently achieved strong results, while others showed unexpected drops on simpler data. No framework failed to run, though performance variability was substantial

Maintaining stable accuracy across varied distributions was difficult, and certain frameworks always used the full-time budget or saw large accuracy swings despite quick runs.

Incorporate adaptive ensembling or selective search to handle different data complexities effectively without monopolizing runtime or suffering marked performance drops

Multilabel (Native)

Only a limited set of frameworks supported native multi-output classification; one generally excelled in accuracy, while another was faster but less accurate

Sparse label sets, limited native support, and inconsistent training times reduced reliability, with many frameworks providing no results at all

Enhance native multilabel capabilities to cope with label sparsity and ensure consistent optimization loops for stable performance

Multilabel (Powerset)

More exhaustive pipelines or ensembling achieved higher scores but demanded longer training. Some frameworks finished rapidly but showed significantly lower accuracy or failed under extreme label inflation

Label powerset transformations exposed imbalance and sparse label combinations, causing pipeline instability and partial failures in certain tools

Adopt specialized balancing or meta-label methods to handle expanded label sets and refine search algorithms to stay robust under label inflation

General

No single tool dominated all tasks. Comprehensive search approaches delivered higher accuracy but often used the entire time limit, while faster methods risked significant degradation on challenging data

Handling real-world data characteristics, such as missing features and label imbalance, remained a common obstacle, and traditional complexity metrics did not fully capture domain-level issues

Employ resilient pipelines that combine flexible search with advanced preprocessing and domain-aware strategies, balancing thoroughness against strict time constraints