Table 11 Summary of the results and discussion.
From: A practical evaluation of AutoML tools for binary, multiclass, and multilabel classification
Scenario | Key findings | Challenges | Recommendations |
|---|---|---|---|
Binary | Advanced HPO approaches yielded top performance but required more time. Some faster frameworks or minimal pipelines struggled on complex or missing data and occasionally failed on certain tasks | Rigid preprocessing and insufficient handling of missing data or class imbalance limited performance and led to failures. | Adopt robust data encoding, improve imbalance mitigation, and enable adaptive model selection to address diverse complexities and avoid execution errors |
Multiclass | A few frameworks consistently achieved strong results, while others showed unexpected drops on simpler data. No framework failed to run, though performance variability was substantial | Maintaining stable accuracy across varied distributions was difficult, and certain frameworks always used the full-time budget or saw large accuracy swings despite quick runs. | Incorporate adaptive ensembling or selective search to handle different data complexities effectively without monopolizing runtime or suffering marked performance drops |
Multilabel (Native) | Only a limited set of frameworks supported native multi-output classification; one generally excelled in accuracy, while another was faster but less accurate | Sparse label sets, limited native support, and inconsistent training times reduced reliability, with many frameworks providing no results at all | Enhance native multilabel capabilities to cope with label sparsity and ensure consistent optimization loops for stable performance |
Multilabel (Powerset) | More exhaustive pipelines or ensembling achieved higher scores but demanded longer training. Some frameworks finished rapidly but showed significantly lower accuracy or failed under extreme label inflation | Label powerset transformations exposed imbalance and sparse label combinations, causing pipeline instability and partial failures in certain tools | Adopt specialized balancing or meta-label methods to handle expanded label sets and refine search algorithms to stay robust under label inflation |
General | No single tool dominated all tasks. Comprehensive search approaches delivered higher accuracy but often used the entire time limit, while faster methods risked significant degradation on challenging data | Handling real-world data characteristics, such as missing features and label imbalance, remained a common obstacle, and traditional complexity metrics did not fully capture domain-level issues | Employ resilient pipelines that combine flexible search with advanced preprocessing and domain-aware strategies, balancing thoroughness against strict time constraints |