Table 1 Summary of the state-of-the-art literature.
From: Efficient diagnosis of diabetes mellitus using an improved ensemble method
Name and author | Dataset | Methodology | Results and accuracy | Limitations | Future scope |
---|---|---|---|---|---|
Yadav and Pal31 | UCI Repository | J48, Decision Stump, REP, RF, Gradient Boosting, AdaBoost M1, XGBoost | RF (Parallel) = 100%, XGBoost (Sequential) = 98.05% | Limited to ensemble methods | Explore hybrid models combining sequential and parallel approaches |
Kumari, Kumar, and Mittal39 | PIMA diabetes dataset (UCI) | RF, Logistic Regression, Naive Bayes | 79.08% (PIMA), 97.02% (breast cancer) | Focused on soft voting ensembles | Extend to other medical datasets and more classifiers |
Tewari and Dwivedi32 | UCI dataset | JRIP, OneR, Decision Table, Boosting, Bagging | Bagging = 98% | Limited feature selection methods | Investigate more advanced feature selection techniques |
Ghosh et al. 2021 | PIMA Indians diabetes dataset | Gradient Boosting, SVM, AdaBoost, RF with and without MRMR feature selection | RF = 99.35% with MRMR | High complexity in MRMR feature selection process | Simplify feature selection and test on other datasets |
Atif, Anwer, and Talib44 | PIMA Indians dataset, Early-Stage Diabetes | Hard Voting Classifier (Logistic Regression, Decision Tree, SVM) | 81.17% (PIMA), 94.23% (Early-Stage Diabetes) | Limited voting scheme to hard voting | Explore soft voting or weighted voting for improved results |
Rashid, Yaseen, Saeed, and Alasaady [45] | PIMA Indians Diabetes Dataset (PIDD) | Decision Tree, Logistic Regression, KNN, RF, XGBoost | 81% after standardization and imputation | Focused only on ensemble voting techniques | Test other ensemble strategies such as bagging or boosting |
Zhou, Xin, and Li 46 | PIMA Indian diabetes dataset | Boruta feature selection, K-Means + + clustering, stacking ensemble learning | 98% | High computational cost in clustering | Reduce computation and test scalability |
Kawarkhe and Kaur47 | PIMA Indians | CatBoost, LDA, LR, RF, GBC with preprocessing techniques | 90.62% | Limited to specific preprocessing techniques | Broaden the preprocessing techniques and methods |
Reza, Amin, Yasmin, Kulsum, and Ruhi48 | PIMA Indian diabetes dataset, local healthcare | Stacking ensemble with classical and deep neural networks | 77.10% (PIMA), 95.50% (simulation) | Limited to stacking approaches | Explore other ensemble methods or hybrid approaches |
Thongkam et al.17 | Breast Cancer Dataset | AdaBoost | Improved prediction and diagnosis | Initially used only for breast cancer | Extend to other medical conditions |
Velu and Kashwan18 | Various Datasets | SVM, Radial Basis Function, Multi-Layer Perceptron, and Multi-Level Counter Propagation Network. | High accuracy in various applications | Complexity in model selection | Test different combinations and optimizations |
Temurtas et al.19 | PIMA-diabetes illness dataset | Multilayer Neural Network | Improved accuracy | Focused on PIMA dataset | Apply to other chronic disease datasets |
Ayo at al.20 | Heart Disease Dataset | Levenberg–Marquardt approach, Probabilistic Neural Network, Naive Bayes, SVM | High accuracy in diagnosing cardiac disease | Limited to cardiac disease prediction | Broaden to include other comorbidities |
Farvaresh and Sepehri21 | Various medical datasets | Decision Tree C4.5, Bagging with C4.5, and Naive Bayes. | Improved prediction of cardiac illness | Initial focus on cardiac illness | Expand to other diseases and datasets |
Kalman Filter Theory22 | PIMA Indian dataset | Adaptive and personalized insulin recommendation | Enhanced classification accuracy | Focused on insulin recommendation systems | Broaden to other therapeutic recommendations |
Ajagbe et al.23 | Various applications | Multimedia analytic techniques, meta-data annotation, MPEG-7 | Improved semantic analysis | Limited to MPEG-7 | Explore alternative multimedia retrieval frameworks |
Gong and Kim,24 | Misbalanced Datasets | RHS-Boost for balanced classification | High accuracy and prediction | Designed for misbalanced datasets | Apply to other datasets and test alternative balancing methods |
Purnami et al.25 | Diabetes detection | ANFIS and PCA | Enhanced detection | Initial partitioning approach | Broaden to include other feature extraction methods |
Rani and Jyothi26 | Diabetes dataset | Bayesian Classification, J48, KNN, Filtered Classifier, ANN, Naive Bayes | 77.01% accuracy | Lack of cross-validation | Implement cross-validation and expand dataset usage |
Zheng et al.27 | Various datasets | KNN, Naive Bayes, Decision Tree, RF, SVM, Logistic Regression | Improved recall and accuracy | Filtering criteria could be improved | Enhance feature selection and parameter tuning |
Komi et al.28 | Sample datasets | ELM, ANN, LR, GMM, SVM | Better accuracy with fewer samples | Less amount of sample data | Increase sample data and test on more complex datasets |
Sai et al.29 | Diabetes dataset | Weighted voting approach for ensemble prediction models | Enhanced predictive performance | Focused on ensemble prediction model | Explore ensemble expansion and optimization |
Rustam et al.56 | Multiple datasets | Ensemble of CNN and LSTM for feature extraction, Random Forest model for prediction | Accuracy score of 0.99 using CNN-LSTM features with Random Forest | Limited by dataset size, generalizability issues in existing approaches | Explore other ensemble models, improve dataset diversity, real-world applicability |
Faustin and Zou et al.57 | Pima Indian Diabetes Dataset | Genetic Algorithm (GA) enhanced with a two-step crossover operator for feature selection. | Accuracy: 97.5%, Precision: 98%, Recall: 97%, F1-score: 97% | Premature convergence due to insufficient population diversity in GA. | Apply the improved GA to other datasets, refine crossover technique. |
Reza et al.58 | PIMA Indian Diabetes dataset, Local healthcare data | Stacking ensemble method combining classical and deep neural network models for diabetes classification. | Stacking ensemble with NN architectures: Accuracy 95.50%, Precision 94%, Recall 97%, F1-score 96% | Limited to dataset used in the study; may need more diverse data for generalization. | Explore further with other datasets, and apply to real-time healthcare systems. |
Saihood and Sonuç59 | Pima Indians Diabetes Database | Ensemble machine learning models: Bagging, boosting, and stacking with hyperparameter tuning and data preprocessing. | Stacking (RF & SVM): 97.50% accuracy, Bagging (RF): 97.20%, Boosting (XGB): 97.10% | Framework limited to Pima Indians dataset; real-world data may vary. | Extend framework to include more diverse datasets and explore real-time applications in healthcare. |
Daza et al.60 | Diabetes Dataset (768 patient records) | Stacking ensemble approach using 7 base algorithms; oversampling to balance the dataset and cross-validation for model training. | Best accuracy: 91.5%, Sensitivity: 91.6%, F1-Score: 91.49%, Precision: 91.5%, ROC Curve: 97%. | Performance depends on the dataset and oversampling method. | Improve the model’s generalizability by testing on other datasets and enhancing model |