Table 4 Performance evaluation metrics.
Metrics | Calculation | Description |
|---|---|---|
Accuracy | \(\:\frac{TP\:+\:TN}{TP\:+\:TN\:+\:FP\:+\:FN}\) | Accuracy measures the overall correctness of the model’s predictions, including both TPs and TNs. |
Precision | \(\:\frac{TP}{TP\:+\:FP}\) | Precision measures the proportion of TP predictions out of the total positive predictions made by the model, i.e., it indicates the model’s ability to identify patients with heart disease correctly. It is useful when minimizing FPs is crucial. |
Recall | \(\:\frac{TP}{TP\:+\:FN}\) | Recall measures the proportion of TP predictions out of the actual positive instances, i.e., reflects the model’s ability to detect patients with heart disease correctly. It is important in situations where FN are critical. |
F1-score | \(\:\frac{2\times\:TP}{2\times\:TP\:+\:FP+\:FN}\) | The F1-score gives a single metric that balances both recall and precision by taking the harmonic mean of the two. It is particularly useful when the dataset is imbalanced or when precision and recall are equally important. A high F1-score indicates good precision-recall balance. |
Specificity | \(\:\frac{TN}{TN\:+\:FP}\) | Specificity measures how many negatively predicted heart disease cases turned out to be TN. It indicates the model’s ability to identify individuals without heart disease correctly. Specificity is important when minimizing FPs are crucial, because FPs may lead to unnecessary medical procedures. |
Macro average (MA) | \(\:\frac{1}{2}\sum\:_{c=0}^{1}{A}_{c}^{m}\) | The MA determines the average performance across all classes or categories. Here, c denotes classes 0 (no heart disease) and 1 (heart disease), and m denotes precision, recall, or F1-score. |
Weighted average (WA) | \(\:\sum\:_{c=0}^{1}{w}_{c}^{m}\times\:\frac{1}{2}\sum\:_{c=0}^{1}{A}_{c}^{m}\) \(\:{w}_{0}+{w}_{1}=1\) | WA provides a summary of performance that takes into consideration the distribution of classes. It is beneficial in unbalanced datasets when some classes have far more instances than others. |
Standard deviation (SD) | \(\:\sqrt{\frac{\sum\:{({x}_{i}-\mu\:)}^{2}}{N}}\) | SD evaluates performance metrics’ variability across multiple folds, providing insights into model consistency or stability. A lower SD indicates more consistent outcomes. (N = no. of instances, \(\:{x}_{i}\:\)= each value from the instance, \(\:\mu\:\:\)= mean of all instances) |
Kappa | \(\:\frac{2\times\:(TP\:\times\:\:TN-FP\times\:FN)}{\sqrt{\begin{array}{c}\left(TP+FP\right)\times\:\left(TP+FN\right)\times\:\\\:(TN+FP)\times\:(TN+FN)\end{array}}}\) | Cohen’s Kappa is a metric for measuring the degree of agreement between actual and predicted class labels that accounts for potential chance agreement. When the distribution of classes is skewed, or the majority class is very prevalent, it helps to evaluate how well the model performs. |
Matthews correlation coefficient (MCC) | \(\:\frac{TP\:\times\:\:TN-FP\times\:FN}{\sqrt{\begin{array}{c}\left(TP+FP\right)\times\:\left(TP+FN\right)\times\:\\\:(TN+FP)\times\:(TN+FN)\end{array}}}\) | The MCC measures the quality of binary classifications. It varies between − 1 and + 1, with + 1 signifying correct classification, 0 signifying random classification, and − 1 signifying complete misclassification. A greater MCC suggests improved model performance. |
Receiver operating characteristic (ROC) curve | TPR (y-axis) vs. FPR (x-axis) | The ROC curve illustrates the tradeoff between recall and specificity. It shows how well the model performs at various threshold settings for heart disease prediction. A higher ROC suggests better model performance. |
Area under the curve (AUC) | \(\:{\int\:}_{0}^{1}TPR\left({FPR}^{-1}\left(t\right)\right)dt\), t is a threshold | The AUC represents the area under the ROC curve and provides a single scalar value that summarizes the overall performance of the prediction model. A higher AUC indicates more accurate discrimination between positive and negative instances of heart disease among the patients. |
Area under the precision-recall curve (AUPRC) | \(\:\int\:p\left(R\right)dR\), where \(\:p\left(R\right)\) is the precision at recall level R. | An indicator of how well a model does on unbalanced datasets, the AUPRC is the area under the precision-recall curve. It considers the tradeoff between precision and recall. A higher AUPRC indicates better performance, particularly when correctly identifying positive instances is crucial. |
Misclassification rate (MCR) | \(\:\frac{FP\:+\:FN}{TP\:+\:TN\:+\:FP\:+\:FN}\) | The percentage of instances that are wrongly classified relative to the total instances is called the MCR, also called the error rate. It complements accuracy by giving the proportion of misclassified events. A decreased misclassification rate suggests improved model performance. |
Execution time | - | Algorithm execution time in seconds. |