Scientific Reports

Table 2 Review of supervised machine learning techniques.

From: Development and validation of electronic health record-based, machine learning algorithms to predict quality of life among family practice patients

Method	Strengths	Limitations
Naïve Bayes classifier (NB)	Short computational time for training and very easy to construct³³ Requires less amount of training data³² Simple and useful for a variety of practical applications³³	Less accurate than other classifiers³³ Classes must be mutually exclusive³²
Decision tree (DT)	Simple and fast to build and interpret³³ Does not require any domain knowledge or parameter setting³³ Able to handle high-dimensional data³³ Robust classifier³² Can be validated using statistical tests³² Self-explanatory tool/simple schematical representation that can be followed by the non-professionals³⁴ Can easily be converted to a set of rules that are often comprehensible for the clinicians³⁴	Instability³⁵ Prone to overfitting based on depth of tree³⁴ Sensitive to training data, can be error-prone on test data³⁵ Classes must be mutually exclusive³²
Support vector machine/classifier (SVM/C)	Robust and well-known algorithm³³ Requires minimal data for training³³ Training is relatively easy³⁴ Scales well to high-dimensional data³⁴ Robust and can handle multiple feature spaces³² Less risk of overfitting³²	Poor interpretability of results³⁴ Poor performance with noisy data³² Slow learner, requires large amount of training time³³ Computationally expensive³²
Boosting algorithms	Investigator choice of loss function allows greater flexibility³⁶ Improved accuracy from adding to the ensemble sequentially³⁶ Demonstrated success in various practical applications³⁶ Simple to implement and debug³⁷	Success depends on the amount of data available³⁷ Can be prone to overfitting³⁷ Can be more sensitive to outliers³⁷
Random forest (RF)	Lower chance of variance and overfitting of training data compared to DT³² Empirically performs better than its individual base classifiers³² Scales well for large datasets³²	Easily overfit³² Variable importance estimation favors attributes that can take a high number of different values³² Computationally expensive³²
k-nearest neighbors (kNN)	Easy to understand and easy to implement classification technique³³ Training phase is fast and low cost³⁴ Simple algorithm can classify instances quickly³² Can handle noisy instances or instances with missing attribute values³²	Computationally expensive³⁴ Slower classification³³ Attributes are given equal importance/no information on which attributes are most effective at classification, which can lead to poor performance³² Large storage requirements³³ Sensitive to irrelevant features³⁴
Neural network (NN)	Can detect complex nonlinear relationships³² Requires less formal statistical training to execute³² Availability of multiple training algorithms³² Able to tolerate noisy data³³ Able to classify patterns on which they have not been trained³³ Can be used with little/no prior knowledge of the relationship between attributes and classes³³	Success of the model depends on the quantity of the data³⁴ Lacking clear guidelines on optimal architecture³⁴ ‘Black-box’-user does not have access to exact decision-making process³²
Logistic regression (LR)	Easy to implement and straightforward³² Easily updated³² Does not make any assumptions regarding the distribution of independent variable³² Probabilistic interpretation of model parameters³²	Poor performance when input variables have complex, linear relationships (multicollinearity)³⁸ Not appropriate when data cannot be linearly separated³⁹ Can overstate prediction accuracy³²

Back to article page

Search

Advanced search

Quick links