Table 2 Classification statistics (average) for the four segmentation configurations of TableĀ 1 and benefits of majority voting.

From: A generalised framework for detailed classification of swimming paths inside the Morris Water Maze

Ā 

Segmentation I

Segmentation II

Segmentation III

Segmentation IV

Number of generated Classifiers

42

78

91

64

Ā 

Performance: Classifiers

Average Error (%) [min-max]

16.8 [5.4 24.9]

17.5 [3.7 25.0]

13.9 [1.8 21.5]

18.0 [7.3 24.9]

Unclassified (%) Segments

2.5

2.5

1.3

3.7

Agreement (%)

58.7

61.0

59.6

56.3

Ā 

Performance: Ensemble(s)

Error (%)

0.0

0.2

0.0

0.0

Unclassified (%) Segments

0.0

0.0

0.0

0.1

Agreement (%)

83.4

82.6

82.3

80.0

  1. (1) Number of generated classifiers: based on each segmentation, only classifiers with cross-validation error lower than 25% were selected to take part in the classification analysis procedures (ensemble and binomial confidence intervals). As a rule of the thumb we require a minimum number of 40 ā€˜strong’ classifiers to be generated in order to trust the classification results. (2) Error: the 10-fold cross validation was used in order to select ā€˜strong’ classifiers based on their validation error. 10-fold cross validation was also used to compute the average accuracy of the ā€˜strong’ classifiers and the accuracy of the ensemble (in case of the ensemble, the same folds used by the classifiers were re-used). The ensemble significantly benefits the classification accuracy. Since in our method the cross validation was used for both tuning and testing we manually assess the error of the ensembles on two out of the four segmentations (see Supplementary material). (3) The percentage of unclassified segments was computed separately; since the classifiers are ā€˜strong’ only a few segments remain unclassified, nevertheless the ensemble almost totally nullifies the unclassified segments. (4) The average agreement between the classifiers was computed by first calculating the percentage of agreement within each pair (we have agreement when two classifiers have assigned the same label on a particular segment) and then averaging all the agreements together (refer to Validity Measurements for more information). In order to perform the same statistical measurement in the ensemble domain, 21 ensembles were created by picking a random sample of 11 ā€˜strong’ classifiers from the pool. The agreement between the classifiers is better than moderate and, as expected, the agreement of the ensembles is high. We chose a sample smaller than 40 to avoid a large overlapping of classifiers across ensembles.