Table 2 Intraclass correlation results reflecting the model’s ability to reproduce gold-standard measures based on expert review of the videos

From: A machine learning contest enhances automated freezing of gait detection and reveals time-of-day effects

ICCs (CI: 95%)

1st place

2nd place

3rd place

4th place

5th place

% Time frozen

Private test

0.949** (0.85–0.98)

0.934** (0.80–0.98)

0.942** (0.83–0.98)

0.886** (0.69–0.96)

0.877** (0.67–0.96)

Private+public test

0.869** (0.77–0.93)

0.884** (0.79–0.94)

0.898** (0.82–0.94)

0.870** (0.77–0.93)

0.852** (0.74–0.92)

No. of FOG episodes

Private test

0.763** (0.04–0.94)

0.869** (0.64–0.96)

0.717** (0.34–0.90)

0.093 (−0.22 to 0.50)

0.885** (0.68–0.96)

Private+public test

0.500** (0.18–0.71)

0.456** (0.18–0.67)

0.597** (0.30–0.78)

0.084 (−0.12 to 0.32)

0.346* (0.04–0.59)

FOG duration

Private test

0.991** (0.97–1.00)

0.991** (0.97–1.00)

0.985** (0.95–0.99)

0.965** (0.90–0.99)

0.985** (0.96–1.00)

Private+public test

0.955** (0.92–0.98)

0.944** (0.89–0.97)

0.965** (0.93–0.98)

0.950** (0.91–0.97)

0.907** (0.82–0.95)

  1. *p < 0.05, **p < 0.001 (exact p-values are shown in Supplementary Table 2) in an ICC2(2,1) test. Note that for some models and outcome measures (e.g., 1st place model, no. of FOG episodes), performance when combining the public and private test sets was lower than that seen in the private data. The data was randomly divided into different test sets, so this finding is somewhat counterintuitive. Notably, this occurred for the number of FOG episodes, an outcome that was generally less robust compared to FOG duration or % time frozen (perhaps because the splitting or lumping of adjacent episodes affects the number of episodes much more than the duration of % time frozen).
  2. ICCs intraclass correlation coefficients.