Extended Data Fig. 3: Replication of classification, internal validation and external validation results in the HCP dataset.
From: Brain–phenotype models fail for individuals who defy sample stereotypes

Results as presented in Fig. 1b, Fig. 2a, b, Fig. 3a, and Fig. 4a. Given the large HCP sample size, 10-fold cross-validation was used (20 partitions, 50 subsampling iterations each), with the requirement that family members be assigned to the same fold. Given that only two measures were classified, we omit measure versus misclassification frequency similarity and hierarchical linkage analyses. (a) Significance via one-tailed permutation testing (as in Fig. 1b); P values FDR adjusted (2 tests). For sample sizes, see Supplementary Table 6. (b) Permuted distribution means significantly differed from 0.5 via two-tailed, one-sample t-test (cIQ mean = 0.491 [P < 0.0001], fIQ mean = 0.498 [P = 0.04], both FDR adjusted [2 tests]). All else as in Yale and UCLA analyses: mean and median of original data-based distribution significantly differed from 0.5 (all P < 0.0001, FDR adjusted [4 tests] via two-tailed t- and Wilcoxon signed-rank tests), and the misclassification frequency distributions for original and permuted analyses significantly differed for each measure (all P < 0.0001, FDR adjusted [2 tests] via two-tailed, two-sample Kolmogorov–Smirnov test). MF, misclassification frequency. (c) **P = 0.001, ****P < 0.0001, FDR adjusted (2 tests) via paired, one-tailed Wilcoxon signed-rank test (as in Fig. 2b). (d) Results presented as in Fig. 3a. Bar height, grand mean; error bars, s.d. *P < 0.0001, FDR adjusted (9 tests) via two-tailed, nested ANOVA. For each classified measure (cIQ/vocabulary and fIQ/MR for HCP/Yale), six models were trained: 1 using all Yale participants, 1 using Yale CCP, 1 using Yale MCP (see Fig. 3 legend for training set sizes), 1 using all HCP participants (number of participants used for training after excluding intermediate and outlier scores and subsampling to balance classes: 230 and 350 for crystallized and fluid measures, respectively), 1 using HCP CCP (168, 216), and 1 using HCP MCP (62, 134). See Supplementary Tables 4 and 6 for test-set sizes. (e) Results as presented in Fig. 4a. Covariate relationships presented if they were significantly related to misclassification frequency in low or high scorers (P < 0.05, adjusted). For full results and relationship of covariates to mean score, as well as sample sizes, see Supplementary Tables 9 and 10. ****P < 0.0001; all P values FDR adjusted (22 tests).