Introduction

As we age, gradual and natural changes can be observed and quantified across diverse cognitive and affective dimensions. Perhaps most importantly, the rate of such changes varies considerably across people, suggesting different aging trajectories caused by either healthy (normal) aging or pathological (abnormal) aging processes that are mediated by a complex interplay between several mechanisms1. Accurate characterization of aging trajectories along the cognitive dimension relies on the accurate assessment of decline in cognitive abilities, including measures of working memory, processing speed, and executive function2, where age-related changes in processing speed are believed to account for changes in several cognitive domains3. Such gradual cognitive decline might occur while most language skills remain intact and semantic knowledge can even improve with age4. To offer an accurate characterization of pathological aging at the individual level, it is important to define what is ‘normal’ aging, that is the range of performance over cognitive tests that could be considered normal or typical. In other words, it is important to fully characterize what is normal aging to enable clinically useful diagnostic decisions about abnormal aging.

One potential cause of cognitive aging is neural dedifferentiation5,6, which reflects brain responses becoming less precise and selective as people get older. Neural dedifferentiation has been negatively associated with cognitive performance specifically on tasks that require fluid processing and memory in healthy aging6. It has also been reported as an age-related increase in the correlation between different cognitive abilities7,8,9,10. Thus, understanding the role of neural dedifferentiation in cognitive performance is essential to explaining aging trajectories, but dedifferentiation might reflect both age-dependent and age-independent factors (see discussion in6). Therefore, it is of paramount importance to study cognitive aging with tests that are sensitive to age-dependent factors. Here, we show how some widely used cognitive tests can tap into different processes depending on age.

Healthy aging versus pathological aging is commonly reported as significant differences in performance in well-documented cognitive tests, many of them already available in existing validated neuropsychological batteries such as the National Institute of Health Toolbox Cognitive Battery (NIHTB-CB). Such cognitive tests are not independent, i.e., they are strongly correlated and can thus be assigned to a common underlying domain or construct (e.g., memory, executive function, language, . etc.). Indeed, prior work has shown that cognitive scores in different tests can be reduced to a few dimensions (components or latent variables), meaning that such tests are highly dependent across individuals11,12. Consequently, individual scores across many cognitive tests can be reduced to a behavioral representation (profile) with few dimensions. Such low-dimensional behavioral representation can explain the common variance in the observed measures across subjects, thereby offering a lower space into which individual cognitive scores can be projected. One question of interest concerns the consistency of such low-dimensional behavioral spaces across age. Are the identified components similar or different across the lifespan? Do tests measure the same abilities in groups with different demographics? Are behavioral scores, with performance measured as accuracy or reaction time, sensitive enough to subtle age-induced variations? Do age-related differences in behavioral scores reflect changes in domain-specific or domain-general processes13?

These questions are of paramount importance to the study of aging, and for the development of reliable neuropsychological assessment tools. To study the reliability of the measured constructs by a battery across age, previous studies examined measurement invariance that reflects the equivalence of a given construct across age groups. This is because measurement invariance underpins valid interpretation of group differences in cognitive scores14. Measurement invariance is typically assessed at four nested levels: (1) configural invariance estimates the consistency of the latent cognitive structure, (2) metric invariance measures the similarity of the strength of the relationship between the observed behavioral scores and the assumed cognitive domain (e.g. factor loadings), (3) scalar invariance tests whether the starting points (intercepts) are equivalent for the tested groups, and (4) at the most detailed level, strict invariance examines whether the unexplained variance is consistent across groups. For instance, Kiselica et al. (2020) examined the invariance of a hypothesized factor model (attention, visual, processing speed/executive, memory, and language factors) for the Uniform Dataset (UDS) neuropsychological battery using confirmatory factor analysis across sex, age and education on a sample aged 68.33 ± 10.26 years15. They showed that the factor model achieved at minimum a metric invariance over a 1-year follow-up assessment and across different demographic groups. In a 4-year longitudinal study with a diverse sample of participants aged over 54 years old, a 2-factor model of memory and executive function showed strict invariance of the neuropsychological battery used in the Brazilian Longitudinal Study of Adult Health (ELSA-Brasil)16. Similarly, another study reported metric invariance on healthy older adults aged 65.7 ± 8.2 years over an average period of two years17. Using the NIHTB-CB tests, a 5-factor model (vocabulary, reading, episodic memory, working memory and executive function/processing speed) and a second-order two-factor model (fluid vs. crystallized skills) were found to consistently explain the measures recorded from different age groups (aged 20–60 years and 60–85 years)18.

In contrast, other studies showed differences across groups when behavioral data were reduced to a low-dimensional space. For instance, by examining the latent structure of executive function across 3 age groups19, it has been shown that a three-factor model of shifting, inhibition, and updating was a better fit for behavioral data in young adults but not in the older groups. Instead, the data of older groups were better explained by a two-factor model of updating/inhibition and shifting19. Likewise, a recent study evaluated the factor structure of the NIHTB-CB tests in a sample of 411 cognitively healthy and impaired participants aged between 45 and 94 years old20. A 2-factor model (fluid vs. crystallized skills) was a good fit for participants aged 65 and above while younger participants’ performance was better explained using a three-factor model of language, executive functions, and memory20. These studies suggested that measurement invariance was not met over the tested groups, raising concerns about the consistency of the measured constructs by these tests as identified by the factorial representations across age. This begs the question of the validity of the comparisons in cognitive scores between young and old adults, considering that the original tests used to measure these cognitive scores might presumably relate to different behavioral dimensions or factors across age groups.

In this context, we examined here the assumption of measurement invariance for some well-established behavioral tests administered to both healthy young and older adults. Specifically, we examined quantitative differences between young and older subjects using validated tests from the NIHTB-CB battery, the Mini-Mental State Examination (MMSE) test, in addition to validated supplementary tests. We also highlighted differences in cognitive scores that might be influenced by sex or education. Correlations, in the across-subject dimension, between the different tasks were assessed for each group, with an emphasis on their variability between and within each domain (e.g., executive function, memory). Lastly, with factor analyses, we investigated the qualitative equivalency of the constructs measured by validated tests between young and old groups. Our analysis revealed that not all tasks were functionally (psychometrically) equivalent in both age groups.

Methods

Data description

This study used the cognitive measures from the neurocognitive aging dataset21,22. The dataset was collected from 163 younger adults, aged 18–34 years old (males: 70, females: 93), and 120 older adults, aged 60–89 years (males: 54, females: 66) in two different recording sites. All participants had a minimum of 12 years of schooling. The administered battery included tasks from NIHTB-CB23: the Rey Auditory Verbal Learning, Picture Sequence Memory, Flanker Inhibitory Control and Attention, Dimensional Change Card Sort (DCCS), List Sorting Working Memory (LSWM), Picture Vocabulary, and Oral Reading Recognition tests. In addition, validated auxiliary tasks were also included: The Verbal Paired Associates from the Wechsler Memory Scale-IV24, the Associative Recall Paradigm25, Shipley-2 Vocabulary26, Trail Making Test B-A (TMT)27, the Reading Span Task28, and the Symbol Digit Modalities Test (SDMT)29. The Mini-Mental State Examination (MMSE) test was also administered to all participants. These tasks were chosen to target the episodic memory, semantic memory, executive functions and processing speed domains21; a detailed description of the tasks can be found in Table S1 of the Online Resource. Demographic data including age, gender, and education level were also collected.

Data preparation

Raw performance measures in each task were obtained21, where each task was evaluated using either accuracy, reaction time, or a composite score of accuracy and reaction time as instructed by the tests’ manuals (Table S1). For the purposes of this study, the following preprocessing steps were carried out: Trail Making B-A score was given as the time difference between the two parts of the test, therefore we reversed the sign of this score such that, on every test, a higher score reflected a better performance14. Second, missing values were imputed using Multivariate Imputation by Chained Equations (MICE) for each age group separately. In the current version of the dataset, less than 5% of each variable was missing with a total of 16.56% from the young group and 20.83% from the older group. Raw scores were transformed to Z-scores for all measures and used for all the following steps except when comparing the task performance with 2-sample t-tests between the two age groups. Lastly, outliers were detected and removed using the boxplot method with a threshold of 3 (resulting in 17 outliers from the young and 8 outliers from the elderly). All variables followed a normal distribution as indicated by the kurtosis and skewness estimates. However multivariate abnormality was observed in the young adults group.

Data analysis

We implemented three analyses (noted Analysis A, B and C) to address different but complementary questions.

Analysis A involved the assessment of between-group comparisons of scores for each task and the assessment of across-subject correlations between tasks. Specifically, multiple statistical t-tests were used to assess performance differences between the two age groups and between genders at a Bonferroni-corrected significance level of p < 0.05. Pairwise Pearson r correlation, in the across-subject dimension, was calculated to identify relationships between tasks as well as correlations between tasks and other variables (age, years of education) for each age group separately. Our rationale here is that tasks that covary across participants (i.e. their scores going down and up across participants) can be considered as part of the same domain (positive correlation) or opposite domains (negative correlation). This analysis yielded a 17 × 17 correlation matrix for each group. The statistical difference between the correlation matrices was assessed using Fisher’s method at a Bonferroni-corrected significance level of p < 0.05. Last but not least, we also tested whether all participants, regardless of their age, could be clustered into two groups on the basis of their behavioral scores only. For that aim, K-means clustering was used to reclassify participants into groups based on their cognitive performance rather than their age. Specifically, the z-scores of the cognitive measures from all participants were submitted to the K-means clustering method with a pre-specified number of clusters of two. The clustering provided a means to examine whether the behavioral scores pattern (range, variance, correlation) is specific to each age group; i.e. if the obtained data-driven groups with K-means match the original young and older groups. A good matching between the data-driven K-means groups and the original age groups would provide strong evidence that variation in behavioral scores does certainly reflect age-related effects.

Analysis B identified the low-dimensional behavioral representations (i.e. constructs) of the included cognitive measures using factor analyses. Confirmatory factor analysis was used to fit a baseline model in each group using robust maximum likelihood estimator to account for the violation of the multivariate normality assumption30. The cognitive ability model was based on the Cattel-Horn-Carroll (CHC) theoretical model as well as on prior work on the factor structure of the NIHTB-CB battery18,20,31,32. The hypothesized model contained three factors: (1) long-term storage and retrieval (Glr) (Verbal Paired Associates: Immediate Recall, Delayed Recall, Delayed Free Recall; Associative Recall, NIH Cognition Rey Auditory; NIH Cognition Picture Sequence Memory), (2) crystallized intelligence (Gc) (Shipley Vocabulary; NIH Cognition Picture Vocabulary; NIH Cognition Oral Reading Recognition), (3) visual processing (Gv)/processing speed (Gs) (Trail Making Task: B-A; NIH Cognition Flanker Inhibitory Control and Attention; NIH Cognition Dimensional Change Card Sort; NIH Cognition List Sort Working Memory). This model was developed while following the rule of thumb that at least three indicators are required per factor to identify a factor model properly33. Processing speed could not be identified in a separate factor, which is a common problem when an insufficient number of tests is administered to support an independent factor34,35. We relied on the Comparative Fit Index (CFI), Tucker-Lewis Index (TLI), Root Mean Square Error of Approximation (RMSEA), and Standardized Root Mean Square Residual (SRMR) to judge the fit of the model. CFI and TLI values larger than 0.9 and RMSEA and SRMR values smaller than 0.07 would indicate an adequate fit36,37. Next, we tested for measurement invariance using the Multi-Group Confirmatory Factor Analysis (MGCFA), including configural, metric, scalar, and strict models. These models were compared statistically using the difference in the chi-square-statistics38.

Since full measurement invariance could not be established across groups (see Results Section below), a follow-up Exploratory Factor Analysis (EFA) was conducted to model the underlying structure of the cognitive measures without imposing strict theoretical constraints for each age group separately39,40. The factorability of the data was evaluated using the Kaiser-Meyer-Olkin (KMO) test and Bartlett’s sphericity test41. KMO represents the percentage of variance among variables that could be considered common variance, with higher KMO values indicating the suitability of the data for factor analysis42. Bartlett’s sphericity test determines if a correlation matrix differs considerably from the identity matrix43, with significant results indicating the suitability of the variables in the correlation matrix for factor analysis. Several factor analysis extraction methods can be used for EFA, such as Maximum Likelihood (ML) and Principal Axis Factoring (PAF). Here, we used PAF as it is recommended for relatively small samples44,45,46, with oblique rotation (ProMax) for easier interpretation of the results. To determine the optimal number of factors to be considered, a technique called Parallel Analysis (PA) was implemented44. It compares the results of factor analyses conducted on the original data to randomly generated synthetic data with the same dimensions as the original dataset, over multiple iterations (here we performed 100 iterations). The optimal number of factors is identified at the point where the eigenvalues from the synthetic data exceed those from the original data. Results from PA were also compared to the scree plot, and the smallest to the largest number of factors from these two methods were tested. The best model was selected based on interpretability, where each factor should have a minimum of 3 variables loading saliently on that factor20,47. Furthermore, a hierarchical clustering analysis was performed to verify the EFA results in each group with correlation as a distance metric. The final step involved comparing the resulting factor model across the two age groups. All factor analyses were performed in RStudio (v1.1.456) using Lavaan package.

In Analysis C, regression models were built to assess how age can be predicted from behavioral scores. First, an Ordinary Least Square (OLS) regression with lasso regularization and “leave one out” cross-validation was used to predict the age of participants based on their performance on all tasks (age as the dependent variable and behavioral scores as the independent variables). Second, we tested whether the relationships between the cognitive measures and age observed in one age group hold true to another age group. This was implemented in two stages: (1) an OLS regression model was fitted using the behavioral scores as independent variables and age as the dependent variable from one age group (e.g. the young adults group), and (2) that estimated regression model was then used to predict the age of the other group (e.g. the older adults group). Given the recent interest in brain age48,49, and assuming that the behavioral tests reflect brain function, this analysis helped (1) assess how well age is reflected in the cognitive scores measured with the used neuropsychological assessment tool, (2) test if scores from old adults represent a natural progression (extrapolation) of scores from the young group. The latter would essentially verify whether the age-behavior relationship is relatively consistent across different age groups.

Results

Analysis A

All task scores were significantly different between the age groups at the corrected p < 0.05 level. As expected, young adults had higher scores on the MMSE, episodic memory and executive function measures, whereas older adults had significantly better performance on the semantic knowledge tasks. This pattern of differences was also reported by the original authors and held true after controlling for gender, education and recording site21. However, when we examined gender effects within each age group, we found that in the older adult group, female participants had higher scores on the verbal episodic memory tasks (Immediate Recall (p < 0.01), Delayed Recall (p = 0.02), Delayed Free Recall (p = 0.05) and Associative Recall (p = 0.01)) compared to male participants.

Figure 1 shows the correlations between the tasks, including age and years of education, for each age group. Two semantic memory tasks were moderately correlated with education, while the symbol digit modality test and the dimensional change card sorting were moderately negatively correlated with age in the older adult group only. Likewise, using Fisher’s method, stronger correlations were also observed between the list sorting task and the semantic memory tasks in the young adults compared to the older participants (Fig. 2).

Figure 1
figure 1

Spearman correlation matrix of the neurocognitive measures along with age and education for each age groups. vpa_imm_tot: Verbal Paired Associates Immediate Recall, vpa_delay: Verbal Paired Associates Delayed Recall, vpa_fr_delay: Verbal Paired Associates Delayed Free Recall, associative_recall: Associative Recall, nihcog_rey: NIH Cognition Rey Auditory Verbal Learning: Immediate Recall, nihcog_psm: NIH Cognition Picture Sequence Memory, shipley_vocab: Shipley Vocabulary, nihcog_orr: NIH Cognition Oral Reading Recognition, nihcog_pva: NIH Cognition Picture Vocabulary, nihcog_lswm: NIH Cognition List Sorting Working Memory, nihcog_dccs: NIH Cognition Dimensional Change Card Sort, nihcog_flanker: NIH Cognition Flanker Inhibitory Control and Attention, trails_b.a: Trail Making Task: B-A, sdmt_oral: Symbol Digits Modality Test.

Figure 2
figure 2

Variables with significant correlation differences (p < 0.05 corrected) between the two age groups. vpa_imm_tot: Verbal Paired Associates Immediate Recall, vpa_delay: Verbal Paired Associates Delayed Recall, vpa_fr_delay: Verbal Paired Associates Delayed Free Recall, associative_recall: Associative Recall, shipley_vocab: Shipley Vocabulary, nihcog_pva: NIH Cognition Picture Vocabulary, nihcog_lswm: NIH Cognition List Sort Working Memory.

Interestingly, the two groups resulting from the data-driven K-means clustering were significantly different in age, with a 90% match to the original age groups. Specifically, Group 1 had 13 and 133 young and old adults respectively, whereas Group 2 had 100 and 10 young and old adults respectively.

Analysis B

The proposed cognitive ability model, based on the Cattel-Horn-Carroll (CHC) theoretical model, was tested with CFA across the two groups. CFA results indicated poor fit in both young adults (\(\:{\chi\:}_{\left(74\right)}^{2}=184.424,\:p<0.001;\:RMSEA=\:0.1;\:SRMR=\:0.126;\:CFI=0.771;\:TLI=0.719\)), and older adults (\(\:{\chi\:}_{\left(74\right)}^{2}=125.741,\:\:p<0.001;\:RMSEA=\:0.08;\:SRMR=\:0.089;\:CFI=0.905;\:TLI=0.883\)). Modification indices were examined, and two error terms were allowed to correlate for both groups (nihcog dccs-nihcog flanker, vpa_imm_tot-vpa_delay); these scores originate from the same domain, and each pair share the same scoring method. This resulted in the following fit indices: young adults: (\(\:{\chi\:}_{\left(72\right)}^{2}=99.246,\:\:p=0.018;\:RMSEA=\:0.052;\:SRMR=\:0.071;\:CFI=0.94;\:TLI=0.924\)), and older adults: (\(\:{\chi\:}_{\left(72\right)}^{2}=112.015,\:\:p=0.002;\:RMSEA=\:0.069;\:SRMR=\:0.076;\:CFI=0.932;\:TLI=0.914\)), suggesting that the model shows acceptable fit across groups according to the predefined goodness-of-fit criteria.

Next, we tested the model with the modification indices using MGCFA. Configural invariance models demonstrated a good fit to the data. The test of the next higher level of invariance, metric invariance, requires that the factor loadings between items and constructs are invariant across groups. This model was rejected according to the chi-square difference test. Partial invariance was tested at this level by relaxing constraints on variable loadings that show the largest modification index, one variable at a time. First, the loadings of the LSWM from the NIHTB-CB were freely estimated; however, the chi-square test still indicated a lack of metric invariance. After re-examining the modification indices, verbal paired associate delayed recall test loadings were freely estimated across the two groups. This resulted in an improved fit with p = 0.076. Strong invariance was established after constraining the intercepts of the model from the previous step to be equal in both groups. Lastly, partial strict invariance was shown at p = 0.054; see Table 1 for the results of the chi-square difference test.

Table 1 Fit indices for MGCFA models and difference tests between the two age groups. nihcog_lswm: NIH Cognition List Sorting Working Memory, vpa_fr_delay: Verbal Paired associates delayed Free Recall. * indicates significant difference (chi-square difference tests between the original model without assuming measurement invariance and the comparison model).

Next, EFA was conducted to model the underlying structure of the cognitive measures for each group separately. Bartlett and KMO statistics for both groups’ correlation matrices fell within the acceptable ranges for factor analysis, young group (KMO = 0.72, Barlett’s test < 0.001), old adults (KMO = 0.73, Barlett’s test < 0.001). PA resulted in a 3-factor solution for both groups, while the scree plot supported a 4-factor model. Both were evaluated in each group. The 4-factor models were not clearly interpretable because of the cross-loading for some of the variables (Table S2 of the Online Resource). Hence, the 3-factor models were selected for both age groups. The observed variables’ loadings on the resulting factors are shown in Table 2 along with the eigenvalues, cumulative explained variance, and internal consistency. The factors in both groups could be interpreted as semantic memory, episodic memory, and executive function factors. However, there were differences in the executive function loadings between the two groups. In the young adults’ group, the Symbol Digit Modality Task was associated with the episodic memory tasks, and the List Sorting Working Memory and Trails Making B-A tasks were associated with the semantic memory tasks, thus resulting in a factor with two loadings only. On the other hand, in the older adults’ group, both tasks were associated with the executive functions factor. Also, the Trails Making B-A test had low loading scores in both groups. The resulting clusters from the hierarchical clustering analysis confirmed the EFA results (Figure S1 of the Online Resource).

Table 2 3-factor latent cognitive structure of the two age groups as indicated by exploratory factor analysis. Columns 2 and 3 for the older adults’ group were swapped for easier comparison between the groups. vpa_imm_tot: Verbal Paired associates Immediate Recall, vpa_delay: Verbal Paired associates delayed Recall, vpa_fr_delay: Verbal Paired associates delayed Free Recall, associative_recall: Associative Recall, nihcog_rey: NIH Cognition Rey Auditory Verbal Learning: Immediate Recall, nihcog_psm: NIH Cognition Picture Sequence Memory, shipley_vocab: Shipley Vocabulary, nihcog_orr: NIH Cognition Oral Reading Recognition, nihcog_pva: NIH Cognition Picture Vocabulary, nihcog_lswm: NIH Cognition List Sorting Working Memory, nihcog_dccs: NIH Cognition Dimensional Change Card Sort, nihcog_flanker: NIH Cognition Flanker Inhibitory Control and attention, trails_b.A: trail making Task: B-A, sdmt_oral: symbol digits modality test.

Analysis C

The leave-one-out cross-validation analysis revealed that the cognitive measures incorporated in the study accounted for 39.6% of the variance of older adults’ age, whereas for the group of young adults, it accounted for a comparatively lower value of 18.5%. This means that age was better predicted by the behavioral scores in the older group than in the younger group. The results from the mixed regression models are shown in Fig. 3 (i.e., predicting participant’s age in one age group from the regression model of the other age group). The performance of these models was evaluated using the absolute mean error, and both models had relatively very low predictive power with an error higher than 45 years. This very low predictive power cannot be due to differences in intercepts only, as there was no correlation between true and predicted age (see regression lines of the mixed models in Fig. 3). This implies that the relationship between behavioral scores and age in the young group cannot be simply extrapolated to predict age in the older group, suggesting that age-behavior relationships are age-group-dependent.

Figure 3
figure 3

Different OLS regression models to predict participants’ age using their cognitive measures A) model fitted on the young adults data to predict their age. B) model fitted on the elderly data to predict the young group age. C) model fitted on the young data to predict the elderly group age, D) model fitted on the elderly data to predict their age.

Discussion

Fundamentally, a between-group comparison in behavioral scores assumes that the tests used to collect these scores measure the same concepts in both groups. There is a rich literature in psychometry about construct validity that deals with measurement validity by ensuring that multiple observable or measurable indicators (e.g. test scores) are well reflecting the constructs of interest (e.g. cognitive abilities). Here, we showed that, despite using the same tests in both age groups, some executive functions tests did not load to the same factors in both groups. Our study calls for more caution when comparing scores between different age groups even when using well-studied tests such as the NIHTB-CB battery.

Overall, the comparisons in task performance between the two age groups were consistent with previous work2,50, including a decline in executive functions presumably due to a significant decrease in brain connectivity51,52, that is also mediating53,54 a decline in episodic memory55,56,57. Our results also confirmed previous reports on healthy aging about older people showing relative preservation or even improvement in semantic memory, in particular in tasks that rely on vocabulary storage56,58,59. Although sex differences in cognitive performance were not consistent across studies60,61, our results revealed better performance in females than males during verbal episodic memory tasks in the older group. This has important clinical implications, as verbal memory scores are typically used in the diagnosis of amnestic mild cognitive impairment (aMCI) and Alzheimer’s disease (AD) dementia often without sex adjustment. This might relate to a different underlying neuropathology in the transition from aMCI to AD dementia in women, thus introducing a sex bias in the detection of MCI62,63. This sex effect in the older adult group might be due to differences in comorbidities along the aging process as women tend to have a lower prevalence of focal lesions (microbleeds, lacunes, and cortical infarcts) than men64.

The difference in the associations (i.e. correlations) between the cognitive scores of the two age groups (Fig. 2) might reflect age-related changes in brain networks7,65. Specifically, there was an increase in the shared variance among verbal episodic memory tasks with increasing age, which is a novel observation not reported in previous studies. On the other hand, the association between list sorting working memory task and the two semantic memory tasks (Shipley vocabulary and picture vocabulary) did not persist with advancing age. A recent study supports this observation where the capacity of visual working memory was found to be dependent on semantic knowledge66. In the same way, the regression results further highlight the complex interplay between age and cognition. Cognitive scores in late adulthood were more sensitive to chronological age than in young adulthood, suggesting that age-related differences in cognition are more pronounced in later life. These results are in line with previous research showing that cognition is not a unitary model and that cognitive abilities have different developmental trajectories that can vary significantly across the lifespan67,68.

Both Analyses B and C provided evidence that the constructs or domains measured by well-validated cognitive tests differed across age groups. The mixed regression models (Analysis C) suggested that brain-behavior relationships might not be the same for both age groups. This can explain why the model used to predict age from behavioral scores in one age group failed to accurately predict age in the other age group. Likewise, our factor analysis (Analysis B) offered important insights into the main constructs measured by the NIHTB-CB battery and other auxiliary tests across the lifespan. In contrast to previous reports on similar tasks (e.g18), full measurement invariance of the hypothesized CHC model could not be established due to the multifaceted nature of the targeted executive functions processes and the limited number of tests included in return. Overall, The NIHTB-CB fluid tests have been found recently to have poor to adequate construct validity, with low correlations between the executive functions tests in particular69,70. Although partial strict invariance was achieved here, other studies argued that partial invariance is insufficient to assure meaningful comparisons, given that the fit relied on data-driven modification indices with a high risk of overfitting the sample data71. Indeed, simulation studies have suggested that modification indices might limit the generalizability of the resulting model72,73,74. Likewise, Steinmetz showed that, while partial invariance allows meaningful comparisons of latent means, it may be unreliable for comparing observed scores75. We reckon that interpreting invariance for group comparison is not always straightforward (e.g76,77,78). Furthermore, our EFA results revealed inconsistent loadings and latent cognitive models across the two age groups. More specifically, processing speed, executive functions and working memory scores were scattered across the 3 factors in the young adults’ group, whereas in the older group, they were found to aggregate into a separate factor from episodic and semantic memory tests. Notably, these three domains are known to account for age-related cognitive decline79. This shift in task loadings could be explained by the fact that memory impairment may manifest at a later age or at a different rate compared to executive functions, consequently, these two cognitive constructs may become more distinct with advancing age80. Indeed, recent work has shown that performance in executive functions is better explained by a 2-factor model in older adults compared to a 3-factor model in young adults19,81. Moreover, it was suggested that as people get older, they rely more on general executive control processes due to the reduction and allocation of cognitive resources as opposed to younger adults19. Recent studies suggested that typical in executive functions tests may not accurately measure individual differences in executive functions, but rather reflect differences in information processing speed82. Other findings have shown that measures of executive functions do not reveal a distinct latent variable unless they are combined with tasks that assess other cognitive processes, such as processing speed or working memory capacity83. We note that existing literature reckons the difficulty in estimating and interpreting the factor structure of executive functions. Overall, existing literature suggests that measures of executive functions might be either unreliable or invalid in measuring individual differences in executive functions82,84.

These findings from the three analyses have implications on how to understand differences in cognitive scores between young and older adults. For example, the List Sorting Working Memory task was found to have variant intercepts and loadings between the two age groups, it was also associated with different factors in both age groups. Therefore, a significant difference in performance in that task between young and older adults might presumably reflect different constructs being measured in both groups. It is interesting to note that prior work has shown that the List Sorting Working Memory test can load to different factors, for instance loading similarly to other working memory tasks18 despite significant correlations with executive functions tasks85. We reckon that the NIHTB-CB battery has been tested and validated in different groups86, and there is already evidence in current literature about its clinical usefulness for the assessment of cognitive abilities in diverse populations87,88,89,90,91,92. However, in line with previous studies, our findings pointed to the complexity of identifying and measuring individual differences in executive functions82,84, and to the validity of the constructs measured by the NIHTB-CB20,70. Fundamentally, any comparison between young and old adults should reflect the same construct or domain to ensure valid and interpretable differences. There is no clear interpretation of any mean differences if there is no common structure93.

Our findings can be explained by several aspects. For instance, our analysis did not explicitly account for the influence of different subcomponents of a given construct that may vary across age groups94,95, which could translate into the same factor being dominated by different subcomponents. Furthermore, the older group might show more pronounced heterogeneity than the younger group, due for instance to significant inter-variability differences in decline rates in abilities96. Although our K-means clustering showed a matched grouping to the original age groups, we cannot rule out that subgroups might exist within the older group. This possibility raises some interesting questions (hypotheses) that warrant future research. For example, if the older group can be subdivided into homogenous subgroups, one can hypothesize that a subgroup with the same identified low-dimensional representation (i.e. the same factors) as the young group would most likely follow a healthy aging process, whereas a subgroup with a different low-dimensional representation than the young group might indicate a high risk of pathological aging for the individuals that belong to that subgroup.

There are some conceptual and methodological issues that need to be acknowledged. First, the cross-sectional design of this study limits inferences about any causal associations between age differences and cognitive ability, in addition to poor interpretability and validity of individual differences due to confounding effects from within-subject associations97. Second, executive functions involves a heterogeneous set of processes that entertain complex relationships with processing speed and working memory98,99. This might complicate the interpretation of factor analysis results, making executive functions tests not well suited for factor analyses (see discussion in7). Third, the dataset14 used here has missing values that we imputed for each age group separately. Fourth, the performance indicators used could also account for the non-invariance between the models of the two age groups, the common time component in the Flanker, DCCS and TMT scores (see scoring criteria in Table S1) may have contributed to the age differences in the latent factor structure, as reaction speed differs significantly between the two age groups; e.g. a similar pattern has been reported previously20. Fifth, the similarity between tests was computed as a correlation across subjects. While this covariance-based approach is suitable to better capture variability across subjects (e.g100), it is also notoriously sensitive to outliers101, therefore we down-weighted their influence before computing the correlations between tests. Sixth, measurement invariance is necessary for uncomplicated interpretation of validity correlations and differences in test scores across groups102. However, observed correlations (cf. Figure 2) may also vary across age groups because of (1) real inherent differences in correlations between the same invariant constructs, or (2) existing differences in the reliability of the test scores. Put another way, observing different patterns of correlations between the same set of cognitive scores in different age groups is possible even under measurement invariance, suggesting that variability in correlations does not necessarily reflect differences in cognitive constructs. Last but not least, the generalizability of the 3-factor structure, identified here in a data-driven way with exploratory factor analysis, needs future investigations. Our main assumptions during the identification of the 3-factor structure were (1) test scores based on reaction times and accuracy represented accurate measurements of task performance in both age groups, (2) correlated tests represented a common factor, (3) weakly correlated tests were representing distinct factors, and (4) each identified factor was equated with one broad construct. Discrepancies between studies in the exact underlying factorial structure for the NIHTB-CB battery might be explained by possible differences in how those assumptions were satisfied.

In conclusion, our study shows that age affects the factorial structure of cognitive assessment in neurotypicals. It raises questions about the validity of significant differences between age groups when the used tests are measuring different constructs. It also highlights the challenge to comprehensively measure age-dependent changes in cognitive abilities with a set of tests that span multiple cognitive dimensions. For instance, some tasks, such as the List Sorting Working Memory, can behave differently across age groups, presumably because their successful execution might put variable demands on different processes in both age groups. In this context, it is important that users of psychometrically validated cognitive assessment batteries are aware of the tests that are not factorable as well as the impact of demographics on task performance (here age and gender were shown to have an impact on some tests). Overall, existing test batteries have been extremely useful in the clinical setting for both diagnostic and prognostic purposes. Future work needs to investigate how measured constructs (factors) by these test batteries would generalize across populations with diverse demographics, spoken languages, affective skills, socioeconomics and co-morbidities.