Abstract
Although the use of short self-report measures is common practice in biobank initiatives, such a phenotyping strategy is inherently prone to reporting errors. To explore challenges related to self-report errors, we first derived a reporting error score in the UK Biobank (UKBB; n = 73,127), capturing inconsistent self-reporting in time-invariant phenotypes across multiple measurement occasions. We then performed genome-wide scans on the reporting error score, applied downstream analyses (linkage disequilibrium score regression and Mendelian randomization) and compared its properties to the UKBB participation propensity. Finally, we improved phenotype resolution for 24 measures and inspected the changes in genomic findings. We found that reporting error was present across all 33 assessed self-report measures, with repeatability levels as low as 47% (childhood body size). Reporting error was not independent from UKBB participation, evidenced by the negative genetic correlation between the two outcomes (rg = −0.77), their shared causes (for example, education) and the loss in self-report accuracy following participation bias correction. Across all analyses, the impact of reporting error ranged from reduced power (for example, for gene discovery) to biased estimates (for example, if present in the exposure variable) and attenuation of genome-wide quantities (for example, 21% relative attenuation in SNP heritability for childhood height). Our findings highlight that both self-report accuracy and selective participation are competing biases and sources of poor reproducibility for biobank-scale research.
Similar content being viewed by others
Main
Genomic research is often confronted with large-scale datasets containing error in the phenotypic measures as data collection is optimized towards the recruitment of large numbers of people. To reduce participant burden, save resources and maximize sample size, recruitment schemes often favour minimal phenotyping (that is, the administration of short self-report scales) over precision phenotyping (that is, the application of gold-standard measures). In the UK Biobank (UKBB), such self-report measures serve as the primary data source for commonly studied phenotypes, notably socio-demographic data, health-related information, behavioural outcomes and lifestyles. Although all phenotypes are measured with some degree of error, including those objectively ascertained (for example, biological measures or laboratory assays), error induced by brief self-report measures pose a particular challenge when studying the associations of those phenotypes with genetic or other phenotypic information. As the reported information is influenced by subjective interpretation, misreporting or cognitive limitations, error in self-report measures constitutes a potentially greater threat to the validity of findings.
While the early stages of genome-wide research were dominated by a push towards ever-growing sample sizes, challenges related to phenotype ascertainment are increasingly recognized as a non-negligible source of bias in genomic research1,2. Although random error in phenotypes does not lead to bias in single nucleotide polymorphism (SNP) estimates (Supplementary Fig. 1), the resulting measurement imprecision and increased type-II error rates constitute one of the causes for large sample size requirements in genomic research. If gene discovery is the primary study aim, increasing sample sizes can compensate for random error in the phenotype within the limits of feasibility. However, more problematically, random error puts an upper bound on how much variance can be explained in the phenotype3,4. Downstream genome-wide analyses (GWA) focusing on variance components (for example, heritability estimates5 and polygenic prediction6,7,8) would therefore show (downward) bias in the presence of self-report inconsistencies.
Detecting and correcting self-report errors can be challenging when relying on biobank-scale data because the required validation data are rarely available. However, with the increasing availability of repeated measurements in the UKBB, it is now possible to more systematically explore causes and consequences of self-report inconsistencies across measurement occasions. In this work, we aim to contribute to the growing body of research scrutinizing the impact of study design characteristics and participant behaviour (for example, sampling procedures9,10,11,12, missing data13, study engagement14 and data quality15,16,17) on findings obtained from biobank-scale data. Here we focus on the challenges related to reporting error, defined as inconsistent self-reporting across measurement occasions. To that end, we aim to quantify error in commonly studied UKBB phenotypes and assess its impact on genome-wide quantities. Further, as self-reporting represents only one among numerous participation behaviours, we also explore its interplay with a previously studied participation behaviour known to impact study findings (that is, selective UKBB participation10,12). Such work is not only crucial for the interpretation of findings obtained from existing biobanks but may also help shape strategies aiming to enhance phenotype resolution and recruitment strategies in future biobank initiatives.
Results
Indices of reporting error in the UKBB
We included 33 time-invariant self-report measures to assess inconsistencies in self-reporting. Outlier values were identified and subsequently removed for 3 of the measures, including age of 1st sexual intercourse (34 outliers removed), childhood sunburns (10 outliers removed) and age of onset of smoking (5 outliers removed). The box and scatter plots of these measures before and after outlier removal are shown in Supplementary Fig. 2. As shown in Fig. 1 (Supplementary Table 1), reporting error was present across all 33 assessed UKBB time-invariant phenotypes, with a mean error estimate of 0.21 (possible range, 0 (absence of error) to 1). High levels of measurement repeatability were present for self-reports providing information about major life events, such as date of birth (R2 > 0.99), number of children (R2 = 0.99) and country of birth (R2 = 0.99). A substantial proportion of self-reports showed questionable levels of repeatability, notably for variables that rely heavily on recall of childhood histories, such as childhood sunburns (R2 = 0.53), age at first facial hair (R2 = 0.50) or comparative childhood body size (R2 = 0.47). Figure 1 also shows the level of repeatability for variables containing error due to misreporting and/or temporal variability. Here, self-report measures subject to temporal instability showed particularly low levels of repeatability, notably for diet (for example, vitamin D intake in last 24 hours) and other lifestyle measures (for example, physical activity in last 24 hours). Although variations among objectively ascertained ‘gold-standard’ measures (for example, height and systolic blood pressure; highlighted in violet in Fig. 1) are free of error due to misreporting, measurement imprecision resulting from other sources (for example, biological fluctuations, technical challenges and data-processing errors) was nevertheless found to be non-negligible for a majority of these measures (for example, sodium concentration and hearing performance). Five UKBB phenotypes had data from directly comparable objective and subjective measures. Estimation of R2 showed that the concordance between the 2 data sources (objective versus subjective) was low, ranging from R2 = 0.002 (vitamin D, self-report versus blood measure) to R2 = 0.031 (sleep, self reported versus accelerometer derived) to R2 = 0.252 (first child’s birth weight, self reported versus hospital records) (Supplementary Fig. 3 and Supplementary Table 2).
R2 = Variance explained by models regressing PT2 (for example, birth weight reported at follow-up) onto PT1 (for example, self-reported birth weight assessed at baseline), while controlling for follow-up time (timeT2 − T1). Variables with R2 estimates above the grey horizontal line indicate variables with high levels of repeatability (R2 > 0.92).
Next, we generated the reporting error scores (REsum; Fig. 2a), indexing the level of reporting inconsistency per phenotype and UKBB participant. Supplementary Figs. 4 and 5 summarize the contribution of baseline age, follow-up time, their interaction (age × follow-up time) and sex on the reporting error scores, highlighting that the scores varied mostly as a function of follow-up time and its interaction with age. In addition, reporting error was more prevalent among men, as 12 (75%) of the 16 reporting error scores showing significant sex-differential effects were higher in men than in women. The largest sex-differential effect was present for self-reported mother’s age at death, where women showed substantially lower levels of reporting error.
a, Illustration of a reporting error score for a particular phenotype, derived as the residual scores from a model regressing PT2 (for example, birth weight reported at follow-up) onto PT1 (for example, self-reported birth weight assessed at baseline). The reporting (residual) error scores are shown as the vertical deviations of the observed values (yi) around the fitted line. b, Correlation matrix highlighting significant (P < 0.05, two sided) Pearson correlation coefficients between the reporting error scores. The labels in bold font highlight variables that were included in PCA. The label highlighted in turquoise (REsum) shows the correlations between the PCA-generated summary score (residualized for follow-up time) and the individual reporting error scores. Darker shades indicate stronger correlations. c, Summary of results from PCA, highlighting the variance explained by PC1 and the loadings of the indicators on PC1.
Assessing the correlations among reporting error scores (Fig. 2b), we found that the majority of correlations were small but positive (170 (96.05%) of the 177 significant correlations). The largest positive correlations were present among measures tapping into similar constructs, such as the r(mother’s age at death, father’s age at death) = 0.38 or r(comparative body size at age 10, comparative height size at age 10) = 0.15. Including 5 of the reporting error scores with n > 50,000 in principal component analysis (PCA; years of education, age when started wearing glasses, father’s age at death, age at first sexual intercourse and year of birth; Supplementary Fig. 6 for the corresponding box and scatter plots), the first principal component (PC1) explained 21% of the variance. The individual reporting error scores all loaded positively on PC1 (Fig. 2c). On the basis of PC1, we computed the reporting error summary score (REsum), which could be generated for 73,127 individuals taking part in repeat assessments. The demographic characteristics of individuals with and without available REsum are shown in Supplementary Table 3.
The link between reporting error and UKBB participation
To examine if reporting error varied as a function of sample representativeness, we first assessed the level of covariation between reporting error and UKBB participation. Phenotypically, we found a negative correlation (rPearson = −0.094) between the REsum and UKBB participation, indicating that a greater willingness to participate in the UKBB links to more consistent self-reporting. Similarly, we observed negative genetic correlations (rg) between reporting error and other participatory behaviours, including the UKBB participation probability (rg = −0.77; 95% confidence interval (CI) = −0.9 to −0.64), recontact availability in the UKBB (rg = −0.67, 95% CI = −0.81 to −0.53) and follow-up (mental health survey) participation (rg = −0.59, 95% CI = −0.72 to −0.45) (Fig. 4 and Supplementary Table 4).
To assess shared and non-shared characteristics between reporting error and UKBB participation, we then tested for associations between a number of baseline characteristics and the two outcomes (Fig. 3a and Supplementary Table 5). Here, significant predictors differentially linked to the two outcomes, where female participants with higher levels of education and lower body mass index (BMI) showed fewer reporting errors but a higher willingness to take part in the UKBB. Only age predicted the two outcomes in the same direction, such that older individuals tended to show more reporting errors and were also more likely to participate in the UKBB. Including all predictors simultaneously in LASSO regression explained around 12% of the variance in UKBB participation and 7% in reporting error (Fig. 3a).
a, βstd coefficients (and 95% CIs) of variables predicting reporting error (in dark turquoise; n = 73,127) and UKBB participation (in light turquoise; n = 415,066) in univariate regression models. b, rg estimates and corresponding 95% CIs of reporting error (n = 62,131) and UKBB participation (n = 283,749) with other traits. Significant genetic correlations (false-discovery-rate-corrected P (PFDR) < 0.05) are highlighted with black borders. c, Standardized estimates (αstd, with 95% CIs) obtained from MR analyses on reporting error (n = 62,131) and UKBB participation (n = 283,749) as the outcomes. Significant MR estimates (PFDR < 0.05) are highlighted with black borders. Two-tailed tests were used in all instances when testing for statistical significance.
The REsum showed low but significant levels of SNP heritability (\({h}_{{\rm{RE}}_{\rm{sum}}}^{2}=3.19 \%\), 95% CI = 1.72–4.66%). In line with the phenotypic correlations, reporting error and UKBB participation differentially correlated with most of the socio-educational and behavioural variables included in linkage disequilibrium (LD) score regression (Fig. 3b and Supplementary Table 4). These included intelligence (\({r}_{\rm{g}_{\rm{reporting}}}\) = −0.85, \({r}_{\rm{g}_{\rm{participation}}}\) = 0.62), years of education (\({r}_{\rm{g}_{\rm{reporting}}}\) = −0.81, \({r}_{\rm{g}_{\rm{participation}}}\) = 0.85) and income (\({r}_{\rm{g}_{\rm{reporting}}}\) = −0.70, \({r}_{\rm{g}_{\rm{participation}}}\) = 0.75). Similarly, applying Mendelian randomization (MR) analysis to identify causal factors contributing to reporting error, we find that reporting error and UKBB participation were explained mostly by socio-educational variables, where higher income, years of education and intelligence reduce self-report errors (standardized effect αeducation = −0.34, αincome = −0.33 and αintelligence = −0.27) but increase the probability of UKBB participation (αeducation = 0.59, αincome = 0.54 and αintelligence = 0.32) (Supplementary Table 6).
Figure 4 shows the distribution of the participation (inverse probability) weights and reporting error (inverse variance) weights. The performance of the inverse variance weights was assessed in terms of reporting error reduction in 8 phenotypes, including those used in PCA and 3 additional phenotypes showing the largest degree of reporting error (that is, body size at age 10, relative age of first facial hair and number of childhood sunburns; Fig. 1). Both the inverse variance weights and the participation weights performed as intended, in that they reduced the error variance in the eight variables inspected for measurement inconsistencies (that is, increasing the level of measurement repeatability R2; Fig. 4b) and made the sample more representative (that is, lowering the mean age and mean level of education; Fig. 4c), respectively. As the variability among the participation weights was large (indicating probable risk of bias due to selective participation), its application resulted in a substantial loss in effective sample size (62%, from n = 63,896 to neffective = 24,437; Fig. 4a). In contrast, the reporting error weights showed little variability, causing a minimal loss in effective sample size (2%, from n = 63,896 to 62,636). The reporting-error-adjusted participation weights (inverse variance weights × participation weights) no longer reduced reporting error in all instances and re-introduced a slight shift towards non-representativeness, resulting in a slight increase in effective sample size compared with the unadjusted participation weights (24,437 versus 24,623).
a, Truncated density curves of the normalized UKBB weights (w), estimated for n = 63,898 participants. The effective sample size was estimated as n × {1/[Var(w) + 1]}. b, R2 = variance explained by standard (ordinary least squares) regression models regressing PT2 onto PT1, while controlling for follow-up time (timeT2 − T1). R2w = variance explained by weighted (weighted least squares regression) models, incorporating UKBB weights to adjust for selective participation (top: unadjusted participation weights), reporting error (middle: reporting error weights) or both (bottom: reporting-error-adjusted participation weights). Positive values in R2diff (R2 − R2w) index reduced measurement repeatability following weighting. c, Change in means as a function of weighting, obtained for two continuous phenotypes known to link to UKBB participation (age and education). Change in means was expressed as a standardized mean difference, that is, the difference between the unweighted mean (m) and the weighted mean (mw), divided by the unweighted standard deviation (mw − m/s.d.).
Simulations
We tested eight simulation scenarios to illustrate the individual and combined impact of reporting error and selective participation on exposure–outcome associations (Fig. 5). The following standardized coefficients for education (E) and BMI (B) on reporting error (R) and the participation probabilities (P) were estimated and used to simulate (sim) the data: Rsim = −0.48E + 0.03B + ε (with ε as the error term) and \({P}_{{\rm{sim}}}=\frac{1}{1+\exp (-\left(-2.84+0.42E-0.12B\right))}\).
a, Directed acyclic graphs illustrating the different simulation settings, including either the ground-truth scenario (no participation bias or reporting error; 1, highlighted in blue) or scenarios where reporting error (2, highlighted in violet), participation bias (3, highlighted in green) or both (4, highlighted in orange) were present when assessing the effect of BMI on self-reported education (top) and the effect of self-reported education on BMI (bottom). b,c, The impact of the two participatory behaviours (reporting error and selective participation) in each of the simulated scenarios was assessed in terms of bias (b; 1 and 2, showing the difference between the estimated coefficient (y axis) and the true estimate of the exposure–outcome association (grey line, where the true causal effect was set to be −0.2)) and RMSE (c; 1 and 2, showing RMSE on the y axis, with the grey line indicating RMSE = 0) when testing the association between education and BMI. Data were simulated to mimic the UKBB response rate, where around 5.5% of the simulated data (for n = 9,000,000 individuals) were selected. All error bars shown in the figure represent the 95% CIs.
We found that deviations from the true causal effect resulted from both selective participation and reporting error in the exposure, in both cases leading to downward bias in the effect estimate (Fig. 5b). Root-mean-square error (RMSE) was most strongly increased by reporting error in the exposure (Fig. 5b), reflecting a large bias in the effect estimate towards the null. Although reporting error in the outcome did not induce bias in the effect estimate, the increased uncertainty in parameter estimates also increased the RMSE, a measure that combines both bias and variance.
Impact of reporting error on SNP effects and trait heritability
To assess the impact of reporting error on genome-wide results, we compared the output obtained from GWA on single-measure phenotypes (for example, self-reported childhood height assessed at baseline) versus repeated-measure phenotypes (using the average across multiple measurement occasions) (Fig. 6). In total, 652 LD-independent SNPs reached significance (P < 5 × 10−8) in genome-wide scans on the 24 traits, of which 149 (22.85%) were only identified in repeated-measure GWA. Among the identified SNPs, the explained variance increased following error correction for 492 SNPs (75.46%). Although the β estimates obtained from the two sets of GWA were the same (Supplementary Table 7), in accordance with the simulations demonstrating that reporting error in the outcome does not induce bias, the reduced error in the phenotype value narrowed the standard errors of the effect estimates, thereby boosting power for genome-wide discovery.
a, Explained variance (βstd2) per SNP reaching genome-wide significance in error-corrected GWA analyses (y axis; phenotype obtained using means across multiple measurement occasions) or error-uncorrected GWA analyses (x axis; phenotype obtained from a single baseline measure). The colour scheme highlights in which GWA the genetic variant was identified, including error-corrected GWA (in red), error-uncorrected GWA (in green) or both (blue). b, The y axis shows the differences in SNP heritability estimates obtained from error-corrected GWA analyses and error-uncorrected GWA analyses (h2diff = h2repeated measure − h2single measure). The x axis gives the degree of repeatability per phenotype, estimates as the variance (R2) explained by models regressing PT2 on PT1, while controlling for follow-up time (timeT2 − T1) and age. The sample size per phenotype is included in Supplementary Table 8. All error bars shown in the figure represent the 95% CIs.
Finally, with respect to SNP-based heritability estimates, we find that enhanced phenotype resolution increased h2 estimates. Overall, the degree of h2 disattenuation was proportional to the degree of reporting error per phenotype \(({r}({{h}}_{{\rm{diff}}}^{2},{R}_{{\rm{repeatabiliy}}}^{2})=-0.75)\), with h2diff = h2 repeated measure − h2 single measure, where the largest notable downward bias in h2 estimates was present for self-reported height size at age 10 (\({R}_{{\rm{repeatabiliy}}}^{2}=0.55\), \({h}_{{\rm{single}}\sim{\rm{measure}}}^{2}=23 \%\) versus \({h}_{{\rm{repeated}}\sim{\rm{measure}}}^{2}=29 \%\)). The complete set of results is included in Supplementary Fig. 7 and Supplementary Tables 7 and 8.
Discussion
Phenotyping based on short self-report measures is common practice in biobank schemes, which has paved the way for large-scale genome-wide discovery studies involving millions of individuals. Although such assessments are cost-effective and minimize the invested time of participants, they are particularly prone to errors resulting from misreporting. In this study, we quantified the extent of reporting error for commonly studied UKBB phenotypes, assessed its properties and links with other participation behaviours, and evaluated its impact on exposure–outcome and genotype–phenotype associations.
Overall, we found that reporting error is non-negligible for many commonly studied self-report measures, notably those relating to early life histories (for example, puberty, education and childhood height/weight), common environmental exposures (for example, number of sunburns) or lifestyles (for example, age when started smoking). Consequently, exploiting large biobank samples does not necessarily enhance the signal-to-noise ratios for these phenotypes, as loss of power resulting from reporting error may equate to discarding up to half the sample (as would be expected if the square of the correlation between the true phenotype and the measured phenotype is around 0.5 (ref. 18)). Considerations on statistical power and sample size requirements should therefore not only focus on the genetic architecture of the trait and the study design but also incorporate phenotype resolution as a parameter of interest.
Examining factors contributing to reporting error, we found that reporting error varied systematically across socio-demographic groups. In particular, young female participants with higher intelligence scores and those from a socio-economically favourable background (higher education and income) tended to provide the most accurate self-report information. This is consistent with the notion of heteroskedastic error, where the error variance depends on certain sample characteristics (for example, the accuracy in reporting level of education depends on education itself; Fig. 5a). The impact of this error structure on study findings will depend on the research question of interest: if gene discovery is the main goal, error in the phenotype reduces power and increases type-II error rates. Although increasing the sample size (that is, reduced sampling error) could compensate for the loss of power, such efforts would not correct for the downward bias in estimates of variance components (for example, SNP heritability and polygenic prediction) resulting from error in the phenotype. For example, for phenotypes with high levels of reporting error, we observed relative h2 attenuation of up to 21% (Fig. 6a). As such, part of the missing heritability problem results from poor phenotype ascertainment, such as the use of minimal phenotyping or misclassification1. Similarly, the higher h2 observed for physical attributes (for example, height and eye colour) than that for socio-behavioural traits (for example, smoking and socio-economic status) in the UKBB19 may not solely reflect a stronger genetic component because measurement problems are mostly inherent to the latter traits. For polygenic scores in particular, attenuation bias due to self-report error can be twofold: first, high phenotypic error in the discovery sample increases measurement error in polygenic indices, leading to attenuation of their effects when tested in replication samples. Second, this bias is expected to be further amplified if self-report accuracy is low in the replication sample.
In classic observational analyses, bias will occur if reporting error is present in the exposure, which attenuates effect estimates towards the null (for example, regression dilution or attenuation bias17,20). In this scenario, the bias on parameter estimates can be particularly large, potentially exceeding bias resulting from other sources (for example, selective participation; Fig. 6b). As such, although large-scale biobanks are imperative for the study of biological pathways of small effects, such minimally phenotyped convenience samples may not be a strong contender for classic (non-genetic) epidemiological research. For that, smaller but more representative samples with gold-standard measures are potentially the more trustworthy alternative.
Finally, we compared features underlying reporting error with those of other participation behaviours, here the UKBB participation propensity. We found that individuals with high self-report quality were more likely to participate in the UKBB and that the application of statistical tools designed to ensure sample representativeness (probability weighting) increased self-report errors. This finding is consistent with findings from survey research, where probability (that is, representative) samples showed more measurement error than volunteer samples21 and where efforts to enhance data quality reduced sample representativeness22,23. Together, these results highlight that biases resulting from response and participation behaviours are not independent and operate in opposite directions, such that adjusting for one type of bias could aggravate bias resulting from other sources. Consequently, design considerations should also focus on finding an optimal trade-off between sampling bias and phenotype precision. For example, the application of reporting error (inverse variance) weights enhanced phenotype resolution in the UKBB without further compromising the level of representativeness in the UKBB (Fig. 4). Collecting quality indicators and metrics for phenotype precision for all individuals when assessed at study entry (for example, use of tools to screen for poor questionnaire responding24) in future biobanks may therefore prove useful to remove some of the noise in the phenotype. Alternatively, researchers may choose to average phenotype scores across multiple measurement occasions if repeat-measurement data are available.
A key consideration when interpreting our results relates to the error structure examined here. More specifically, our work focused on inconsistent self-reporting over time (that is, random fluctuations in the phenotype), rather than sources of consistent misreporting (that is, systematic over- or under-reporting; Supplementary Fig. 1d). Systematic error, documented for numerous traits (for example, self-reported weight, where overweight individuals tend to under-report25), can only be explored if error-free reference data are available. For that reason, it was also not possible to explore error in phenotypes subject to temporal variability (for example, self-reported alcohol use) because the data at hand did not allow us to distinguish reporting error from environmental influences on the observed within-individual variability. In addition, our derived reporting error composite score reflects only an imperfect approximation of its underlying construct (reporting error propensity). Implementing strategies to enhance the resolution of this measure (for example, by using additional follow-up waves when deriving the individual reporting error scores), alongside explorations of alternative structural models (for example, single-trait versus single-factor versus multifactor analyses to capture dimensions of reporting error) could therefore prove useful in future investigations. Finally, the reporting error mechanisms identified in this work may not translate to other cohorts because differences in recruitment schemes and population characteristics probably impact how error in self-report measures is expressed. Future research exploring how self-report error manifests in samples with different characteristics (for example, enriched for individuals from disadvantaged backgrounds, with poorer health or from younger samples) is therefore needed to assess how different recruitment strategies may impact bias resulting from reporting error and/or sample non-representativeness.
In summary, our findings emphasize that both self-report data quality and sampling features are potential sources of poor reproducibility for biobank-scale research, leading to imprecision and bias that can complicate the interpretation of findings. Analogous to quality control procedures developed for the processing of genetic data, the application of tools designed to enhance phenotype resolution (for example, repeat measurements, regression calibration17, imputation26 and weighted regression) and sample representativeness (for example, probability sampling or weighting) should therefore become an integral part of data collection, pre-analytic data handling and sensitivity checks.
Methods
Indices of reporting error in the UKBB
This research has been conducted with the UKBB resource under application number 16389. The UKBB is a large prospective study assessing more than 500,000 participants aged between 40 and 69 years who attended 1 of the baseline assessment centres between 2006 and 201027. Included in this work were individuals with at least 1 follow-up assessment, including either individuals taking part in the 1st repeat assessment centre (around 20,000 participants living within 35 km of the Stockport Biobank coordinating centre28) or the brain magnetic resonance imaging assessment (ongoing, inviting back up to 100,000 of the original volunteers29). We first screened all UKBB phenotypes that could be used as indices of reporting error, defined as inconsistent self-reporting over time. To that end, we included phenotypes that were assessed longitudinally but represented time-invariant variables, namely, those that cannot change following the baseline assessment (for example, self-reported birth weight, number of older siblings and age at first sexual intercourse). To minimize the impact of possible outlier values in continuous variables (variables with >10 levels, for example, age when started smoking), we excluded baseline and/or follow-up observations with large deviations (≥10 s.d.) from the baseline mean. For each of the included time-invariant phenotypes, we partitioned the variance of the phenotype into its error-free and reporting error component by regressing phenotype assessed at time point 2 (PT2; for example, self-reported birth weight at follow-up) onto the phenotype assessed at time point 1 (PT1; for example, self-reported birth weight at baseline). Follow-up time (time between PT1 and PT2; timeT2 − T1) was included as a covariate in this model (PT2 = PT1 + timeT2 − T1). The variance explained by the model (R2) was used as an index of phenotype repeatability, such that 1 − R2 quantifies the level of reporting error per phenotype. For comparison, we also estimated R2 for phenotypes subject to within-person temporal variability (including only objectively ascertained phenotypes, for example, BMI and LDL) and measures subject to both temporal variability and reporting error (for example, self-reported alcohol use and physical activity).
Next, to explore some of the properties underlying reporting error, we derived individual reporting error scores using a two-stage protocol. In stage 1, we extracted the residuals (|RESi|) from a model regressing PT2 on PT1 (see Fig. 2a for an illustration). In stage 2, the scaled residuals (|RESi|/s.d.T1,T2) from stage 1 were then used as input for PCA to obtain a weighted REsum. In PCA, we included only reporting error scores with at least 50,000 non-missing repeated observations. After combining the selected scores, we imputed missing values using row-wise mean imputation and performed PCA. On the basis of PC1, we then generated the weighted summary scores (REsum) from the values of their observed indicator items and residualized REsum for follow-up time (timeT2 − T1). This score is a (weighted) average of reporting errors, representing the overall inaccuracy an individual shows when responding to time-invariant questions repeated over time. The resulting summary scores were used as the primary outcome in downstream analyses exploring correlates and causes of reporting error. The UKBB resource was approved by the UKBB Research Ethics Committee, and all participants provided written informed consent to participate.
GWA
The REsum was then subjected to a genome-wide scan. For all GWA, we restricted the sample to individuals of European ancestry based on principal components and excluded individuals with a high missing rate (that is, proportion of genotypes not called) and/or high heterozygosity on autosomes (that is, proportion of autosomal heterozygous calls). Here, the UKBB30,31 flagged 968 samples as outliers due to high missingness and/or extreme heterozygosity that was not explained by mixed ancestry or increased levels of marriage between close relatives. Extreme values in these metrics can be indicators of poor sample quality (for example, due to DNA contamination) and were therefore discarded. Genetic variants were filtered according to Hardy–Weinberg disequilibrium (P > 1 × 10−15), minor allele frequency (>1%), minor allele count (>100) and call rate (>90%). The association tests were performed in REGENIE v.3.2.6 (ref. 32), adjusting for age, sex and the first ten principal components. The resulting REsum summary statistics file was then included in LD score regression33 (as implemented in GenomicSEM34) to estimate SNP heritability and genetic correlations with other traits. Genetic correlations were estimated for 39 publicly selected traits with available summary statistics files, where the selected traits tapped into participation behaviours (for example, the UKBB participation probability and recontact availability in the UKBB), physical features (for example, height and BMI), biological markers (for example, LDL and systolic blood pressure), lifestyles (for example, smoking and coffee intake), social variables (for example, socio-economic status and education) and mental health/personality (for example, schizophrenia, attention deficit hyperactivity disorder and neuroticism) (see Supplementary Table 9 for details and references). To identify causal factors contributing to reporting error, we performed MR as implemented in the R package TwoSampleMR35. Here, we used the same 39 selected traits with publicly available summary statistics files to extract genetic instruments for the exposure, where we selected LD-independent (–clump-kb 10,000–clump-r2 0.001) SNPs reaching genome-wide significance (P < 5 × 10−8). We only performed MR for exposures with at least five genetic instruments. Tests of causality were performed using the inverse-variance-weighted MR estimator, where the reporting error GWA output was included as the outcome. To facilitate comparability of the results, we standardized the SNP effects (βstd) before conducting MR. βstd per SNP j was obtained by dividing the Z-score per SNP (Zj = β(SNPj)/s.e.(SNPj)) by the square root of the sample size (βstd(SNPj) = \({Z_j}/{\sqrt{N}}\)). The results were corrected for multiple testing using false discovery rate correction (controlled at 5%), correcting for the total number of performed tests per downstream analysis (linkage disequilibrium score regression (LDSC) and MR).
The link between reporting error and UKBB participation
To explore patterns of covariation between reporting error and other participatory behaviours that are known to bias genome-wide estimates, we also included ‘UKBB participation probabilities’ in the analytical pipeline described above. This trait was derived as part of a previous study10 focusing on the impact of participation bias on genome-wide findings. In brief, the participation probabilities are the predicted probabilities of UKBB participation (with 1 = individuals taking part in the UKBB and 0 = individuals taking part in a representative reference sample, the Health Survey England36) based on 14 harmonized demographic, social and lifestyle variables. In brief, taking the inverse of the participation probabilities serves as a statistical tool to correct for bias induced by selective participation, as is commonly used in surveys37,38, classic epidemiological studies39,40, electronic health record studies41,42 and, more recently, in volunteer biobank samples10,12,43. The probability weights included in this work have previously been validated10 based on external data drawn from representative samples (the Health Survey England36 and UK Census Microdata44) and negative control analyses (genetic analyses on sex45). A more detailed summary of the validation procedures is included in Supplementary Methods. Phenotypically, we estimated the level of covariation between the REsum and the UKBB participation probability. In addition, we obtained the standardized coefficients of the 14 baseline variables predicting UKBB participation (representative sample = 0; UKBB = 1), as was done in our previous work10, to compare the coefficients to those obtained when including the REsum as the outcome. The total variance explained by the 14 predictors was obtained from LASSO regression (5-fold cross-validation) in glmnet46, which also included all possible 2-way interaction terms among the categorical (dummy) and continuous variables. To assess if UKBB participation and reporting error share similar genetic and causal structures, we applied the same genome-wide pipeline as described above (that is, performing LDSC regression and MR analyses) to UKBB participation (n = 283,749) as the outcome of interest. The summary statistic file from the GWA on UKBB participation is accessible via the GWAS Catalog (accession number GCST90267294).
Finally, within a regression framework, adjustment for selective participation (unequal inclusion probabilities) and reporting error (unequal error variances and heteroskedasticity) can be achieved by the implementation of weights, where over-represented or reporting-error-prone individuals are downweighted and under-represented or reporting-error-free individuals are upweighted. To assess how weighting informed by participation and/or reporting error affect phenotype and sample characteristics, we derived reporting error weights (wRE), indexed as the inverse of the error variance (\({w}_{\rm{RE}}=1/\left(1+\sigma _{\rm{RE}}^{2}\right.\))). \({\sigma }_{\rm{RE}}^{2}\) was obtained by taking the average of the reporting error variances (VarP) across the time-invariant phenotypes (P) selected for PCA: \({{\rm{Var}}}_{P}=({P}_{\rm{T2}}-\hat{{P}_{\rm{T2}}})^{2}\), where \(\hat{{P}_{\rm{T2}}}\) are the fitted values from a model regressing the standardized phenotype assessed at follow-up (PT2) on the standardized phenotype assessed at baseline (PT1). We then assessed changes in sample and phenotype characteristics following inverse probability/variance weighting, where we included either the UKBB participation weights (wP), the error weights (wRE) or the error-adjusted participation weights (wP × RE = wP × wP). Change was assessed at the level of (1) measurement repeatability in time-invariant phenotypes (that is, comparing estimates of R2 obtained in an unweighted versus weighted sample), and (2) means in continuous phenotypes known to link to UKBB participation (that is, comparing the weighted and unweighted means obtained for years of education and age).
Simulations
To illustrate the individual and combined impact of reporting error and participation bias on exposure–outcome associations in a realistic setting, we simulated data for two phenotypes included in exposure–outcome linear regression models (education and BMI), the two participation behaviours of interest (reporting error and study participation) and modelled the relationships among these variables. The two phenotypes of interest, BMI and education, were chosen as these represent two continuous traits with different measurement properties (reporting-error-free versus reporting-error-prone measure, respectively) and have been linked to UKBB participation10.
The following simulation scenarios were tested: (1) the ground truth, where the causal effect of the exposure on the outcome was estimated in a representative sample, and the exposure and outcome were measured without error; (2) reporting-error-only scenario, where reporting error was present in the exposure or outcome measure (but no participation bias); (3) participation-bias-only scenario, where we introduced participation bias (but no measurement error); and (4) a scenario in which both reporting error and participation bias were introduced. These scenarios were then simulated within a bidirectional framework, testing the effects of (error-free) BMI on (error-prone) education and vice versa. The data-generating mechanisms are depicted in the directed acyclic graphs shown in Fig. 5.
The coefficients used in the simulation scenarios were derived as follows from the UKBB data: for UKBB participation, we used the standardized coefficients for education (βedu) and BMI (βBMI) on UKBB participation as estimated in MR (described above). To obtain the coefficients required to simulate reporting error in self-reported years of education, we regressed the reporting error score for education (RESedu, as described above) onto education (E) and BMI (B) and extracted the standardized effect estimates: RESedu = αeduE + αBMIB + ε.
The obtained coefficients were then used to simulate the data, where biases were introduced as follows: for participation bias, we first generated the simulated participation probabilities, \({P}_{\rm{sim}}=\frac{1}{1+\exp (-\left({{\beta }_{0}+\beta }_{\rm{edu}}E+{\beta }_{\rm{BMI}}B\right))}\), where E and B denote the simulated variables for years of education (E) and BMI (B), respectively. The variables were simulated as E ~ N(0, 1) and B ~ N(0, 1) when included as the exposure and as E = vB + ε and B = vE + ε when included as the outcome, where ε ~ N(0, 1 − v2) and v denotes the true causal effect of the exposure on the outcome. The coefficient β0 was set to mimic the UKBB response rate, where around 5.5% of the 9,000,000 individuals initially invited to take part were recruited in the study27 (\({\beta }_{0}=-\log (|1-\frac{1}{0.055}|)\left.\!\right)\). Subjects were then assigned a random number U from the uniform distribution U ~ Uniform(0, 1) and were classified as either respondent (U < Psim) or non-respondent (U ≥ Psim).
Reporting error was generated for one self-report measure (education, E) and was simulated as heteroskedastic error. Heteroskedasticity in this context refers to error in the measured phenotype (Emeasured) that is non-constant and varies across individuals: Emeasured = Etrue + εedu, where εedu ~ N(0, R). R was simulated as Rsim = αeduE + αBMIB+ ε, which was then scaled to have a standard deviation of 1 and values of \(R > 0[R=(R_{\rm{sim}}+|\min ({R}_{\rm{sim}})|)/{\rm{s.d.}}({R}_{\rm{sim}})]\). BMI was modelled as an error-free measure in all simulation scenarios (Bmeasured = Btrue).
The impact of reporting error and selective participation was assessed in terms of bias (that is, β coefficients of the exposure–outcome association) and RMSE, an index that captures both the severity of the bias and the variance of the estimator: \({\rm{RMSE}}=\sqrt{\frac{1}{k}{\sum }_{k}{(\hat{{v}_{k}}-v)}^{2}}\), where \(\hat{{v}_{k}}\) is the estimated effect of the exposure–outcome association at simulation k and v is the true causal effect of the exposure on the outcome. We performed k =1,000 simulations and true causal effect was set to v = −0.2.
Impact of reporting error on SNP effects and heritability
To explore the impact of reporting error on genome-wide quantities, we compared the results from GWA tests on error-corrected versus error-prone versions of the same phenotype. We derived error-corrected phenotypes by taking the mean across multiple measurement occasions (for example, mean in self-reported childhood height) because the within-person average reduces the random error in a variable. The baseline phenotype assessed in the same subset of UKBB participants was used as the error-prone counterpart (for example, baseline self-reported childhood height). Genome-wide tests using REGENIE were then performed on both the repeated-measure and the single-measure phenotype. LD-independent SNPs reaching genome-wide significance (P < 5 × 10−8) were selected via clumping (clump-kb, 250; clump-r2, 0.1) and the explained variance per SNP j was obtained by squaring βstd. We estimated SNP heritability for both the single-measure (h2S) and the repeated-measure GWA (h2R) and calculated the difference (\({{h}^{2}}_{\rm{diff}}={{h}^{2}}_{\rm{R}}-{{h}^{2}}_{\rm{S}}\)) using the following test statistic:
The correlation coefficient r(h2R, h2S) was obtained from 200-block jackknife analysis, where we split the genome into 200 equal blocks of SNPs and removed 1 block at a time to perform jackknife estimation. h2diff was obtained for traits with at least 2% SNP heritability.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The reporting error genome-wide association statistics are available through the GWAS Catalog (accession number GCST90448966).
Code availability
The following software was used to run the analyses: REGENIE (https://github.com/rgcgithub/regenie), TwoSampleMR (https://mrcieu.github.io/TwoSampleMR/) and GenomicSEM (https://github.com/GenomicSEM/GenomicSEM). All analytical scripts are available at https://github.com/TabeaSchoeler/TS2023_repErrorUKBB.
References
van der Sluis, S., Verhage, M., Posthuma, D. & Dolan, C. V. Phenotypic complexity, measurement bias, and poor phenotypic resolution contribute to the missing heritability problem in genetic association studies. PLoS ONE 5, e13929 (2010).
Abdellaoui, A. & Verweij, K. J. H. Dissecting polygenic signals from genome-wide association studies on human behaviour. Nat. Hum. Behav. 5, 686–694 (2021).
Tiego, J. et al. Precision behavioral phenotyping as a strategy for uncovering the biological correlates of psychopathology. Nat. Ment. Health 1, 304–315 (2023).
Saccenti, E., Hendriks, M. H. W. B. & Smilde, A. K. Corruption of the Pearson correlation coefficient by measurement error and its estimation, bias, and correction under different error models. Sci. Rep. 10, 438 (2020).
Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).
DiPrete, T. A., Burik, C. A. P. & Koellinger, P. D. Genetic instrumental variable regression: explaining socioeconomic and health outcomes in nonexperimental data. Proc. Natl Acad. Sci. USA https://doi.org/10.1073/pnas.1707388115 (2018).
Pingault, J. et al. Research review: how to interpret associations between polygenic scores, environmental risks, and phenotypes. J. Child Psychol. Psychiatry 63, 1125–1139 (2022).
de Vlaming, R. et al. Meta-GWAS Accuracy and Power (MetaGAP) calculator shows that hiding heritability is partially due to imperfect genetic correlations across studies. PLoS Genet. 13, e1006495 (2017).
Fry, A. et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am. J. Epidemiol. 186, 1026–1034 (2017).
Schoeler, T. et al. Participation bias in the UK Biobank distorts genetic associations and downstream analyses. Nat Hum. Behav. https://doi.org/10.1038/s41562-023-01579-9 (2023).
Stamatakis, E. et al. Is cohort representativeness passé? Poststratified associations of lifestyle risk factors with mortality in the UK Biobank. Epidemiology 32, 179–188 (2021).
van Alten, S., Domingue, B. W., Faul, J., Galama, T. & Marees, A. T. Reweighting UK Biobank corrects for pervasive selection bias due to volunteering. Int. J. Epidemiol. https://doi.org/10.1093/ije/dyae054 (2024).
Mignogna, G. et al. Patterns of item nonresponse behaviour to survey questionnaires are systematic and associated with genetic loci. Nat. Hum. Behav. https://doi.org/10.1038/s41562-023-01632-7 (2023).
Tyrrell, J. et al. Genetic predictors of participation in optional components of UK Biobank. Nat. Commun. 12, 886 (2021).
Ward, J. et al. Head motion in the UK Biobank imaging subsample: longitudinal stability, associations with psychological and physical health, and risk of incomplete data. Brain Commun. 6, p.fcae220 (2024).
Brayne, C. & Moffitt, T. E. The limitations of large-scale volunteer databases to address inequalities and global challenges in health and aging. Nat. Aging 2, 775–783 (2022).
Rutter, C. E., Millard, L. A. C., Borges, M. C. & Lawlor, D. A. Exploring regression dilution bias using repeat measurements of 2858 variables in ≤49 000 UK Biobank participants. Int. J. Epidemiol. https://doi.org/10.1093/ije/dyad082 (2023).
Buzas, J. S., Stefanski, L. A. & Tosteson, T. D. in Handbook of Epidemiology (eds Ahrens, W. & Pigeot, I.) 729–765 (Springer, 2014).
Ge, T., Chen, C. Y., Neale, B. M., Sabuncu, M. R. & Smoller, J. W.Phenome-wide heritability analysis of the UK Biobank. PLoS Genet. 13, e1006711 (2017).
Hutcheon, J. A., Chiolero, A. & Hanley, J. A. Random measurement error and regression dilution bias. BMJ 340, c2289 (2010).
Chang, L. & Krosnick, J. A. National surveys via Rdd telephone interviewing versus the internet. Public Opin. Q. 73, 641–678 (2009).
Nakash, R. A., Hutton, J. L., Jørstad-Stein, E. C., Gates, S. & Lamb, S. E. Maximising response to postal questionnaires—a systematic review of randomised trials in health research. BMC Med. Res. Methodol. 6, 5 (2006).
Woolf, B., Pedder, H., Rodriguez-Broadbent, H. & Edwards, P. Silence is golden, but my measures still see—why cheaper-but-noisier outcome measures in large simple trials can be more cost-effective than gold standards. Trials 25, 532 (2024).
DeSimone, J. A. & Harms, P. D. Dirty data: the effects of screening respondents who provide low-quality data in survey research. J. Bus. Psychol. 33, 559–577 (2018).
Cawley, J., Maclean, J. C., Hammer, M. & Wintfeld, N. Reporting error in weight and its implications for bias in economic models. Econ. Hum. Biol. 19, 27–44 (2015).
Freedman, L. S., Midthune, D., Carroll, R. J. & Kipnis, V. A comparison of regression calibration, moment reconstruction and imputation for adjusting for covariate measurement error in regression. Stat. Med. 27, 5195–5216 (2008).
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Lyall, D. M. et al. Cognitive test scores in UK Biobank: data reduction in 480,416 participants and longitudinal stability in 20,346 participants. PLoS ONE 11, e0154222 (2016).
Miller, K. L. et al. Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat. Neurosci. 19, 1523–1536 (2016).
Genotyping and Quality Control of UK Biobank, a Large-Scale, Extensively Phenotyped Prospective Resource (UK BioBank, 2015); https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/genotyping_qc.pdf
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53, 1097–1103 (2021).
Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Grotzinger, A. D. et al. Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat. Hum. Behav. 3, 513–525 (2019).
Hemani, G. et al. The MR-Base platform supports systematic causal inference across the human phenome. eLife https://doi.org/10.7554/eLife.34408 (2018).
Mindell, J. et al. Cohort profile: the health survey for England. Int J. Epidemiol. 41, 1585–1593 (2012).
Jensen, H. A. R. et al. The impact of non-response weighting in health surveys for estimates on primary health care utilization. Eur. J. Public Health 32, 450–455 (2022).
Franco, A., Malhotra, N., Simonovits, G. & Zigerell, L. J. Developing standards for post-hoc weighting in population-based survey experiments. J. Exp. Polit. Sci. 4, 161–172 (2017).
Kapteyn, A., Michaud, P. C., Smith, J. P. & Van Soest, A. Effects of Attrition and Non-Response in the Health and Retirement Study IZA Discussion Paper No. 2246, RANSD Working Paper No. WR-407 (IZA, 2006).
Plewis, I. Non‐response in a birth cohort study: the case of the Millennium Cohort Study. Int J. Soc. Res Methodol. 10, 325–334 (2007).
Beesley, L. J. & Mukherjee, B. Statistical inference for association studies using electronic health records: handling both selection bias and outcome misclassification. Biometrics 78, 214–226 (2022).
Beesley, L. J. & Mukherjee, B. Case studies in bias reduction and inference for electronic health record data with selection bias and phenotype misclassification. Stat. Med. 41, 5501–5516 (2022).
Salvatore, M. et al. To weight or not to weight? The effect of selection bias in 3 large electronic health record-linked biobanks and recommendations for practice. J. Am. Med. Inform. Assoc. https://doi.org/10.1093/jamia/ocae098 (2024).
2011 Census microdata (Office for National Statistics, 2011); https://www.ons.gov.uk/census/2011census/2011censusdata/censusmicrodata
Pirastu, N. et al. Genetic analyses identify widespread sex-differential participation bias. Nat. Genet. 53, 663–671 (2021).
Hastie, T., Qian, J. & Tay, K. An Introduction to glmnet. https://glmnet.stanford.edu/articles/glmnet.html (2021).
Acknowledgements
We thank all biobank participants for sharing their data. This study would not have been possible without the use of publicly available genome-wide summary data and software tools. We acknowledge these resources and thank the research participants, the research teams and institutions that have contributed to this research. Computations have been performed on the HPC cluster of the Lausanne University Hospital. Z.K. was funded by the Swiss National Science Foundation (number 310030-189147). T.S. is funded by a Wellcome Trust Sir Henry Wellcome fellowship (grant 218641/Z/19/Z). J.-B.P. has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 863981). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
Z.K. and T.S. conceptualized the study. T.S. performed the statistical analyses. Z.K., J.-B.P. and T.S. discussed the results and provided comments on the article. All authors critically reviewed the article.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Human Behaviour thanks Xueyi Shen, Jordan Smoller, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Methods and Figs. 1–7.
Supplementary Tables
Includes Supplementary Tables 1–9.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Schoeler, T., Pingault, JB. & Kutalik, Z. The impact of self-report inaccuracy in the UK Biobank and its interplay with selective participation. Nat Hum Behav 9, 584–594 (2025). https://doi.org/10.1038/s41562-024-02061-w
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41562-024-02061-w
This article is cited by
-
COVID-19 infection associated with increased risk of new-onset vascular dementia in adults ≥50 years
npj Dementia (2025)
-
Combining cross-sectional and longitudinal genomic approaches to identify determinants of cognitive and physical decline
Nature Communications (2025)
-
Associations between past infectious mononucleosis diagnosis and 47 inflammatory and vascular stress biomarkers
Scientific Reports (2025)