Inference under outcome misclassification in health risk models using a simulation study with a validation dataset

Liu, Xirui; McComish, Stacey L.; Howard, Sara C.; Zhou, Joey Y.; Tolmachev, Sergey Y.

doi:10.1038/s41598-026-41788-6

Download PDF

Article
Open access
Published: 04 March 2026

Inference under outcome misclassification in health risk models using a simulation study with a validation dataset

Xirui Liu¹,
Stacey L. McComish¹,
Sara C. Howard²,
Joey Y. Zhou³ &
…
Sergey Y. Tolmachev¹

Scientific Reports volume 16, Article number: 11981 (2026) Cite this article

974 Accesses
Metrics details

Subjects

Abstract

Death certificates are commonly used in epidemiological studies investigating the relationship between exposure and health outcomes. It is known that death certificates may misclassify the underlying causes of death, and it is commonly understood that if misclassification is non-differential, it will bias dose-response relationships toward the null or underestimate the association. This simulation study explores the probability that results of an individual study may contradict the general understanding by addressing two key questions: (1) what is the probability that misclassification of disease mortality moves measures of dose-response associations away from the null? and (2) what is the probability that misclassification moves measures of dose-response associations away from the null sufficiently to change the conclusion of a study from statistically non-significant to significant? As the starting point, this simulation study used a small group of radiation-exposed nuclear workers for whom both death certificates and autopsy reports were available. Results suggest that nominally non-differential misclassification can lead to an odds ratio that moves away from the null. For datasets where the initial p-values were slightly non-significant, the percentage of odds ratios that moved away from the null generally decreased with higher levels of misclassification, and the probability that the p-values associated with these odds ratios would change to significant decreased with increasing misclassification rates. The traditional heuristic is more likely to be true when: (1) there is a larger misclassification rate, and (2) there is a high association between dose and disease mortality. This has implications for environmental epidemiology, such as low-dose radiation epidemiology, where estimated effects are often small and conclusions may hinge on marginal statistical significance. As another implication, these findings apply broadly to various health outcomes, even if the outcome misclassification rate is low.

Use of the p-values as a size-dependent function to address practical differences when analyzing large datasets

Article Open access 22 October 2021

Cross-classification between self-rated health and health status: longitudinal analyses of all-cause mortality and leading causes of death in the UK

Article Open access 10 January 2022

Non-cancer disease prevalence and association with occupational radiation exposure among Korean radiation workers

Article Open access 17 November 2021

Introduction

Misclassification of causes of death on death certificates is a well-documented issue^1,2,3,4,5,6. In general, it is understood that if misclassification is non-differential, it will bias dose-response relationships toward the null^7,8,9. This heuristic is commonly used to suggest that if epidemiological studies with significant dose-response associations incorporated misclassification, the associations would have been stronger. However, several studies described exceptions to the conventional assumption that outcome misclassification biases risk estimates toward the null^10,11,12. Yland et al.¹² provided several exceptions to the conventional assumption. One of these exceptions demonstrated that even when the underlying misclassification rates are non-differential, the observed misclassification in a population may actually be differential due to the random nature of misclassification. Simulated datasets were used to demonstrate that this effect is more significant for smaller population sizes and smaller misclassification rates. Whitcomb & Niami¹¹ made a useful distinction between bias and error. They clarify that bias is “a tendency – the difference between the true value of a parameter and the expected value based on (usually hypothetical) repetitions of a study,” whereas an error is the “difference between a particular study result and the truth.” Both Yland et al.¹² and Whitcomb & Niami¹¹ demonstrate that, regardless of the direction of the overall bias, the error observed for an individual study may not move the measure of the dose-response relationship toward the null. This means that there is some probability that the heuristic will not hold true for an individual study. A thought experiment demonstrates that if the true odds ratio in a study population is exactly 1 (the null), and the underlying misclassification is non-differential, the odds ratio can only move away from the null with equal probability of 50% in both directions (larger or less than 1).

While Yland et al.¹² and Whitcomb & Niami¹¹ simulated the effects of dose misclassification, a similar effect would be expected as a result of outcome misclassification. The current study focuses on the impact of non-differential outcome misclassification and its effect on dose-response estimates for individual studies. As such, two questions will be addressed: (1) what is the probability that misclassification of disease mortality moves measures of dose-response associations away from the null? and (2) what is the probability that misclassification moves measures of dose-response associations away from the null sufficiently to change the conclusion of a study from non-significant to significant? The first question represents the probability that the heuristic does not hold true for an individual study. The second question is of unique importance, because it explores the probability that misclassification would artificially enhance the measure of risk sufficiently to change the conclusion from non-significant to significant. By addressing these questions, the present analysis provides a quantitative framework for evaluating the extent of concern warranted in studies that rely on causes of death from death certificates. Specifically, it enables researchers to estimate the probability that an observed significant association may arise from outcome misclassification rather than from a true relationship between dose and response. Quantifying these probabilities is essential for critically interpreting findings and for assessing how outcome misclassification could influence the conclusions drawn from individual epidemiological studies.

Methods

A validation dataset

United States Transuranium and Uranium Registries (USTUR) Registrants are former nuclear workers who had documented exposures to external radiation and/or to internally incorporated actinide elements such as plutonium^13,14. These individuals agreed to have an autopsy at the time of death and voluntarily donated tissues and organs, or their entire bodies, for posthumous research. The USTUR maintains copies of radiation exposure records from Registrant worksites as well as autopsy reports and death certificates. The combination of cumulative external dose (Sv) data, autopsy report cause of death, and death certificate cause of death needed for misclassification analyses was available for 229 Registrants. A study of misclassification of causes of death on death certificates⁶ showed an overall misclassification rate of 25.4% among USTUR Registrants. No association between misclassification and radiation dose was found, suggesting that misclassification of underlying causes of death among USTUR Registrants is non-differential with regard to dose.

This study was performed as part of the USTUR research program, which was reviewed and approved by the Central Department of Energy Institutional Review Board (USA) No. WASU-68–50181. Since the initiation of Registrant recruitment in 1968, the USTUR has routinely obtained authority for autopsy and/or informed consent, as well as a release of medical records, from participants next-of-kin, or power of attorney in accordance with the ethical standards in place at the time of data collection. In addition, a formal informed consent process has been in place for decades. For cases collected prior to the establishment of formal IRB requirements, consent procedures adhered to the prevailing ethical standards and institutional policies at the time.

Initial and misclassified datasets

Datasets were formed by pairing dose and outcome data, where outcomes were either cancer or non-cancer mortality. Two types of datasets were used in this study: initial datasets and misclassified datasets. Initial datasets represented the ‘true’ distribution of diseases in a studied population, as might be found on autopsy reports, and were used as the starting point for misclassification simulations. Misclassified datasets were the result of the misclassification simulations, and represented possible misclassified distributions of disease in a population, such as those typically found on death certificates used in epidemiological studies.

Source of dose data

Two sources of dose data were used: actual cumulative external doses and generated cumulative external doses. Actual cumulative external doses were taken directly from copies of worksite occupational exposure records from USTUR Registrants, and ranged from 0.001 to 0.714 Sv (geometric mean: 0.063 Sv, geometric standard deviation: 4.09 Sv, n = 229). Generated cumulative external doses were designed to increase the sample size by generating 5,000 external doses from a truncated lognormal distribution, with the same geometric mean and geometric standard deviation as were observed in actual USTUR external dose data. An upper limit of 1 Sv was set to avoid extreme or outlier doses. After creating these dose values, the geometric mean and geometric standard deviation of the distribution were calculated and compared to actual USTUR external dose data. This process was repeated 100 times using different random seeds. The dose distribution that was the closest to the actual USTUR doses was selected by identifying the iteration that minimized the sum of absolute differences between their respective geometric means and geometric standard deviations. This optimization approach ensured that the final dose distribution was as close as possible to the actual USTUR data. The cumulative external doses in the generated dataset ranged from 0.004 to 0.98 Sv (geometric mean: 0.057 Sv, geometric standard deviation: 3.94). A sample size of 5,000 was selected because that was the point where patterns in the results became stable.

Source of ‘true’ outcome data

The outcome of interest in this study was cancer mortality. Outcomes were binary and they were labeled as 1 if an individual’s underlying cause of death was cancer and 0 if the person died from other causes. Two sources of ‘true’ outcome data were used to form initial datasets: actual cancer deaths from USTUR Registrant autopsy reports and generated cancer deaths. Actual cancer deaths were identified by a medical doctor (MD) who reviewed each autopsy report⁶.

Generated ‘true’ outcomes were produced using the logistic probability function,

$$\:p\left(x\right)=\:1/\left[1+{e}^{-\left({\beta\:}_{0}+{\beta\:}_{1}\cdot\:x\right)}\right]$$

(1)

Where, $\:x$ was the radiation dose in Sv.

$\:p\left(x\right)$ was the probability of a cancer death.

β₀ was a constant derived from a baseline cancer mortality rate of 20% using the formula $\:{\beta\:}_{0}=-\text{l}\text{n}(\frac{1}{0.2}-1)$.

β₁ was the log of a preset odds ratio.

The preset odds ratio used to calculate β₁ was designed to either force the odds ratio to be close to 1 by using β₁ = log(1.001), or to produce a non-significant dose-outcome dataset with a p-value sufficiently close to 0.05 (0.05 < p-value < 0.0501).

Equation 1 provided the probability of a cancer death as a function of radiation dose. It was used to calculate the probability of a cancer death associated with each value in the dose dataset. Those probabilities were then used to randomly generate outcomes corresponding to each dose value. The resulting paired dose-outcome dataset represented just one possible dataset that could have been generated from the p($\:x$) dose probabilities. Therefore, using the same doses and probabilities, a total of 1,000 possible dose-outcome datasets were randomly generated, and odds ratios and p-values were calculated for each.

Afterward, a single initial dataset was selected from among the 1,000 dose-outcome datasets for use as the starting point in the misclassification analysis. This initial dataset was selected to force either the odds ratio to be close to the null value of 1, or the p-value to be slightly larger than 0.05. The scenario where the initial dataset’s odds ratio was close to 1 was designed to mimic a situation where the association between exposure and outcome is minimal or non-existent. The scenario that forced the initial p-value to be slightly larger than 0.05 was designed to create a borderline non-significant initial dataset, representing an extreme situation where the conclusion of a study is most vulnerable to changing from non-significant to significant as a result of death certificate misclassification errors.

Calculation methods

Odds ratios were used as a measure of the association between cancer mortality and dose, and p-values as a measure of the significance of that association. Two methods were used to calculate the odds ratios and p-values: a 2 × 2 contingency table and a logistic regression¹⁵. The 2 × 2 table method used a categorical dose variable to calculate the odds ratio and p-value for a dataset. To create the categorical dose variable, cases were divided into low- and high-dose groups using the median dose (0.076 Sv). The low- and high-dose groups were each further subdivided into cancer and non-cancer cases to make a 2 × 2 contingency table. The contingency table was then used to calculate the odds ratio, and the chi-squared test was used to calculate the p-value. The logistic regression method used the continuous dose variable to calculate the odds ratio and p-value for a dataset, such that each dose was treated as a unique value associated with a specific outcome in calculations.

Misclassification: real data from USTUR registrants

As a first step, the impact of death certificate misclassification among USTUR Registrants was calculated using data that was taken directly from Registrant files: actual external doses (Sv), actual underlying causes of death from autopsy reports, and actual underlying causes of death from death certificates. For the purposes of this work, underlying causes of death from autopsy reports represented the ‘true’ distribution of diseases in a population, and underlying causes of death from death certificates represented the misclassified distribution of diseases. The odds ratio for cancer mortality was calculated based on autopsy reports and compared to the odds ratio based on death certificates. Both of these odds ratios were calculated using two methods, a 2 × 2 table and a logistic regression, to explore how calculation methods influence the findings.

Misclassification: simulated outcomes

Figure 1 illustrates the general approach used to establish an initial dataset and simulate misclassified outcomes. Overall, simulation of misclassified outcomes consisted of three steps: (1) selection or generation of doses, (2) selection or generation of ‘true’ cancer outcomes, and (3) the misclassification simulation.

Six separate scenarios were selected to represent different combinations of actual and generated data as the starting points for six separate misclassification simulations. Each of these scenarios had different initial datasets and/or different calculation methods as summarized in Table 1 and described below.

Scenario 1: The initial dataset consisted of actual doses and actual cancer mortality outcomes from USTUR Registrants (229). The 2 × 2 table method was used for calculations of odds ratios and p-values.

Scenario 2: The initial dataset consisted of actual doses and actual cancer mortality outcomes from USTUR Registrants (229). The logistic regression method was used for calculations of odds ratios and p-values.

Scenario 3: The initial dataset consisted of actual doses from USTUR Registrants (229) and generated mortality outcomes that forced the initial dataset to have an odds ratio close to 1. The logistic regression method was used for calculations of odds ratios and p-values.

Scenario 4: The initial dataset consisted of actual doses from USTUR Registrants (229) and generated mortality outcomes that forced the initial dataset to have a slightly non-significant p-value (p > 0.05). The logistic regression method was used for calculations of odds ratios and p-values.

Scenario 5: The initial dataset (5,000) consisted of generated doses and generated mortality outcomes that forced the initial dataset to have an odds ratio close to 1. The logistic regression method was used for calculations of odds ratios and p-values.

Scenario 6: The initial dataset (5,000) consisted of generated doses and generated mortality outcomes that forced the initial dataset to have a slightly non-significant p-value (p > 0.05). The logistic regression method was used for calculations of odds ratios and p-values.

Table 1 Misclassification simulation scenarios: initial datasets and calculation methods.

Full size table

Misclassification was simulated for each scenario using over- and under-misclassification rates ranging from 0% to 30%, where over- and under-misclassification rates were defined as follows:

$$\text{Over-misclassification Rate}=\frac{\text{False Positives}}{\text{False Positives}+\text{True Negatives}}$$

(2)

$$\:\text{Under-misclassification Rate}=\frac{\text{False Negatives}}{\text{False Negatives}+\text{True Positives}}$$

(3)

Over-misclassified cancer death outcomes were simulated by randomly selecting non-cancer cases in the initial dataset and changing them to cancer cases in the misclassified dataset (i.e. the outcome was changed from 0 to 1). Similarly, under-misclassified outcomes were simulated by randomly selecting cancer cases in the initial dataset and changing them to non-cancer cases (i.e. the outcome was changed from 1 to 0). For example, to generate a combination of 5% over- and 15% under-misclassification of disease mortality, 5% of non-cancer cases in the initial dataset were selected at random and changed to cancer cases, and 15% of cancer cases from the initial dataset were selected at random and changed to non-cancer cases. This process was repeated 20,000 times for each combination of over- and under-misclassification rates, resulting in 20,000 misclassified datasets. These datasets represented the range of possible outcomes that misclassified death certificates could have if misclassification was random. For each scenario, the misclassification was simulated for all combinations of over- and under-misclassification rates of 0%, 5%, 10%, 15%, 20%, 25% and 30%.

Odds ratios and p-values were calculated for each of the 20,000 simulated datasets and were used to calculate three summary statistics for each combination of misclassification rates:

Statistic A: The geometric mean of the odds ratios for all simulated datasets, where the geometric mean is equivalent to the exponentiated mean of the log of the odds ratios from all simulated datasets: $\:{e}^{mean\left[\text{log}\left(OR\right)\right]}$.
Statistic B: The percentage of simulated datasets where the odds ratio moved away from the null value of 1, indicating a strengthened association between dose and disease mortality. In Scenarios 1 and 2, this was calculated as the percentage of odds ratio values that decreased since the initial odds ratios were less than 1. In Scenarios 3–6, it was calculated as the percentage of odds ratio values that increased since the initial odds ratios were greater than 1.
Statistic C: The percentage of simulated datasets where the odds ratio moved away from the null value of 1 and the p-value changed from non-significant to significant, indicating that the odds ratio was strengthened sufficiently to change the dose-response relationship to significant. Movement away from the odds ratio was again defined as odds ratio values that decreased for Scenarios 1 and 2, and as odds ratio values that increased for Scenarios 3–6.

Visualization

For each scenario, Statistics A, B, and C were visualized as heatmap figures. In each heatmap, the x-axis represented the over-misclassification rate and the y-axis represented the under-misclassification rate, both ranging from 0% to 30% in increments of 5%. Each cell displayed the value of a given summary statistic for 20,000 simulations of one specific combination of over- and under-misclassification rates, with a color gradient applied across cells to indicate the magnitude of values. The cell corresponding to 0% over- and 0% under-misclassification was left blank for Statistic B and C, as it represented the initial dataset with no misclassification applied. This resulted in three heatmaps per scenario, one for each statistic.

Software and packages

All data processing, simulations, statistical analyses, and visualization were conducted in R version 4.5.2¹⁶ using RStudio version 2026.01.0¹⁷. Generated dose distributions were produced using a truncated lognormal distribution via the truncdist package version 1.0.2¹⁸. Heatmaps were created using the ggplot2 package version 4.0.1¹⁹. The full R script (Supplementary Material 3) is provided as supplementary materials.

Results

Observed misclassification impacts

The odds ratios and p-values associated with ‘true’ causes of death found in Registrant autopsy reports and misclassified causes of death from Registrant death certificates are presented in Table 2. The 2 × 2 table approach moved the odds ratio slightly away from the null, and the logistic regression moved the odds ratio toward the null. However, the p-values for both the ‘true’ and misclassified odds ratios for both calculation methods were quite non-significant.

Table 2 Misclassification among USTUR Registrants.

Full size table

Simulated misclassification impacts

The full results for all six scenarios and all three summary statistics are provided as heatmaps in Supplementary Material 1. Figure 2 provides an example of these heatmaps using Scenario 2 results. Figure 2(a), which displays Statistic A, indicates that the initial odds ratio was 0.360 and the geometric mean of misclassified odds ratios ranged from 0.366 to 0.696. The minimum geometric mean was observed when the over-misclassification rate was 0% and the under-misclassification rate was 5%. The maximum geometric mean was observed when the over-misclassification rate was 30% and the under-misclassification rate was 30%. Figure 2(b), which displays Statistic B, indicates that the percentage of datasets where the odds ratio moved away from the null value of 1 ranged from 22.9% to 43.8%. The minimum percentage was observed when the over- and under-misclassification rates were both 30%. The maximum percentage was observed when the over-misclassification rate was 0% and the under-misclassification rate was 5%. Figure 2(c), which displays Statistic C, indicates that the percentage of datasets where the odds ratio moved away from the null value of 1 and the p-value changed to significant ranged from 0% to 4.5%. The minimum percentage was observed with a 5% over-misclassification and no under-misclassification. The maximum percentage was observed when the over-misclassification rate was 30% and the under-misclassification rate was 15%.

The range of values in each scenario’s heatmaps are summarized in Table 3, along with the odds ratio and p-value of the initial dataset used for each scenario.

Table 3 Misclassification simulation summary statistics.

Full size table

Several trends in the odds ratios can be observed from the full results of all six simulation scenarios (Supplementary Material 1). When the odds ratio for the initial dataset was not close to 1 (Scenarios 1–2, 4, and 6), the impact of misclassification was more sensitive to low misclassification rates. For these scenarios, the percentage of datasets where the odds ratios moved away from the null tended to decrease as misclassification increased. Conversely, when the odds ratio for the initial dataset was approximately 1, the percentages of dataset where the odds ratios moved away from the null was approximately 50% regardless of the misclassification rate. It can also be observed that when the initial dataset had a p-value that was slightly larger than the level of significance (Scenarios 4 and 6), there was a wider range in the percentage of datasets where the misclassified odds ratios moved away from the null. Additionally, there was a higher probability that the odds ratio would both move away from the null and change the conclusion of a study from non-significant to significant, given a significance level of p = 0.05.

Discussion

The conventional heuristic states that if misclassification is non-differential, it will bias dose-response relationships toward the null. Statistic A indicates that on average, the odds ratios did move toward the null for scenarios where the odds ratio of the initial dataset was not 1. Thus, if a study could be repeated many times, on average, the dose response relationship would be expected to move toward the null. However, the heuristic is often used to make inferences about the error in a single study. The error in the odds ratio for a particular study may or may not move the measure of the dose-response relationship toward the null. Statistic B illustrates the probability that the odds ratio for a single study will move away from the null as a consequence of misclassification. It can be seen that for a non-trivial percentage of simulated studies, the odds ratio did not follow the conventional heuristic, but instead moved away from the null. For example, between 4% and 47% of misclassified datasets associated with Scenario 6 contradicted the conventional heuristic. Statistic C further indicated that not only can the odds ratio move away from the null, but it can also do so in such a way that a p-value shifts from non-significant in the absence of outcome misclassification to significant as a result of misclassification. Thus, caution should be used when assuming that slightly significant findings would have been more significant if misclassification could have been accounted for. Likewise, additional caution should be exercised when assuming that slightly non-significant findings would have been significant if misclassification had been accounted for.

As suggested by the thought experiment in the introduction, when the odds ratio of the initial dataset was set to the null value of 1 (Scenarios 3 and 5), the proportion of datasets with odds ratios that moved away from the null was approximately 50%. Consequently, the geometric means of the odds ratios were approximately 1. This trend became clearer when the size of the dataset was increased to 5,000. It is interesting to note that a small percentage (< 3%) of datasets still had odds ratios that moved far enough away from the null to change the p-value from p_initial=0.99 to a value < 0.05, indicating that these misclassified datasets had statistically significant, but erroneous, associations between dose and disease.

The heatmaps in Supplementary Material 1 indicated that the percentage of datasets where the odds ratios that moved away from the null tended to decrease as misclassification increased. Thus, the impact of misclassification was more sensitive to low misclassification rates, which is consistent with the findings from Yland et al¹². This is concerning for epidemiological studies that are based on death certificate causes of death, because very small misclassification rates can have a relatively large impact on the validity of the findings of a study even with health outcomes having low misclassification rates such as certain cancers. Additionally, when the initial dataset had a p-value that was slightly larger than the level of significance, there was a higher probability that the odds ratio would both move away from the null and change the conclusion of a study from non-significant to significant. This occurred because the initial p-value was so close to the level of significance that even a small deviation in the odds ratio could tip it over to significance. This is also concerning for epidemiological studies, such as low-dose radiation epidemiological studies, where barely significant associations are often found and published.

Given the various factors that influence the probability that correcting for misclassification would change the conclusion of a study, one might ask when it is reasonable to apply the heuristic to the findings of a particular study. Certainly, if the probability of a conclusion change was 0.01%, the impact of misclassification would be trivial, and the heuristic could be used. However, if the probability of a conclusion change was 40%, there would be a noteworthy chance that the heuristic could be misleading.

Other studies have simulated the effect of outcome misclassification on estimates of risk^11,12,20. Yland et al¹². investigated the impact of misclassification on risk ratios, and observed a decreasing probability that the risk ratio would move away from the null with increasing misclassification rates in simulated datasets. Yland et al¹². simulated non-differential misclassification in hypothetical populations of increasing size, and demonstrated that even when the expected misclassification rate is non-differential, the observed errors may be differential due to random nature of chance. However, for a fixed risk ratio, as the sample size and/or the level of misclassification increased, random chance was less likely to move the results away from the null.

Efforts are underway to better understand additional factors that influence the trends in the heatmaps in Supplementary Material 1. A “cancellation effect” may play a role, especially for categorical exposures, where the effect of over-misclassified cases cancels out the effect of under-misclassified cases, such that the net balance of cancer deaths for high- and low-dose cases remains similar. Other factors that may influence the trends on the heatmaps include: the dose distribution, baseline disease rate, the strength of the dose-response association, confounding factors, sample size, etc. Future work is needed to explore the influence of these factors as well as the influence of interactions among the factors. Previous work²¹ indicates that when confounding factors are introduced, the effect of misclassification on measures of risk is more complex.

This simulation study was carried out using cancer mortality among USTUR Registrants as an example. However, the methods and equations involved are not specific to any particular disease, and the findings can be generalized to other outcomes. For example, the baseline rate of deaths due to heart disease (21%) among US residents is similar to that for cancer (19%); therefore, the results of this simulation study can be generalized to heart disease. The impact of misclassification of less common diseases could be explored by changing the coefficient associated with baseline disease rate in Equation 1. The methodology could also be extended to endpoints such as morbidity or specific types of cancer.

Recent methodological discussions have highlighted important limitations of odds ratios as effect-size measures, particularly their non-collapsibility and limited interpretability as population-level risks in multivariate logistic regression models^22,23. In this study, odds ratios were not used for effect-size interpretation, causal inference, or comparison across models. Rather, they served as the statistical quantities underlying hypothesis testing in logistic regression. Because the model specification was held constant across simulations and included radiation dose as the only explanatory variable, variation in estimated odds ratios across simulations reflected only outcome misclassification and finite-sample variability, not non-collapsibility induced by conditioning on additional covariates. The purpose of the analysis was therefore inferential—to examine how outcome misclassification can influence apparent statistical significance in a single realized study—rather than interpretative. This distinction aligns with recent guidance emphasizing the importance of separating inferential and effect-size roles of odds ratios in applied research²².

A substantial methodological literature has developed approaches to mitigate the impact of outcome misclassification under various assumptions. Comprehensive overviews of these methods are provided by Fox, MacLehose, and Lash²⁴. One class of methods relies on likelihood-based models that explicitly parameterize misclassification probabilities, allowing simultaneous estimation of disease risk and classification error when sensitivity and specificity are known or estimable. Related approaches incorporate external validation data, in which misclassification parameters are estimated from an independent or nested validation sample and subsequently integrated into the primary likelihood. Alternative strategies include validation-sample–based models that treat true outcome status as partially observed, as well as regression calibration techniques that use estimated misclassification probabilities to adjust regression coefficients. When sensitivity and specificity of outcome classification are available or can be estimated from validation data, simpler correction approaches may also be applied. One such approach is the Rogan–Gladen estimator²⁵, which adjusts observed outcome prevalence to obtain an unbiased estimate of the true prevalence and has been widely used in epidemiologic studies involving screening and surveillance. These methods can reduce bias in effect estimation when model assumptions are satisfied, and adequate validation data are available.

The objective of the present study, however, was not to estimate corrected effect measures but rather to examine how outcome misclassification can affect statistical inference in individual realized studies prior to, or in the absence of, formal correction. Understanding the inferential behavior of uncorrected analyses remains important in practice, as correction methods are not always implemented, validated, or reported, and applied interpretations frequently rely on nominal statistical significance from standard regression models. Ongoing work by the authors applies Rogan–Gladen–type correction methods using validation data from the USTUR Registrants to directly examine how misclassification correction alters estimated associations and statistical inference.

Conclusions

It is generally understood that non-differential misclassification biases dose-response relationships toward the null, and this heuristic is often used to suggest that correcting for misclassification would only strengthen these relationships. While this is often the case, the findings of this study indicate that general belief is not always correct for individual studies. Nominally non-differential misclassification can move the odds ratio away from the null. There is a non-trivial probability that correcting for misclassification would change a dose-response relationship from significant in a misclassified dataset to non-significant in the correctly classified dataset. Consequently, researchers should use caution when assuming that accounting for outcome misclassification in a particular study would have strengthened the dose-response association. Future work will focus on the application of appropriate methods for correcting this type of misclassification errors.

Data availability

All data generated or analyzed during this study are included in the Supplementary Material 2 and 3.

Abbreviations

GM:: Geometric mean
GSD:: Geometric standard deviation
IRB:: Institutional Review Board
LR:: Logistic regression
OR:: Odds ratio
USTUR:: United States Transuranium and Uranium Registries

References

Engel, L. W., Strauchen, J. A., Chiazze, L. Jr. & Heid, M. Accuracy of death certification in an autopsied population with specific attention to malignant neoplasms and vascular diseases. Am. J. Epidemiol. 111, 99–112 (1980).
James, G., Patton, R. E. & Heslin, A. S. Accuracy of cause-of-death statements on death certificates. Public. Health Rep. 70, 39–51 (1955).
CAS PubMed PubMed Central Google Scholar
Mieno, M. N. et al. Accuracy of death certificates and assessment of factors for misclassification of underlying cause of death. J. Epidemiol. 26, 191–198 (2016).
Article PubMed Google Scholar
Modelmog, D., Rahlenbeck, S. & Trichopoulos, D. Accuracy of death certificates: A population-based, complete-coverage, one-year autopsy study in East Germany. Cancer Causes Control 3, 541–546 (1992).
CAS PubMed Google Scholar
Ron, E., Carter, R., Jablon, S. & Mabuchi, K. Agreement between death certificates and autopsy diagnoses among atomic bomb survivors. Epidemiology 5, 48–56 (1994).
Article CAS PubMed Google Scholar
McComish, S. L., Liu, X., Martinez, F. T., Zhou, J. Y. & Tolmachev, S. Y. Misclassification of causes of death among a small all-autopsied group of former nuclear workers: Death certificates vs. autopsy reports. PLoS One 19, e0302069. https://doi.org/10.1371/journal.pone.0302069 (2024).
Article CAS PubMed PubMed Central Google Scholar
Copeland, K. T., Checkoway, H., McMichael, A. J. & Holbrook, R. H. Bias due to misclassification in the estimation of relative risk. Am. J. Epidemiol. 105, 488–495 (1977).
Article CAS PubMed Google Scholar
Rothman, K. J., Greenland, S. & Lash, T. L. Modern Epidemiology 3rd edn (Lippincott Williams & Wilkins, 2008).
Linet, M. S., Schubauer-Berigan, M. K. & de Berrington González, A. Outcome assessment in epidemiological studies of low-dose radiation exposure and cancer risks: Sources, level of ascertainment, and misclassification. J. Natl. Cancer Inst. Monogr. 2020, 154–175 (2020).
Article PubMed PubMed Central Google Scholar
Jurek, A. M., Greenland, S. & Maldonado, G. How far from non-differential does exposure or disease misclassification have to be to bias measures of association away from the null? Int. J. Epidemiol. 37, 382–385 (2008).
Article PubMed Google Scholar
Whitcomb, B. W. & Naimi, A. I. Things don’t always go as expected: The example of nondifferential misclassification of exposure—bias and error. Am. J. Epidemiol. 189, 365–368 (2020).
Article PubMed Google Scholar
Yland, J. J., Wesselink, A. K., Lash, T. L. & Fox, M. P. Misconceptions about the direction of bias from nondifferential misclassification. Am. J. Epidemiol. 191, 1485–1494 (2022).
Article PubMed PubMed Central Google Scholar
Bruner, H. D. A plutonium registry. In Diagnosis and Treatment of Deposited Radionuclides (eds Kornberg, H. A. & Norwood, W. D.) 661–665 (Excerpta Medica Foundation, 1968).
Kathren, R. L. & Tolmachev, S. Y. The United States Transuranium and Uranium Registries (USTUR): A five-decade follow-up of plutonium and uranium workers. Health Phys. 117, 118–132 (2019).
Article ADS CAS PubMed Google Scholar
Dohoo, I. R., Martin, S. W. & Stryhn, H. Methods in Epidemiologic Research (VER Incorporated, 2012).
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna Austria Version 4.5.2. https://www.R-project.org (2025).
Posit team. RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA. Version 2026.01.0. http://www.posit.co (2026).
Novomestky, F. & Nadarajah, S. Truncated Random Variables. R package version 1.0.2. https://doi.org/10.32614/CRAN.package.truncdist (2016).
Wickham, H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. R package version 4.0.1. https://ggplot2.tidyverse.org (2016).
French, B. et al. Misclassification of primary liver cancer in the Life Span Study of atomic bomb survivors. Int. J. Cancer 147, 1294–1299 (2020).
Article CAS PubMed PubMed Central Google Scholar
Greenland, S. The effect of misclassification in the presence of covariates. Am. J. Epidemiol. 112, 564–569 (1980).
Article CAS PubMed Google Scholar
Norton, E. C., Dowd, B. E. & Maciejewski, M. L. Odds ratios—current best practice and use. JAMA 320, 84–85 (2018).
Article PubMed Google Scholar
Norton, E. C., Dowd, B. E., Maciejewski, M. L. & Garrido, M. M. Requiem for odds ratios. Health Serv. Res. 59, e14337. https://doi.org/10.1111/1475-6773.14337 (2024).
Article PubMed PubMed Central Google Scholar
Fox, M. P., MacLehose, R. F. & Lash, T. L. Applying Quantitative Bias Analysis to Epidemiologic Data 2nd edn. (Springer, Cham, 2021).
Book Google Scholar
Rogan, W. J. & Gladen, B. Estimating prevalence from the results of a screening test. Am. J. Epidemiol. 107, 71–76 (1978).
Article CAS PubMed Google Scholar

Download references

Funding

The USTUR is funded by U.S. Department of Energy, Office of Health Studies and Former Worker Programs (EHSS-12), under grant award DE-HS0000073 to Washington State University.

Author information

Authors and Affiliations

United States Transuranium and Uranium Registries, College of Pharmacy and Pharmaceutical Sciences, Washington State University, Richland, WA, USA
Xirui Liu, Stacey L. McComish & Sergey Y. Tolmachev
Oak Ridge Institute for Science and Education, Oak Ridge Associated Universities, Oak Ridge, TN, USA
Sara C. Howard
U.S. Department of Energy, Washington, DC, USA
Joey Y. Zhou

Authors

Xirui Liu
View author publications
Search author on:PubMed Google Scholar
Stacey L. McComish
View author publications
Search author on:PubMed Google Scholar
Sara C. Howard
View author publications
Search author on:PubMed Google Scholar
Joey Y. Zhou
View author publications
Search author on:PubMed Google Scholar
Sergey Y. Tolmachev
View author publications
Search author on:PubMed Google Scholar

Contributions

XL, SLM, and SCH prepared and analyzed all data. XL, SLM, SCH, JYZ, and SYT prepared the manuscript. JYZ and SYT designed and supervised the study. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xirui Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Institutional review

This study was performed as a part of the USTUR research program, which was reviewed and approved by the Central Department of Energy Institutional Review Board (USA) No. WASU-68-50181.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download DOCX )

Supplementary Material 2 (download CSV )

Supplementary Material 3

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, X., McComish, S.L., Howard, S.C. et al. Inference under outcome misclassification in health risk models using a simulation study with a validation dataset. Sci Rep 16, 11981 (2026). https://doi.org/10.1038/s41598-026-41788-6

Download citation

Received: 03 July 2025
Accepted: 23 February 2026
Published: 04 March 2026
Version of record: 10 April 2026
DOI: https://doi.org/10.1038/s41598-026-41788-6

Subjects

Abstract

Similar content being viewed by others

Use of the p-values as a size-dependent function to address practical differences when analyzing large datasets

Cross-classification between self-rated health and health status: longitudinal analyses of all-cause mortality and leading causes of death in the UK

Non-cancer disease prevalence and association with occupational radiation exposure among Korean radiation workers

Introduction

Methods

A validation dataset

Initial and misclassified datasets

Source of dose data

Source of ‘true’ outcome data

Calculation methods

Misclassification: real data from USTUR registrants

Misclassification: simulated outcomes

Visualization

Software and packages

Results

Observed misclassification impacts

Simulated misclassification impacts

Discussion

Conclusions

Data availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Institutional review

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1 (download DOCX )

Supplementary Material 2 (download CSV )

Supplementary Material 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links