Abstract
Idiopathic pulmonary fibrosis (IPF) is a lethal fibrosing interstitial lung disease with a mean survival time of less than 5 years. Nonspecific presentation, a lack of effective early screening tools, unclear pathobiology of early-stage IPF and the need for invasive and expensive procedures for diagnostic confirmation hinder early diagnosis. In this study, we introduce a new screening tool for IPF in primary care settings that requires no new laboratory tests and does not require recognition of early symptoms. Using subtle comorbidity signatures identified from the history of medical encounters of individuals, we developed an algorithm, called the zero-burden comorbidity risk score for IPF (ZCoR-IPF), to predict the future risk of an IPF diagnosis. ZCoR-IPF was trained on a national insurance claims database and validated on three independent databases, comprising a total of 2,983,215 participants, with 54,247 positive cases. The algorithm achieved positive likelihood ratios greater than 30 at a specificity of 0.99 across different cohorts, for both sexes, and for participants with different risk states and history of confounding diseases. The area under the receiver-operating characteristic curve for ZCoR-IPF in predicting IPF exceeded 0.88 and was approximately 0.84 at 1 and 4 years before a conventional diagnosis, respectively. Thus, if adopted, ZCoR-IPF can potentially enable earlier diagnosis of IPF and improve outcomes of disease-modifying therapies and other interventions.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout


Similar content being viewed by others
Data availability
The Truven, UCM and MAYO datasets cannot be made available due to their commercial nature. D.O. and I.C. had access to the Truven and UCM databases, and I.C. was responsible for maintaining the integrity of these datasets. C.G.N., L.J.F. and A.H.L. had access to the MAYO dataset, and A.H.L. was responsible for maintaining the integrity of that dataset.
Code availability
Methodological details needed to evaluate our conclusions are included in the Methods and Supplementary Information. A working software implementation of the pipeline (free for noncommercial evaluations) is available at https://doi.org/10.5281/zenodo.6040418, which includes installation instructions in standard Python environments. To enable fast execution, some more compute-intensive features are disabled in this version. Results from this software are for demonstration purposes only, and must not be interpreted as medical advice, or serve as replacement for such.
References
Lederer, D. & Martinez, F. Idiopathic pulmonary fibrosis. N. Engl. J. Med. 378, 1811–1823 (2018).
Raghu, G., Remy-Jardin, M. & Myers, J. Diagnosis of idiopathic pulmonary fibrosis. an official ats/ers/jrs/alat clinical practice guideline. Am. J. Respir. Crit. Care Med. 198, 44–68 (2018).
Raghu, G. Idiopathic pulmonary fibrosis: shifting the concept to irreversible pulmonary fibrosis of many entities. Lancet Respir. Med. 7, 926–929 (2019).
Ley, B., Collard, H. & King, T., Jr. Clinical course and prediction of survival in idiopathic pulmonary fibrosis. Am. J. Respir. Crit. Care Med. 183, 431–440 (2011).
Antoniou, K., Symvoulakis, E., Margaritopoulos, G., Lionis, C. & Wells, A. Early diagnosis of IPF: time for a primary-care case-finding initiative? Lancet Respir. Med. 2, 1 (2014).
Adegunsoye, A. Diagnostic delay in idiopathic pulmonary fibrosis: where the rubber meets the road. Ann. Am. Thorac. Soc. 16, 310–312 (2019).
Cottin, V. & Richeldi, L. Neglected evidence in idiopathic pulmonary fibrosis and the importance of early diagnosis and treatment. Eur. Respir. Rev. 23, 106–110 (2014).
Putman, R., Rosas, I. & Hunninghake, G. Genetics and early detection in idiopathic pulmonary fibrosis. Am. J. Respir. Crit. Care Med. 189, 770–778 (2014).
Lamas, D. et al. Delayed access and survival in idiopathic pulmonary fibrosis: a cohort study. Am. J. Respir. Crit. Care Med. 184, 842–847 (2011).
Hoyer, N., Prior, T., Bendstrup, E., Wilcke, T. & Shaker, S. Risk factors for diagnostic delay in idiopathic pulmonary fibrosis. Respir. Res. 20, 103 (2019).
Mooney, J., Chang, E. & Lalla, D. Potential delays in diagnosis of idiopathic pulmonary fibrosis in medicare beneficiaries. Ann. Am. Thorac. Soc. 16, 393–396 (2019).
Pritchard, D., Adegunsoye, A. & Lafond, E. Diagnostic test interpretation and referral delay in patients with interstitial lung disease. Respir. Res. 20, 253 (2019).
Cosgrove, G. P., Bianchi, P., Danese, S. & Lederer, D. J. Barriers to timely diagnosis of interstitial lung disease in the real world: the INTENSITY survey. BMC Pulm. Med. 18, 9 (2018).
Schoenheit, G., Becattelli, I. & Cohen, A. Living with idiopathic pulmonary fibrosis: an in-depth qualitative survey of European patients. Chron. Respir. Dis. 8, 225–231 (2011).
Collard, H., Tino, G. & Noble, P. Patient experiences with pulmonary fibrosis. Respir. Med. 101, 1350–1354 (2007).
Thickett, D., Voorham, J. & Ryan, R. Historical database cohort study addressing the clinical patterns prior to idiopathic pulmonary fibrosis (IPF) diagnosis in UK primary care. BMJ Open 10, 034428 (2020).
Hewson, T. et al. Timing of onset of symptoms in people with idiopathic pulmonary fibrosis. Thorax https://doi.org/10.1136/thoraxjnl-2017-210177 (2017).
Cottin, V. & Cordier, J. Velcro crackles: the key for early diagnosis of idiopathic pulmonary fibrosis? Eur. Respir. J. 40, 519–521 (2012).
Hart, S. Machine learning molecular classification in IPF: UIP or not UIP, that is the question. Lancet Respir. Med. 7, 466–467 (2019).
Oldham, J. & Noth, I. Idiopathic pulmonary fibrosis: early detection and referral. Respir. Med. 108, 819–829 (2014).
Hansen, L. The Truven Health MarketScan Databases for Life Sciences Researchers (Truven Health Ananlytics IBM Watson Health, 2017).
Andrade, C. Examination of participant flow in the CONSORT diagram can improve the understanding of the generalizability of study results. J. Clin. Psychiatry 76, e1469–e1471 (2015).
Wallace, P. J., Shah, N. D., Dennen, T., Bleicher, P. A. & Crown, W. H. Optum Labs: building a novel node in the learning healthcare system. Health Aff. 33, 1187–1194 (2014).
Raghu, G., Amatto, V., Behr, J. & Stowasser, S. Comorbidities in idiopathic pulmonary fibrosis patients: a systematic literature review. Eur. Respir. J. 46, 1113–1130 (2015).
World Health Organization. International Classification of Diseases—Ninth Revision (ICD-9). Wkly Epidemiol. Rec. 63, 343–344 (1988).
Chattopadhyay, I. & Lipson, H. Abductive learning of quantized stochastic processes with probabilistic finite automata. Philos. Trans. A Math. Phys. Eng. Sci. 371, 20110543 (2013).
Huang, Y. & Chattopadhyay, I. Universal risk phenotype of us counties for flu-like transmission to improve county-specific covid-19 incidence forecasts. PLoS Comput. Biol. 17, e1009363 (2021).
Ley, B. et al. Code-based diagnostic algorithms for idiopathic pulmonary fibrosis. Case validation and improvement. Ann. Am. Thorac. Soc. 14, 880–887 (2017).
Alqarni, A. M., Schneiders, A. G. & Hendrick, P. A. Clinical tests to diagnose lumbar segmental instability: a systematic review. J. Orthop. Sports Phys. Ther. 41, 130–140 (2011).
Vining, R., Potocki, E., Seidman, M. & Morgenthal, A. P. An evidence-based diagnostic classification system for low back pain. J. Can. Chiropr. Assoc. 57, 189–204 (2013).
Kaplan, E. L. & Meier, P. Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc. 53, 457–481 (1958).
Noble, P. W. et al. Pirfenidone in patients with idiopathic pulmonary fibrosis (capacity): two randomised trials. Lancet 377, 1760–1769 (2011).
Richeldi, L. et al. Efficacy and safety of nintedanib in idiopathic pulmonary fibrosis. N. Engl. J. Med. 370, 2071–2082 (2014).
Hyldgaard, C., Hilberg, O. & Bendstrup, E. How does comorbidity influence survival in idiopathic pulmonary fibrosis? Respir. Med. 108, 647–653 (2014).
Oldham, J., Adegunsoye, A. & Khera, S. Underreporting of interstitial lung abnormalities on lung cancer screening computed tomography. Ann. Am. Thorac. Soc. 15, 764–766 (2018).
Walsh, S., Humphries, S., Wells, A. & Brown, K. Imaging research in fibrotic lung disease; applying deep learning to unsolved problems. Lancet Respir. Med. 8, 1144–1153 (2020).
Raghu, G., Flaherty, K. & Lederer, D. Use of a molecular classifier to identify usual interstitial pneumonia in conventional transbronchial lung biopsy samples: a prospective validation study. Lancet Respir. Med. 7, 487–496 (2019).
Torrisi, S. E., Pavone, M., Vancheri, A. & Vancheri, C. When to start and when to stop antifibrotic therapies. Eur. Respir. Rev. 26, 170053 (2017).
Sugino, K. et al. Efficacy of early antifibrotic treatment for idiopathic pulmonary fibrosis. BMC Pulm. Med. 21, 218 (2021).
Ryerson, C. J. et al. Effects of nintedanib in patients with idiopathic pulmonary fibrosis by gap stage. ERJ Open Res. 5, 00127–2018 (2019).
Kropski, J. Biomarkers and early treatment of idiopathic pulmonary fibrosis. Lancet Respir. Med. 7, 725–727 (2019).
Farrand, E., Iribarren, C. & Vittinghoff, E. Impact of idiopathic pulmonary fibrosis on longitudinal health-care utilization in a community-based cohort of patients. Chest 159, 219–227 (2020).
Kreuter, M., Ehlers-Tenenbaum, S. & Palmowski, K. Impact of comorbidities on mortality in patients with idiopathic pulmonary fibrosis. PLoS ONE 11, 0151425 (2016).
Ley, B. & Collard, H. R. Risk prediction in idiopathic pulmonary fibrosis. Am. J. Respir. Crit. Care Med. 185, 6–7 (2012).
Ryerson, C. J. et al. Predicting mortality in systemic sclerosis-associated interstitial lung disease using risk prediction models derived from idiopathic pulmonary fibrosis. Chest 148, 1268–1275 (2015).
Kim, G. H. J. et al. Prediction of idiopathic pulmonary fibrosis progression using early quantitative changes on ct imaging for a short term of clinical 18- to 24-month follow-ups. Eur. Radiol. 30, 726–734 (2020).
Richards, T. J. et al. Peripheral blood proteins predict mortality in idiopathic pulmonary fibrosis. Am. J. Respir. Crit. Care Med. 185, 67–76 (2012).
King Jr, T. E., Tooze, J. A., Schwarz, M. I., Brown, K. R. & Cherniack, R. M. Predicting survival in idiopathic pulmonary fibrosis: scoring system and survival model. Am. J. Respir. Crit. Care Med. 164, 1171–1181 (2001).
Wells, A. U. et al. Idiopathic pulmonary fibrosis: a composite physiologic index derived from disease extent observed by computed tomography. Am. J. Respir. Crit. Care Med. 167, 962–969 (2003).
du Bois, R. M. et al. Ascertainment of individual risk of mortality for patients with idiopathic pulmonary fibrosis. Am. J. Respir. Crit. Care Med. 184, 459–466 (2011).
Singh, R. P., Hom, G. L., Abramoff, M. D., Campbell, J. P. & Chiang, M. F. Current challenges and barriers to real-world artificial intelligence adoption for the healthcare system, provider, and the patient. Transl. Vis. Sci. Technol. 9, 45 (2020).
Holm, E. A. In defense of the black box. Science 364, 26–27 (2019).
Esposito, D., Lanes, S. & Donneyong, M. Idiopathic pulmonary fibrosis in united states automated claims. incidence, prevalence, and algorithm validation. Am. J. Respir. Crit. Care Med. 192, 1200–7 (2015).
Ley, B., Urbania, T. & Husson, G. Code-based diagnostic algorithms for idiopathic pulmonary fibrosis. Case validation and improvement. Ann. Am. Thorac. Soc. 14, 880–887 (2017).
Inoue, Y., Kaner, R. & Guiot, J. Diagnostic and prognostic biomarkers for chronic fibrosing interstitial lung diseases with a progressive phenotype. Chest 158, 646–659 (2020).
George, P., Spagnolo, P. & Kreuter, M. Progressive fibrosing interstitial lung disease: clinical uncertainties, consensus recommendations, and research priorities. Lancet Respir. Med. 8, 925–934 (2020).
Mortimer, K., Bartels, D. & Hartmann, N. Characterizing health outcomes in idiopathic pulmonary fibrosis using US health claims data. Respiration 99, 108–118 (2020).
Miotto, R., Li, L., Kidd, B. A. & Dudley, J. T. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep. 6, 26094 (2016).
Granger, C. W. J. & Joyeux, R. An introduction to long-memory time series models and fractional differencing. J. Time Ser. Anal. 1, 15–29 (1980).
American Academy of Pediatrics. Transitioning to 10: 2014 general equivalence mappings (online exclusive). AAP Pediatric Coding Newsletter https://doi.org/10.1542/pcco_book116_document005 (2013).
Chattopadhyay, I. & Lipson, H. Data smashing: uncovering lurking order in data. J. R. Soc. Interface 11, 20140826 (2014).
Onishchenko, D. et al. Reduced false positives in autism screening via digital biomarkers inferred from deep comorbidity patterns. Sci. Adv. 7, eabf0354 (2021).
Cover, T. M. & Thomas, J. A. Elements of Information Theory (Wiley-Interscience, 1991).
Kullback, S. & Leibler, R. A. On information and sufficiency. Ann. Math. Statist. 22, 79–86 (1951).
Doob, J. Stochastic Processes (Wiley, 1953). https://books.google.com/books?id=KvJQAAAAMAAJ
Ke, G. et al. LightGBM: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 3146–3154 (2017).
Birnbaum, Z. W. & Klose, O. M. Bounds for the variance of the Mann–Whitney statistic. Ann. Math. Stat. 4, 933–945 (1957).
Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60 (1947).
Wilcoxon, F. Individual comparisons by ranking methods. in Breakthroughs in Statistics Vol. 2 196–202 (Springer, 1992).
Newcombe, R. G. & Vollset, S. E. Confidence intervals for a binomial proportion. Stat. Med. 13, 1283–1285 (1994).
Birnbaum, Z. On a use of the Mann–Whitney statistic. in Contribution to the Theory of Statistics Vol. 1, 13–18 (University of California Press, 2020).
van Dantzig, D. On the consistency and the power of wilcoxon’s two-sample test (Proceedings KNAW series A, 54, nr 1, Indagationes Mathematicae, 13, 1–8). Stichting Mathematisch Centrum. Statistische Afdeling (1951).
Newcombe, R. G. Two-sided confidence intervals for the single proportion: comparison of seven methods. Stat. Med. 17, 857–872 (1998).
Haldane, J. B. & Smith, C. A. A simple exact test for birth-order effect. Ann. Eugen. 14, 117–124 (1947).
Schmidhuber, J. Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015).
Van Houdt, G., Mosquera, C. & Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 53, 5929–5955 (2020).
Albawi, S., Mohammed, T. A. & Al-Zawi, S. Understanding of a convolutional neural network. In 2017 International Conference on Engineering and Technology (ICET), 1–6 (IEEE, 2017).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017).
Alom, M. Z. et al. The history began from AlexNet: a comprehensive survey on deep learning approaches. Preprint at https://arxiv.org/abs/1803.01164 (2018).
Zhang, K., Guo, Y., Wang, X., Yuan, J. & Ding, Q. Multiple feature reweight densenet for image classification. IEEE Access 7, 9872–9880 (2019).
Lu, Z., Jiang, X. & Kot, A. Deep coupled resnet for low-resolution face recognition. IEEE Signal Processing Lett. 25, 526–530 (2018).
Guo, W., Ge, W., Cui, L., Li, H. & Kong, L. An interpretable disease onset predictive model using crossover attention mechanism from electronic health records. IEEE Access 7, 134236–134244 (2019).
Acknowledgements
This work is funded in part by the Defense Advanced Research Projects Agency under project no.HR00111890043. The claims made in this study do not reflect the position or the policy of the US Government. The UCM dataset is provided by the Clinical Research Data Warehouse (CRDW) maintained by the Center for Research Informatics at the University of Chicago. The Center for Research Informatics is funded by the Biological Sciences Division, the Institute for Translational Medicine/CTSA (National Institutes of Health award no. UL1 TR000430) at the University of Chicago.
Author information
Authors and Affiliations
Contributions
D.O. implemented the algorithm and ran validation tests. D.O. and I.C. carried out mathematical modeling and algorithm design. R.J.M., F.J.M. and I.C. wrote the paper. F.J.M., G.M.H. and I.C. interpreted results and guided research. C.G.N., L.J.F. and A.H.L. evaluated the tool on the dataset available at the Mayo Clinic. I.C. procured funding for the research.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Medicine thanks Athol Wells, Harold Collard and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Michael Basson, in collaboration with the Nature Medicine team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Performance under delayed updates to participant records.
a, b, Out-of-sample ROC curves when the patient data is delayed by 4w vs the no-delay condition, for the UCM and the Truven datasets, respectively. 95% confidence bounds about the mean is shown, computed with n=2,053,277 for Truven and n=68,658 for UCM. Note that there is no significant loss of performance with such delayed data. c, d, ZCoR-IPF performance vs a 87-feature baseline model optimized via logistic regression, where these features denote presence/absence of manually-curated risk factors (Supplemental Table 4) and age (over/under 65 years), for the Truven and the UCM datasets, respectively.
Extended Data Fig. 2 Comparison with neural network architectures.
a,b, Out-of-sample AUC achieved in Truven and UCM datasets, respectively, by a range of neural network architectures ranging from simple feed-forward networks, LSTMs and CNNs, to large state of the art models such as the ALEXNET, DENSENET and RESNET, along with 95% confidence intervals about the mean (n=2,053,277 for Truven and n=68,658 for UCM).
Extended Data Fig. 3 Performance with broader target definition.
a,b, Out-of-sample ROC curves for the Truven and the UCM dataset, respectively, comparing the results from the primary analysis with that in the secondary analysis (analysis with broader target definition as specified in Extended Data Table 1). 95% confidence bounds about the mean is shown, computed with n=2,053,277 for Truven and n=68,658 for UCM. c, Negative vs positive likelihood ratios (LR- vs LR+). d, Positive vs negative predictive values. Note that with the broad target definition we can select to operate with LR+ > 30 as well, similar to the target in the primary analysis.
Extended Data Fig. 4 Co-morbidity Spectra.
a,b, Diseases (recorded ICD codes) that increase the odds of the patient being a ‘true positive’ vs a ‘true negative’ for males and females respectively. These odds are broadly similar across the sexes, with over-representation of respiratory disorders.
Extended Data Fig. 5 Expected increase in survival times.
a, Survival function lower bounds at two specificity levels (90 and 95%). b, Cumulative hazard function upper bounds. 95% confidence bounds around the mean shown for both, generated using the Truven dataset (n=2,053,277). c, Variation of the mean survival time as a function of the specificity at which ZCoR-IPF is operated. d, Variation of estimated raw risk as a function of age for screening four years from actual recorded diagnosis of IPF, showing that risk increases almost linearly with age for the patients eventually diagnosed with IPF. e, Degradation of out-of-sample AUC as we attempt to screen earlier, stepping back from the time of current diagnosis (in absence of ZCoR-IPF screening).
Supplementary information
Supplementary Information
Supplementary Note, Tables 1–11 and Figs. 1–3
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Onishchenko, D., Marlowe, R.J., Ngufor, C.G. et al. Screening for idiopathic pulmonary fibrosis using comorbidity signatures in electronic health records. Nat Med 28, 2107–2116 (2022). https://doi.org/10.1038/s41591-022-02010-y
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41591-022-02010-y
This article is cited by
-
Epidemiology and comorbidities in idiopathic pulmonary fibrosis: a nationwide cohort study
BMC Pulmonary Medicine (2023)