Introduction

Lung cancer (LC) is the leading cause of cancer-related mortality globally and in the United States (US). Although a majority of LC cases are attributable to smoking, 15–25% of all lung cancers in the US occur in persons who have never smoked1. Moreover, LC in individuals who have never smoked (LCINS) is the 8th leading cause of cancer-related mortality in the US (5th globally)2. Because persons who have never smoked (PWHNS) are ineligible for LC screening, they are commonly diagnosed at advanced stages, when 5-year survival is only 26%3,4.

There are notable differences between LC arising in smokers and LC in PWHNS. LCINS is twice as common in women—particularly Asian and Hispanic women—than in men. Further, LC in younger persons is much more likely to occur in PWHNS2,5,6. LCINS appears to be histologically and molecularly distinct from LC in smokers, with a greater proportion of LCINS cases being of adenocarcinoma histology and harboring targetable driver mutations2,5,6,7. Factors associated with LCINS appear to selectively target distal airways, unlike smoking-related carcinogens that target both proximal and distal lung compartments6.

There are several known or suspected risk factors for LCINS, including exposure to second-hand tobacco smoke, indoor and outdoor air pollution, radon, asbestos, silica, and heavy metals, and inherited genetic susceptibility5,6,8,9,10,11. Previous lung diseases likely also increase LCINS risk. A large, pooled analysis found significantly elevated rates of LCINS among individuals diagnosed with pneumonia and tuberculosis 5-10 years prior to the lung cancer diagnosis12. These risk factors only explain a small proportion of LCINS in the US and Western Europe13.

Because the majority of LCINS cannot be explained by established risk factors, alternative approaches are needed to gain etiologic insight. Electronic Medical Records (EMRs) contain extensive health-related data on medical diagnoses, laboratory tests, and prescription drug use. These data can be leveraged to screen multiple medical conditions to generate hypotheses for further follow-up14. The aim of this study was to use the United Kingdom’s (UK) Clinical Practice Research Datalink (CPRD) to agnostically identify and independently validate medical conditions associated with the risk of LCINS. This study is part of the Sherlock-Lung project on LCINS15,16. In this work, we show that individuals with LCINS are more likely to have been diagnosed with inflammatory diseases. In addition to providing etiologic insights, these findings may help lay a foundation for future risk stratification and LCINS screening programs.

Results

Discovery stage

We identified 1581 LCINS cases and 14,318 never-smoking controls who had been registered for a median of 15.0 years (interquartile range [IQR] 6.3–28.3) prior to the index date (Table 1). Differences between the feasibility study (“Methods”) and the actual numbers of included cases arose because the final study population included an additional year (2019) and we imposed restrictions (e.g., age at diagnosis 30–89). By design, cases and controls were similar with respect to age, sex, registration year, and region, although the proportion of female controls was slightly higher than female cases (58.2% versus 56.4%).

Table 1 Characteristics of lung cancer cases who have never smoked diagnosed between 1998-2019 and matched controls who have never smoked in CPRD-GOLD and CPRD-Aurum

In the 1–10 years prior to the index date (the date of diagnosis for cases and the date of selection for controls), infections and inflammation (II) (adjusted odds ratio [aOR] = 1.26, 95% confidence interval [CI] 1.12–1.42) and anemia (aOR = 1.44, 95%CI 1.13–1.83) were associated with LCINS in conditional logistic regression analyses (Table 2). Results for all categories and subcategories are listed in Supplementary Table S1. Dementia and Alzheimer’s disease were inversely associated with LCINS risk (aOR = 0.55, 95%CI 0.36–0.84). However, these associations could be confounded by functional status (i.e., individuals who are frail and/or have short life expectancy are unlikely to be evaluated and/or treated for cancer) and we thus elected not to validate them. Although the cardiovascular disease category was not associated with LCINS at a false discovery rate (FDR) p-value < 0.05, a few diseases within that category were strongly associated with LCINS risk, including: peripheral vascular disease (aOR = 2.93, 95%CI 1.74–4.96) and myocardial infarction (aOR = 1.51, 95%CI 1.22–1.88). Because the category was not associated with LCINS overall, these conditions were not considered for validation. Thrombosis, kidney disease, eye conditions, and osteoporosis were significantly associated (aORthrombosis 10-32 years prior = 2.28, 95%CI 1.47-3.52, aORkidney disease 10-32 years prior = 2.67, 95%CI 1.22–5.87, aOReye conditions 10-32 years prior = 1.39, 95%CI 1.00–1.92, aORosteoporosis 10-32 years prior = 1.57, 95%CI 1.01–2.46) with LCINS in the 10–32 years prior to the index date despite not being significantly associated with LCINS in the 1–10 years prior to the index date (aORthrombosis 1-10 years prior = 1.20, 95%CI 0.87-1.64, aORkidney disease 1-10 years prior = 1.07, 95%CI 0.87–1.33, aOReye conditions 1-10 years prior = 1.39, 95%CI 1.00–1.92, aORosteoporosis 10-32 years prior = 1.57, 95%CI 1.01–2.46). No other primary categories were associated with LCINS in the 10–32 years prior to the index date.

Table 2 Adjusted odds ratios (aORs) and 95% confidence intervals (CIs) estimated from conditional logistic regression models for the association between primary disease categories and LCINS sorted by the false discovery rate for the 1-10 year associations in the discovery dataset (CPRD-GOLD)

In hierarchical logistic models, the group mean estimate for the Infections and Inflammation (II) category was similar to that from the conditional logistic regression analysis (aOR = 1.22, 95%CI 1.08-1.38). Of the 51 II sub-categories, 12 were associated with LCINS in the 1–10 years prior to the index date (Table 3 and Fig. 1). In hierarchical models, statistically significant associations were observed for COPD (aOR1-10 years prior = 3.43, 95%CI 2.47–4.76); tuberculosis (aOR1-10years prior = 3.69, 95%CI 1.01-13.38); lupus (aOR1-10 years prior = 3.78, 95%CI 1.36–10.54); gastrointestinal ulcers (aOR1-10years prior = 1.27, 95%CI 1.01–1.61); gastro-esophageal reflux (GERD, aOR1-10 years prior = 1.41, 95%CI 1.13–1.75); gastritis and non-infective gastroenteritis and colitis (gastritis/NIGC, aOR1-10 years prior = 1.79, 95%CI 1.31–2.45); influenza (aOR1-10 years prior = 1.52, 95%CI 1.09–2.10); upper respiratory infections, which largely consisted of bronchitis and tracheitis (aOR1-10 years prior = 1.25, 95%CI 1.09–1.43); psoriasis (aOR1-10 years prior = 1.29, 95%CI 1.02–1.62); diabetes mellitus type 1 (DMT1: aOR1-10 years prior = 1.74, 95%CI 1.05–2.89); other autoimmune conditions NOS (aOR1-10 years prior = 1.77, 95%CI 1.24–2.53); and rosacea (aOR1-10 years prior = 1.52, 95%CI 1.09–2.12). Of these 12 sub-categories, COPD (aOR10-32 years prior = 3.28, 95%CI 1.59–6.80), lupus (aOR10-32 years prior = 3.30, 95%CI 0.89–12.31), and gastritis/NIGC (aOR10-32 years prior = 1.53, 0.96–2.43) were similarly associated with LCINS in the 10–32 years prior to the index date (Table 3, Fig. 1 and Supplementary Fig. S1).

Table 3 Statistically significant associations (adjusted odds ratios, aORs, and 95% confidence intervals, CIs, estimated from hierarchical logistic regression models) between disease subcategories in the Infection and Inflammation category and LCINS in the discovery dataset
Fig. 1: Forest plots showing the associations between the medical conditions identified in the 1-10 years prior to selection and lung cancer in individuals who have never smoked.
figure 1

Data are presented as adjusted odds ratios with 95% confidence intervals from conditional logistic regression models. Estimates in blue are the associations from the discovery (CPRD-GOLD) dataset (1581 LCINS cases and 14,318 never smoking controls), and those in red are the associations in the validation dataset (CPRD-Aurum) (2188 LCINS cases and 19,597 never smoking controls). Source data are provided as a Source Data File.

Validation

We identified 2188 LCINS cases and matched 19,597 never-smoking controls who had been enrolled in a CPRD-Aurum general practice for a median of 15.7 years (IQR 7.3–27.5) prior to the index date (Table 1). See Source Data file S1_ValidationAurumCodes for codes used to identify medical conditions. Compared with the CPRD-GOLD population, the CPRD-Aurum population was more likely to have index dates in later years.

Conditions which were significantly associated (with p < 0.05) with LCINS risk in the validation dataset as in the discovery dataset in either time period included: COPD (aOR1-10 years prior = 2.90, 95%CI 2.23–3.78; aOR10-32 years prior = 2.52, 95%CI 1.34–4.71); upper respiratory infections (aOR1-10 years prior = 1.30, 95%CI 1.16-1.46; aOR10-32 years prior = 1.20, 95%CI 1.01–1.42); DMT1 (aOR1-10 years = 1.62, 1.08–2.44; aOR10-32 years = 1.55, 95%CI 0.73–3.32); anemia (aOR1-10 years = 1.27, 95%CI 1.05–1.54; aOR10-32 years = 1.40, 95%CI 1.01–1.95); GERD (aOR1-10 years = 1.32,1.09–1.60; aOR10-32 years = 1.11, 95%CI 0.78–1.58); and gastritis/NIGC (aOR1-10 years = 1.14, 95%CI 0.84–1.53; aOR10-32 years = 1.64, 95%CI 1.08–2.49). Conditions suggestively associated include tuberculosis (aOR1-10 years = 2.01, 95%CI 0.97–4.18; aOR10-32 years = 2.44, 95%CI 0.81–7.35) and psoriasis (aOR1-10 years = 1.27, 95%CI 0.91–1.76; aOR10-32 years = 0.88, 95%CI 0.52–1.51) (Table 4, Fig. 1 and Supplementary Fig. 1).

Table 4 Adjusted odds ratios and 95% confidence intervals from conditional logistic regression models for the associations between medical conditions and LCINS in the validation (CPRD-Aurum) dataset

Sensitivity analyses

In addition adjusting models for both body mass index (BMI) and socioeconomic status (SES) did not materially alter the associations seen in CPRD-Aurum (Table 5) or CPRD-GOLD data (Supplementary Table S2). We adjusted for medication use (ever use) in both the discovery and validation analyses for COPD, GERD, and gastritis/NIGC. We only adjusted for medication use for psoriasis, lupus, and rosacea in CPRD-GOLD, because they were not statistically associated with LCINS in CPRD-Aurum. We examined use of the following medications with codes available in both CPRD-GOLD and CPRD-Aurum: oral corticosteroids; inhaled corticosteroids; methotrexate; other immunosuppressants; Long-Acting Beta-2 Agonists (LABA); Short-Acting Beta-2 Agonists (SABA); Antimuscarinic Bronchodilators; macrolides; proton pump inhibitors (PPI); H2-receptor antagonists; and antacids; Nonsteroidal Anti-Inflammatory Drugs (NSAIDs); and tetracyclines. Associations between LCINS and the conditions were only modestly attenuated when medications were added to the statistical models (Table 6).

Table 5 Associations between medical conditions and LCINS in the validation (CPRD-Aurum) dataset 1–10 years before selection adjusted for body mass index (BMI) and socioeconomic status (SES, index of multiple deprivation)
Table 6 Associations between medical conditions and LCINS in both datasets adjusted for ever use of medications

Of note, iron-related anemia appeared to drive the anemia associations with LCINS (Supplementary Table S3). However, larger studies are needed to confirm these findings.

Discussion

In this large, population-based study of PWHNS, we used EMRs to agnostically identify and independently validate medical conditions potentially associated with LCINS risk. To the best of our knowledge, this is the largest study of LCINS risk17. Our approach identified several putative or established risk factors, and some conditions without prior associations with LC.

Among the previously reported risk factors, the strongest and most consistent association observed in this study was for COPD/emphysema. Although COPD is generally associated with smoking, a recent review estimates that > 50% of all cases globally occur in PWHNS, with estimates varying geographically and by SES18. Risk factors include childhood respiratory diseases, air pollution, occupational exposures18,19 as well as dysanapsis, a mismatch of airway tree caliber to lung size20. Several cytokines (e.g., interleukin (IL)-8, IL-6, IL-1β) that are elevated in persons with COPD have been previously associated with LC risk (including LCINS) and are important for LC and/or cancer biology21,22,23,24,25. Our associations are also similar to those reported in a recent retrospective cohort study from South Korea (hazard ratio [HR] = 2.67, 95%CI 2.09–3.40)26. Adjusting for several of the medications potentially used in the treatment of COPD modestly attenuated the association, suggesting that they might mediate some of the association with COPD.

Another previously reported association that we validated was upper respiratory infections, a category largely comprised of bronchitis12,27. Tuberculosis, which was statistically significant in the discovery phase, was not statistically significant in the validation phase (p = 0.06). Pneumonia has commonly been associated with LC12, although it was not statistically significantly associated with LCINS in this study in the discovery phase. Asthma was also not associated with LCINS in this study. However, the UK Million Woman Study found that only asthma requiring treatment was associated with LCINS risk in women28.

We validated several conditions with little to no previous evidence supporting a relationship with LC including DMT1, anemia, gastritis, and GERD. There is conflicting evidence of an association between diabetes mellitus (DM; both types) and LC29, but the use of Metformin, an antidiabetic medication, has been associated with reduced LC risk30. Further, individuals with DM are at increased risk of lung diseases like asthma, COPD, and pulmonary fibrosis31. Interestingly, IL-21 deficiency, which can lead to immunosuppression, is dysregulated in several autoimmune conditions including DMT1, and was inversely associated with LCINS risk in a nested case-control study of women23,32.

The association between anemia and LCINS could reflect the presence of latent cancer, although it was associated with LCINS 10-32 years prior to the index date. Furthermore, our results are similar to those from a study of iron deficiency anemia and LC in a cohort of Taiwanese healthcare beneficiaries. The authors speculated that anemia may render the microenvironment conducive to carcinogenesis33. Biologically, inflammation-related anemia can be caused by the pro-inflammatory cytokine IL-6, which regulates serum iron and has been associated with elevated LC risk (including PWHNS)23,34. Interestingly, the association between iron-related anemia SNOMED codes and LCINS was similar to the association between anemia overall and LCINS. Furthermore, both diabetes35 and gastritis36 can cause anemia.

To the best of our knowledge, there are no studies supporting an association between gastritis and LCINS, and the single study reports an association (HR = 1.53 95%CI = 1.19–1.98) between GERD and LC combined smokers and non-smokers37. GERD is one of the leading causes of chronic cough and is common in several inflammatory lung diseases such as cystic fibrosis, asthma, and idiopathic pulmonary fibrosis (IPF)38. It has been suggested that GERD might play a role in their pathogenesis, perhaps by causing chronic microaspiration39,40,41. Although most individuals with GERD do not experience chronic microaspiration, most persons with occult chronic aspirations have GERD42. Interestingly, hernia, which was associated with LCINS in the discovery stage—albeit not meeting the multiple testing correction (aOR = 1.29, 95%CI 1.06–1.56, p = 0.01)—can cause GERD43.

The associations were slightly attenuated for both GERD and gastritis when we adjusted for the use of PPIs or H2 receptor antagonists which are common treatments. PPIs, H2 receptor antagonists, and antacids were independently associated with LCINS risk. PPI use has been linked to various cancers, including gastric cancer in the CPRD44. Several mechanisms by which PPIs might contribute to cancer development have been proposed, including alteration of the gut microbiome which can affect inflammation, immune responses, and the production of carcinogenic metabolites by overgrowth of harmful bacteria. PPI-induced hypergastrinemia can cause proliferation and hyperfunction of enterochromaffin-like (ECL) cells45. Our observations could also be driven by vague symptoms that are due to a latent, yet undetected LCINS, which led to antacid or PPI/H2 receptor antagonist use.

The suggestive association between psoriasis and LCINS in CPRD-GOLD was similar in magnitude in CPRD-Aurum, albeit not statistically significant. A single study that adjusted for smoking status reported that only severe psoriasis was associated with LC46. Severe psoriasis is treated with methotrexate which can cause pulmonary fibrosis47,48. The association between psoriasis and LCINS was modestly attenuated when methotrexate use was added to the statistical model. Psoriasis is also associated with asthma49.

Unfortunately, we could not examine the association between IPF and LCINS50 as we did not identify any IPF from LCINS cases included in our feasibility study. This was expected because the prevalence of IPF in the UK is estimated at only 0.78 per 10,000 persons (0.38, 1.63)51.

Our study has some limitations. First, our grouping of medical conditions was based on our clinical expertize interpreting clinical terminologies. The coding and definitions of some diseases are likely to vary across different doctors, general practices, and calendar years. Further, some conditions (e.g., anemia and gastritis) have multiple etiologies and our study was not designed to differentiate between heterogeneous entities. This study also did not account for disease severity and/or manifestations. For example, the elevated risk for lupus in CPRD-GOLD that was not replicated in CPRD-Aurum may have been driven by systemic lupus erythematosus (SLE) which can affect the lungs and is treated by immunosuppressant medications such as azathioprine. SLE is more common in Northern Ireland and Scotland, two regions not represented in the validation dataset52. While we examined the impact of adjusting for medications on condition-LCINS associations, we relied on only single medication codes (ever exposure), although most medications examined are expected to be used on a more chronic basis. Further, data on some medication use may be unavailable in CPRD (e.g., medications administered in a hospital setting or prescribed by a specialist). However, this study was not designed to robustly examine specific medication-LCINS associations and these findings should be interpreted cautiously.

There are additional limitations. The medical ontology used in the discovery stage differed from that used in the validation stage which may explain why some conditions did not validate. Also, our study was not linked to cancer registries. Only ~ 50% of individuals in the CPRD-GOLD are linkable to cancer registries, and that restriction would have greatly reduced sample size. However, there is high concordance between cancers identified in the CPRD and those found in UK cancer registries53. Finally, this study was also not designed to examine how individual LCINS risk factors jointly influence LCINS risk and several of the conditions co-occur (e.g., autoimmune disease and anemia). Cancer risk factors rarely act alone and future, more targeted studies need to examine the interplay between the conditions and medications in relation to LCINS risk. The study limitations should be viewed within the goals of this hypothesis-generating study which was to identify etiologic clues to LCINS.

Our study had several strengths. The large study population had a median of 15 years of clinical history available. The CPRD-GOLD dataset is research quality and contains smoking status on ~ 95% of the population, and recorded smoking status has high validity54,55. The identification and validation of associations in expected conditions (e.g., COPD, tuberculosis) gives us confidence in our agnostic approach. Finally, we employed a conservative multiple-testing correction and validated our results in an independent population.

Most of the medical conditions we identified involved pulmonary or systemic inflammation. Furthermore, several of the associations persisted for > 10 years, hinting at the importance of long-term inflammation in the development of LCINS. The identification of medical conditions that were present many years before the LC diagnosis in this study agrees with the findings of a recent study from the Sherlock-Lung project that identified three molecular subtypes of LCINS16. That study reported that a large proportion of these cancers had a long latency of up to a decade before diagnosis, providing a window of time for early detection. Moreover, these cancers had specific genomic alterations suggestive of stem cells that had exited their quiescent state; inflammation (as in the medical conditions we identified here) can cause tissue damage and consequent tissue regeneration with stimulation of stem cells.

Our results are particularly relevant because of the inflammatory lung damage that SARS-CoV-2 can cause, with evidence that extended post-infection symptoms (long COVID) is associated with a two-fold increased risk for COPD, severe asthma, and pulmonary fibrosis56,57. In addition, air pollution which is associated with an increased risk of COPD in non-smokers18, has been shown to promote lung adenocarcinoma by causing a release of IL-1β, a pro-inflammatory cytokine that plays a role in the pathogenesis of LC and COPD25,58.

The potentially long latency period might provide an opportunity to identify individuals who would benefit from earlier LC detection. The growing burden of LCINS requires developing strategies that are not reliant upon smoking history for LC risk assessment in LC screening ineligible persons. Our findings highlight the potential utility of using routinely collected data from clinical practice to identify signals associated with elevated LCINS risk.

Methods

Ethics

This study uses data from the CPRD, obtained under license from the U.K. Medicines and Healthcare Products Regulatory Agency. The data collected by the National Health Service (NHS) as part of routine care, are provided by patients. The CPRD Independent Scientific Advisory Committee reviewed and approved this study (proposal #18_160.RAR). Since the National Cancer Institute only received de-identified data from CPRD, had no direct contact or interaction with the study participants, and did not use or generate identifiable private information, Sherlock-Lung has been determined to constitute ‘non-human subject research’ based on the federal Common Rule (45 CFR 46; https://www.ecfr.gov/cgi-bin/ECFR?page=browse).

Study design

We employed a two-stage approach. In the first stage, we identified conditions associated with LCINS risk using CPRD-GOLD. Then we validated these conditions in an independent dataset, CPRD-Aurum.

Study design for the discovery stage

We performed a nested case-control study using the UK’s CPRD54 GOLD database. CPRD-GOLD is a research-quality, population-based database established in 1987. It currently collects EMRs from 985 primary care practices, covering ~ 20 million UK residents59. CPRD is the world’s largest computerized database of anonymized longitudinal patient records from primary care practices and CPRD-GOLD is sex and age representative of the UK population54. It includes demographic characteristics, clinical diagnoses, referral information, specialty consultation notes, laboratory test results, and prescriptions. Importantly, smoking status is available for > 95% of individuals, as primary care physicians are paid to inquire about smoking status and then update that information if necessary54,60. Clinical events are recorded using Read codes61.

Discovery population

We identified all invasive primary first lung cancer cases diagnosed Jan 1, 1988 - Dec 31, 2019 (n = 77,099). We excluded cases with less than one year of registration prior to diagnosis in a CPRD-GOLD general practice (n = 13,731), or any evidence of any invasive cancer (except basal cell carcinoma and cervical cancer) before diagnosis (n = 10,867). Unlike cutaneous squamous cell carcinoma, basal cell carcinoma does not metastasize to the lung. Cervical cancer was not excluded so that we could test the hypothesis that the human papilloma virus is associated with lung cancer62. Unfortunately, the number of cervical cancer cases was too low (n = 2) for this purpose. We excluded cases with no documented smoking status in the full clinical history or any evidence of smoking in either Read codes or entity codes from practice visits (n = 50,643). Controls who had never smoked with at least one year of general practice registration and no evidence of a prior cancer were identified. Between 5 and 10 controls who were cancer-free, alive, and enrolled in CPRD-GOLD at case diagnosis date (termed selection date for controls) were individually matched to cases on year of birth (+/− 2 years); sex; general practice or region (general practice first, then region if we could not identify a control within the same practice because regions are large); and year of practice registration (+/− 2 year). We further excluded cases with less than five matched controls (n = 28), cases diagnosed before 30 or after 89 years of age (n = 129), and cases with evidence of smoking after LC diagnosis (n = 120), resulting in 1581 cases included in our discovery analyses.

Exposure classification for the discovery stage

A feasibility study conducted in late 2018 in the CPRD-GOLD dataset yielded 1478 LCINS cases, from which we extracted all Read codes (indicating diagnoses) present in the clinical history. We grouped case Read codes into clinically meaningful and specific disease categories. We required each disease category (called subcategory below) to include at least 15 cases, corresponding to a case prevalence of ~ 1% (0.01 × 1478 LCINS = ~ 15). This was based on statistical power calculations under a range of reasonable assumptions, including a multiple testing corrected alpha level (Supplementary Table S4). We supplemented our conditions with codes already generated within our Division for prior CPRD studies (Source data file S2_DiscoveryGOLDCodes).

These categories were then further restructured into 24 primary categories, in total containing 98 distinct subcategories (e.g., upper respiratory conditions within the infections and inflammation category). Some primary categories, such as infections and inflammation had many subcategories, whereas others (e.g., anemia) had none. This strategy was used to improve statistical power by lessening the multiple testing burden and to enable the fitting of statistical hierarchical models. See Supplementary Table S5 for the hierarchy of conditions. We excluded codes that were not related to a specific disease or suspected disease. Nor did we consider conditions such as obesity or alcoholism, given that they are not uniformly collected across the practices. Moreover, excessive alcohol consumption in the UK might be associated with passive smoking because alcohol consumption commonly occurred in pubs where smoking was allowed until 2007. Read codes referring to the active management of a disease, for example, management of COPD, were included.

For each condition, we identified the earliest date at which the condition was observed in the 1–10 and 10–32 years prior to the index date. Events identified within the year before the index date were excluded to reduce the potential for reverse causation. Primary analyses were performed for diagnoses occurring in the 1–10 year interval, because it contained the largest sample size (individuals only had to be registered for one year) and it is more representative of the UK population. The 1–10 year exposure interval is relevant also because genomic results show that LCINS may progress from the progenitor cells a decade before diagnosis. We explored whether the associations we identified in the 1–10 year assessment period persisted 10–32 years prior to the index date. Diagnoses before Jan 1,1987 (32 years before 2019) were left truncated as the data are not considered research standards before this date.

Statistical analyses

We used a two-stage approach to identify medical conditions associated with LCINS risk in the 1–10 years before the index date. First, we estimated adjusted odds ratios (aORs) and 95% confidence intervals (CIs) for all 24 primary disease categories (coded as 1 if any condition in that category was present in the clinical history 1–10 years prior to the index date and 0 otherwise) using conditional logistic regression models to account for the matched design. We additionally adjusted models for age at index date in single years. For each primary disease category that was associated with LCINS at an FDR p-value < 0.05, we then fit a hierarchical (random effects) logistic regression model (second stage) that included all individual conditions in that primary category. Under that model, the aORs of the individual conditions that comprise the primary category are assumed to vary randomly around the mean of the primary category (details are given in Supplementary methods: Hierarchical analyses). We adjusted the hierarchical models for matching factors to accommodate the matched study design. Because the aORs of the individual conditions are mutually adjusted for all other conditions within the primary category under the hierarchical model, a multiple testing correction is not needed, and p < 0.05 was considered statistically significant. Secondary analyses were performed for conditions diagnosed 10–32 years prior to the index date to explore associations with longer lag periods.

Sensitivity analyses

To assess the robustness of the hierarchical regression model results, we examined associations between all primary categories and individual conditions in both time intervals and LCINS risk using conditional logistic regression. Conditioning on the matching variables removes the effect of any unmeasured confounders correlated with the matching variables and thus is a more robust, although less efficient statistical analytic approach. The aORs from conditional logistic models are not adjusted for other individual conditions in the same primary disease category.

We used conditional logistic regression to examine the 1–10 year condition-LCINS associations additionally adjusted for BMI and SES for those conditions that were statistically significant in the discovery stage. SES is associated with many diseases and has been shown to be independently associated with LC risk63. SES was mapped to deciles of social deprivation and included in the models as a linear term for individuals with available linkage to the CPRD index of multiple deprivation64. In CPRD-GOLD, 43% of the study population had a linkage to SES information. BMI, which has been more consistently recorded over time65, was available for ~ 75% of the study population. The most recent BMI measurements were used. Measurements occurring > 15 years before the index date were set to missing.

We also examined if medications used to treat particular conditions might modify the association estimates of the conditions on LCINS risk66 (see list of medications in Source Data file S3_MedicationsGOLD). We present associations from conditional logistical regression models which include both the condition and the medication(s) that may have been used to treat the condition. UK treatment guidance and the British National Formulary (September 2020-March 2021) were used to compile a list of medications used in the treatment of conditions identified as being associated with LCINS. Medications were then identified from the records of LCINS cases. We required only a single medication prescription within 1–10 years for an individual to be considered exposed to the medication (ever use).

Independent validation in CPRD-Aurum

Specific conditions (not primary disease categories) that were significantly associated (p < 0.05) with LCINS in CPRD-GOLD were validated in the independent CPRD-Aurum dataset. CPRD-Aurum is a population-based research quality database covering 10 regions of England59. Case and control selection was identical to the discovery stage. Among 105,679 lung cancer cases diagnosed between Jan 1, 1988 and Dec 31, 2019, we excluded cases with less than one year of registration prior to diagnosis (n = 2737), evidence of invasive cancer (except basal cell carcinoma and cervical cancer) before diagnosis (n = 33,629), evidence of smoking or no documented smoking status according to the full clinical history (n = 65,984), cases with less than five controls (n = 36), cases diagnosed before 30 or after 89 (n = 247), and cases with any evidence of smoking after diagnosis (n = 206). Individuals who were present in both databases (23%) were also excluded from the validation dataset (n = 652). After exclusions, the validation dataset included 2188 cases. We removed controls who lost matched cases in the deduplication process. Unlike CPRD-GOLD, CPRD-Aurum does not capture patients in Northern Ireland, Scotland, or Wales and is therefore not geographically representative of the entire UK population67. For each validated condition, we identified the earliest date at which the condition was observed in the 1–10 and 10–32 years prior to the index date in CPRD-Aurum. Conditions only identified within the year before the index date were excluded to reduce the potential for reverse causation. SNOMED codes were used to classify conditions in CPRD-Aurum. Conditional logistic regression models were used to estimate aORs between conditions and LCINS for both the 1–10 and 10–32 year assessment windows. We present all associations but focus on those that were significant (p < 0.05) in CPRD-Aurum.

As in the discovery stage, we also performed sensitivity analyses in which conditional regression models were jointly adjusted for BMI and SES for significant findings. In contrast to the CPRD-GOLD population, most of the CPRD-Aurum population is linkable to SES information, thereby providing a more robust interpretation of the results. Following a similar strategy in the discovery stage, we also adjusted for medication use in statistical models for conditions associated with LCINS. See Source Data file S4_MedicationsAurum for the list of medications. For select validated conditions that can have multiple etiologies (e.g., anemia), we attempted to use SNOMED codes to identify different manifestations/etiological subtypes of the disease and then examine associations with LCINS.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.