Introduction

This paper focuses on the use of multiple biomarkers for the assessment of patients, or identification of otherwise healthy individuals, who are at increased risk of cancer. With the high mortality rate associated with cancer patients, significant research has been conducted to help identify patients at higher risk, starting with identifying medical conditions that increase the risk of cancer, such as diabetes, or genetic predispositions that promote its development1. Furthermore, various screening procedures have been developed to help facilitate early diagnosis such as the Faecal Immunochemical Test (FIT) and colonoscopy for colorectal cancer (CRC)2, mammography for breast cancer3, and low-dose computed tomography (LDCT) for lung cancer4. However, cancer screening rates and their uptake remains lower than desired, e.g., in the US5. While there are several factors contributing to this low uptake, one of the key factors is the lack of awareness within the general population. This is even more important to address for people who may be at increased risk and would benefit from early and/or regular screening. In other words, there is still a need for convenient tests for early detection of rapidly progressing diseases such as cancer so that intervention can start as early as possible6.

Several cancer risk prediction/assessment tools based on demographic, socioeconomic or blood based markers have been developed over the years, and studies have shown that cancer risk assessment algorithms could have an impact in early cancer diagnosis7. For instance, the Qcancer 10 year risk algorithm8 considers the age, ethnicity, deprivation, body mass index, smoking, alcohol, previous cancer diagnoses, family history of cancer, relevant comorbidities, and medication data for a patient and predicts the cancer risk for 11 types of cancers. Nartowt et al.9 reported high concordance in the prediction of CRC into low, medium and high groups using an artificial neural network trained on patient data comprising age, sex, and complete blood count (CBC). ColonFlag10 can be used to identify individuals at high risk of CRC using specific blood-based markers and refer them to screening procedures such as colonoscopy. More recently, a cell-free DNA-based blood test for the early detection of CRC has been clinically validated in the ECLIPSE study11. Moreover, multi-cancer early detection technologies12 such as the Galleri test13,14 can identify abnormal methylation patterns in cell-free DNA to detect a cancer signal and predict its origin.

Besides algorithm development, the deployment of the algorithm and communication of the findings play a critical role in acceptance and clinical use of the algorithm15, and must be taken into account to facilitate screening uptake. To this end, instead of defining a test with a specific set of ingredients catered towards a particular cancer, we propose to use commonly measured blood markers, often obtained during the annual physical exam, and obtain a risk profile for multiple cancers. Furthermore, instead of reporting a risk score, we compute the pre- and post-test odds of a patient at risk of developing cancer over the next 12 months.

A key challenge in developing a model that considers several biomarkers is to deal with a significant degree of missingness in the historical data as not all markers may be obtained at each encounter. Although this issue can be partly alleviated by considering the biomarkers obtained at an annual physical exam, we observed that in real world data, there is still a significant degree of missingness, either due to a lack of awareness, insurance coverage or reimbursement, among other reasons. A standard approach to deal with missingness in input data is to impute the missing values using statistical methods, such as expectation maximization and regression16. However, the quality of imputed data is limited and can significantly impact the generalization ability of the trained model.

In this work, we address the aforementioned challenges by training a deep learning model, Deep Profiler, which takes the age, sex, and commonly obtained blood biomarkers included in CBC and Comprehensive Metabolic Panel (CMP), and outputs a likelihood ratio of a patient to develop cancer over the period of the following 12 months (see Fig. 1). The Deep Profiler architecture employs a variational autoencoder (VAE) model that is pre-trained to impute missing data similar to the masked language modeling technique. Subsequently, we train cancer-specific risk prediction models from the shared encoded latent space and compute the likelihood ratio for each patient. We validate the proposed method over screening-relevant cohorts for three different cancers—colorectal, liver, and lung. These are among the top cancers responsible for cancer related mortality rate in the US (https://seer.cancer.gov/statfacts/html/common.html, accessed April 30, 2024).

Fig. 1
figure 1

Workflow of using a biomarker-based pre-screening test.

Methods

Study selection

We created a cohort of individuals diagnosed with malignant neoplasms of colorectal, liver or lung, or had no confirmed diagnosis of any malignant neoplasm, within the period from 2017 to 2021. We further enriched the cohort by considering additional patients who are diagnosed with chronic kidney disease and/or chronic liver disease but have no personal history or a novel cancer diagnosis during the aforementioned period. The cohort was developed through the Prognos Factor®platform by Prognos Health, a data provider that collaborates with diverse health systems to acquire and securely integrate de-identified data while ensuring compliance with HIPAA regulations. Once the cohort was created, fully de-identified patient records were licensed from Prognos Health. The patient records include the laboratory marker measurements (blood or urine based) from various encounters as well as CPT17 and ICD-10 codes18 from claims. ICD-10 codes of patients were used as surrogate for the diagnosis for selecting patients in this study. All the records correspond to an anonymized healthcare provider in the US.

Screening-based cohorts

To create cohorts for the pre-screening scenario, we utilized the medical claims data to select the patients and their visits that correspond to a screening procedure. Table 1 lists the CPT and ICD-10 codes used for each cancer type. If screening-related codes were not reported for a patient, we used diagnostic procedure codes (listed in Table 1 as well).

Table 1 Screening and diagnosis relevant CPT and ICD-10 codes used for patient selection and assignment of diagnosis labels for each cancer.

For each patient in the screening cohorts, we next determined if the patient had a positive cancer diagnosis. Although we used ICD-10 codes to select patients for the study from the Prognos Factor®platform, we recognized that the use of ICD-10 codes as confirmed diagnosis may not be sufficiently reliable19. Therefore, we ensured that at least one additional CPT or ICD-10 code associated with cancer diagnosis and/or therapy (chemotherapy and/or radiation therapy) is present after diagnosis. Figures 2a–c show the respective consort flow diagrams for each cancer. The screening cohorts for each cancer type as described above correspond to the first cell in the flow diagram for each cancer.

Fig. 2
figure 2figure 2

Consort flow diagrams for various cancer cohorts. Each cell depicts the number of cancer positive (N+) and cancer negative (N−) patients. For each entry, both the number of patients and their aggregate encounter count has been reported. SIRS, systemic inflammatory response syndrome.

Fig. 3
figure 3

Schematic of the system output interpretation using likelihood ratio.

For patients with a positive cancer diagnosis, we only considered the records from visits within 12 months prior to the visit associated with the screening/diagnostic procedure. This ensured that the records all had a cancer diagnosis within the next 12 months. For patients with no cancer, we only considered records for which there are subsequent records available. For each cell in the consort flow diagrams, both the aggregate patient and record counts are shown.

To be consistent with the pre-screening scenario, we next identified the records of all selected patients between the age of 40 and 89, with no prior diagnosis of cancer. The laboratory markers obtained at each visit are tied to the underlying conditions that were diagnosed or monitored and, hence, the number and type of markers varied significantly, e.g., from a single marker to all the markers in CBC and CMP. Thus, we only considered visits with at least 18 markers for our analysis and discarded all other visits. While this approach potentially results in some data loss, it helps avoid the machine learning model to overfit to the specific markers measured. Note that CBC and CMP also include derived markers such as BUN-creatinine ratio which, if missing in the records, are computed using the reported BUN and creatinine values. Similarly, blood-count markers and their percentage markers were computed if either one was not reported.

We subsequently enriched the resulting cohorts by including records of additional patients between the age of 40 and 89 who have no history of cancer and no cancer diagnosis within the next 12 months. While this step enriched the data for model development, more importantly, it also incorporated individuals who are eligible for screening but may not have gone through screening within the period considered in this study. We note that the number of patients (records) that were used to enrich each cancer cohort is not the same in the consort flow diagram. This is primarily due to the fact that there are patients that have gone through a screening procedure for one or more cancers but not all.

Data preparation and pre-processing

Given all the patients and their records in the enriched screening cohorts, we next split the data into development and validation cohorts in a 2:1 ratio, stratified by age, sex, and diagnosis. We then performed additional steps to enrich the development cohort that may assist in training a robust model as well as in cleaning the validation cohort to remove any ambiguous records. Specifically, we filter out patient records that correspond to severe/acute infection such as systemic inflammatory response syndrome (SIRS), sepsis, and/or septic shock, since they can result in significant change in CBC and CMP marker levels and may not be relevant in the pre-screening scenario. We also enriched the development cohort by adding newly diagnosed cancer patients whose records do not include a visit associated with any of the screening codes.

Given the development and validation cohorts, we normalized the distributions of biomarker values by subtracting their median and dividing by their inter-quartile distance (IQD). In addition, we transformed the distributions of the following markers to the logarithmic (log10) scale prior to their normalization: alanine aminotransferase (AST), alkaline phosphatase (ALP), aspartate aminotransferase (AST), bilirubin, creatinine, erythrocyte distribution width (RDW), glucose, leukocytes, lymphocytes, neutrophils, and urea nitrogen. Note that the pre-processing parameters (i.e., median and IQD values) for all markers were only computed on the development cohort, and the same parameter values were then applied to the validation cohort.

Deep profiler

Given the pre-processed development cohort, we employed a deep neural network called Deep Profiler20 to train models for each of the three cancer scenarios. This method has previously been applied to train models to predict a severity progression risk score for COVID-19 patients based on biomarkers (age and nine blood markers) obtained at the time of admission20,21. Furthermore, it was demonstrated to be robust to marker missingness as well as competitive to other methods such as a logistic regression and boosted forests.

The Deep Profiler model builds on a variational autoencoder architecture22, and consists of three main networks: an encoder for extracting prominent features represented in a latent space, a decoder for reconstructing the input data to ensure data fidelity of the latent feature representation, and, finally, a multi-label classifier network, which is trained to estimate the ordinal risk score. Use of autoencoders to deal with robustness to missing data has been demonstrated to be effective23,24.

In this work, we use the same Deep Profiler architecture for each of the three cancer models, with a symmetric encoder–decoder structure, each comprising of blocks of three fully connected layers with 32 kernels. Each fully connected layer in the encoder is followed by a batch normalization and LeakyReLU (0.2), and each fully connected layer in the decoder is followed by batch normalization and ReLU. Network parameters were learned using the ADAM optimizer with an initial learning rate of 10-4. The training loss is a combination of reconstruction loss (only applied to corresponding input features whose values are not missing), VAE regularization loss and binary cross entropy loss based on the patient’s diagnosis.

Prediction uncertainty using deep profiler ensembles

While the VAE helps adding robustness to data missingness, undesirable biases may still be present due to the patient population/sampling. To this end, we trained an ensemble of Deep Profiler models. Given the ensemble of models, a confidence interval for the risk score (output of the sigmoid layer) is computed as within 0.5 standard deviation of the mean risk score. Each model is trained on different subsets of the data, accounting for variations in the acquisition protocols across hospitals, availability of various biomarkers, measurement methods used to assess the biomarker as well as biomarker measurements such as laboratory values, age range, comorbidities, etc. In this study, for each cancer scenario, we used an ensemble of ten Deep Profiler models.

Estimation of likelihood ratios

Given the patient biomarker data and computed confidence interval over the risk score, the next key step is to present this information such that it can be easily interpreted and acted upon. To this end, we used the risk score and confidence interval to identify the subgroup of patients in the development cohort with similar scores. As a result, we obtained a “similar patient cohort” (a.k.a. patient cohort with similar risk scores) in addition to the development cohort. We then computed the pre- and post-test odds for the patient to have the disease and, subsequently, calculated the likelihood ratio. This can be done separately for each medical condition.

Formally, let us denote the risk score and know Ground truth outcome for each patient in the development cohort be \(P_i\) and \(O_i\) respectively. Given a new patient with risk score \(P_{new}\) with confidence interval range as \([P^{low}_{new}, P^{high}_{new}]\), we define the likelihood ratio \(LR_{new}\) as a ratio of pre-test and post-test odds of the patient as,

$$\begin{aligned} \begin{aligned} LR_{new}&= \frac{P^{post-test}_{new}}{P^{pre-test}_{new}} \\ P^{pre-test}_{new}&= \frac{\sum _i^{D} P^{gt}_i}{|P^{gt}_i|} \quad \text {where D is development cohort} \\ P^{post-test}_{new}&= \frac{\sum _i^S P^{gt}_i}{|P^{gt}_i|} \quad \text {where} \quad S = \{i\} \hspace{2mm} \forall i\in D \hspace{2mm} \text {such that} \hspace{2mm} P_i \in [P^{low}_{new}, P^{high}_{new}] \end{aligned} \end{aligned}$$
(1)

Figure 3 below shows one way of presenting this information to the clinician. Note that the figure illustrates grouping LR in to low, medium and high, corresponding to values below 2, between 2 to 10, and above 10, respectively25.

Fig. 4
figure 4

Quantitative performance assessment on (a) colorectal, (b) liver, and (c) lung cancer validation cohorts. ROC curves represent the predictions of the corresponding models (left). Likelihood ratio plots show the likelihood ratios of corresponding models on the full cohort and the subgroup of patients having at least nine out of the top 15 markers available, for increasing risk thresholds (middle). Likelihood ratio curves are compared to those of baseline models that are either based on the total number of OoR markers (middle) or single markers (right), corresponding to the top five markers for each cancer type (cf. Fig. 5), including age for CRC. Thin lines depict likelihood ratio curves of base models, while ribbons correspond to one standard deviation of their likelihood ratios. ALP alkaline phosphatase, MCH mean corpuscular hemoglobin, MCV mean corpuscular volume, RDW red cell distribution width.

Subgroup analysis based on comorbidities

We used phecodes26 to analyze comorbidities in each of the cohorts. Phecodes are manually curated groups of ICD-10 codes that are clinically meaningful and relevant to research. They have been widely applied in phenome-wide association studies but, more recently, were also used for rapid electronic health record phenotyping27,28. For the colorectal, liver and lung cancer cohorts, we censored all ICD-10 codes assigned after the date of diagnosis and only mapped the codes assigned before diagnosis to phecodes. This ensured the identification of clinical comorbidities prior to a cancer diagnosis. For control cohorts, we mapped all of the ICD-10 codes recorded to phecodes. Subsequently, we compared the counts of individuals with each phecode in the control group against each of the pre-diagnosis cancer cohorts and identified significantly different phecodes, or comorbidities, based on Fisher’s exact tests. Finally, we ranked the most significant comorbidities with at least 50 individuals in both the cancer and the control cohort ranked for subgroup analysis and evaluated the respective Deep Profiler models in those subgroups.

Results

Patient characteristics

To evaluate Deep Profiler across multiple cancer types, we created a cohort of individuals that either have no diagnosis of cancer, or were diagnosed with malignant neoplasms of either colorectal, liver or lung. The validation cohorts each include nearly 5000 patients with no cancer diagnosis, and 94, 189 and 224 patients who develop colorectal, liver or lung cancer, respectively. Note that the patients in the validation set were not used during model development and training. The cohorts used for model development include nearly 10,000 patients with no cancer diagnosis, and 293, 626 and 683 patients who develop colorectal, liver or lung cancer, respectively. Supplementary Table S1 summarizes the statistics of various biomarkers, including age and sex, over the entire cohort covering all three cancer types. Significant differences in the distributions of biomarker measurements in cancer cohorts as compared to control cohorts are indicated.

Likelihood ratio

We train Deep Profiler ensembles to estimate cancer risk for three cancer types: colorectal, liver and lung cancer. Given the biomarker values, the model outputs an array of risk scores (from the ensemble) for each cancer type, which are then utilized to compute the likelihood ratio (LR) of post- to pre-test odds. LRs at increasing risk thresholds, together with Receiver Operating Characteristic (ROC) curves, are shown in Fig. 4a–c for colorectal, liver, and lung cancer, respectively (corresponding precision-recall curves are provided in Supplementary Fig. S1).

Fig. 5
figure 5

Cohort-level SHAP summaries showing the contribution of the 15 laboratory markers with highest impact on the LR, separately for the (a) colorectal, (b) liver, and (c) lung cancer model. Red and blue points correspond to patients with high and low values of the corresponding laboratory markers, respectively. Laboratory markers are ordered along the y-axis with respect to their overall importance for the prediction, with most important markers at the top. ALP alkaline phosphatase, AST aspartate aminotransferase, BUN blood urea nitrogen, MCH mean corpuscular hemoglobin, MCV mean corpuscular volume, RDW red cell distribution width, WBC white blood cells.

Since Næser et al. reported that the probability of cancer increased with the number of test results outside the reference range29, we used the total out-of-reference-range (OoR) markers as a baseline. Indeed, cancer likelihood increases with the total OoR markers. However, the increase is significantly lower as compared to the Deep Profiler models. Similarly, when compared individually with any of the top-5 markers identified by the model for each cancer, Deep Profiler provides a 2–3 fold improvement.

In CRC, age is the only factor considered to determine screening eligibility according to recommendations by the US Preventive Services Task Force (USPSTF)30. Therefore, we also provide an LR curve for age (normalized between 40 and 85 years) as the single indicator. While CRC shows increased prevalence until the age of about 60, the increase of LR provided by Deep Profiler is significantly higher, with a four-fold improvement in LR at a threshold greater than 0.8.

Overall, the liver-specific Deep Profiler model performs significantly better as compared to our Deep Profiler models for CRC and lung cancer (LR of ~7.5 vs. ~4.5 at a threshold of 0.8, respectively), which is also reflected in the corresponding ROC curves. Importantly, performance is preserved when assessed in cohort-specific subsets that only include cases for which measurements for at least nine of top-15 markers are available, indicating that the models did not learn patterns related to missing marker measurements.

We further analyzed the model performance on patients with age below 74 years in the validation cohorts, as USPTF guidelines limit cancer screening procedures beyond this age. In the CRC validation cohort, 18 patients with age above 74 years were diagnosed with cancer and nearly 613 patients with age above 74 years had no cancer diagnosis over the next year. After filtering out these patients, model performed at an AUC of 0.78 ± 0.02, which is marginally above the AUC on the original cohort of 0.76 ± 0.02. Similarly, for Liver and Lung cancer validation cohorts, the AUC over the patients below age 74 years was 0.85 and 0.81 respectively.

Relevant biomarkers by cancer type

To gain insights into the laboratory markers with highest impact on the LR, we show cohort-level SHAP summaries for our colorectal, liver, and lung cancer models in Fig. 5a–c, respectively. Since the LR can range from zero to infinity, we passed the computed LR for each sample through a logistic function with mean 5 and scale 0.5. This ensures that outliers, i.e., samples with significantly low or high LR, do not significantly impact the SHAP value distribution.

Fig. 6
figure 6

Comorbidity analysis on (a) colorectal, (b) liver, and (c) lung cancer validation cohorts. Likelihood ratio plots show the likelihood ratios of corresponding models on subgroups of patients suffering from the indicated comorbidities, for increasing risk thresholds on the testing dataset. Barplots illustrate the significance of association of the corresponding comorbidity with cancer type (as negative logarithm of the p-value obtained with Fisher’s exact test) over the entire dataset. The prevalence of the respective comorbidities in the cancer-positive and control cases is listed with each comorbidity. Note that the base prevalence is different for each comorbidity. GI gastrointestinal.

While only three (age, albumin and hematocrit) of the 15 most important laboratory markers are shared across all cancer types (only age among the five most important markers), there are at least seven markers shared between any two cancer types. In contrast, fine-tuned cancer type-specific prediction models are required to give more weight to markers predictive of only one cancer type, as, in fact, one (neutrophils), three [total protein, WBC and basophils (%)], and three (BUN-creatinine ratio, calcium, and ALT) markers are most important in only colorectal, liver, and lung cancer, respectively.

Several of the biomarkers that are identified as important in our SHAP analysis have also been independently studied and reported. For instance, six of the 12 markers that were integrated into one or both sex-specific colon cancer prediction models by Goshen et al.31 (ten of which are also available in our cohort) also appear to be most important in our CRC model (RDW, MCV, neutrophils, monocytes, hemoglobin, and AST).

For liver cancer, there are currently no screening recommendations, except for hepatitis B carriers, who are recommended by the American Association for the Study of Liver Diseases (AASLD) to start screening at age 40 and 50 years for men and women, respectively. In our analysis, platelet counts appear as the most important blood marker predictor of liver cancer. In fact, raised platelet counts (> 400 \(\times\) \(10^9\)/l) indicative of thrombocytosis, or even high-normal platelet counts (326–400 \(\times\) \(10^9\)/l) have been reported to be associated with higher cancer incidence, in particular colorectal and lung cancer32. However, patients with high-normal platelet counts had, in general, advanced-stage cancer at diagnosis, which may explain why platelets are particularly important in our liver cancer model, as liver cancer is mostly detected at a late stage, emphasizing the need for improved guidelines concerning screening eligibility.

Of note, the lack of a clear color gradation indicates interactions between different laboratory markers, illustrating the need to train a holistic model by integrating multiple laboratory markers.

Analysis by comorbidities

For each of the cancer cohorts, we used phecodes26 to identify comorbidities that are more likely to be present in the underlying population prior to cancer diagnosis. Relative odds ratios were computed to select comorbidities that are significantly different between cancer and control samples using Fisher’s exact test. Figure 6a–c shows the significant comorbidities over the entire dataset (left column). We found that those are usually the conditions or symptoms that raise suspicion of cancer or other organ dysfunction in a cancer population. For instance, the top three significant comorbidities prior to diagnosis in the CRC cohort are other disorders of the intestine, benign neoplasm of the colon, and hemorrhage of rectum and anus (Fig. 6a). For each significant comorbidity in corresponding cohort, we identify a sub-cohort in the testing set where its confirmed to be present and report the likelihood ratio obtained using the Deep Profiler model at different thresholds (Fig. 6a–c right column.

Since the FIT test screens stool for occult blood from the lower intestines, we use the subgroup of patients with blood in stool as a surrogate to evaluate the added value of our model in comparison to an established screening test. Compared to the base prevalence, we see an increase in LR greater than three-fold, at a risk threshold greater than 0.8 (Fig. 6a). Furthermore, we also see added value in subgroups of patients with conditions such as hemorrhage of rectum and anus or gastrointestinal tract.

For lung cancer, the USPSTF criteria for screening comprise age and smoking history33. While our lung cohort does not include smoking history, we analyzed the performance of our model on a subgroup of patients having a tobacco use disorder. Notably, our model shows added value in identifying high-risk patients (more than 50% increase in LR as compared to the base prevalence, at a risk threshold greater than 0.9, Fig. 6b).

Discussion

Here, we report on the development and evaluation of separate deep learning-based risk prediction models for three cancer types based on routine laboratory marker measurements. Although other blood marker-based pan-cancer or cancer-specific risk prediction models have been reported, a direct comparison of performance and markers identified to be important for risk assessment is difficult due to varying (i) study designs, e.g., with respect to the follow-up period until cancer diagnosis (90 days34 vs. 365 days in our model), (ii) cancer prevalence in training and validation cohorts (25.2% in35 vs. 5.7% in our lung cohort), (iii) inclusion criteria (e.g., smoking history in lung cancer), (iv) number and availability of markers (e.g., restrict to samples with complete data vs. imputation of missing data), (v) inclusion of non-routine blood markers (e.g., AFP and CA-12529) as well as patient characteristics other than age and sex36. Although performance is not yet adequate enough for the model to be used as a diagnostic test, we anticipate it to be a valuable tool for recruiting patients into screening programs, thus increasing screening uptake and efficiency overall, pending additional confirmatory studies. In future work, we also plan to adopt the model to personalize the screening interval after a negative screening result (e.g., based on LDCT).

While the current USPSTF guideline for lung cancer screening only considers age and smoking history for screening eligibility37, a multitude of risk prediction models have been proposed and reported to show improved performance over the USPSTF criteria38. For instance, the PLCOm2012 model includes race, body mass index, education, presence of chronic lung disease, personal history of cancer, and family history of lung cancer, in addition to age and smoking history39. Furthermore, a four-marker protein panel (4MP), measuring a precursor form of surfactant protein B (Pro-SFTPB), cancer antigen 125 (CA-125), carcinoembryonic antigen (CEA), and cytokeratin-19 fragment (CYFRA 21-1), has been combined with PLCOm2012 to predict lung cancer risk40. However, the required biomarkers are not assessed within routine blood testing, posing a challenge for the test’s clinical utility. In contrast, our model is only based on routine laboratory markers and even performs comparable to models that integrate knowledge about smoking status or smoking history35,36.

As cohorts were enriched with patients diagnosed with chronic liver disease, the better performance of our model in liver cancer might originate from an improved differential diagnosis. Interestingly, ALT and AST are not picked up as important markers for liver cancer either, which might also be tied to the model trying to differentiate from chronic liver disease cases in the control set. Analogous, the addition of cases with chronic bowel and lung disease could prove helpful in the colorectal and lung cancer setting, respectively.

We further note that the likelihood ratio for a patient is computed using similar patient cohort identified using the risk scores predicted by the model. This presents a risk that patients with different biomarker values may be mapped to a similar risk profile by the model. Although the use of ensembles reduces the likelihood of such occurrences, further research to mitigate this issue may result in a performance boost. For example, defining a similarity score based on a combination of risk scores and the VAE manifold vectors could help.

The SHAP analysis in Fig. 5 depicts the contributions of important laboratory markers to the normalized LR on the cohort level. In addition, we provide the relative contributions of laboratory markers to model prediction on the individual level as waterfall plots in Supplementary Fig. S2 for two selected cases: (1) a liver cancer case with only four OoR markers (total CO2, lymphocytes (%), hemoglobin, and hematocrit, Supplementary Fig. S2a), of which total CO2 and hematocrit are the only 2 of the 15 laboratory markers with highest impact on model prediction (Fig. 5b), and a control case with 18 OoR markers (Supplementary Fig. S2b), both correctly classified by the model with a high (\(>0.8\)) and low (\(\sim 0\)) risk, respectively. These example cases illustrate the importance of considering the measurements of a range of markers together to achieve optimal risk assessment.

In this work, we focused on CBC and CMP blood tests as the availability of complete lipid panel blood tests was limited in our cohorts. Although lipid panel use is less common in Europe, it is more frequently ordered in the United States. In fact, we expect its integration into our risk models to further improve cancer risk assessment by providing complementary information on physiological or pathological status. Furthermore, we did not have access to information about the race or ethnicity of all the patients, and hence, we were unable to characterize any potential bias due to the same.

To avoid potential harm resulting from unnecessary follow-up procedures, differential life expectancy, e.g., as a result of competing comorbidities, needs to be taken into account41. This is of particular importance, as chronic diseases can increase the risk of complications from biopsies and cancer treatment. Although we have focused on prioritizing patients for screening eligibility, allowing clinicians to discuss with patients who have been predicted to be at increased risk to conduct follow-up screening procedures as early as possible, we plan to investigate different risk grading and threshold approaches that can be applied depending on the presence/absence of certain comorbidities.

Similarly, it would be of great value if we could explore the contributions of laboratory markers with respect to cancer stage. Unfortunately, we did not have access to a cohort-linked cancer registry and, therefore, lack information on cancer stage. In the future, we anticipate that such an analysis to open up the possibility to develop cancer risk prediction models that also provide an assessment of cancer progress.

In summary, this study does demonstrate the significant potential of using routine blood marker panels to prioritize patients for cancer screening. We however note that the performance achieved using the proposed methods is not sufficient to be directly used for cancer screening or for diagnostic purposes and additional tests would be required to ascertain a clinical diagnosis.