Fig. 1: Overview of model development.

a Feature Generation: patches are sampled at random from regions containing tumor of a given case in the training set, and each patch is passed through a CNN to obtain an embedding vector. A k-means algorithm is then fit on these embeddings. Note: for demonstration purposes in this example only 10 cluster centroids were placed; in our actual model we fit 200 clusters on the patch embeddings. b Feature Selection: all patch embeddings from a case are run through the trained k-means model and are assigned a cluster id, and the fraction of patches in a case assigned to each cluster is computed (case-level cluster quantitation vector). This is repeated for all cases in the training set. The top 5 clusters are chosen to maximize AUROC in a greedy stepwise forward selection on the training set when combined with baseline clinical features in a logistic regression model. c Feature Evaluation: a cluster quantitation vector is computed for a case to be evaluated, and the cluster quantitations for the pre-selected top cluster ids are concatenated to the baseline clinical features. This case-level combined feature vector is then fed through a logistic regression model to obtain a prediction.