Fig. 1: Adaptation of the self-supervised Histological Phenotype Learning pipeline to study cutaneous squamous cell cancer.

a The slides were first tiled into smaller images of 224 ×224 pixels at 0.5 um/pixel (equivalent to a magnification of 20×). b A subset of those tiles were used to train the self-supervised Barlow-Twins architecture. c Once trained, all the tiles from the three cohorts were then projected onto the trained network to extract their tile vector representations z, a 128 vector coding each image. d Those vector representations are then over-clustered using the Leiden approach in order to get homogeneous clusters (called Histomorphological Phenotype Clusters, HPC) and visually identify artifacts from tissue representations. In this UMAP of the tile vector representation z, each dot represents a tile, and each color a different HPC. e Tiles belonging to HPCs identified as highly enriched in artifacts are removed from the study. f The cleaned dataset is then subject to more detailed analysis and subjected to a new round of Leiden clustering. This UMAP of the cleaned tile vector representations z shows 26 HPCs corresponding to 26 groups of self-identified phenotypes, and representative tile for the top 5 clusters corresponding to the example slides in panel (c). g The resulting HPCs can then be used to generate heatmaps showing simplified slide representations and analyzed to identify potential correlations between those phenotypes identified by the self-supervised approach and patients’ outcome. Here, the heat maps corresponding to the example slide section in panel (a) is shown, with the top 5 clusters numbered and corresponding to the ones in panel (f). All tiles are shown after Reinhard’s color normalization47.