Non-foundation models outperform foundation models at “re-identification”

Most re-identification experiments performed by the authors are in retinal imaging and use a foundation model called RETFound2. RETFound is used to obtain “features”, i.e. a numeric vector that describes the content of retinal images in a meaningful yet abstract way. One of the datasets they consider is openly available, namely the GRAPE dataset3 We replicate their experiments on this dataset, but replace RETFound with a very small convolutional neural network (CNN) – a 10-layer ResNet4 - that has never been trained on any retinal images. The experiment here involved no training either; the model is kept frozen. Furthermore, we conduct the same experiment using raw pixel values instead of features from a neural network. Our code is available here: https://gist.github.com/justinengelmann/63d5ad32cad4b57bb31627ded9111093.

The results are shown in Table 1. ResNet10t achieves substantially higher performance across the board. In other words, a CNN that was not trained on any retinal images allows for better re-identification than RETFound. This calls into question the titular thesis of Nebbia et al. that foundation models enable such re-identification. Furthermore, the very naïve approach of comparing pixel values directly achieves worse yet comparable performance to RETFound, and substantially better performance than random guessing. This suggests that the images of a given patient in the GRAPE dataset might simply be very similar to each other and re-identification is trivial in this scenario.

Table 1 Image matching performance on the openly available GRAPE dataset3

Figure 1 shows some examples. The successful re-identification examples in Fig. 1 a are nearly identical to each other. So, it appears that on the GRAPE dataset, “re-identification” is not particularly complex. This is not surprising, as the follow-up interval in GRAPE is relatively short (mean 18 months, min 5, max 53), the images were quality controlled and are of the same fixation. In Fig. 1 b, we can see that erroneous matches are likewise visually similar, supporting the view that it is this superficial similarity that RETFound matches on.

Fig. 1
figure 1

Example retinal image pairs from GRAPE, where finding the most similar image using RETFound features retrieved an image from the same patient (a) and a different patient (b), respectively; c illustrates a fundus image at normal resolution (top) and a resolution of 16 × 16 pixels (bottom). The correct matches have very consistent fixation and little apparent change between visits.

Thus, we think that the lack of non-foundation model baselines in Nebbia et al. might have led to an interpretation of the results that does not hold up to closer examination, namely that foundation models are what enable the re-identification results they obtained.

Predictability of demographic features and “re-identification”

Nebbia et al.’s explanation for why foundation models enable re-identification is that they learn representations that encode demographic information. So, if – as we have shown – non-foundation models can outperform foundation models, what do we then make of the finding that demographic features were marginally better predicted in patients that could be re-identified?

Nebbia et al. argue that this indicates that their “hypothesis that demographic features prediction and re-identification are related was correct”. We would be more cautious. For instance, image quality could explain this difference just as well: Some fundus images are blurry or under-illuminated, making most or all of the retina hard to see. Thus, from such images, demographic features will be less well predicted as they contain less information. Poor quality images would also lead to poorer re-identification as e.g. all blurry images look similar to each other, yet a given patient likely has a mix of good and bad quality images which look very different from each other.

Finding similar images is not the same as “re-identification”

A secondary point we want to raise is that the experiments by Nebbia et al., strictly speaking, do not (re-)identify anyone. Instead, they show that given an image of an individual, it is possible, some of the time, to retrieve an image of the same individual from a larger pool of images. However, the identity of said individual remains unknown if it was not already known.

This distinction is not mere sophistry. In our opinion, the phrase “re-identification from medical imaging” is likely to be misunderstood, especially by a lay audience such as patients concerned about their privacy, and suggests that one could identify a particular person of interest (e.g. Pearse Keane) given an openly available dataset of medical images (e.g. retinal images). But that is not the case. One needs to already be in possession of a medical image of the same type of person. Only then can we attempt to find similar images in the dataset.

Consider what the “attacker” gains when they try to re-identify someone in a dataset of retinal images. To attempt their attack, they must already possess an image of the targeted individual. If they execute their attack successfully, they might identify additional images as well as associated metadata from the dataset. For retinal imaging datasets (e.g., GRAPE), this is typically age, sex, and information about ocular disease.

However, as Nebbia et al. also point out, age and sex are reasonably well-predicted from fundus images. Of course, fundus images further contain rich information about ocular health. Thus, prior to executing their attack, the attacker could already infer age, sex, and ocular health of the targeted individual from the fundus image they needed to have to attempt re-identification. It is then unclear what harm resulted from the re-identification attack.

In security research, the focus is on “threat models” – under what circumstances can an attacker with specific resources bring about specific harm. In the scenario that Nebbia et al. consider, the harm in question is that the attacker learns protected information about a target individual. But in order to do so, the attacker needs to have resources that allow them to infer said information even without any re-identification.

Conclusion

In summary, the view presented by Nebbia et al. that foundation models enable re-identification from medical imaging appears incompatible with the experimental results presented here. On the contrary, non-foundation models might allow better image matching. Finally, the responsible use of patient data in medical research is paramount and thus studying potential risks is important. However, data need to be carefully interpreted and properly contextualised. This is especially important as this is an area of particular interest to a lay audience. The re-identification scenario Nebbia et al. consider would require an attacker to already be in a position to infer much of the information that could be learned by re-identifying someone. We recommend that future work consider simple baseline approaches (e.g., matching on pixels) to understand whether image-matching results are non-trivial, and to carefully spell out the envisioned threat model, including what resources an attacker needs and what potential harm could result.