Comparison of Three Anonymization Tools for a Health Fitness Study

Francis, Paul; Jurak, Gregor; Leskošek, Bojan; Otte, Karen; Prasser, Fabian

doi:10.1038/s41597-025-05823-x

Download PDF

Article
Open access
Published: 18 September 2025

Comparison of Three Anonymization Tools for a Health Fitness Study

Scientific Data volume 12, Article number: 1548 (2025) Cite this article

3082 Accesses
Metrics details

Subjects

Abstract

One of many challenges to open science is anonymization of personal data so that it may be shared. This paper presents a case study of the anonymization of a dataset containing cardio-respiratory fitness and commuting patterns for Slovenian school children. It evaluates three different anonymization tools, ARX, SDV, and SynDiffix. The fitness study was selected because its small size (N=713) and generally low statistical significance make it particularly challenging for data anonymization. Unlike most prior anonymization tool evaluations, this paper examines whether the scientific conclusions of the original study would have been supported by the anonymized datasets. It also considers the burden imposed on researchers using the tools both for data generation and data analysis.

Utility-driven assessment of anonymized data via clustering

Article Open access 30 July 2022

Personalized fitness recommendations using machine learning for optimized national health strategy

Article Open access 24 November 2025

The Health Gym: synthetic health-related datasets for the development of reinforcement learning algorithms

Article Open access 11 November 2022

Introduction

A core function of open science is data sharing. In an ideal setting, scientists should be able to easily upload data to a repository which satisfies the FAIR principles, i.e. makes the data findable, accessible, interoperable, and resuable¹. Data sharing is challenging even when the data is not personal data², i.e. does not contain information about individuals. Sharing of personal data is even more challenging³ due to stricter and more diverse legal frameworks.

In a data sharing pipeline where personal data is involved, there is often an extra step of data anonymization that historically has required individual attention to each data release, for instance in order to find a good balance between privacy and utility⁴. At the same time, there are constant advances in data anonymization technology. In recent years for instance, a new generation of synthetic data tools have become available⁵, both as open source and commercial products.

This study assesses the applicability of a small but representative set of data anonymization tools to an open science personal data sharing scenario. For this, we selected a base study, which we tried to replicate using anonymized data. The base study is a research study authored by GJ and BL⁶ with data controlled by them. The study, titled “Associations of mode and distance of commuting to school with cardiorespiratory fitness in Slovenian school children: a nationwide cross-sectional study”, is challenging for data anonymization tools due to its small sample size (713 records), the amount and depth of the statistical analyses, and overall few significant results and small effect sizes. These conditions are challenging because the distortion necessarily introduced by anonymization must be small so as not to overwhelm the significance of the data.

The evaluation in this study focuses on three questions:

1.
Would the scientific conclusions of the study have changed if anonymized data were used instead of the original data?
2.
How difficult would it have been to use the anonymized data for the scientific study?
3.
How difficult would it have been for non-experts to generate the anonymized data?

The three anonymization tools evaluated are ARX⁷ (developed by Prasser), SynDiffix⁸ (developed by Francis), and SDV⁹. All tools are open source software and implement different methods to create anonymized datasets. ARX provides a variety of anonymization mechanisms, with a focus on K-anonymity and it variants (L-diversity etc.). For this study, K-anonymity was used (K = 2). SDV also offers a variety of mechanisms, but all from the family of AI-based data synthesis techniques that have become popular in recent years⁵. For this study, we used CTGAN. SynDiffix is one of a class of techniques that uses regression trees to automate aggregation, adding suppression and noise to strengthen anonymity. These three methods are representative of three popular approaches to data anonymization: K-anonymity, AI synthesis, and regression trees. More information about these tools, as well as the procedure used to select them, can be found in the Methods section of this paper.

To our knowledge, this is the first paper to examine anonymized data quality in the context of a specific scientific study, versus prior work which examines data quality in general terms. Another distinguishing feature of this paper is that it analyses the relative ease of generating anonymized data, which is important in an open science setting. While this paper is narrow in that it only analyzes a single scientific study, we believe that it serves as a blueprint for how to evaluate anonymized data quality.

Prior work

The quality and utility of anonymized data has long been studied. It is common to distinguish between general-purpose and special-purpose (or workload-aware) approaches for quantifying anonymized data quality. General-purpose approaches include information loss metrics, like granularity reduction¹⁰ or record fidelity¹¹, as well as resemblance metrics, which can be univariate, e.g. comparing distributions of variables in unprotected vs. anonymized data¹², or multivariate, e.g. comparing differences in correlation matrices¹³. Another approach is to study how well classifiers can distinguish unprotected and anonymized records¹⁴. The three presented anonymization methods have been previously explored regarding the general-purpose utility of the protected data¹⁵, whereas workload-aware utility analysis was not yet explored for all methods. Other prior work proposes that even the anonymization mechanisms themselves should be workload aware: they take into account what analysis is going to be done after anonymization^16,17. The quality evaluations in this work are therefore naturally based on the workload analysis, which is a novelty of this paper.

All of the quality measures in this prior work, however, are generic in that they do not take into account a specific scientific goal. These generic measures are useful for comparing the quality of different anonymization techniques generally, but they fall short of predicting whether any given scientific conclusion will be correct or not. By contrast, our paper measures quality based on specific scientific conclusions, which is another novelty of this study. Here we directly compare the ability of the anonymization methods to preserve the data characteristics that lead to the scientific conclusions.

Although the data quality and utility was previously thoroughly explored, previous research on data privacy risks associated with non-traditional anonymization methods and synthetic data generation is rather sparse¹⁸. Few frameworks exist that can estimate residual privacy risks on anonymized and synthetic data alike. One of such framework is the Anonymeter framework by Giomi et al.¹⁹, originally proposed for large, synthetic datasets to evaluate risk of linkage, attribute inference and record singling-out.

Results

The base study of Jurak et al. investigated whether active commuting has the potential to improve children’s health⁶. The cardiorespiratory fitness (CRF) of 713 Slovenian school children aged 12 to 15 years, was determined with a 20-m shuttle run test to estimate their maximal oxygen uptake (VO₂max). Moreover, information was collected on their distance from home to school and whether the commute was done by walking, wheels (e.g., bicycle or skateboard), public transport or car in both directions. The study found that commuting distance minimally affected CRF, except in the Car group, where children living closer to school had significantly lower CRF than those farther away. The study recommends targeting car-driven children within walking or cycling distance of school with interventions promoting active transport. The original paper presented its results in three tables and one figure which we refer to as base-tables and base-figure. This study replicates these base-tables and base-figure, but in such a way that the the original data and the data for the three anonymized datasets are combined for easy comparison.

The following sections discuss the similarities and differences between the anonymized and original data, and comment on the extent to which the conclusions of the original paper⁶ still hold given the anonymized data.

Dataset comparison

Descriptive univariate statistics as mean and standard deviations are provided in Table 1, showing overall similar distributions and no statistical significant difference to the original data. The data generated by SDV showed differences in four out of five variables with p-values below 0.001. The p-value for the Age variable was 0.006, which is very close to the chosen alpha-level of 0.005. Looking at the difference in mean values, we see huge deviations in the SDV dataset for both commute distances, potentially indicating general flaws in the synthetic data generation used here.

Table 1 Descriptive statistics of continuous variables expressed as mean and standard deviation.

Full size table

Reproduction of commute modes and distances

The original and anonymized data for base-tables 1 and 2⁶ explored the types of commuting in school children and the commute distances traveled. These are given in this paper’s Tables 2 and 3 respectively²⁰.

Table 2 Base-table 1⁶ from the paper showing the counts and percentages for the original data and the three anonymization methods.

Full size table

Table 3 Base-table 2⁶ from the original paper showing the counts and distances in meters (median and IQR) for the original data and the three anonymization methods⁶.

Full size table

Among the three anonymization methods, SynDiffix most closely reproduced original counts and percentages for cross-tabulation of commuting modes (Table 2), with only one deviating record in the cross table. It is closely followed by ARX which showed up to 18 deviations, while SDV provides large differences of up to 179 different commute mode counts in the cross table. Nevertheless, the basic description of these results in original paper⁶, i.e. that active commuting is more frequent in the direction from school to home than in other direction, still holds, but this may be a coincidence in case of SDV as it substantially over- or underestimates the frequencies for both modes of active commuting (walking and wheels). Note also that the anonymized table of ARX contains 1.4% fewer rows (703 versus 713) due to suppression of rare records.

Similar results were found for commuting distance (Table 3). Here also SDV performs badly, hugely over- or underestimating the distance in most of the table cells, both in central tendency (median) and variability (IQR) of the data. SynDiffix and ARX perform much better, but still with large errors in some of the table cells, especially in cells with low frequencies, e.g. SynDiffix substantially underestimates the median distance of car commuting from school to home (3615 → 2584) and ARX overestimates the distance of wheels commuting in same direction (1444 → 2235). The verbal description of this results from original paper⁶, i.e. that active commuting groups (Walk, Wheels) typically live close to school, still holds in case of SynDiffix and ARX, but not in case of SDV.

The absolute errors of the cross-table counts from Table 2 and the distance values in Tables from 3 are visualized in the boxplots of Fig. 1.

Reproduction of statistical significance of regression coefficients

The original and anonymized data for base-table 3⁶ are given is this paper’s Tables 4 and 5 (the data is spread over two tables for formatting purposes)²⁰. The primary purpose of base-table 3⁶ is to indicate statistical significance. As such, those entries where the original data is significant and the anonymized data is not, or vice versa, are highlighted in red. In total, there are 3 such mismatches for ARX, 8 for SDV, and 5 for SynDiffix.

Table 4 Part 1 (of 2) of the base-table 3⁶ showing the parameters (regression coefficients) of the linear model for prediction of VO₂max by group and distance.

Full size table

Table 5 Part 2 (of 2) of the original paper’s Table 3 showing the parameters (regression coefficients) of the linear model for prediction of VO₂max by group and distance⁶.

Full size table

Several differences were found in statistics of linear prediction models based on original data and the data produced by all three anonymization methods (Tables 4 and 5). In the original models predicting VO₂max, the constant (intercept), the predictors Gender and MVPA, and the derived Car × Gender interaction term were statistically significant in both directions of commuting. This also holds for all parameters in the SynDiffix and ARX models, except for the Car x Gender parameter in the SynDiffix model, which is not significant in one of the directions of commuting. The Gender parameter in the SDV model was found to be non-significant, despite being highly significant (p≤.001) in the three other models.

Note that only the SDV method provides estimate for Wheels x Males interaction parameter in the school to home direction models. ARX and SynDiffix suppressed this data as part of anonymization because there were so few datapoints. The normalized error for coefficients in Tables 4 and 5 is illustrated in Fig. 2.

Reproduction of the stratification of children’s cardiorespiratory fitness

The plot for base-figure 1⁶ is given in this paper’s Fig. 3, which replicates the plot from the original paper as well as gives the corresponding plots for the three anonymized datasets⁶. In male children, point predictions and prediction intervals in the original paper⁶ are quite closely matched with the ones from ARX and SynDiffix, but not with SDV. In female children, original statistics are most closely matched by ARX, followed SynDiffix and then by SDV. SDV gives similar results for male and female children, although they are clearly separated in the original data. SDV performs worst also in estimating prediction interval widths, which are very similar for both sexes, although being quite different in the original plot. ARX and SynDiffix produce interval widths that are more similar to original, but still differ in some cases, e.g. being too narrow in case of walk commuting in ARX and too narrow in females’ wheels commuting from school to home in both ARX and SynDiffix.

Comparison of derived scientific insights

Table 6 summarizes the ability of each anonymization method to produce the same analytic conclusions as those of the base paper. It presents the set of statements made in the base paper, and evaluates whether the statement would be supported (O), negated (X), or neither supported nor negated (?).

Table 6 This table summarizes the ability of each anonymization method to result in the same analytic conclusion as that of the original data.

Full size table

As mentioned, besides the specific scientific statements, the paper suggests a policy intervention: that active transport should be promoted for children who use passive transport, but live within walking or biking distance to school. We regard this as the most important conclusion of the paper, and it is supported by ARX and SynDiffix, but would have been negated by SDV.

The base paper starts with a number of simple statistical observations about the data (descriptive analytic results). Both ARX and SynDiffix support all six of these observations, while SDV negates four of them.

The main analytic thrust of the paper is the linear regression analysis of (CRF). Here, the performance of ARX and SynDiffix is less good. While none of the statement were negated, about half of the statements were not supported. While this would not have prevented the main policy conclusions from having been reached, the overall quality of the study would have suffered by using either anonymization technique.

Usability of the anonymization tools and data

To enable data sharing in an open science environment, processes are needed that allow scientists to conveniently create and share safe data. As scientists already tend to be undermotivated to share data², it is hence beneficial if the anonymization step places as little additional burden on scientists as possible. Ideally, there should be little or no extra work required by data collectors or operators of repository systems to anonymize data and prepare them for sharing.

In terms of protected data generation, SynDiffix is the easiest to use of all methods studied here due to its simplicity and lack of configuration need. Using ARX, in contrast, requires expertise on the various anonymization methods that it supports, as well as substantial configuration efforts to generate a protected data table of sufficient quality. SDV is easier to use than ARX since it provides default configurations and code templates, but nevertheless requires that a decision about which SDV tool to use, and requires configuration of datatypes. However, in terms of data analysis, SynDiffix places an extra burden on the analyst, in that they need to understand that different tables need to be generated for different analytic tasks. Neither ARX nor SDV have this additional burden: the generated tables can be used as is.

In general, any anonymization technique requires that the analyst understands the extent to which data has been distorted, and requires that they compensate for that distortion. This always creates additional burden. Since the anonymization process in ARX is typically explainable and deterministic, and the present distortions are visible in univariate and bivariate statistics, they might be easier to understand. In contrast, typical synthetic data generation involves randomness and it is often not clear how the internal structure of the data might have changed, even if uni- and bivariate distributions appear to be similar.

Risk Evaluation

All three of the data anonymization methods studied in this paper have strong privacy properties. K-anonymity is a well-established technique that has been in use for more than two decades²¹. AI-based synthetic data tools like CTGAN are offered commercially. Like K-anonymity, SynDiffix uses the mechanisms of aggregation and suppression, and additionally adds noise.

Here we demonstrate the strong privacy properties of the three methods with respect to the commuting dataset using the attribute inference privacy metric from the Anonymeter tool¹⁹ (https://github.com/statice/anonymeter). In an attribute inference, an attacker knows several attributes of a person known to be in a dataset, and then tries to predict an unknown attribute from the released anonymized data. Anonymeter makes this prediction by finding the microdata record in the anonymized data that most closely matches the known attributes, and predicting the unknown attribute from that record.

Anonymeter measures the improvement of the predictions over a statistical baseline. For example, the statistical baseline precision for predicting sex would be 50%, assuming an even distribution of male and female. Anonymeter establishes the statistical baseline by making inference predictions on records that have been removed from the dataset prior to anonymization. Any prediction success on these records is only statistical in nature, and does not represent a loss of individual privacy.

Anonymeter measures Risk as R = (P_atk − P_base)/(1 − P_base,),where P_atk is the prediction precision of records in the dataset, and P_base is the precision of records not in the dataset. For example, if the baseline precision for predicting sex is P_base = 0.5, and the attack precision is P_atk = 0.75, then the improvement, or Risk, is R = 0.5. Any Risk below 0.5 can be regarded as strongly anonymous. When P_base is low, then R = 0.5 leaves substantial uncertainty on the part of the attacker, or equivalently, substantial deniability on the part of the victim. When P_base is high, the attacker in any event has little uncertainty, and a Risk of 0.5 represents only a modest improvement. Furthermore, a high baseline implies an attribute that is common in the dataset population and therefore unlikely to be a sensitive attribute.

For our measure, we removed 100 records from the dataset to establish the baseline, and re-anonymized the remaining 613 records. We assumed that the attacker knows the victim’s gender, age, and commuting distance and mode in both directions. These are all potentially publicly-known attributes. The unknown attributes being predicted are VO₂max and MVPA. In evaluating SynDiffix, the two tables with only the known attributes are unknown attribute were used. Table 7 presents the Risk scores and the 95% confidence intervals. In all cases, even the high confidence bound is well below the strong privacy Risk threshold of 0.5.

Table 7 Privacy risk scores and 95% confidence intervals.

Full size table

Discussion

In this paper, we compared the suitability of three tools for data anonymization (ARX, SDV and SynDiffix) for use in a public health study in Slovenia. Our study distinguishes itself from the large body of literature on this topic by studying in detail whether the individual analytical findings as well as the resulting conclusions would hold when using anonymized instead of the original data. Moreover, we explicitly considered the burden that would be imposed on researchers by applying one of the studied tools to follow open-science principles in our analysis.

Of the three tools, SDV’s data quality was poor and led to many incorrect scientific conclusions. The data quality of ARX and SynDiffix was good enough to provide definite value. The data generated by both tools supported the main conclusion of the base study (that children within walking or biking commuting distance should not commute with a car), and no incorrect conclusions were drawn. In addition, all of the descriptive analytics (counting and simple statistics) were supported by ARX and SynDiffix.

Regarding researcher burden for data generation, all three tools require some manual configuration. ARX typically requires configuration of data hierarchies per-column among other configuration choices, which requires expertise to find suitable settings. SDV requires selection of an algorithm, definition of data types, and potentially other configuration choices. If the dataset is time-series, SynDiffix only requires that the user configures the column that identifies individuals in the dataset. For the fitness dataset, ARX required around 200 lines of code (Java), SDV required 5 lines, and SynDiffix required 2 lines (both Python). Note that ARX is available as a GUI application (Windows, MacOS, or Linux), whereas SDV and SynDiffix require Python.

Regarding researcher burden for data analysis, the output of ARX and SDV can be used as is. SynDiffix generates multiple anonymized datasets, and the analyst must select the one with only the columns necessary for each given analytic task. This can only be done using a SynDiffix blob reader package that runs on Python. For the fitness study, a 22-line Python routine was needed to generate the two datasets used in a R script to generate Fig. 3. In the analysis code, an extra line of code is required each time a dataframe is read.

A major limitation of the current study is that it does not truly replicate an open science scenario, whereby an analyst undertakes a scientific analysis purely from anonymized data. In the current study, the analyst had the advantage of hindsight, was already familiar with the raw data and its analysis, and simply replicated an analysis that had already taken place on the raw data. A more realistic study would be to give the anonymized data to a researcher unfamiliar with the data, have them undertake a scientific study from scratch, and then follow up the study with the raw data to determine not only whether the results are correct, but also whether they would have done the analysis differently given the raw data.

Another limitation of this study is that it is based on only one dataset and analysis. As future work, it would be valuable to undertake additional scientific studies on different datasets, ideally with separate teams working in isolation on the original and anonymized data respectively. It would also be valuable to explore other anonymization methods, or these methods with different parameter settings. In this fashion, we can build up an understanding as to whether current anonymization methods can be used in an open science settings, and what improvements are needed to reach that goal.

Methods

Choice of base study and dataset

The selection of the base study and corresponding dataset to use for this paper was limited to those in which the authors were involved. Besides the practical matter of having access to the original datasets, the authors of the original studies are in the best position to determine if the anonymized data is fit for purpose. We decided to select a base study where the dataset and the analysis are typical of those used by researchers participating in the SPOZNAJ open science project in Slovenia. Most such datasets are small, consisting of hundreds or a few thousand records. Using a small dataset also challenges the anonymization tools, since the perturbation introduced by anonymization has a proportionally stronger effect on small datasets.

In total, we considered eight studies^{6,22,23,24,25,26,27,28}. Of these²²,²⁶ and²⁷ had too much data, and the analyses in²³ and²⁸ were less interesting than the others. Of the remaining studies⁶, was selected in part because it involves a sophisticated analysis technique (linear regression), and in part because it has a mix of significant and non-significant regression coefficients. An important and challenging test of anonymized data is how well it preserves significance.

Choice of anonymized data methods

Anonymization mechanisms fall into two broad classes. One class aims to replicate or modify the original data as closely as possible so that statistical statements about the original data are accurate. Another class aims to behave like the original data in that they generate data suitable for predictive ML applications, and may even wish to modify the statistics of the data, by for instance creating additional data, possibly with bias added or removed. This latter class is generally referred to as “synthetic data”, although there is no strict definition of this term²⁹.

There is a long history of open source tools in the replicate class, including sdcMicro³⁰ which implements techniques classically used by statistics agencies (swapping, outlier removal, sampling), ARX³¹ which implements k-anonymity²¹ and related techniques among others, Synthpop³² which implements the decision tree approach CART³³ among others, and most recently SynDiffix⁸ which is a multi-table tree-based approach.

Of these, we selected ARX which has demonstrated success particularly in medical domains^31,34, and SynDiffix which claims high accuracy and ease of use¹⁵. The anonymized data generated by both of these tools is row-level data (also known as microdata) that syntactically is equivalent to the original data. In this specific narrow sense, ARX and SynDiffix can be thought of as synthetic data.

We wished also to select at least one tool from the behave-like class. These techniques have received considerable attention in recent years since the publication of the CTGAN and TVAE techniques³⁵, and a number of open source and commercial tools are available. We tested both the open source tool Synthetic Data Vault (SDV)⁹ and the commercial product Mostly AI. The two demonstrated similar results for this paper’s dataset, so we selected SDV since open source is better for open science.

Note that the anonymized datasets generated for this paper for ARX and SynDiffix were built by the respective authors of those tools. The SDV dataset was generated by Francis.

The procedures used to generate the three anonymized datasets from ARX, SDV, and SynDiffix, are described in the following sections.

ARX

Overview

The ARX software is intended to anonymize sensitive personal data and supports a wide variety of privacy models and data transformation methods. It can be used either with a graphical user interface as a standalone software, or by using the Java-based ARX library to perform the data anonymization via code³⁴. ARX allows for fine-grained configuration to implement tailored anonymization procedures and offers a wide variety of options to protect the data while providing high performing optimization algorithms to retain the data utility³¹.

Anonymized data generation

To perform the data anonymization, ARX requires dataset-specific configuration. This includes the privacy models to be used and their thresholds as well as domain-generalization hierarchies for the variables that can be used by the software to aggregate data tailored to the scientific research question or as a distance-measure during clustering. Finally the specific transformations to be performed, e.g., suppression, full-domain or local generalization, aggregation, as well as the algorithm used to perform the optimization process need to be specified. Parameters often need to be fine-tuned over several iterations.

For the given dataset, the process was quite straight forward. We chose k-Anonymity as a strict privacy model that protects all records and applies to all variables. We chose the threshold k=2, because it is the weakest possible parameterization and weaker parameters tend to provide higher utility for small datasets. ARX was configured to perform a clustering process where domain-generalization hierarchies are used to determine distances between values. For categorical variables, a simple domain-generalization hierarchy was designed, grouping “walk” and “wheels” together, because they both indicate movement through physical activity. For continuous variables, the associated domain-generalization hierarchies represented increasingly large intervals. The hierarchy for the “gender” variable simply included a common root node “*” for both genders. In each cluster, categorical values were replaced by the mode of all values in the cluster, while drawing from the distribution within input data if there was no mode, and continuous variables were replaced with the arithmetic mean of all values in the cluster.

We applied the transformations using ARX’s local optimization strategy³¹ using the Java library provided by ARX version 3.9.2. The core Java code (i.e. excluding I/O, imports, etc.) for executing the anonymization is around 200 lines. Executing the process took about 2.3 minutes on a laptop with an i5-1135G7 processor running at 2.4GHz and sufficient memory (8GB).

Anonymized data usage

The data anonymization performed by ARX retained the data structure of the original data, therefore no post-processing was needed before use.

Synthetic Data Vault (SDV)

Overview

The Synthetic Data Vault (SDV) is an open source project implementing several synthetic data tools⁹. SDV offers different tools depending on whether the original table is a single table, multiple tables (relational), or time-series. The single-table case offers several tools, such as Gaussian-Copula, CTGAN, and TVAE. The latter two are promoted by SDV as being suitable for data with a mix of categorical and continuous columns, and the data quality of CTGAN and TVAE is similar³⁵.

We used SDV’s CTGAN tool (Conditional Tabular Generative Adversarial Network) to generate an anonymized dataset. As the name implies, CTGAN uses a Generative Adversarial Network (GAN) approach³⁵. It runs two neural networks, a generator and a discriminator. The generator tries to create anonymized datasets that the discriminator cannot distinguish from the original data, while avoiding overfitting.

There are a number of parameters that can be used to fine-tune CTGAN. Most importantly, the metadata description must be correct, especially the labeling of continuous and categorical columns. There are a number of other parameters related to the GAN itself (e.g., enforced minimums and maximums, enforced rounding, number of epochs) that can improve data quality somewhat.

Anonymized data generation

SDV has an option to auto-generate the metadata. We used this option and checked to ensure that the metadata was correct. Since all of the categorical columns in the original dataset are strings, the auto-generated metadata was correct. Since simplicity of operation is an important requirement in an open science environment, we chose to use the default parameter settings. Note that the commercial synthetic data product Mostly AI, which has sophisticated algorithms to automate the selection of models and parameters and in general outperforms SDV’s CTGAN¹⁵, did not in this case perform better than CTGAN with the default settings. We therefore believe that we could not have improved substantially by tweaking the parameters.

from sdv.metadata import SingleTableMetadata

from sdv.single_table import CTGANSynthesizer

metadata = SingleTableMetadata()

metadata.detect_from_dataframe(df_orig)

synthesizer = CTGANSynthesizer(metadata)

synthesizer.fit(df_orig)

df_syn = synthesizer.sample(num_rows=len(df_orig))

The core Python code is five lines. We used version 1.14.0 of SDV with the default settings. It took 32 seconds to generate the anonymized data on a laptop with an i7-7820HQ processor running at 2.9GHz and sufficient memory (32GB).

Anonymized data usage

The data anonymization performed by SDV retains the data structure of the original data, therefore no post-processing is needed before use.

SynDiffix

Overview

SynDiffix takes a multi-table approach to synthesizing data⁸. A key characteristic of all data anonymization methods is that accuracy degrades as the number of columns increases. Therefore it is better to synthesize only those columns needed for a given analytic task. Given that there can be thousands of different combinations of columns that analysts may be interested in, a multi-table approach can lead to the generation of thousands of distinct tables. Unlike other anonymized data methods, SynDiffix is designed to maintain anonymity no matter how many anonymized tables are generated.

When SynDiffix synthesizes a table with more than around 5 or 6 columns, SynDiffix partitions the table into a set of sub-tables each with fewer columns, synthesizes each sub-table individually, and then joins the sub-tables back together. Columns within sub-tables are more strongly correlated than columns across sub-tables. Each sub-table has at least one column in common with another sub-table, and these common columns are used for joining.

SynDiffix is implemented in Python, and has two modes of operation, full-table mode and sub-table blob mode. In full-table mode, the user (data controller or analyst) requests a single table, and SynDiffix returns that table after doing any required partitioning and joining. Full-table mode requires that the original data is available to SynDiffix. Full-table mode uses the Synthesizer class of SynDiffix.

In sub-table blob mode, a single API call is made by the data controller to create the complete set of sub-tables required to generate any table. This set of sub-tables is zipped into a single file which is refered to as a SynDiffix blob. The blob may safely be released to the public. Creating the blob uses the SyndiffixBlobBuilder class, and requires access to the original data. Subsequently, an analyst with access to a blob uses the SyndiffixBlobReader class to join and retrieve tables from the blob. The analyst requests a table with the given columns, the blob reader selects the appropriate sub-tables from the blob, joins them together, and returns the table. The original data is not required for this operation.

In either mode, no table-specific configuration is required if the table is not longitudinal or time-series. If the table is, then the column that identifies each individual in the dataset must be specified. No other configuration is necessary.

Anonymized data generation

To generate the blob, the user must write a python script to read in the original data file as a dataframe (df_original below), and generate and save the blob. Generating the blob required two lines of Python:

sbb = SyndiffixBlobBuilder(blob_name, blob_path)

sbb.write(df_original)

This creates a blob from the full original dataframe df_original and places it in the directory at blob_path with the name blob_name.

The blob file created from the original data with eight columns and 713 records contains 160 tables, took 81 seconds to build on the same laptop as with ARX, and is 1.4MB in size (zipped).

Anonymized data usage

While creating the blob is simple and automatic and requires no expertise from the controller, using the blob for analytics is unfortunately more complex. For each analytic task, the analyst must request from the blob a table with only the required columns. In addition, if the analytic task is to build a predictive model for a given target column, then the target column should be specified in the blob request. The analyst must be aware of these requirement, and must modify their analysis code to satisfy the requirements. The quality of the data is substantially worse if the analyst instead uses a single full table for all of their analysis.

Recreating the three tables and one plot from the original paper⁶ required seven different tables from the blob; 2 with 1 column, 3 with 2 columns, and 2 with 6 columns and a target column (VO₂max). In the scripts used for generating the tables and plot, an additional API call to SyndiffixBlobReader was required prior to each analytic task:

sbr = SyndiffixBlobReader(blob_name, blob_path)

df1 = sbr.read([col1, col2])

analytic_task1(df1)

df2 = sbr.read([col3, col4, col5, col6], target_column=col5)

analytic_task2(df2)

...

Code availability

Accession codes

The repository at https://github.com/yoid2000/commute-health-study contains:

• The Python code used to generate the SDV and SynDiffix anonymized data.

• The .jar executable to generate the ARX anonymized data.

• All of the anonymized data.

• The R script and Python code used to generate all of the tables and figures in this paper.

The repository at https://github.com/BIH-MI/commute-health-anonymization contains the source code (Java) to generate the ARX anonymized data.

Data availability

The public repository at²⁰ contains all of the anonymized data used for this paper. This anonymized data may be used and shared freely for the purpose of studying anonymity. Note that the original pseudonymized data from which the anonymized data was derived is personal data and is therefore not publicly available. Please contact Bojan Leskosek (bojan.Leskosek@fsp.uni-lj.si) or Gregor Jurak (gregor.jurak@fsp.uni-lj.si) to access the original dataset. Requesters need to sign a Data Use Agreement (DUA), which is available at github (https://github.com/yoid2000/commute-health-study/blob/main/Data_Usage_Agreement.pdf). The DUA requires the the requester state the research purpose, and to agree to protect the data, to not share the data, and to destroy it upon completion of the research. All legitimate requests will be honored.

The data at²⁰ has 713 records, one per individual, and contains the following columns:

• VO2max: Float. Maximum oxygen consumption during exercise, measured in milliliters of oxygen per kilogram of body weight per minute (ml/kg/min). VO2max is a key indicator of cardiorespiratory fitness and aerobic endurance. Higher values indicate better cardiovascular fitness.

• CommToSch: String. Mode of commuting to school. Contains the categorical values “walk”, “wheels”, “car” and “public”.

• CommHome: String. Mode of commuting from school to home. Contains the categorical values “walk”, “wheels”, “car” and “public”. This may differ from CommToSch if students use different transportation methods for their return journey.

• gender: String. Gender of the participant. Contains categories “male” and “female”.

• age: Float. Age of the participant in years. The float type suggests precision beyond whole years (e.g., 12.5 years).

• MVPAsqrt: Float. Square root transformed value of Moderate to Vigorous Physical Activity. The square root transformation is commonly applied to normalize skewed physical activity data. This represents the amount of meaningful physical activity performed, transformed for statistical analysis.

• DistFromHome: Integer. Distance from participant’s home to their school, measured in meters.

• DistFromSchool: Integer. Distance from school to participants homes, measured in meters.

References

Wilkinson, M. D. et al. The fair guiding principles for scientific data management and stewardship. Scientific Data 3, 1–9 (2016).
Article Google Scholar
Borycz, J. et al. Perceived benefits of open data are improving but scientists still lack resources, skills, and rewards. Humanities and Social Sciences Communications 10, 1–12 (2023).
Article Google Scholar
Guillot, P., Bøgsted, M. & Vesteghem, C. Fair sharing of health data: a systematic review of applicable solutions. Health and Technology 13, 869–882 (2023).
Article Google Scholar
Vovk, O., Piho, G. & Ross, P. Anonymization methods of structured health care data: A literature review. In International Conference on Model and Data Engineering, 175–189 (Springer, 2021).
Figueira, A. & Vaz, B. Survey on synthetic data generation, evaluation methods and gans. Mathematics 10, 2733 (2022).
Article Google Scholar
Jurak, G. et al. Associations of mode and distance of commuting to school with cardiorespiratory fitness in slovenian schoolchildren: a nationwide cross-sectional study. BMC Public Health 21, 1–10 (2021).
Article Google Scholar
Prasser, F. & Kohlmayer, F. Putting statistical disclosure control into practice: The arx data anonymization tool. Medical Data Privacy Handbook 111–148 (2015).
Francis, P., Berneanu, C. & Gashi, E. Syndiffix: More accurate synthetic structured data. arXiv preprint arXiv:2311.09628 (2023).
Patki, N., Wedge, R. & Veeramachaneni, K. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 399–410 (IEEE, 2016).
Iyengar, V. S. Transforming data to satisfy privacy constraints. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 279–288 (2002).
Bayardo, R. J. & Agrawal, R. Data privacy through optimal k-anonymization. In 21st International conference on data engineering (ICDE’05), 217–228 (IEEE, 2005).
Gionis, A. & Tassa, T. k-anonymization with minimal loss of information. IEEE Transactions on Knowledge and Data Engineering 21, 206–219 (2008).
Article Google Scholar
Lautrup, A. D., Hyrup, T., Zimek, A. & Schneider-Kamp, P. Systematic review of generative modelling tools and utility metrics for fully synthetic tabular data. ACM Computing Surveys 57, 1–38 (2024).
Article Google Scholar
El Emam, K., Mosquera, L., Jonker, E. & Sood, H. Evaluating the utility of synthetic covid-19 case data. JAMIA open 4, ooab012 (2021).
Article PubMed PubMed Central Google Scholar
Francis, P. A comparison of syndiffix multi-table versus single-table synthetic data. In International Conference on Privacy in Statistical Databases, 161–177 (Springer, 2024).
LeFevre, K., DeWitt, D. J. & Ramakrishnan, R. Workload-aware anonymization techniques for large-scale datasets. ACM Transactions on Database Systems (TODS) 33, 1–47 (2008).
Article Google Scholar
Goldsteen, A., Ezov, G., Shmelkin, R., Moffie, M. & Farkash, A. Anonymizing machine learning models. In International Workshop on Data Privacy Management, 121–136 (Springer, 2021).
Kaabachi, B. et al. A scoping review of privacy and utility metrics in medical synthetic data. npj Digital Medicine 8 (2025).
Giomi, M., Boenisch, F., Wehmeyer, C. & Tasnádi, B. A unified framework for quantifying privacy risk in synthetic data. Proceedings on Privacy Enhanced Technologies Symposium PoPETs (2023).
Francis, P., Jurak, G., Leskosek, B., Otte, K. & Prasser, F. figshare repo for ’Data Anonymization for Open Science: A Case Study’. https://doi.org/10.6084/m9.figshare.28041242 (2025).
Sweeney, L. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 557–570 (2002).
Article MathSciNet Google Scholar
Ortega, F. B. et al. European fitness landscape for children and adolescents: updated reference values, fitness maps and country rankings based on nearly 8 million test results from 34 countries gathered by the fitback network. British Journal of Sports Medicine 57, 299–310 (2023).
Article PubMed Google Scholar
Jurak, G. et al. A covid-19 crisis in child physical fitness: creating a barometric tool of public health engagement for the republic of slovenia. Frontiers in Public Health 9, 644235 (2021).
Article PubMed PubMed Central Google Scholar
Jurak, G., Cooper, A., Leskosek, B. & Kovac, M. Long-term effects of 4-year longitudinal school-based physical activity intervention on the physical fitness of children and youth during 7-year follow-up assessment. Central European Journal of Public Health 21, 190 (2013).
Article PubMed Google Scholar
Kovač, M., Leskošek, B., Hadžić, V. & Jurak, G. Occupational health problems among slovenian physical education teachers. Kinesiology 45, 92–100 (2013).
Google Scholar
Radulović, A., Jurak, G., Leskošek, B., Starc, G. & Blagus, R. Secular trends in physical fitness of slovenian boys and girls aged 7 to 15 years from 1989 to 2019: A population-based study. Scientific Reports 12, 10495 (2022).
Article ADS PubMed PubMed Central Google Scholar
Sember, V. et al. Secular trends in skill-related physical fitness among slovenian children and adolescents from 1983 to 2014. Scandinavian Journal of Medicine & Science in Sports 33, 2323–2339 (2023).
Article Google Scholar
Jurak, G. et al. Slofit surveillance system of somatic and motor development of children and adolescents: upgrading the slovenian sports educational chart. Auc Kinanthropologica 56, 28–40 (2020).
Article Google Scholar
Stadler, T., Oprisanu, B. & Troncoso, C. Synthetic data – anonymisation groundhog day. In 31st USENIX Security Symposium (USENIX Security 22), 1451–1468 (2022).
Templ, M., Kowarik, A. & Meindl, B. Statistical disclosure control for micro-data using the r package sdcmicro. Journal of Statistical Software67 (2015).
Prasser, F., Eicher, J., Spengler, H., Bild, R. & Kuhn, K. A. Flexible data anonymization using arx — current status and challenges ahead. Software: Practice and Experience 50, 1277–1304 (2020).
Google Scholar
Nowok, B., Raab, G. M. & Dibben, C. synthpop: Bespoke creation of synthetic data in r. Journal of Statistical Software 74, 1–26 (2016).
Article Google Scholar
Reiter, J. P. Using cart to generate partially synthetic public use microdata. Journal of Official Statistics 21, 441 (2005).
Google Scholar
Prasser, F., Kohlmayer, F., Lautenschläger, R. & Kuhn, K. A. Arx-a comprehensive tool for anonymizing biomedical data. In AMIA Annual Symposium Proceedings, vol. 2014, 984 (American Medical Informatics Association, 2014).
Xu, L., Skoularidou, M., Cuesta-Infante, A. & Veeramachaneni, K. Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems 32 (2019).

Download references

Acknowledgements

F.P. and K.O. acknowledge funding from the German Ministry of Health (project KI-FDZ, grant agreement number 2521DAT01C) and from the German Research Foundation (project NFDI4Health, project number 442326535). This research has been co-financed by the Horizon Europe programme of the European Union within the framework of the SmartCHANGE project (\({n}^{\underline{O}}\) 101080965), and the Slovenian Research and Innovation Agency (Bio-psycho-social research program, \({n}^{\underline{O}}\) P5-0142).

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

MPI-SWS, Kaiserslautern, 67663, Germany
Paul Francis
University of Ljubljana, Faculty of Sports, Ljubljana, 1000, Slovenia
Gregor Jurak & Bojan Leskošek
Berlin Institute of Health @ Charité - Universitätsmedizin Berlin, Berlin, 10117, Germany
Karen Otte & Fabian Prasser

Authors

Paul Francis
View author publications
Search author on:PubMed Google Scholar
Gregor Jurak
View author publications
Search author on:PubMed Google Scholar
Bojan Leskošek
View author publications
Search author on:PubMed Google Scholar
Karen Otte
View author publications
Search author on:PubMed Google Scholar
Fabian Prasser
View author publications
Search author on:PubMed Google Scholar

Contributions

P.F. generated the SynDiffix and SDV datasets, and generated the paper’s tables and figures. K.O. and F.P. generated the ARX dataset. B.L. and G.J. interpreted the validity of the anonymized datasets relative to the original data. P.F. drafted the initial version of the manuscript. All authors reviewed and edited the manuscript.

Corresponding author

Correspondence to Paul Francis.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Francis, P., Jurak, G., Leskošek, B. et al. Comparison of Three Anonymization Tools for a Health Fitness Study. Sci Data 12, 1548 (2025). https://doi.org/10.1038/s41597-025-05823-x

Download citation

Received: 20 December 2024
Accepted: 13 August 2025
Published: 18 September 2025
Version of record: 18 September 2025
DOI: https://doi.org/10.1038/s41597-025-05823-x

Subjects

Abstract

Similar content being viewed by others

Utility-driven assessment of anonymized data via clustering

Personalized fitness recommendations using machine learning for optimized national health strategy

The Health Gym: synthetic health-related datasets for the development of reinforcement learning algorithms

Introduction

Prior work

Results

Dataset comparison

Reproduction of commute modes and distances

Reproduction of statistical significance of regression coefficients

Reproduction of the stratification of children’s cardiorespiratory fitness

Comparison of derived scientific insights

Usability of the anonymization tools and data

Risk Evaluation

Discussion

Methods

Choice of base study and dataset

Choice of anonymized data methods

ARX

Overview

Anonymized data generation

Anonymized data usage

Synthetic Data Vault (SDV)

Overview

Anonymized data generation

Anonymized data usage

SynDiffix

Overview

Anonymized data generation

Anonymized data usage

Code availability

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links