Introduction

A core function of open science is data sharing. In an ideal setting, scientists should be able to easily upload data to a repository which satisfies the FAIR principles, i.e. makes the data findable, accessible, interoperable, and resuable1. Data sharing is challenging even when the data is not personal data2, i.e. does not contain information about individuals. Sharing of personal data is even more challenging3 due to stricter and more diverse legal frameworks.

In a data sharing pipeline where personal data is involved, there is often an extra step of data anonymization that historically has required individual attention to each data release, for instance in order to find a good balance between privacy and utility4. At the same time, there are constant advances in data anonymization technology. In recent years for instance, a new generation of synthetic data tools have become available5, both as open source and commercial products.

This study assesses the applicability of a small but representative set of data anonymization tools to an open science personal data sharing scenario. For this, we selected a base study, which we tried to replicate using anonymized data. The base study is a research study authored by GJ and BL6 with data controlled by them. The study, titled “Associations of mode and distance of commuting to school with cardiorespiratory fitness in Slovenian school children: a nationwide cross-sectional study”, is challenging for data anonymization tools due to its small sample size (713 records), the amount and depth of the statistical analyses, and overall few significant results and small effect sizes. These conditions are challenging because the distortion necessarily introduced by anonymization must be small so as not to overwhelm the significance of the data.

The evaluation in this study focuses on three questions:

  1. 1.

    Would the scientific conclusions of the study have changed if anonymized data were used instead of the original data?

  2. 2.

    How difficult would it have been to use the anonymized data for the scientific study?

  3. 3.

    How difficult would it have been for non-experts to generate the anonymized data?

The three anonymization tools evaluated are ARX7 (developed by Prasser), SynDiffix8 (developed by Francis), and SDV9. All tools are open source software and implement different methods to create anonymized datasets. ARX provides a variety of anonymization mechanisms, with a focus on K-anonymity and it variants (L-diversity etc.). For this study, K-anonymity was used (K = 2). SDV also offers a variety of mechanisms, but all from the family of AI-based data synthesis techniques that have become popular in recent years5. For this study, we used CTGAN. SynDiffix is one of a class of techniques that uses regression trees to automate aggregation, adding suppression and noise to strengthen anonymity. These three methods are representative of three popular approaches to data anonymization: K-anonymity, AI synthesis, and regression trees. More information about these tools, as well as the procedure used to select them, can be found in the Methods section of this paper.

To our knowledge, this is the first paper to examine anonymized data quality in the context of a specific scientific study, versus prior work which examines data quality in general terms. Another distinguishing feature of this paper is that it analyses the relative ease of generating anonymized data, which is important in an open science setting. While this paper is narrow in that it only analyzes a single scientific study, we believe that it serves as a blueprint for how to evaluate anonymized data quality.

Prior work

The quality and utility of anonymized data has long been studied. It is common to distinguish between general-purpose and special-purpose (or workload-aware) approaches for quantifying anonymized data quality. General-purpose approaches include information loss metrics, like granularity reduction10 or record fidelity11, as well as resemblance metrics, which can be univariate, e.g. comparing distributions of variables in unprotected vs. anonymized data12, or multivariate, e.g. comparing differences in correlation matrices13. Another approach is to study how well classifiers can distinguish unprotected and anonymized records14. The three presented anonymization methods have been previously explored regarding the general-purpose utility of the protected data15, whereas workload-aware utility analysis was not yet explored for all methods. Other prior work proposes that even the anonymization mechanisms themselves should be workload aware: they take into account what analysis is going to be done after anonymization16,17. The quality evaluations in this work are therefore naturally based on the workload analysis, which is a novelty of this paper.

All of the quality measures in this prior work, however, are generic in that they do not take into account a specific scientific goal. These generic measures are useful for comparing the quality of different anonymization techniques generally, but they fall short of predicting whether any given scientific conclusion will be correct or not. By contrast, our paper measures quality based on specific scientific conclusions, which is another novelty of this study. Here we directly compare the ability of the anonymization methods to preserve the data characteristics that lead to the scientific conclusions.

Although the data quality and utility was previously thoroughly explored, previous research on data privacy risks associated with non-traditional anonymization methods and synthetic data generation is rather sparse18. Few frameworks exist that can estimate residual privacy risks on anonymized and synthetic data alike. One of such framework is the Anonymeter framework by Giomi et al.19, originally proposed for large, synthetic datasets to evaluate risk of linkage, attribute inference and record singling-out.

Results

The base study of Jurak et al. investigated whether active commuting has the potential to improve children’s health6. The cardiorespiratory fitness (CRF) of 713 Slovenian school children aged 12 to 15 years, was determined with a 20-m shuttle run test to estimate their maximal oxygen uptake (VO2max). Moreover, information was collected on their distance from home to school and whether the commute was done by walking, wheels (e.g., bicycle or skateboard), public transport or car in both directions. The study found that commuting distance minimally affected CRF, except in the Car group, where children living closer to school had significantly lower CRF than those farther away. The study recommends targeting car-driven children within walking or cycling distance of school with interventions promoting active transport. The original paper presented its results in three tables and one figure which we refer to as base-tables and base-figure. This study replicates these base-tables and base-figure, but in such a way that the the original data and the data for the three anonymized datasets are combined for easy comparison.

The following sections discuss the similarities and differences between the anonymized and original data, and comment on the extent to which the conclusions of the original paper6 still hold given the anonymized data.

Dataset comparison

Descriptive univariate statistics as mean and standard deviations are provided in Table 1, showing overall similar distributions and no statistical significant difference to the original data. The data generated by SDV showed differences in four out of five variables with p-values below 0.001. The p-value for the Age variable was 0.006, which is very close to the chosen alpha-level of 0.005. Looking at the difference in mean values, we see huge deviations in the SDV dataset for both commute distances, potentially indicating general flaws in the synthetic data generation used here.

Table 1 Descriptive statistics of continuous variables expressed as mean and standard deviation.

Reproduction of commute modes and distances

The original and anonymized data for base-tables 1 and 26 explored the types of commuting in school children and the commute distances traveled. These are given in this paper’s Tables 2 and 3 respectively20.

Table 2 Base-table 16 from the paper showing the counts and percentages for the original data and the three anonymization methods.
Table 3 Base-table 26 from the original paper showing the counts and distances in meters (median and IQR) for the original data and the three anonymization methods6.

Among the three anonymization methods, SynDiffix most closely reproduced original counts and percentages for cross-tabulation of commuting modes (Table 2), with only one deviating record in the cross table. It is closely followed by ARX which showed up to 18 deviations, while SDV provides large differences of up to 179 different commute mode counts in the cross table. Nevertheless, the basic description of these results in original paper6, i.e. that active commuting is more frequent in the direction from school to home than in other direction, still holds, but this may be a coincidence in case of SDV as it substantially over- or underestimates the frequencies for both modes of active commuting (walking and wheels). Note also that the anonymized table of ARX contains 1.4% fewer rows (703 versus 713) due to suppression of rare records.

Similar results were found for commuting distance (Table 3). Here also SDV performs badly, hugely over- or underestimating the distance in most of the table cells, both in central tendency (median) and variability (IQR) of the data. SynDiffix and ARX perform much better, but still with large errors in some of the table cells, especially in cells with low frequencies, e.g. SynDiffix substantially underestimates the median distance of car commuting from school to home (3615 → 2584) and ARX overestimates the distance of wheels commuting in same direction (1444 → 2235). The verbal description of this results from original paper6, i.e. that active commuting groups (Walk, Wheels) typically live close to school, still holds in case of SynDiffix and ARX, but not in case of SDV.

The absolute errors of the cross-table counts from Table 2 and the distance values in Tables from 3 are visualized in the boxplots of Fig. 1.

Fig. 1
figure 1

Absolute error of the three anonymization methods for the counts and distances in Tables 2 and 3. Each box plot data point is taken from a single cell in Tables 2 and 3.

Reproduction of statistical significance of regression coefficients

The original and anonymized data for base-table 36 are given is this paper’s Tables 4 and 5 (the data is spread over two tables for formatting purposes)20. The primary purpose of base-table 36 is to indicate statistical significance. As such, those entries where the original data is significant and the anonymized data is not, or vice versa, are highlighted in red. In total, there are 3 such mismatches for ARX, 8 for SDV, and 5 for SynDiffix.

Table 4 Part 1 (of 2) of the base-table 36 showing the parameters (regression coefficients) of the linear model for prediction of VO2max by group and distance.
Table 5 Part 2 (of 2) of the original paper’s Table 3 showing the parameters (regression coefficients) of the linear model for prediction of VO2max by group and distance6.

Several differences were found in statistics of linear prediction models based on original data and the data produced by all three anonymization methods (Tables 4 and 5). In the original models predicting VO2max, the constant (intercept), the predictors Gender and MVPA, and the derived Car  × Gender interaction term were statistically significant in both directions of commuting. This also holds for all parameters in the SynDiffix and ARX models, except for the Car x Gender parameter in the SynDiffix model, which is not significant in one of the directions of commuting. The Gender parameter in the SDV model was found to be non-significant, despite being highly significant (p≤.001) in the three other models.

Note that only the SDV method provides estimate for Wheels x Males interaction parameter in the school to home direction models. ARX and SynDiffix suppressed this data as part of anonymization because there were so few datapoints. The normalized error for coefficients in Tables 4 and 5 is illustrated in Fig. 2.

Fig. 2
figure 2

Normalized error for coefficients in Tables 4 and 5, and Fit (V02max predicted at mode for median distance) for Fig. 3. Normalized error E is computed as \(E=\frac{| O-A| }{\max (| O| ,| A| )}\), where O is the original value and A is the anonymized value. The middle plot displays normalized error only for the datapoints in Tables 4 and 5 where the original coefficient is significant.

Reproduction of the stratification of children’s cardiorespiratory fitness

The plot for base-figure 16 is given in this paper’s Fig. 3, which replicates the plot from the original paper as well as gives the corresponding plots for the three anonymized datasets6. In male children, point predictions and prediction intervals in the original paper6 are quite closely matched with the ones from ARX and SynDiffix, but not with SDV. In female children, original statistics are most closely matched by ARX, followed SynDiffix and then by SDV. SDV gives similar results for male and female children, although they are clearly separated in the original data. SDV performs worst also in estimating prediction interval widths, which are very similar for both sexes, although being quite different in the original plot. ARX and SynDiffix produce interval widths that are more similar to original, but still differ in some cases, e.g. being too narrow in case of walk commuting in ARX and too narrow in females’ wheels commuting from school to home in both ARX and SynDiffix.

Fig. 3
figure 3

Comparison of the VO2max data plots corresponding to the base-figure from the original study6.

Comparison of derived scientific insights

Table 6 summarizes the ability of each anonymization method to produce the same analytic conclusions as those of the base paper. It presents the set of statements made in the base paper, and evaluates whether the statement would be supported (O), negated (X), or neither supported nor negated (?).

Table 6 This table summarizes the ability of each anonymization method to result in the same analytic conclusion as that of the original data.

As mentioned, besides the specific scientific statements, the paper suggests a policy intervention: that active transport should be promoted for children who use passive transport, but live within walking or biking distance to school. We regard this as the most important conclusion of the paper, and it is supported by ARX and SynDiffix, but would have been negated by SDV.

The base paper starts with a number of simple statistical observations about the data (descriptive analytic results). Both ARX and SynDiffix support all six of these observations, while SDV negates four of them.

The main analytic thrust of the paper is the linear regression analysis of (CRF). Here, the performance of ARX and SynDiffix is less good. While none of the statement were negated, about half of the statements were not supported. While this would not have prevented the main policy conclusions from having been reached, the overall quality of the study would have suffered by using either anonymization technique.

Usability of the anonymization tools and data

To enable data sharing in an open science environment, processes are needed that allow scientists to conveniently create and share safe data. As scientists already tend to be undermotivated to share data2, it is hence beneficial if the anonymization step places as little additional burden on scientists as possible. Ideally, there should be little or no extra work required by data collectors or operators of repository systems to anonymize data and prepare them for sharing.

In terms of protected data generation, SynDiffix is the easiest to use of all methods studied here due to its simplicity and lack of configuration need. Using ARX, in contrast, requires expertise on the various anonymization methods that it supports, as well as substantial configuration efforts to generate a protected data table of sufficient quality. SDV is easier to use than ARX since it provides default configurations and code templates, but nevertheless requires that a decision about which SDV tool to use, and requires configuration of datatypes. However, in terms of data analysis, SynDiffix places an extra burden on the analyst, in that they need to understand that different tables need to be generated for different analytic tasks. Neither ARX nor SDV have this additional burden: the generated tables can be used as is.

In general, any anonymization technique requires that the analyst understands the extent to which data has been distorted, and requires that they compensate for that distortion. This always creates additional burden. Since the anonymization process in ARX is typically explainable and deterministic, and the present distortions are visible in univariate and bivariate statistics, they might be easier to understand. In contrast, typical synthetic data generation involves randomness and it is often not clear how the internal structure of the data might have changed, even if uni- and bivariate distributions appear to be similar.

Risk Evaluation

All three of the data anonymization methods studied in this paper have strong privacy properties. K-anonymity is a well-established technique that has been in use for more than two decades21. AI-based synthetic data tools like CTGAN are offered commercially. Like K-anonymity, SynDiffix uses the mechanisms of aggregation and suppression, and additionally adds noise.

Here we demonstrate the strong privacy properties of the three methods with respect to the commuting dataset using the attribute inference privacy metric from the Anonymeter tool19 (https://github.com/statice/anonymeter). In an attribute inference, an attacker knows several attributes of a person known to be in a dataset, and then tries to predict an unknown attribute from the released anonymized data. Anonymeter makes this prediction by finding the microdata record in the anonymized data that most closely matches the known attributes, and predicting the unknown attribute from that record.

Anonymeter measures the improvement of the predictions over a statistical baseline. For example, the statistical baseline precision for predicting sex would be 50%, assuming an even distribution of male and female. Anonymeter establishes the statistical baseline by making inference predictions on records that have been removed from the dataset prior to anonymization. Any prediction success on these records is only statistical in nature, and does not represent a loss of individual privacy.

Anonymeter measures Risk as R = (Patk − Pbase)/(1 − Pbase,),where Patk is the prediction precision of records in the dataset, and Pbase is the precision of records not in the dataset. For example, if the baseline precision for predicting sex is Pbase = 0.5, and the attack precision is Patk = 0.75, then the improvement, or Risk, is R = 0.5. Any Risk below 0.5 can be regarded as strongly anonymous. When Pbase is low, then R = 0.5 leaves substantial uncertainty on the part of the attacker, or equivalently, substantial deniability on the part of the victim. When Pbase is high, the attacker in any event has little uncertainty, and a Risk of 0.5 represents only a modest improvement. Furthermore, a high baseline implies an attribute that is common in the dataset population and therefore unlikely to be a sensitive attribute.

For our measure, we removed 100 records from the dataset to establish the baseline, and re-anonymized the remaining 613 records. We assumed that the attacker knows the victim’s gender, age, and commuting distance and mode in both directions. These are all potentially publicly-known attributes. The unknown attributes being predicted are VO2max and MVPA. In evaluating SynDiffix, the two tables with only the known attributes are unknown attribute were used. Table 7 presents the Risk scores and the 95% confidence intervals. In all cases, even the high confidence bound is well below the strong privacy Risk threshold of 0.5.

Table 7 Privacy risk scores and 95% confidence intervals.

Discussion

In this paper, we compared the suitability of three tools for data anonymization (ARX, SDV and SynDiffix) for use in a public health study in Slovenia. Our study distinguishes itself from the large body of literature on this topic by studying in detail whether the individual analytical findings as well as the resulting conclusions would hold when using anonymized instead of the original data. Moreover, we explicitly considered the burden that would be imposed on researchers by applying one of the studied tools to follow open-science principles in our analysis.

Of the three tools, SDV’s data quality was poor and led to many incorrect scientific conclusions. The data quality of ARX and SynDiffix was good enough to provide definite value. The data generated by both tools supported the main conclusion of the base study (that children within walking or biking commuting distance should not commute with a car), and no incorrect conclusions were drawn. In addition, all of the descriptive analytics (counting and simple statistics) were supported by ARX and SynDiffix.

Regarding researcher burden for data generation, all three tools require some manual configuration. ARX typically requires configuration of data hierarchies per-column among other configuration choices, which requires expertise to find suitable settings. SDV requires selection of an algorithm, definition of data types, and potentially other configuration choices. If the dataset is time-series, SynDiffix only requires that the user configures the column that identifies individuals in the dataset. For the fitness dataset, ARX required around 200 lines of code (Java), SDV required 5 lines, and SynDiffix required 2 lines (both Python). Note that ARX is available as a GUI application (Windows, MacOS, or Linux), whereas SDV and SynDiffix require Python.

Regarding researcher burden for data analysis, the output of ARX and SDV can be used as is. SynDiffix generates multiple anonymized datasets, and the analyst must select the one with only the columns necessary for each given analytic task. This can only be done using a SynDiffix blob reader package that runs on Python. For the fitness study, a 22-line Python routine was needed to generate the two datasets used in a R script to generate Fig. 3. In the analysis code, an extra line of code is required each time a dataframe is read.

A major limitation of the current study is that it does not truly replicate an open science scenario, whereby an analyst undertakes a scientific analysis purely from anonymized data. In the current study, the analyst had the advantage of hindsight, was already familiar with the raw data and its analysis, and simply replicated an analysis that had already taken place on the raw data. A more realistic study would be to give the anonymized data to a researcher unfamiliar with the data, have them undertake a scientific study from scratch, and then follow up the study with the raw data to determine not only whether the results are correct, but also whether they would have done the analysis differently given the raw data.

Another limitation of this study is that it is based on only one dataset and analysis. As future work, it would be valuable to undertake additional scientific studies on different datasets, ideally with separate teams working in isolation on the original and anonymized data respectively. It would also be valuable to explore other anonymization methods, or these methods with different parameter settings. In this fashion, we can build up an understanding as to whether current anonymization methods can be used in an open science settings, and what improvements are needed to reach that goal.

Methods

Choice of base study and dataset

The selection of the base study and corresponding dataset to use for this paper was limited to those in which the authors were involved. Besides the practical matter of having access to the original datasets, the authors of the original studies are in the best position to determine if the anonymized data is fit for purpose. We decided to select a base study where the dataset and the analysis are typical of those used by researchers participating in the SPOZNAJ open science project in Slovenia. Most such datasets are small, consisting of hundreds or a few thousand records. Using a small dataset also challenges the anonymization tools, since the perturbation introduced by anonymization has a proportionally stronger effect on small datasets.

In total, we considered eight studies6,22,23,24,25,26,27,28. Of these22,26 and27 had too much data, and the analyses in23 and28 were less interesting than the others. Of the remaining studies6, was selected in part because it involves a sophisticated analysis technique (linear regression), and in part because it has a mix of significant and non-significant regression coefficients. An important and challenging test of anonymized data is how well it preserves significance.

Choice of anonymized data methods

Anonymization mechanisms fall into two broad classes. One class aims to replicate or modify the original data as closely as possible so that statistical statements about the original data are accurate. Another class aims to behave like the original data in that they generate data suitable for predictive ML applications, and may even wish to modify the statistics of the data, by for instance creating additional data, possibly with bias added or removed. This latter class is generally referred to as “synthetic data”, although there is no strict definition of this term29.

There is a long history of open source tools in the replicate class, including sdcMicro30 which implements techniques classically used by statistics agencies (swapping, outlier removal, sampling), ARX31 which implements k-anonymity21 and related techniques among others, Synthpop32 which implements the decision tree approach CART33 among others, and most recently SynDiffix8 which is a multi-table tree-based approach.

Of these, we selected ARX which has demonstrated success particularly in medical domains31,34, and SynDiffix which claims high accuracy and ease of use15. The anonymized data generated by both of these tools is row-level data (also known as microdata) that syntactically is equivalent to the original data. In this specific narrow sense, ARX and SynDiffix can be thought of as synthetic data.

We wished also to select at least one tool from the behave-like class. These techniques have received considerable attention in recent years since the publication of the CTGAN and TVAE techniques35, and a number of open source and commercial tools are available. We tested both the open source tool Synthetic Data Vault (SDV)9 and the commercial product Mostly AI. The two demonstrated similar results for this paper’s dataset, so we selected SDV since open source is better for open science.

Note that the anonymized datasets generated for this paper for ARX and SynDiffix were built by the respective authors of those tools. The SDV dataset was generated by Francis.

The procedures used to generate the three anonymized datasets from ARX, SDV, and SynDiffix, are described in the following sections.

ARX

Overview

The ARX software is intended to anonymize sensitive personal data and supports a wide variety of privacy models and data transformation methods. It can be used either with a graphical user interface as a standalone software, or by using the Java-based ARX library to perform the data anonymization via code34. ARX allows for fine-grained configuration to implement tailored anonymization procedures and offers a wide variety of options to protect the data while providing high performing optimization algorithms to retain the data utility31.

Anonymized data generation

To perform the data anonymization, ARX requires dataset-specific configuration. This includes the privacy models to be used and their thresholds as well as domain-generalization hierarchies for the variables that can be used by the software to aggregate data tailored to the scientific research question or as a distance-measure during clustering. Finally the specific transformations to be performed, e.g., suppression, full-domain or local generalization, aggregation, as well as the algorithm used to perform the optimization process need to be specified. Parameters often need to be fine-tuned over several iterations.

For the given dataset, the process was quite straight forward. We chose k-Anonymity as a strict privacy model that protects all records and applies to all variables. We chose the threshold k=2, because it is the weakest possible parameterization and weaker parameters tend to provide higher utility for small datasets. ARX was configured to perform a clustering process where domain-generalization hierarchies are used to determine distances between values. For categorical variables, a simple domain-generalization hierarchy was designed, grouping “walk” and “wheels” together, because they both indicate movement through physical activity. For continuous variables, the associated domain-generalization hierarchies represented increasingly large intervals. The hierarchy for the “gender” variable simply included a common root node “*” for both genders. In each cluster, categorical values were replaced by the mode of all values in the cluster, while drawing from the distribution within input data if there was no mode, and continuous variables were replaced with the arithmetic mean of all values in the cluster.

We applied the transformations using ARX’s local optimization strategy31 using the Java library provided by ARX version 3.9.2. The core Java code (i.e. excluding I/O, imports, etc.) for executing the anonymization is around 200 lines. Executing the process took about 2.3 minutes on a laptop with an i5-1135G7 processor running at 2.4GHz and sufficient memory (8GB).

Anonymized data usage

The data anonymization performed by ARX retained the data structure of the original data, therefore no post-processing was needed before use.

Synthetic Data Vault (SDV)

Overview

The Synthetic Data Vault (SDV) is an open source project implementing several synthetic data tools9. SDV offers different tools depending on whether the original table is a single table, multiple tables (relational), or time-series. The single-table case offers several tools, such as Gaussian-Copula, CTGAN, and TVAE. The latter two are promoted by SDV as being suitable for data with a mix of categorical and continuous columns, and the data quality of CTGAN and TVAE is similar35.

We used SDV’s CTGAN tool (Conditional Tabular Generative Adversarial Network) to generate an anonymized dataset. As the name implies, CTGAN uses a Generative Adversarial Network (GAN) approach35. It runs two neural networks, a generator and a discriminator. The generator tries to create anonymized datasets that the discriminator cannot distinguish from the original data, while avoiding overfitting.

There are a number of parameters that can be used to fine-tune CTGAN. Most importantly, the metadata description must be correct, especially the labeling of continuous and categorical columns. There are a number of other parameters related to the GAN itself (e.g., enforced minimums and maximums, enforced rounding, number of epochs) that can improve data quality somewhat.

Anonymized data generation

SDV has an option to auto-generate the metadata. We used this option and checked to ensure that the metadata was correct. Since all of the categorical columns in the original dataset are strings, the auto-generated metadata was correct. Since simplicity of operation is an important requirement in an open science environment, we chose to use the default parameter settings. Note that the commercial synthetic data product Mostly AI, which has sophisticated algorithms to automate the selection of models and parameters and in general outperforms SDV’s CTGAN15, did not in this case perform better than CTGAN with the default settings. We therefore believe that we could not have improved substantially by tweaking the parameters.

from sdv.metadata import SingleTableMetadata

from sdv.single_table import CTGANSynthesizer

metadata = SingleTableMetadata()

metadata.detect_from_dataframe(df_orig)

synthesizer = CTGANSynthesizer(metadata)

synthesizer.fit(df_orig)

df_syn = synthesizer.sample(num_rows=len(df_orig))

The core Python code is five lines. We used version 1.14.0 of SDV with the default settings. It took 32 seconds to generate the anonymized data on a laptop with an i7-7820HQ processor running at 2.9GHz and sufficient memory (32GB).

Anonymized data usage

The data anonymization performed by SDV retains the data structure of the original data, therefore no post-processing is needed before use.

SynDiffix

Overview

SynDiffix takes a multi-table approach to synthesizing data8. A key characteristic of all data anonymization methods is that accuracy degrades as the number of columns increases. Therefore it is better to synthesize only those columns needed for a given analytic task. Given that there can be thousands of different combinations of columns that analysts may be interested in, a multi-table approach can lead to the generation of thousands of distinct tables. Unlike other anonymized data methods, SynDiffix is designed to maintain anonymity no matter how many anonymized tables are generated.

When SynDiffix synthesizes a table with more than around 5 or 6 columns, SynDiffix partitions the table into a set of sub-tables each with fewer columns, synthesizes each sub-table individually, and then joins the sub-tables back together. Columns within sub-tables are more strongly correlated than columns across sub-tables. Each sub-table has at least one column in common with another sub-table, and these common columns are used for joining.

SynDiffix is implemented in Python, and has two modes of operation, full-table mode and sub-table blob mode. In full-table mode, the user (data controller or analyst) requests a single table, and SynDiffix returns that table after doing any required partitioning and joining. Full-table mode requires that the original data is available to SynDiffix. Full-table mode uses the Synthesizer class of SynDiffix.

In sub-table blob mode, a single API call is made by the data controller to create the complete set of sub-tables required to generate any table. This set of sub-tables is zipped into a single file which is refered to as a SynDiffix blob. The blob may safely be released to the public. Creating the blob uses the SyndiffixBlobBuilder class, and requires access to the original data. Subsequently, an analyst with access to a blob uses the SyndiffixBlobReader class to join and retrieve tables from the blob. The analyst requests a table with the given columns, the blob reader selects the appropriate sub-tables from the blob, joins them together, and returns the table. The original data is not required for this operation.

In either mode, no table-specific configuration is required if the table is not longitudinal or time-series. If the table is, then the column that identifies each individual in the dataset must be specified. No other configuration is necessary.

Anonymized data generation

To generate the blob, the user must write a python script to read in the original data file as a dataframe (df_original below), and generate and save the blob. Generating the blob required two lines of Python:

sbb = SyndiffixBlobBuilder(blob_name, blob_path)

sbb.write(df_original)

This creates a blob from the full original dataframe df_original and places it in the directory at blob_path with the name blob_name.

The blob file created from the original data with eight columns and 713 records contains 160 tables, took 81 seconds to build on the same laptop as with ARX, and is 1.4MB in size (zipped).

Anonymized data usage

While creating the blob is simple and automatic and requires no expertise from the controller, using the blob for analytics is unfortunately more complex. For each analytic task, the analyst must request from the blob a table with only the required columns. In addition, if the analytic task is to build a predictive model for a given target column, then the target column should be specified in the blob request. The analyst must be aware of these requirement, and must modify their analysis code to satisfy the requirements. The quality of the data is substantially worse if the analyst instead uses a single full table for all of their analysis.

Recreating the three tables and one plot from the original paper6 required seven different tables from the blob; 2 with 1 column, 3 with 2 columns, and 2 with 6 columns and a target column (VO2max). In the scripts used for generating the tables and plot, an additional API call to SyndiffixBlobReader was required prior to each analytic task:

sbr = SyndiffixBlobReader(blob_name, blob_path)

df1 = sbr.read([col1, col2])

analytic_task1(df1)

df2 = sbr.read([col3, col4, col5, col6], target_column=col5)

analytic_task2(df2)

...