Validating a membership disclosure metric for synthetic health data

Abstract Background One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. Objective Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. Materials and methods We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack. Results The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets. Conclusions Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.


INTRODUCTION
There has been growing interest in using synthetic data generation (SDG) techniques to enable broader privacy-preserving sharing of data for secondary purposes, 1,2 and specifically for health data. [3][4][5][6][7][8][9][10][11][12][13] While patient (re-)consent is one legal basis for making data available for secondary purposes, it is often impractical to get retroactive consent under many circumstances and there is significant evidence of consent bias. 14 Anonymization is another approach for addressing privacy concerns when making health data available for secondary analysis. However, there have been repeated claims of successful reidentification attacks on anonymized data, [15][16][17][18][19][20][21] eroding public and regulator trust in this approach. [21][22][23][24][25][26][27][28][29][30] There are multiple synthetic health datasets that are currently available to a broad research community such as: the NIH National COVID Cohort Collaborative (N3C), 31 the CMS Data Entrepreneur's Synthetic Public Use files, 32 synthetic cardiovascular and COVID-19 datasets available from the CPRD in the United Kingdom, 33,34 A&E data from NHS England, 35 cancer data from Public Health England, 36 a synthetic registry from the Dutch cancer registry, 37 synthetic variants of the French public health system claims and hospital dataset (SNDS), 38 and South Korean data from the Health Insurance Review and Assessment service (the national health insurer). 39 The general assumption has been that synthetic data has low identity disclosure risks because there is no unique or one-to-one mapping between the records in the synthetic data with the records in the original (real) data. [40][41][42][43][44][45][46][47] However, there are additional risks beyond identity disclosure that need to be managed for synthetic datasets: (1) attribution risk (attribute disclosure conditional on identity disclosure), 48 and (2) membership disclosure. 49,50 Our primary focus in this article is on evaluating membership disclosure for synthetic data.
There has been a growing literature on assessing membership disclosure risks for synthetic data. 8,[49][50][51][52][53][54][55][56][57] Membership disclosure is when an adversary, using the information in synthetic data, determines that a target individual was included in the real dataset used as input for SDG. Knowing that an individual was in the real data can reveal sensitive attributes about that individual if the dataset pertains to a particular disease, condition, or process. The target individual is assumed to be from the same population as the real dataset.
For example, if the real dataset pertains to a clinical study of HIV patients, membership disclosure would reveal that the target individual has HIV, or that they had participated in the study. Both would be deemed inappropriate disclosures of private information.
A broader type of membership risk, referred to as a membership inference attack, has been used to evaluate privacy risks for discriminative machine learning models. 58,59 There are multiple reasons why membership inference attacks on machine learning models may be performed. For example, if an organization wishes to see if any of their own data was inappropriately used to train a machine learning model to detect copyright infringement or a breach of contract, or a regulator attempting to detect if some information was used without individual consent. In the context of the current study, we are only focused on membership inference attacks for the sole purpose of privacy violations. This distinction is important because the privacy purpose imposes some pragmatic constraints on these attacks.
While we are not aware of real-world membership disclosure attacks on synthetic datasets, the extensive and growing literature on the topic has highlighted the risk. From a legal and compliance perspective, it will arguably not be acceptable to share synthetic data without demonstrating low membership disclosure risks.
One proposed method for estimating membership disclosure requires the training of a shadow model, 51 and using a discriminator, such as a random forest model, to distinguish between records in and not in the training dataset. However, this approach makes a strong assumption about the availability to the adversary of a large reference dataset from the same population as the training data, 60 which may be difficult to meet in practice. Another strong assumption that is made is that the adversary would know the generative model details including all the parametrizations. For example, a data custodian would not generally share all of the trained weights and hyperparameters of their generative models with the data users.
Therefore, in this article, we evaluate another and more commonly used partitioning method for estimating membership disclosure risk of synthetic data, demonstrate through a theoretical and empirical analysis that its default parametrization in the literature could give inaccurate estimates of membership disclosure risk, and define a parameterization that gives the same results as the ground truth. We then provide a general benchmark to evaluate whether membership disclosure is acceptably low or not, and apply the membership disclosure metric to assess the risks for 7 clinical trial datasets.

Notation
We will use the notation in Table 1.
The partitioning membership disclosure attack method An assessment of membership disclosure is performed by the data custodian before a dataset is released. The data custodian does not have access to the population that the real dataset was sampled from, and therefore they will use an estimation procedure to compute this disclosure risk. The estimation procedure should accurately reflect the level of success that an adversary performing an attack would achieve on average. If the estimation procedure does not meet

R
The real dataset P The population from which the real dataset is sampled S Synthetic datasets D The attack dataset y A record in the attack dataset (ie, y 2 D) y 0 A record in the synthetic dataset (ie, y 0 2 S) r A record in the real dataset (ie, r 2 R)

Dataset sizes
n The number of records in the real dataset (ie, n ¼ jRj) m The number of records in the attack dataset (ie, m ¼ jDj) N The size of the population that R is sampled from (ie, N ¼ jPj) t The proportion of the attack dataset records that are in the real dataset Hamming distance L Hamming distance function h Hamming distance threshold that objective, then it would not be useful for decision-making by the data custodian. Under the partitioning method a membership disclosure attack occurs when an adversary has an attack dataset that is a sample from the same population as the real dataset. The attack dataset consists of one or more target individuals that the adversary intends to compromise. Then the adversary matches records in the attack dataset with the synthetic dataset. Membership disclosure occurs when a matching record is also in the training dataset. This process is illustrated in Figure 1. A key assumption in this process is that the attack dataset is from the same population as the real dataset, otherwise there is no reason for the adversary to expect that attack dataset records would be in the real dataset.
Because the data custodian does not have access to the attack dataset nor the population, they would need to estimate this risk using a different procedure as described below. The starting assumption for the partitioning method is that the synthetic data distribution approximates the real dataset distribution. 61 Therefore, the probability that the attack dataset belongs to the training dataset is proportional to the probability that the attack dataset belongs to the synthetic dataset. The partitioning method does not require a large reference dataset, which explains why it is the most commonly implemented in practice. 8,49,50,53,55,56 The partitioning method is illustrated in Figure 2. Here the real dataset is randomly split into 2 subsets, the training sample, and a holdout sample. The training sample is then synthesized, and a synthetic dataset is created. The holdout data provides records not used in training for inclusion in the attack dataset.
We assume that an adversary has complete information on m patients, 49,50,53 where m Â t are drawn from the training sample and m Âð1 À tÞ are drawn from the holdout sample. For example, if t ¼ 0:5 then the attack dataset is half training and half holdout. We set m ¼ s Ân were s is a sampling fraction from the real dataset. Previous work did not demonstrate a pronounced change when the sampling fraction was altered. 49,50 Therefore, we will not consider s to be a key parameter.
We can then compute the minimum distance between every record in D and all the records in synthetic dataset S. In the literature, the distance L is measured using the Hamming distance, and a match for attack record y is considered to have occurred if min y 0 Lðy; y 0 Þ h, where h is a predefined threshold and y 0 is a record in the synthetic dataset. Precision and recall metrics are then computed based on the number of matched records that are in the training dataset. These can be combined through their harmonic mean into an F1 score: The advantage of the F1 score is that it provides a single metric that can be used for decision-making and optimization during the training of the generative model.
The F1 score computed using this method is an estimate of the expected success that an adversary would have when performing the membership disclosure attack in Figure 1.

Parameterizing the partitioning method
Previous work using the partitioning method had set t ¼ 0:5. 8,49,50,53,55,56 In the analysis below, we show theoretically and empirically that the accuracy of the estimate of the membership disclosure F1 using the partitioning method is dependent on the value of t, and that there is a valid value of t that is consistent with the sample that an adversary would obtain when constructing an attack dataset, irrespective of the size of the attack dataset.
The real dataset is a set of records R of size n ¼ jRj, and that a synthetic version of this dataset is generated, denoted by the set of records S. The attacker has another dataset represented by D which is the attack dataset, and we let m ¼ jDj. Both R and D are inde- Figure 1. The (ground truth) process for a membership disclosure attack which accounts for the fact that the attack dataset will be sampled independently from the same population as the real dataset. The attack dataset is matched with the synthetic dataset to infer which records are in the real dataset. pendent random samples from the same population, which consists of the set of records P and N ¼ jPj which is the size of the population.
With the above setup, the probability that there are k individuals in the overlap of R and D, such that k ¼ jR \ Dj, can be expressed as a hypergeometric distribution: and this hypergeometric distribution has an expected value mn = N . The proportion of individuals from the attack dataset that can plausibly exist in the real dataset is therefore n/N, which is the sampling fraction of the real dataset from the population. This means that an adversary sampling an attack dataset from the same population as the real dataset will have an expected proportion of t ¼ n/N of records in the attack dataset that are also in the real dataset. For the data custodian to correctly assess membership disclosure, that same proportion that the adversary will have should be used to give a correct estimate of the F1 score. Unless the real dataset represents 50% of the population, setting t ¼ 0:5 will not provide an attack dataset that is reflective of the expected attack dataset that an adversary would have in practice. In the empirical assessment below, we demonstrate the differences in the calculation of the F1 score from the ground truth when t ¼ 0:5.

Empirical demonstration
In this empirical demonstration, we simulate an actual membership disclosure adversary attack on synthetic datasets as illustrated in Figure 1 and compute the F1 score for the adversary. This is the ground truth in that it provides the correct F1 success rate of a membership disclosure attack.
This ground truth simulation assumes that the adversary samples 1000 records randomly from the population and matches these records with the synthetic records. Records that match are claimed to also be in the training dataset. The claims are evaluated by com-puting the F1 score. This process models the adversary behavior of the membership disclosure attack as defined in the literature.
We then simulate the partitioning method illustrated in Figure 2 while varying the value of t and also compute the F1 score each time. For this simulation, we randomly select a value of t between 0 and 1 for each iteration of the simulation.
The 2 approaches are then compared to determine when they give the same results (ie, at what value of t are the F1 score values the same). This is the value of t that should be used when computing membership disclosure using the partitioning method since that is the value which gives the same result as the ground truth.
For these simulations, we ran 50 iterations for each study point where we varied the parameters as follows: (1) the t parameter was varied randomly from 0 to 1, (2) the size of the attack dataset was fixed at 1000 observations, although when we varied that parameter it had no impact on the results as we just need sufficient observations to get a stable value for F1, (3) the training dataset size was set to 5k, 15k, and 25k, (4) the Hamming distance threshold was set to 5 which is within the range of values commonly used in the literature, 61 (5) 2 generative models were used, and (6) 4 different datasets.
The first type of generative models was a sequential tree-based synthesizer. 62 This has been used to synthesize health and social sciences data, [63][64][65][66][67][68][69][70][71] and applied in research studies on synthetic data. 63,72,73 The second is CTGAN, 74 which is a generative adversarial network (GAN) architecture. GANs have been applied often for the synthesis of health data. 41,49,50,53,55,75 These 2 types of generative models are representative of those used in practice.
We used 4 datasets as the population in our simulations summarized in Table 2. These datasets were selected as they reflect heterogeneous data collection contexts including care settings, public health, and surveys. They also vary in data complexity. We set up the dataset sizes so that there is realistic variation in the sampling fractions of the real datasets that were used.

Interpreting the partitioning method F1 score
The F1 score is known to depend on the distribution of positive classes, which in our case is the proportion of real records in the attack dataset. This means that the F1 value by itself will not have a consistent interpretation across different datasets with varying distributions.
We propose to interpret the obtained F1 score relative to the maximum that the adversary would obtain with no background knowledge about the real dataset. The highest F1 score that can be obtained by the adversary with no background knowledge would be if they classify all of the records in the attack dataset as being in the real dataset. This value would be obtained irrespective of any synthesis-its only assumption is that the adversary has drawn a sample of targets from the same distribution as the training dataset and does not depend on the availability of a synthetic dataset. In such a case the maximum F1 score from classifying all records in the attack dataset as being in the real dataset would be: As N grows for a fixed n, F max ! 0, and F max ¼ 1 when n ¼ N. This maximum F1 score is a function of the proportion of the population that is in the real dataset. The larger that proportion the greater the success of the adversary by using this naïve strategy. That is not surprising in that the more individuals in the real dataset, claiming that a randomly selected person from the population is in the training dataset is more likely to be correct.
Note that if the adversary randomly assigns attack records to a training dataset based on a probability of 0.5, their F1 score would be lower than Equation (3). Therefore, there is no reason for an adversary to follow that suboptimal approach.
The naïve maximum value in Equation (3) will be the case even if another privacy enhancing technology was used instead of synthetic data generation. For example, if risk-based deidentification methods were used 76 the maximum F1 score from a naïve membership disclosure attack would be the same. Under this naïve attack recall is by definition equal to 1 and precision is equal to n/N.
The F1 score produced using the partitioning method can be interpreted with respect to this maximum value. We define a corrected F1 score to reflect membership disclosure that is similar in construction to other metrics such as Cohen's Kappa, 77 and denote it with M: Note that M is undefined when F max ¼ 1 since no additional improvements are possible.
Previous researchers have used a 20% improvement over a naive baseline as an acceptable threshold for membership disclosure risk for synthetic data, 8 and therefore, we can use that as a cutoff. In such a case, we would define acceptable membership disclosure risk as M 0:2. If the M value is negative then the adversary actively matching the attack dataset with the synthetic dataset would produce results worse than the naïve approach which means that using the synthetic dataset in a membership disclosure attack reduces the relative success of the adversary.

RESULTS
The graphs in Figures 3-6 show the results for the COVID, Washington, CCHS, and Nexoid datasets, respectively. The plots show the F1 score using the partitioning method as the t value is varied. The ground truth F1 score based on a simulation of a membership disclosure attack by an adversary is relatively fixed across iterations since it is not affected by the value of t.
Our results show that: (a) The t value for the partitioning method has a nontrivial impact on the F1 score. We can see the values varying significantly across the range. (b) The partitioning method only gives the same F1 score as the ground truth when t ¼ n/N which is where the 2 lines in the plots intersect. The values of t ¼ n/N are the same values where the ground truth and the partitioning method intersect in the graphs. (c) Setting t ¼ 0:5 would not give us correct estimates of the actual membership disclosure F1 score, and sometimes the error can be quite large. Depending on whether the real data sampling fraction is above or below t ¼ n/N point, the F1 score of the partitioning method can be substantially higher than or lower than the ground truth value.
These results are consistent across the datasets, generative models, and dataset sizes.
An adversary with a random sample of target individuals from the population will not achieve t ¼ 0:5 all the time and therefore always using that value will not give a true reflection of the performance of an adversary attack. When the sampling fraction is equal to 50%, the default partitioning method with t ¼ 0:5 gives the same result as the ground truth.
To further demonstrate that the correct parameterization of the partitioning method is t ¼ n/N, in Table 3(a) are the mean values of the F1 score from the ground truth simulation and another simulation where we set t ¼ n/N with the same number of iterations. As can be seen, the F1 score is very similar between the 2, further supporting the conclusion that this is an accurate reflection of the performance of a membership disclosure attack.
The M values for our 3 datasets are shown in Table 3(b). All the values are below the threshold. The specific membership disclosure value is a function of the combination of dataset complexity and the generative model that is used.

Summary
Membership disclosure is considered an important privacy risk for synthetic data, and needs to be evaluated before such datasets can be used and disclosed for secondary purposes. The partitioning method for estimating membership disclosure makes reasonable assumptions about the information that an adversary has access to and is often used in the literature.
The partitioning method splits the real dataset into a training dataset and a holdout dataset. The training dataset is used to generate the synthetic data. An attack dataset is constructed with a certain proportion of it from the training dataset, and the rest from the holdout. The default in the literature is a proportion of 0.5. Then a matching exercise between the attack and synthetic dataset is performed. Matches are predicted to be in the training dataset, and the accuracy of that prediction is evaluated using an F1 score.
We showed theoretically and empirically through simulations that the proportion of training records included in the attack dataset has a nontrivial impact on the accuracy of the F1 score, and to give valid results this proportion must be equal to the sampling fraction of the real dataset from the population. If this condition is not met,  the F1 score can be quite inaccurate and does not reflect the results that an adversary would obtain in practice. An interpretable adjustment of the F1 score computed through this approach was proposed. This enables data custodians to determine whether their membership disclosure values are acceptably small or not.
The work in this article built on existing methods for assessing membership disclosure while addressing a common assumption in its calculation that has resulted in potentially inaccurate results. Our approach provides a validated and interpretable metric that can be applied on synthetic datasets.
It is necessary to provide a value for the population size. This can be defined by the prevalence of a disease in a particular geography, for example. We demonstrate this in the applications below.

Applications
We demonstrate the application of this membership disclosure metric on 7 oncology clinical trial dataset. Given the increasing interest in making clinical trial datasets available, [78][79][80] the objective was to determine what the privacy risks would be for synthetic variants,  and whether these risks would be deemed acceptably small. The 7 datasets we examine are from Project Data Sphere (see https://data. projectdatasphere.org/). 81 The population was defined as other similar trials, which is consistent with Health Canada recommendations for defining the reference population in privacy risk assessments. 79 For each trial, we identified other trials in the same therapeutic area over the same period and with overlapping geographies from <clinicaltrials.gov>. The sequential synthesis method was used to synthesize these trial datasets.
As can be seen from the results in Table 4, the membership disclosure risks are consistently below the threshold that we had defined earlier. These suggest that sequential synthesis can be a useful generative approach for protecting the membership disclosure risks of oncology clinical trial datasets, and enable their broader sharing within the research community. This is appealing given that previous results have shown that sequential synthesis can have good utility for oncology clinical trial data 11 and for observational datasets. 82 This example application demonstrates how the population size was determined for clinical trial datasets. In the general context of disclosure risk estimation models, it is common to have to provide a population size value. [83][84][85] In cases where determining the popula-  tion size is not obvious, one default option is to use the geographic population size for the region that is covered by the dataset. If the population size is underestimated then that results in an overestimation of membership disclosure risk, and if the population size is overestimated then that results in an underestimation of membership disclosure risk. Therefore, to err on the conservative side it is preferrable to use a lower value for the population size when there is uncertainty. In the case of the default option for determining population size, this means selecting the smallest region that covers a dataset.

Risk mitigation
For a specific dataset, it is possible to ensure that the membership disclosure risk is acceptably small by incorporating the M metric in a risk-utility loss during hyperparameter tuning of the generative model while it is being trained. The following loss metric can be used: where U is some validated utility metric 82 and ½ are Iverson brackets. This loss proportionally penalizes the utility if the membership disclosure is above the 0.2 threshold using a sigmoid function. If the risk is at or below 0.2, then the loss is equal to the utility since in that case the risk is deemed acceptable. If the risk is slightly above the 0.2 threshold then the loss is almost equal to the utility, and starts to decrease monotonically as the risk grows. The advantage of loss RU is that privacy considerations are integrated within model development rather than being a post hoc assessment. Imatinib is an FDA approved protein-tyrosine kinase inhibitor for treating certain cancers of the blood cells. This drug is hypothesized to be effective against GIST as imatinib inhibits the kinase which experiences gain of function mutations in up to 90% of GIST patients. 86 At the time of this trial the efficacy of imatinib for GIST as well as the optimal dosage for treatment of GIST was unknown.
Trial #2 (NCT01124786): Clovis Oncology Most pancreatic cancer patients have advanced inoperable disease and potentially metastases. At the time of this trial the first line therapy for patients with inoperable disease was gemcitabine monotherapy. One transporter (hENT1: human equilibrative nucleoside transporter-1) has been identified as a potential predictor of successful treatment via gemcitabine. This trial compares standard gemcitabine therapy to a novel fatty acid derivative of gemcitabine. This is hypothesized to be superior to gemcitabine in metastatic pancreatic adenocarcinoma patients with low hENT1 activity as it exhibits anticancer activity independent of nucleoside transporters like hENT1, while gemcitabine seems to require nucleoside transporters for anticancer activity. Trial #6 (NCT00119613): Amgen This was a randomized and blinded Phase 3 trial aimed at evaluating whether "increasing or maintaining hemoglobin concentrations with darbepoetin alfa" improves survival among patients with previously untreated extensive-stage small cell lung cancer. The treatment group received darbepoetin alfa with platinum-containing chemotherapy, whereas the control group received placebo instead of darbepoetin alfa.
16 484 (n ¼ 479) À0.0322 Trial #7 (N0147): NCCTG This was a randomized trial of 2686 patients with stage 3 colon adenocarcinoma that were randomly assigned to adjuvant regimens with or without Cetuximab. After resection of colon cancer, Cetuximab was added to the modified 6th version of the FOLFOX regimen including oxaliplatin plus 5-fluorouracil and leucovorin (mFOLFOX6), fluorouracil, leucovorin, and irinotecan (FOLFIRI), or a hybrid regimen consisting of mFOLFOX6 followed up by FOLFIRI. 90 Our focus is on the secondary retrospective analysis of N0147 (the published secondary analysis). 91 27 526 (n ¼ 1543) 0.052 Note: The population includes the specific study participants. The n value indicates the number of trial participants for which we had data available.

Limitations
In our analysis, we considered the mean results across our simulations. The variation across the iterations was largely driven by sampling variability. These results do not account for the worse case situation, but only the average performance of our membership disclosure metric. Our membership disclosure metric is applicable to tabular data, which is consistent with the literature thus far. 8,49,50,51,53,55,56 Future work should evaluate and extend these membership disclosure estimators to longitudinal datasets.
There are other types of privacy risks in synthetic data beyond just membership disclosure, such as attribution risks. 43,48 In practice all privacy risks should be considered when assessing synthetic datasets.

AUTHOR CONTRIBUTIONS
KEE and LM designed the study; KEE, LM, and XF performed the analysis, and wrote the paper. All authors approved the final manuscript.

ETHICS AND CONSENT TO PARTICIPATE
This study was approved by the CHEO Research Institute Research Ethics Board protocol CHEOREB# 21/139X. All research was performed in accordance with relevant guidelines/regulations. This study only used deidentified data. Patient consent was not required by the IRB.