Robust fingerprinting of genomic databases

Abstract Motivation Database fingerprinting has been widely used to discourage unauthorized redistribution of data by providing means to identify the source of data leakages. However, there is no fingerprinting scheme aiming at achieving liability guarantees when sharing genomic databases. Thus, we are motivated to fill in this gap by devising a vanilla fingerprinting scheme specifically for genomic databases. Moreover, since malicious genomic database recipients may compromise the embedded fingerprint (distort the steganographic marks, i.e. the embedded fingerprint bit-string) by launching effective correlation attacks, which leverage the intrinsic correlations among genomic data (e.g. Mendel’s law and linkage disequilibrium), we also augment the vanilla scheme by developing mitigation techniques to achieve robust fingerprinting of genomic databases against correlation attacks. Results Via experiments using a real-world genomic database, we first show that correlation attacks against fingerprinting schemes for genomic databases are very powerful. In particular, the correlation attacks can distort more than half of the fingerprint bits by causing a small utility loss (e.g. database accuracy and consistency of SNP–phenotype associations measured via P-values). Next, we experimentally show that the correlation attacks can be effectively mitigated by our proposed mitigation techniques. We validate that the attacker can hardly compromise a large portion of the fingerprint bits even if it pays a higher cost in terms of degradation of the database utility. For example, with around 24% loss in accuracy and 20% loss in the consistency of SNP–phenotype associations, the attacker can only distort about 30% fingerprint bits, which is insufficient for it to avoid being accused. We also show that the proposed mitigation techniques also preserve the utility of the shared genomic databases, e.g. the mitigation techniques only lead to around 3% loss in accuracy. Availability and implementation https://github.com/xiutianxi/robust-genomic-fp-github.


I. INTRODUCTION
Genomic database sharing is critical in modern biomedical research, clinical practice, and customized healthcare.However, it is generally not viable due to the copyright and intellectual property concerns from the database owners.In other words, the requirements of copyright protection and antipiracy may prevent genomic data holders from sharing their data, which may hinder the progress of cooperative scientific research.
Digital fingerprinting is a technology that allows to claim copyright, deter illegal redistribution, and identify the source of data breaches (i.e., the guilty party who is responsible for the leakage) by embedding a unique mark into each shared copy of a digital object.Although the most prominent usage of fingerprinting is for multimedia [6], [7], [15], fingerprinting techniques for databases have also been developed [11], [16], [20], [21].These techniques change database entries at different positions when sharing a database copy with a new service provider (SP).If the SP shares its received copy without authorization, the database owner can use the inserted fingerprints to hold the guilty SP responsible.

A. Challenges in Genomic Database Fingerprinting
Existing fingerprinting schemes for databases have been developed to embed fingerprints in continuous numerical entries (floating points) in relational databases, e.g., [11], [19], [20].Whereas, fingerprinting discrete (or categorical) values is more difficult, as the number of possible values (or instances) for a data point is much fewer.Hence, in such databases, a small change in data points (as a fingerprint) can significantly affect the utility.Fingerprinting becomes even more challenging when it comes to a genomic databases, which contain even fewer values, e.g., 4 instances (A, G, C, T) when considering nucleobases, and 3 instances (0, 1, 2) when considering number of minor alleles for each single nucleotide polymorphism -SNP (Section III gives more details about the type of genomic data considered in this work).In real world, leaked genomic databases often end up being sold or publicly shared on the internet [22].Once that happens, the genomic database owner wants to find out the traitors who should be responsible for the data leakage by extracting the their fingerprints in the leaked databases.Thus, in this paper, we first propose a vanilla genomic database fingerprinting scheme by (i) taking into account of the abundant attributes of genomic sequence, and (ii) extending the state-of-the-art database fingerprinting scheme [20].Our vanilla scheme is more robust against common attacks targeted on database fingerprinting schemes (e.g., random bit flipping attack and subset attack [1], [19]) than previously developed fingerprinting schemes for generic relational databases such as [11], [20], [21], because we can insert denser fingerprint for each selected genomic data by introducing a new parameter, which controls the percentage of fingerprinted entries for selected rows (see Section IV-A).Whereas, [20] only fingerprints one attribute for all selected rows.Compared with [20], we also assign higher confidence score during fingerprint extraction considering that genomic databases usually contains more attributes than generic databases (see Section IV-B).
In addition, existing fingerprinting schemes for databases do not consider various inherent correlations between the data records in a database.In our previous work [14], we have shown that a malicious party having a fingerprinted copy of a database can detect and distort the embedded fingerprints using its knowledge about the correlations in the data entries.Genomic databases contain even richer row-and column-wise correlations due to the biological characteristics.In particular, the row-wise correlations arise from (i) the Mendel's law, and (ii) similarities of genomes among family members.The column-wise correlations are the pairwise correlation between genomic data points at different locations (e.g., linkage disequilibrium [24]).In this paper, we use Atk row (S) and Atk col (J ) to represent the correlation attacks using the row-and column-wise correlations, respectively, where S and J denote the corresponding correlation model and they are assumed to be publicly known (in Section III-B, we describe these two attacks in detail).In Section VI, we consider a real world genomic database and show that by launching Atk row (S) and Atk col (J ) in sequence, a malicious SP can easily compromise more than half of the bits in a fingerprint string at the cost of only changing about 5% of the entries in the genomic databases.As a result, we also need to make the proposed vanilla genomic database fingerprinting scheme be robust against the correlation attacks in order to lay a solid foundation for genomic data sharing.

B. Our Solution
In this work, to address the unique challenges of robust fingerprinting of genomic databases, i.e., mitigating Atk row (S) and Atk col (J ), we develop mitigation techniques for each of them, i.e., Mtg row (S) and Mtg col (J ).These techniques utilize the correlations among genomic data, i.e., Mendel's law, S, and J , and they work as post-processing steps for our developed vanilla scheme.Besides, they only modify nonfingerprinted entries in the genomic databases.Thus, they do not reduce the robustness of the vanilla scheme.Note that the proposed robust genomic database fingerprinting scheme in this paper is not just a simple application of our previous work [14] for genomic databases, because the correlation models considered in this paper are different compared to the generic models we have [14], and thus they require new mitigation techniques to make the fingerprinted genomic databases match the Mendel's law and genome-specific correlations.In Table I, we summarize the differences between the proposed robust genomic database fingerprinting scheme and previous schemes.
In particular, Mtg row (S) is composed of two phases.First, it checks all fingerprinted genomic data-tuples of family Properties [20] [14] this paper

Flexible density in marked attributes
Higher confidence in extraction Genome-specific correlation Mtg col (J ) considers all attributes (columns) of the genomic databases, obtains the empirical marginal distributions after the vanilla fingerprinting, and changes non-fingerprinted entries in each attribute to make the empirical marginal distributions resemble the marginal distributions obtained by marginalizing the joint distributions provided in J .Similar to our previous work [14], the database owner selects the non-fingerprinted entries and modifies their value by solving a linear programming problem which is discussed in Section V-B.
In Section VI, we show that by applying Mtg row (S) and Mtg col (J ) after our proposed vanilla fingerprinting scheme, the malicious SP can hardly distort large portion of the fingerprint bits anymore, even if it introduces significant utility loss in the database (e.g., by decreasing database accuracy and making the SNP-phenotype associations less consistent compared to the ground-truth measured on the original -nonfingerprinted -database).For example, if the malicious SP compromises around 24% accuracy and 20% consistency in SNP-phenotype associations, it can only distort about 30% fingerprint bits.This implies that the malicious SP will be held responsible for the genomic database leakage with high probability [14].
Contributions and Broader Impact.To the best of our knowledge, our work is the first to investigate the liability issue when sharing genomic relational databases (i.e., the collection of genomic data of individuals with the same attributes) and at the same time address the threats of correlation attacks due to the unique biological characteristics.
Our proposed robust genomic database fingerprinting scheme helps facilitate the development of genomic research which requires large-scale genomic data analyses, and is increasingly relying upon the sharing of genomic databases with various SPs.The ideas developed in this paper also shed light on sharing other sensitive biomedical databases, e.g., electrocardiogram and electrooculogram data samples, where the data correlations are determined by the spatio-temporal dependency between data records.In this case the correlations can be characterized by autocorrelation and cross-correlations or they are modeled as a Markov process.We will also study these types of databases in the future.
Roadmap.We review related works on database fingerprinting schemes in Section II.In Section III, we present the system and threat models, and evaluation metrics.Section IV introduces the foundation of our scheme, i.e., the vanilla fingerprinting scheme for genomic databases.Then, in Section V, we consolidate this foundation by developing mitigation techniques against various correlation attacks on the genomic database.In Section VI, we show the vulnerabilities of the vanilla fingerprinting scheme for genomic databases, and also demonstrate the performance of mitigation techniques from both database utility and fingerprint robustness perspective.Section VII discusses the limitations of the proposed scheme and points out future research directions.Finally, Section VIII concludes the paper.

II. RELATED WORK
The seminal work of database fingerprinting is proposed by [1], which assumes that the database consumer can tolerate a small amount of error in the marked databases.Then, based on [1], some variants have been proposed [11], [20], [21].For instance, [20] develop a database fingerprinting scheme by extending [1] to enable the insertion and extraction of arbitrary bit-strings in relations.However, these works do not consider the correlations among data entries, which makes them vulnerable to correlation attacks [14].Records in Genomic databases usually have much stronger correlations caused by Mendel's law and linkage disequilibrium, which make the genomic database prone to correlation attacks.
Recently, some works have explicitly taken the genomic data correlations into account in fingerprinting scheme design.In particular, [31] develop a probabilistic fingerprinting scheme by considering the conditional probabilities between genomic data points of a single individual.[2] propose an optimization-based fingerprinting scheme for sharing personal genomic sequential data by jointly considering collusion attack and data correlation.However, these two works focus on the genomic data of an individual, instead of a genomic database, where individuals may have kinship relationships.[25] develop a watermarking scheme for sequential genomic data based on belief propagation which considers the privacy of data and the robustness of watermark requirements at the same time.[14] propose mitigation techniques against general correlation attacks targeted on generic relational databases and show that the proposed technique can be applied after any existing relational database fingerprinting scheme to achieve robustness against correlation attacks.However, the above mentioned works cannot be directly applied to genomic database fingerprinting, because they fail to consider the characteristics that are unique to the genomic data.Particularly, (i) hereditary units governed by the Mendel's law can be utilized by the malicious SP to further infer the potentially fingerprinted locations.(ii) limited values of genomic data also makes the utility-preserving fingerprinting a challenging task.Thus, in this paper, we first show the vulnerability of genomic database fingerprinting against correlation attacks that take advantage of the Mendel's law and linkage disequilibrium.Then, we discuss how to mitigate the identified attacks in a way that the utility of the database is preserved.III.SYSTEM, THREAT MODEL, AND SUCCESS METRICS Now, we discuss the genomic database fingerprinting system, the considered various threats, and fingerprint robustness and utility metrics.

A. Genomic Database Fingerprinting System Model
We consider a database owner (Alice) with a genomic database (e.g., dbSNP [30]) including single-nucleotide polymorphisms (SNPs) of a certain population, i.e., each row corresponds to the SNP sequence of a specific individual.Each individual has two alleles for each SNP position, and each of these alleles are inherited from one of their parents.Thus, each SNP (i.e., each entry of the database) can be represented by the number of its minor alleles as 0, 1, or 2, and can be encoded as "00", "01", or "10", respectively.In this paper, we focus on sharing SNP databases, because such databases are critical to many genomic and medical research [23], e.g., genome-wide association studies [5].The techniques developed in this paper can be applied to other types of genomic databases (e.g., ones including nucleotides that may contain 4 values A, G, C, or T) by simply changing the data coding.
We present the system model for genomic database fingerprinting in Figure 1.We denote the genomic database owned by Alice as R. When Alice wants to share the database with various medical service providers (SPs), she includes a unique fingerprint in each copy of her database.The fingerprint bit-string customized for each SP is a randomly generated binary bit-string (elaborated in Section IV-A).The fingerprint essentially changes different entries in R at various positions (indicated by the yellow dots).The fingerprint generated for the ith SP (SP i ) is f SPi , and the fingerprinted genomic database received by SP i is represented as R fSP i .We also use R to represent a generic fingerprinted genomic database.
In real world applications, some of the SPs may be malicious (e.g., SP i in Figure 1) who will redistribute its received genomic database copy after conducting certain attacks (discussed in Section III-B) on top of it.System with Vanilla Fingerprinting.If SP i compromise R fSP i via random bit flipping attack, Alice is able to identify SP i as the traitor by extracting a large portion of its fingerprint from the leaked database by only using the proposed vanilla Fig. 1: The genomic database fingerprinting system, where Alice adds a unique fingerprint in each copy of her genomic database R when sharing.The inserted fingerprint will change entries at different locations (indicated by the yellow dots) in R.She is able to identify the malicious SP who pirates and redistributes her database using the customized fingerprint.fingerprint scheme (Section IV).However, if SP i launches correlation-based attacks on R fSP i , it can avoid being accused of data leakage with large probability.System with Robust Fingerprinting.The correlation-based attacks can be effectively mitigated, if R fSP i is generated by adopting our proposed robust fingerprinting scheme, and thus, SP i will still be held responsible for illegal redistribution.We will empirically evaluate these using a real world genomic database in Section VI.

B. Threat Model
Fingerprinted databases are subject to various attacks summarized in the following.Note that in all considered attacks, a malicious SP can change/modify most of the entries in R to distort the fingerprint (and to avoid being accused).However, such a pirated database will have significantly poor utility measured in terms of database accuracy and consistency of SNP-phenotype association (see Section III-D).Thus, a rational SP will try to get away with making pirated copies of the genomic databases by changing as few entries as possible in order to maintain high utility for the pirated database and gain illegal profit.Random Bit Flipping Attack.In this attack, to pirate a database, a malicious SP selects random entries in its received copy of the genomic database and flips their bit values [1].For example, a SNP value 2 ("10") becomes 0 ("00") after the attack.Row-wise correlation attack Atk row (S).As discussed in Section I, a malicious SP may utilize Mendel's law and similarities among family members' genomes to infer the potentially changed loci in the fingerprinted database.Thus, we assume that the malicious SP has access to the sets of families in the database as well as the genome similarities (denoted as S) among family members in each set.Note that this is a valid assumption, because quite a few works have shown that kinship or familial relationships from SNP genotyping data can be inferred with very high accuracy for small and medium size groups (e.g., dozens or hundreds of individuals [10], [26], large size populations [29], or even worldwide [18]) or such information can be obtained directly from the metadata.
Thus, upon receiving a fingerprinted copy of the genomic database, the malicious SP will check all SNPs at the same loci of family members and then flips SNPs at the loci that violate Mendel's law.For example, if both parents have SNP value 0, but their children have SNP value 1 at the same locus, then, the malicious SP knows that one or more family members' SNP values have been changed with high probability due to fingerprint insertion, since such change can be due to a mutation with a slight probability.Next, the malicious SP can further calculates the empirical row-wise correlations (denoted as S ) from the received fingerprinted copy, compares S with S, and changes entries that leads to large discrepancy between them.
Column-wise correlation attack Atk col (J ).We model the publicly known allelic associations (linkage disequilibrium) between SNP values at different loci as a set of joint distributions J .Once a malicious SP receives the fingerprinted database, it can calculate a new set of empirical pair-wise joint distributions J .Then, it compares J and J , and flips the entries in the fingerprinted copy that leads to large discrepancy between them.
In this paper we do not consider other common attacks, such as the subset attack and superset attack [20], because they are usually much weaker than the random bit flipping attack as shown in [31].Another widely investigated attack is the collusion attack [3], [4], [27].Our proposed robust fingerprinting scheme for genomic databases can also be extended to collusion-resistant genomic database fingerprinting by incorporating collusion-resistant codewords when generating the fingerprint [3].We will extend our work in the scenario of colluding medical SPs in future.

C. Fingerprint Robustness Metrics
The primary goal of a malicious SP is to distort the fingerprint in its received copy to avoid being accused.We use the percentage of compromised fingerprint bits, i.e., Per cmp , to measure the robustness of the fingerprint scheme.Per cmp calculates the percentage of mismatches between the fingerprint bit-sting extracted from the compromised fingerprinted database and the original fingerprint bit-string that is used to generate the fingerprinted genomic database.In our previous work [14], we have shown that if the malicious SP can compromise more than 50% of the fingerprint bits, it can cause the database owner to accuse other innocent SPs who also received the databases.In this paper, we only focus on Per cmp , because other robustness metrics (e.g., the accusable ranking of a malicious SP) directly depends on Per cmp [14].

D. Utility Metrics
Fingerprinting naturally changes the content of databases (i.e., the values of the SNPs), and hence degrades its utility.We quantify the utility of a fingerprinted genomic database using the following metrics.Accuracy of the database, i.e., Acc.It calculates the percentage of matched data entries between the original genomic database and the fingerprinted copy (or the compromised fingerprinted copy, i.e., the pirated copy generated by a malicious SP).The higher Acc, the fewer entries are changed due to fingerprint insertion, attack from the malicious SP, or mitigation to resist the attacks, and thus, the higher the utility.Consistency of SNP-phenotype association.GWAS (genome-wide association study) is a widely adopted method to identify genetic variations that are associated with a particular phenotype (e.g., a disease).In GWAS, a researcher usually quantifies the associations between a phenotype and each SNP in the database using a p-value with a confidence level of 95% [12], [28].In particular, SNPs with low p-values (typically smaller than 0.05) are considered to have strong associations with the phenotype (i.e., the association cannot be due to chance).In general, a larger utility loss in terms of accuracy degradation will lead to less accurate SNP-phenotype association.To evaluate the p-value of each SNP in the genomic database, we first randomly divide the database into non-overlapping case (denoted as S) and control (denoted as C) groups, and then follow the steps listed in (1) to perform the calculations.
In particular, in (1), OR is the odd ratio, C 0 , C 1 , and C 2 (or S 0 , S 1 , and S 2 ) are the numbers representing a specific SNP taking a value of 0, 1, and 2 in the control (or case) group.StdErr(ln(OR)) is the standard error of the logarithm of the odd ratio, and z is interpreted as the standard normal deviation (i.e., z-value).Finally, the p-value is the area (probability) of the normal distribution that falls outside ±z, and it can be obtained using Ψ(•); the cumulative distribution function of the standard normal distribution.
To evaluate the utility of the genomic database, we identify the top-50 SNPs (i.e., the 50 SNPs with the lowest p-values) from the original (non-fingerprinted) database.Then, we check how many of such SNPs are preserved (i.e., remains to be the top-50 SNPs) after fingerprinting or various attacks.Note that we only consider the consistency of SNP-phenotype association for individual SNPs (i.e., not the consistency of SNP-phenotype association for SNP tuples).This is because the proposed mitigation techniques can preserve the Pearson's correlations among each pair of SNPs (see Section V-B for details).

IV. THE FOUNDATION: VANILLA GENOMIC FINGERPRINTING SCHEME
Now, we establish the foundation of our robust fingerprinting scheme for genomic database.Our developed vanilla scheme is inspired by [20], which enables the insertion and extraction of arbitrary bit-strings in databases.However, our scheme differs from [20] and its variants, e.g., [11], [14], as they only mark one bit position in each selected row, which leads to less fingerprinting robustness due to significantly larger number of attributes (e.g., number of SNPs in a genomic database) and strong correlation patterns in genomic data.

A. Fingerprint Insertion Phase of the Vanilla Scheme
When the database owner shares a fingerprinted copy of the genomic database R with a SP (whose id is n), it first generates the fingerprint bit string f SPn = Hash(K|n), where K is the secret key of the database owner and | stands for the concatenation operator. 2The database owner uses a cryptographic pseudorandom sequence generator U to select specific bit positions of specific SNPs from some individuals and fingerprint these bits using mark bits m's, which are the result of the XOR operation between the random mask bits (x's) and randomly selected fingerprint bits (f l 's), i.e., m = x ⊕ f l , where f l is the lth bit in f SPn .
To be more specific, for all individuals in the genomic database, the database owner fingerprints the SNP sequence if U 1 (K|r i .primarykey) mod 1/γ r = 0, where γ r ∈ (0, 1) is the row fingerprint density.As a result, the fraction of fingerprinted SNP sequences in R is approximately γ r .For all SNPs in a selected sequence (i.e., r i ), the element with attribute p (i.e., the SNP value at loci p of r i represented by r i [p]) will be fingerprinted if U 2 (K|r i .primarykey|p) mod 1/γ l = 0, where γ l ∈ (0, 1) is the column fingerprint density.Then, the database owner sets the binary mask bit x = U 3 (K|r i .primarykey|p) mod 2, and selects one bit position of f SPn via l = U 4 (K|r i .primarykey|p) mod L + 1. Next, it obtains the mark bit m as m = x ⊕ f SPn [l], and selects a bit position (count backwards) of r i [p] via t = U 5 (K|r i .primarykey|p) mod 2 + 1.Finally, it fingerprints r i [p] by replacing the tth to the last bit of r i [p] with m.We summarize the steps of the fingerprint insertion phase of the vanilla fingerprinting scheme in Algorithm 1.
Different from [11], [14], [20], by involving γ l , our vanilla scheme can mark more bits in each selected row.In our recent work [13], we also derived a closed-form expression to characterize the relationship between fingerprint robustness and database utility.Thus, by jointly tuning γ r and γ l we can 12 Set t = U5(K|ri.primarykey|p) mod 2 + 1.

13
Set the tth to the last bit of ri[p] to m. also achieve desired tradeoff between robustness and utility.We will theoretically and empirically investigate this in the future.

B. Fingerprint Extraction Phase of the Vanilla Scheme
When the database owner observes a leaked (or pirated) genomic database denoted as R, it tries to identify the traitor (i.e., the malicious SP) by extracting the fingerprint from R and comparing it with the fingerprints of all SPs who have received its genomic database.
We present the fingerprint extraction phase of the vanilla scheme in Algorithm 2. Specifically, the database owner first initiates a fingerprint template Here, "?" means that the fingerprint bit at that position remains to be determined. 3Then, the database owner determines the mask bit (x), obtains the corresponding indices of fingerprint bits (l's), locates the bit positions of the fingerprinted SNP elements exactly as in the fingerprint insertion phase, and finally fills in each "?" using a voting scheme.To be more precise, for each fingerprinted SNP r i [p], the database owner obtains the corresponding mark bit m by reading the tth to the last bit of r i [p], and recovers one instance of the lth bit of the fingerprint bit string via f l = m ⊕ x.Since the value of f l may be changed by the malicious SP, the database owner maintains and updates two counting arrays 3 Similar symbol has also been used in other works [1], [3], [14], [20].c 0 and c 1 , where c 0 (l) and c 1 (l) record the number of times f l is recovered as 0 and 1, respectively.Finally, the database owner sets f (l) = 1 if c 1 (l)/(c 1 (l) + c 0 (l)) ≥ τ , and f (l) = 0 if c 0 (l)/(c 1 (l) + c 0 (l)) ≥ τ , otherwise f (l) =? (i.e., this fingerprint bit cannot be determined due to the database owner's low confidence), where τ ∈ (0.5, 1] is the parameter that quantifies the database owner's confidence in the fingerprint recovery phase. 4lgorithm 2: Vanilla scheme: fingerprint extraction 12 forall l ∈ [1, L] do 13 f (l) = 1, if c1(l)/(c1(l) + c0(l)) ≥ τ , and f (l) = 0, if c0(l)/(c1(l) + c0(l)) ≥ τ .

V. CONSOLIDATING THE FOUNDATION: MAKING THE VANILLA GENOMIC FINGERPRINTING SCHEME ROBUST AGAINST CORRELATION ATTACKS
Here, we propose a robust fingerprinting scheme for genomic databases against the correlation attacks identified in Section III-B.The robust scheme is developed by augmenting the vanilla scheme using two mitigation techniques which can serve as the post-processing steps after the vanilla fingerprinting.It brings us two benefits: (i) fingerprint robustness of the vanilla scheme is maintained, because the devised mitigation techniques only change non-fingerprinted entries and (ii) the mitigation techniques can be applied to any vanilla fingerprinting schemes to resist correlation attacks on genomic databases.As discussed in Section IV-A, we choose our developed vanilla scheme to have more control over the fingerprint density on each selected SNP sequence.In practice, one can develop their own vanilla scheme depending on the content of their genomic database and then apply our proposed mitigation techniques to make their scheme also robust against the correlation attack.
In highlevel, to provide robustness against the row-wise correlation Atk row (S) and column-wise correlation attack Atk col (J ), the database owner (Alice) will perform mitigation steps (after the vanilla fingerprinting) by utilizing Mendel's law and her prior knowledge S (correlation of genomic data among different individuals), J (correlation of SNP values at different loci) to reduce the discrepancy caused by fingerprinting insertion.We will show that to implement the proposed mitigation steps, Alice needs to change only a few entries after the vanilla fingerprinting (e.g., less than 3% as shown in Table III in Section VI-C).

A. Mitigating the Row-wise Correlation Attack
To mitigate the row-wise correlation attack (in Section III-B), we develop Mtg row (S), which is composed of two phases.The first phase tries to eliminate all SNP loci that violate the Mendel's law, and the second phase makes the similarities of genome data among family members close to that before the fingerprint (generated by the vanilla is inserted.In particular, for each family set denoted as fmly, Mtg row (S) checks all fingerprinted SNPs of all family members.If the SNP-tuple at a locus violates the Mendel's law, Mtg row (S) changes other non-fingerprinted entries in the tuple to make them comply with the Mendel's law.For example, if a mother-father-child SNP tuple at a specific locus takes value "2-1-0" (2 for mother, 1 for father, and 0 for child, which violates the Mendel's law), and "1" (value of the SNP for the mother) is the fingerprinted SNP, then, Mtg row (S) can modify this tuple as "2-1-1".
In the second phase, for each family set (i.e., fmly) in the genomic database, Alice changes the SNP values of the family members such that the cumulative similarities between individuals with kinship relation is close to their original similarities.This can be formulated as the following optimization problem In (2), s ij fmly stands for the publicly known similarities between two family members in fmly and s ij fmly stands for the similarities after row-wise mitigation. 5We denote the SNP sequence of an individual i after the mitigation as r i .Also, value change(•) is the function that changes each SNP attribute of r i , and it will be elaborated later.( 2) is an integer programming problem.Since each SNP sequence may contain thousands of SNPs, it will be computationally expensive to obtain the optimal value of r i .Thus, we solve it heuristically.Particularly, if s ij fmly > s ij fmly ( s ij fmly is the empirical similarity calculated after the vanilla fingerprinting), the database owner needs to post-process r i , ∀i ∈ fmly to increase the similarity.On the other hand, if s ij fmly < s ij fmly , the database owner needs to post-process r i , ∀i ∈ fmly to decrease the similarity.We use the example of SNP sequences from mother-fatherchild to further explain how to increase or decrease the similarity.To increase s ij fmly , we randomly select a certain number of non-fingerprinted 3-SNP-tuples (i.e., mother-fatherchild) with value "0-0-0" and then change it to "1-0-1" or "0-1-1" depending on whether s ij fmly is a mother-child or fatherchild similarity.We only change 3-SNP-tuples with value "0-0-0", because this is the one of the most common 3-tuples in all families, and modification of mother-child tuples will not have an impact on the father-child similarity (vice-versa).To decrease s ij fmly , we let the database owner change a certain number of 3-tuples with value "1-0-1" (or "0-1-1") to "0-0-0" if s ij fmly is the mother-child (or father-child) similarity.
The reasons are exactly the same as the case of increasing s ij fmly .Although there are also higher degrees of relatedness among family members (e.g., the SNP correlations between grandparents and grandchildren), those correlation (or similarity) are usually much weaker than the first order correlations (e.g., mother-father-child).The vanilla fingerprinting is subject to destroying the strong correlations the most (and hence an attacker can easily infer the fingerprint due to the distorted correlations).For higher degree family members, the original correlations are not high and fingerprint will not destroy such correlation too much.We will experimentally verify this in Section VI-A.Note that the row-wise mitigation technique for genomic databases is different than the one developed for general databases in our previous work [14], which changes entries of non-fingerprinted data records to make the newly obtained similarities as far away from Alice's prior knowledge S to mislead the malicious SP.In contrast, here, we make the new similarities close to S in order to alleviate Atk row (S) (try to make Atk row (S) distort less fingerprint bits), and if the objective function on (2) equals 0, Atk row (S) is completely invalidated.The reason that we can pursue this in genomic database is because each row has much more attributes then general databases and the number of unique values is small (i.e., only 3 options: 0, 1, and 2).Besides, the row-wise mitigation techniques developed in [14] solves an NP-hard combinatorial search problem greedily, which introduces large computation overhead.

B. Mitigating Column-wise Correlation Attack
To make the vanilla scheme robust against column-wise correlation attack, we propose Mtg col (J ), which transforms the vanilla fingerprinted genomic database to have column-wise joint distributions (e.g., linkage disequilibrium between the SNPs) close to the publicly known joint distributions in J .Inspired by [14], we develop Mtg col (J ) using the idea of "optimal transport" [8], which moves the probability mass of the marginal distribution of each SNP attribute of the vanilla fingerprinted genomic database to resemble the distribution obtained from the marginalization of each reference joint distribution in J .Then, the optimal transport plan is used to change the entries in the genomic database after the vanilla fingerprinting.
In particular, for a specific SNP (column, i.e., locus of SNP sequence) p, we denote its marginal distribution obtained after the vanilla fingerprinting as Pr(C p ), and that obtained from the marginalization of a joint distribution J p,q distribution in J as Pr(C p ) = J p,q 1 T (here q can be any attribute that is different from p, because the marginalization with respect to p using different J p,q will lead to the identical marginal distribution of p).To move the mass of Pr(C p ) to resemble Pr(C p ), we need to find another joint distribution (i.e., the mass transport plan) G p,p ∈ R 3×3 , whose marginal distributions are identical to Pr(C p ) and Pr(C p ).Then, G p,p (a, b), a, b ∈ {0, 1, 2} indicates that the database owner should change G p,p (a, b) percentage of entries in the vanilla fingerprinted genomic database whose attribute p (SNP p) takes value a to value b, so as to make Pr(C p ) close to Pr(C p ).In practice, such a transport plan can be obtained by solving a regularized optimal transport problem, i.e., the entropy regularized Sinkhorn distance minimization [8] as follows: where  3) can be efficiently solved by linear programming [8].The obtained G p,p is more heterogeneous for larger values of λ p , i.e., the database owner will change less entries after the vanilla fingerprinting, which preserves more database utility.On the contrary, G p,p is more homogeneous for smaller values of λ p , i.e., it causes more SNP entries to be changed, which leads to more utility loss.Although (3) processes each column (attribute) of the genomic database independently, as shown in [14], the post-processed fingerprinted database will have the Pearson's correlations among attribute pairs that are close to the prior knowledge J .This further suggests that the mitigation step can boost the utility of the fingerprinted genomic databases.
Mitigation against additional auxiliary information.The developed row-and column-wise mitigation techniques focus on the correlation attacks that use generic correlations among genome data.In some task-dependent applications, the malicious SP can also use specific auxiliary information, e.g., racespecific information determined by genome and population structure, to compromise the fingerprinted database.This can be alleviated by involving additional mitigation steps before Mtg row (S) and Mtg col (J ) to make the vanilla fingerprinted database also match those auxiliary information.More discussion on the availability of these information to the database owner is deferred to Section VII.

VI. EXPERIMENT RESULTS
In this section, we first show that the vanilla fingerprinting scheme can resist random bit flipping attacks, but it is vulnerable to the correlation attacks developed specific for genomic databases.The correlation attacks are more powerful, as they can easily distort more than half of the embedded fingerprint bits at only a small cost of database utility (i.e., introducing less error and preserving the SNP-phenotype association).Then, we demonstrate that the proposed mitigation techniques can thwart correlation attacks and make the vanilla scheme robust against them.Since the mitigation techniques only change limited entries on top of the vanilla scheme, they also maintain a high utility for the genomic database.More importantly, we show that if the attacker conducts correlation attacks after the proposed robust fingerprinting scheme, it cannot succeed even if at a significant cost of database utility loss.Similar to [14], since the row-wise correlation attack and mitigation are computationally light and modify less database entries, we let the malicious SP launches Atk row (S) followed by Atk col (J ) when compromising a fingerprinted database, and let the database owner perform Mtg row (S) followed by Mtg col (J ) when making a vanilla fingerprinted database robust.

A. Genomic Database Description
We use the SNP data belonging to 1500 individuals from the HapMap dataset [9].Each individual has 156 data points (i.e., SNPs).In this population, there are 150 families, each of which is composed of 3 individuals, i.e., mother, father, and child.We assume that both the database owner and the malicious SP know the members of each family and the pairwise correlations (e.g., linkage disequilibrium) among SNPs (Section III-B discusses why this assumption is valid in practice).The importance to consider the correlations due to the first degree relationships among family members.As discussed in Section V-A, row-wise correlations are the strongest among the first degree family members and the vanilla fingerprinting potentially destroys such strong correlations the most (compared to the correlations between more distant family members).Here, we verify this claim by generating new generations of family members from the offsprings of the 150 families and checking the SNP similarities between these new generations and the original 150 pairs of parents before and after fingerprint insertion.In Figure 2, by varying the overall fingerprinting density (γ r γ l ∈ {10%, 20%, 30%}), we plot the average absolute difference of cosine similarity between the 150 parents and various generations of their descendants due to the vanilla fingerprinting.Clearly, the average change in the similarity with the first generation is the most significant for all considered fingerprinting density, which suggests that mother-father-child trio has the strongest correlation and it provides the most prior knowledge for the malicious SP to launch the row-wise correlation attack, and hence the row-wise mitigation techniques should preserve the correlations between first degree family members as much as possible.Thus, in the following experiment, we focus on the similarity change between individuals and their first generation of descendants during row-wise correlation attack and mitigation.

B. Vulnerability of Vanilla Fingerprinting Scheme Against
Correlation Attack 1) Performance against Correlation Attack: We first show the vulnerability of the vanilla scheme under the correlation attacks.To change sufficient number of database entries during fingerprint insertion, we let both row-and columnwise fingerprint density (i.e., γ r and γ l ) vary in {0.05, 0.06, 0.07, 0.08, 0.09, 0.1}, which gives 36 different fingerprinted databases.Then, we let these databases be compromised by Atk row (S) followed by Atk col (J ).For each compromised database, we record its percentage of changed entries (i.e., Per chg = 1 − Acc) caused by the correlation attack as well as the resulting fingerprint robustness measured in terms of Per cmp (percentage of compromised fingerprint bits), and scatter the results as blue dots in Figure 3 (we will discuss the black and red dots in the figure in later experiments).As discussed in Section III-C and empirically shown in [14], as long as the malicious SP can compromise more than 50% Percentage of changed entries (%)  fingerprint bits, it is able to avoid being detected as the traitor and cause the database owner to accuse other innocent SPs who have also received the database.Thus, we say an attack is successful if Per cmp > 50%, and the green horizontal line in Figure 3 represents the attack success boundary.From Figure 3, we observe that in most of the cases, the identified correlation attacks can compromise more than 50% fingerprint bits (blue points that are above the green line) at the cost of changing only less 5% SNPs in the vanilla fingerprinted database (i.e., an attacker can distort the majority of the fingerprint bits by also keeping the utility of the database high).
In Table II, we show the consistency of SNP-phenotype association study (discussed in Section III-D) after the vanilla fingerprinting and the correlation attacks.In particular, we first obtain the set of top-50 SNPs having strong associations with a phenotype (i.e., the 50 SNPs with the lowest p-values) from the original (non-fingerprinted) database and denote this set as the ground-truth set.Next, we get the new sets of top-50 SNPs from the vanilla fingerprinted database and the correlation attack-compromised database.Then, we evaluate the consistency by counting the percentage of overlapping SNPs between them and the ground-truth set.From the upper panel of Table II, we observe that the vanilla fingerprinting preserves high utility of the consistency and the correlation attacks on vanilla fingerprinting scheme also maintain such utility.For example, one of the successful attacks happens when γ r = γ l = 0.09 (blue dot indicated by the blue arrow in Figure 3), and the resultant pirated database copy still preserves more than 96% SNP-phenotype association.
2) Performance against Random Bit Flipping Attack: Next, we compare the attack ability of the random bit flipping attack with our identified correlation attacks.To show the effectiveness of the correlation attacks, we let the random bit flipping attack change more percentage of entries in the vanilla fingerprinted genomic database than the correlation attacks.We also record the fingerprint robustness after the vanilla fingerprinted database is subject to random bit flipping attack.In particular, we set γ r = γ l ∈ {0.05, 0.06, 0.07, 0.08, 0.09, 0.1}, and let the malicious SP randomly change a certain percentage (i.e., Per chg ∈ {5%, 7%, 9%, 11%, 13%15%}) of entries in its received copy of the vanilla fingerprinted genomic database.Thus, we also obtain 36 instance of vanilla fingerprinted databases compromised by random bit flipping attacks.We scatter the recorded results as black dots in Figure 3. Clearly, the random bit flipping attack can hardly compromise more than half of the fingerprint bits if the database owner inserts dense fingerprint in the genomic database, i.e., when γ r = γ l ≥ 0.06.In particular, when γ r = γ l = 0.09, the malicious SP can only distort 35.16% fingerprint bits at the cost of changing 15% of the SNPs if it launches the random bit flipping attack (indicated by the black arrow in Figure 3).In contrast, it can compromise 52.34% of the fingerprint bits at the cost of only changing the values of 3.11% of the SNPs if it launches the correlation attacks (indicated by the blue arrow in Figure 3).This clearly suggests that our vanilla fingerprint scheme developed specially for genomic databases is robust against random bit flipping attacks, but is vulnerable to the attacks using correlations among genome data.Compared with our previous work [14], the identified correlation attacks for genomic data are even more powerful than the ones identified for a general relational database.The reason is that the correlation patterns in genomic data (e.g., Mendel's law and linkage disequilibrium) are much stronger than the patterns in a general relational database (i.e., census database in [14]).Thus, the robust fingerprinting for genomic databases is critical.

C. Robust Genomic Database Fingerprinting Against Correlation Attacks
Now, we investigate the impact of the proposed robust genomic database fingerprinting scheme.Recall that the robust fingerprinting is achieved by post-process the vanilla finger-   III, we record the additional percentage of entries being changed due to the post-processing steps (Mtg row (S) and Mtg col (J )).Clearly, as shown in Table III, the mitigation techniques only need to change about 3% of the SNPs in order to make the postprocessed database has row-wise and column-wise correlation close to S and J and at the same time comply with the Mendel's law.Thus, the robust fingerprint scheme can preserve high utility of the genomic database, i.e., the database accuracy and the consistency of SNP-phenotype association.For example, as shown in the lower panel of Table II, in most of the cases, robust scheme achieves more than 90% top-50 SNPs match with the original database.
2) Impact on Fingerprinting Robustness: In Figure 3, using red dots, we scatter the fingerprint robustness and the percentage of changed entries when the robust scheme is under the identified correlation attacks (which are identical with that considered in Section VI-B).In particular, comparing with the blue dots, we see that the number of successful attacks (red dots above the green line) is significantly reduced by the proposed mitigation techniques.This suggests that it is very difficult for the malicious SP to compromise more than half of the fingerprint bits using the correlation attacks under the proposed robust scheme, even if the malicious SP has changed more than 20% of the SNPs in the received database.Moreover, correlation attacks on top of the robust fingerprinted genomic database also significantly reduce the utility of SNP-phenotype association study.As shown in Table II (the row corresponding to robust after correlation attacks), the percentage of matched top-50 SNPs drops more than 10% compared to the original database.This suggests that the proposed robust genomic database fingerprinting scheme can effectively thwart the identified correlation attacks by just changing about 3% of the SNPs in the post-processing steps and at the same time maintain high database accuracy and consistency of SNP-phenotype association.

D. Scalability
Now, we investigate the performance of the proposed robust fingerprinting scheme for larger genomic databases, where each individual has a higher number of SNPs (i.e., 234).In particular, we consider 8000 individuals among which there are 1333 families (due to the same reasoning in Section VI-A, we also focus on the correlation between mother-child-father tuple).In this experiment, we let γ r = γ l ∈ {0.06, 0.08, 0.1}.We scatter the pair of percentage of changed entries and percentage of compromised fingerprint bits in Figure 4. We also plot the p-value consistency before and after the robust scheme is subject to the identified correlation attacks in Figure 5. From Figure 4, we see that the robustness increases as the database increases (in terms of both rows and columns), because the correlation attacks cannot distort more than 12% of the fingerprint bits even though more than 20% entries are modified.Figure 5 further suggests that if the malicious SP launches the correlation attacks on a robust fingerprinted genomic database, the p-value consistency will drop by 10% on average.This experiment shows that our proposed robust fingerprinting scheme is also promising when sharing large genomic databases.VII.DISCUSSION Independent treatment of elements in SNP sequence.Note that by checking the change of inner product before and after fingerprinting, (2) essentially treats each element in the SNP sequence independently.However, in practice, blocks of SNP elements may also contain inherent structure, e.g., two or more individuals are identical by descent (IBD) if they have inherited blocks of SNPs from a common ancestor without genetic recombination.Thus, a malicious SP may also use this structural information during an attack.In future work, we will extend the proposed robust fingerprinting scheme to incorporate the recombination of IBD segments during meiosis.Limited side-effect of row-wise mitigation.In the row-wise mitigation, we post-process each pair of first degree family members in a family set.This may impact the similarity of other pairs in which either individual is involved.For example, updating mother-child pair may increase the similarity of grandfather-grandchild pair.However, as discussed in Section VI-A, such impact is very limited for higher degree family members.Assumption on prior knowledge.To the advantage of the malicious SP, we assume that it has at least equally accurate knowledge about the genomic database (i.e., Mendel's law, row-wise and column-wise correlation) compared with the database owner.We do not consider specific auxiliary information (such as SNP population frequencies, rare disease-associated variants, population stratification, and SNPphenotype associations) in this paper.If the malicious SP has more auxiliary information than the database owner (which rarely happens in real world applications), the robustness of the proposed scheme may be compromised.Such robustness degradation will be limited for generic relational database, i.e., the malicious SP still cannot distort more than half of the fingerprint bits [14].We will empirically investigate this for genomic databases by considering various case studies in the future work.Privacy concerns during genomic database sharing.The primary goal of database fingerprinting is to claim copyright and prevent unauthorized redistribution, however, privacy concerns and regulations may also impede genomic data sharing.In our recent work [13], we developed a novel scheme which can leverage the intrinsic randomness introduced by fingerprinting to provide provable privacy guarantees in relational database sharing, i.e., copyright and privacy protection can be achieved in one shot.In future, we will also study privacypreserving genomic database fingerprinting by adapting the scheme in [13].Computational complexity 0f mitigation techniques.If the genomic database contains M individuals, each of which has N SNPs, then the computation complexity for Mtg row (S) is O(M N ), because solving (2) requires checking all SNPs of each mother-child-father tuple.The computation complexity for Mtg col (J ) is O( 3N α ), where 3 is the number of possible instances of SNP values and α is the desired error in Sinkhornbased optimal transport [17].

VIII. CONCLUSION
In this paper, we have proposed robust fingerprinting for genomic databases composed of SNP sequences.To this end, we first identified the row-wise and column-wise correlation attack which utilize Mendel's law and linkage disequilibrium to distort the embedded fingerprint bits.Next, we developed a vanilla fingerprinting scheme specifically for genomic database by allowing the database owner to embed more fingerprint in each selected SNP sequence.Then, we further made this vanilla scheme robust against the identified correlation attacks by augmenting it with two mitigation techniques, which serve as post-processing steps for the vanilla scheme.In particular, the row-wise mitigation is achieved via solving a cumulative absolute distance minimization, and the column-wise mitigation is realized using optimal mass transport of distributions.Via experiments, we have shown that the identified correlation attacks are much more powerful than common attacks against fingerprinting schemes; they can easily distort more than half of the fingerprint bits at a small cost of database utility.However, these attacks are effectively alleviated by our developed mitigation techniques.The proposed scheme has the potential to further motivate researchers to share their genomic databases with each other, knowing that the shared database is of high utility and the recipient will be hesitant to leak the database due to the provided liability guarantees via the proposed robust fingerprinting scheme.

Algorithm 1 : 4 / 7 //fingerprint the pth SNP of the ith individual 8 Set
Vanilla scheme: fingerprint insertion Input : The original genomic relational database R, row fingerprinting density γr, column fingerprinting density γ l , database owner's secret key K, pseudorandom number sequence generator U, and the SP's series number n (which can be public).Output: The vanilla fingerprinted genomic relational database R. 1 Generate the fingerprint bit string of SP n, i.e., fSP n = Hash(K|n); 2 forall individual ri ∈ R do 3 if U1(K|ri.primary key) mod 1/γr = 0 then /fingerprint the SNP sequence of the ith individual 5 forall SNP element ri[p] ∈ ri do 6 if U2(K|ri.primary key|p) mod 1/γ l = 0 then mask bit x = 0, if U3(K|ri.primary key|p) is even; otherwise set x = 1. 9 fingerprint index l = U4(K|ri.primarykey|p) mod L.10 fingerprint bit f l = fSP n (l).11 mark bit m = x ⊕ f l .
is the set of all joint probability distributions whose marginal distributions are the probability mass functions of Pr(C p ) and Pr(C p ). •, • F denotes the Frobenius inner product of two matrices.Also, Θ p,p is the transport cost matrix and Θ p,p (a, b) > 0 representing the cost to move a unit percentage of mass from Pr(C p = a) to Pr(C p = b).Finally, H(G p,p ) = − G p,p , log G p,p F calculates the information entropy of G p,p and λ p > 0 is a tuning parameter.In practice, (

Fig. 2 :
Fig.2: Average absolute value of the SNP cosine similarity difference, before and after fingerprint insertion, among family members and their different generations of simulated descendants.

Fig. 4 :
Fig. 4: Fingerprint robustness versus utility loss when the proposed robust scheme is compromised by the correlation attacks.

Fig. 5 :
Fig. 5: p-value consistency before and after the proposed robust scheme is compromised by the correlation attacks.

TABLE I :
Differences between the robust genomic database fingerprinting and the previous schemes.indicates the scheme has a certain property, and indicates the opposite.members.If a tuple violates Mendel's law, the database owner changes the non-fingerprinted entries in this tuple to make it compliant with the Mendel's law.Second, it checks all family sets in the genomic database, calculates the empirical correlations among family members after the vanilla fingerprinting, and changes the non-fingerprinted entries in each family set to push the empirical correlations close to the publicly known model (S) by solving a distance minimization problem.Note that the second phase of Mtg [14](S) is different with the rowwise mitigation developed in our previous work[14], because the second phase is able to perfectly defend against Atk row (S) if the objective function of the distance minimization problem reaches to 0. Whereas, our previous row-wise mitigation technique[14](formulated as a set function maximization problem) can only mislead the malicious SP when launching the row-wise correlation attack.More details are deferred to Section V-A.

TABLE II :
Consistency of SNP-phenotype association compared with the ground-truth.

TABLE III :
Additional change caused by mitigation.printedgenomic database using two mitigation techniques, i.e., Mtg row (S) and Mtg col (J ).1) Impact on Database Utility: In Table