A new way to protect privacy in large-scale genome-wide association studies

Motivation: Increased availability of various genotyping techniques has initiated a race for finding genetic markers that can be used in diagnostics and personalized medicine. Although many genetic risk factors are known, key causes of common diseases with complex heritage patterns are still unknown. Identification of such complex traits requires a targeted study over a large collection of data. Ideally, such studies bring together data from many biobanks. However, data aggregation on such a large scale raises many privacy issues. Results: We show how to conduct such studies without violating privacy of individual donors and without leaking the data to third parties. The presented solution has provable security guarantees. Contact: jaak.vilo@ut.ee Supplementary information: Supplementary data are available at Bioinformatics online.


Practical Details of Privacy-Preserving Data Mining
Privacy-preserving data mining is a cover term for a set of techniques that allow sensitive data to be processed with minimal or no leaks. In our work, we use secure multi-party computation which allows us to securely evaluate a function among several parties so that none of them learns anything except for their own inputs and the final outputs. More precisely, we chose the Sharemind secure computation framework [BLW08] as a concrete implementation platform.

Main Properties of the Sharemind Framework
All computations in the Sharemind platform are based on additive secret of 32-bit and 64-bit integers. Hence, data donors can distribute their secret values among the hosts so that none of the hosts nor a subset of them can recover anything about the inputs unless all of the hosts pool their databases. More formally, a secret sharing scheme splits a secret value into several pieces (called shares) so that it is infeasible to reconstruct the original value without access to a predetermined number of shares. In the case of additive secret sharing, we need all shares to reconstruct the secret. The domain of secret sharing determines what type of data can be stored and which operations can be performed on the data without revealing secrets. With Sharemind we can perform secure arithmetic with 32-bit and 64-bit unsigned integers.
The general idea of computations on Sharemind are shown in Figure 1. Input values are secret-shared at the source and collected by three servers. The servers run secure multi-party computation protocols (distributed computations), which process input shares and output results in the form of new shares. These protocols guarantee that nothing is leaked about the inputs and outputs. Thus, data hosts learn only the types of computation steps they perform and nothing more. In the end of the computation, the shares of the end result are published to the parties who are interested in the result, who can reconstruct it since they have all the shares.
As the Sharemind platform naturally supports basic integer operations, we converted all algorithms into the form, where they are specified in terms of addition, multiplication and comparison operations. The latter makes our algorithms relatively fast since secure operations for fixed and floating point arithmetic are significantly slower.
Another advantage of the Sharemind platform is the universal composability of all operations. Note that not all secure multi-party computation protocols preserve security guarantees if they are executed together with other protocols-sometimes it is possible to break the confidentiality of protocols by gathering information from several parallel protocol runs.
Parallel execution of several multi-party protocols is guaranteed to preserve security if all protocols are universally composable. As all Sharemind protocols have been shown to be universally composable [BLW08,BNTW12], we can execute several operations in parallel. This can give us significant speed-ups, as the price of sending a value over the network in Sharemind is similar for single values and vectors of small values. To use this speedup to one's advantage, it is possible to instruct Sharemind to perform componentwise operations on entire vectors and matrices in addition to processing single values. These parallel operations are significantly more efficient than single operations because several values can be packed into fewer network messages.
For an example of the gains, see Figures 2. The horizontal axis shows input vector sizes from a single value to one million values on a logarithmic scale and the vertical axis shows the running time of the multiplication in milliseconds. Two lines have been fit on the average running timesthe horizontal line shows that processing a single value and processing one thousand values takes nearly the same amount of time. The second line shows a steeper increase in running times. From our benchmarking experiments we have determined that the running time becomes nearly linear in the size of the vector once the network bandwidth is used to its limits. Number of parallel operations Running−time in milliseconds 10 1 10 2 10 3 q q q q q q q q q q qq q q q q q q q q qq q q q q q q q q q 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Figure 2: Effect of vectorization on secure multiplication running time.
The efficiency of parallel operations motivates us to design the secure computation algorithms in a manner that stresses large vector operations. However, the latter is not the entire truth. Working with large vectors also increases the memory footprint. Controlling memory usage is significant in the case of genome-wide association studies because of the data size. Moreover, performing more operations decreases the average cost of an individual operations only up to a certain number of operations, see Figure 2. Therefore, it is reasonable to do operations in batches, cutting larger vectors into smaller pieces.
Although all our implementations use explicit batching to reduce the memory footprint, we decided to omit these details. Firstly, it complicates the overall presentation of algorithms. Secondly, it is straightforward to add batching to algorithms by dividing the SNP data into parts and processing these in succession.

Developer Tools in the Sharemind Framework
Sharemind is a general secure multi-party computation system in the sense that it is not limited to solving a single problem with a dedicated protocol. The universally composable protocol suite enables the virtual machine inspired design of Sharemind and allows us to program this machine as we would program a standard computer. Sharemind natively executes an assembly language that has instructions for both public and private operations. Public operations are used for nonsensitive data like loop invariants, data structure sizes and conditional expressions. Sharemind does not hide the flow of the code being executed and instead provides means for hiding branching decisions automatically. All branching decisions in Sharemind are based on public values.
The Sharemind framework uses a high-level programming language called SecreC [Jag10]. The main feature of SecreC is a type system that separates public and private data. During execution, private data is processed using secure multi-party computation and public data is processed as usual. Interaction between public and private data is controlled as follows: • public values can automatically become private and • private values can become public only through the declassify() operator.
The use of a dedicated declassification operator explicitly shows the places in the algorithm where private data is made public (public values are reconstructed from shares). To show that an algorithm does not leak sensitive data, the developer needs to show that none of the declassification calls leak information about the sensitive inputs.
SecreC also includes support for vectors and matrices of both private and public data. SecreC allows vectors and matrices of the same size to be used in componentwise operations and also aggregation operations like computing a sum of all the values in a vector. It is possible to work with Sharemind's secure database from within SecreC code.
SecreC enables another process in the development of secure applications-since a SecreC program specifies, what computations a Sharemind node can perform and what it can publish, SecreC code can be audited before deployment by each secure multi-party computation host.

Algorithm Notation
Let x denote a vector and let x + y denote the componentwise addition of vectors and x y the componentwise multiplication of vectors. When a scalar value is used in the place of a vector, it is automatically extended to a vector of the same size as the other vector operand. Vector operation precedence is the same as scalar operation precedence-multiplications are performed  [y]] means that the shares of z = x · y are securely computed from the shares of x and y. The same notational convention applies to vectors and matrices, e.g [[x]] + [ [y]] denotes component-wise addition of shared vectors.
As in the main article, we assume that four possible states of a SNP are reduced to two alleles: a dominant base pair and a mutation. The resulting data is stored in the shared genotype database [[SDB]] with 2k rows and columns where k is the number of SNPs and is the number of donors. Rows of [[SDB]] come in pairs-the first row in every pair contains the count of allele A in the SNP and the second row the count of allele B. Each column represents one donor.
We represent case and control groups as shared zero-one index vectors (masks). Let [[mask]] be an index vector for the case group. Then the ith donor belongs to the case group if and only if mask[i ] = 1. See Figure 3 for an illustration of the database layout and mask vectors. Algorithm 1 shows how to securely compose the mask based on range filters. The first input of the algorithm is a vector [[attribute]] with length (the attribute value for each donor). The second and third inputs are the filter's lower and upper bounds respectively. Since the Sharemind implementations of smaller and greater than predicates always return zero or one, the mask mask

χ 2 Algorithms
Let x and y be masks for case and control groups respectively. Let A and B be the database columns for a particular SNP. Then the allele counts can be expressed as Based on the equations (1), Algorithm 2 calculates for each SNP the number of alleles (A or B) in a given group (cases or controls). The first input to Algorithm 2 is the shared genotype database [[SDB]]. The second input k gives the number of rows in the database and the third input allele ∈ {1, 2} denotes the allele that we are aggregating where 1 stands for allele A and 2 stands for allele B. The final input, is the mask [[mask]].

Cochran-Armitage Test
To compile the contingency table seen in Table 1 for the Cochran-Armitage test, note that Thus, the contingency table is also easily computable. Based on the equations (2), Algorithm 3 calculates the frequencies of allele pairs AA, AB and BB for each SNP. The first input to Algorithm 3 is the shared genotype database [[SDB]] with the number of rows given in the second input k. The third input is the group mask [[mask]].
It is important to note, that according to the equations (2), all of the calculated counts are fourfold. This will be taken into account in the algorithms where the test statistics are calculated.

Transmission Disequilibrium Test
Transmission disequilibrium test (TDT) is applicable only if the data consist of parent-child trios. The test measures whether one homozygous genotype is more over-represented than the other among affected children with heterozygous parents. For this we need to construct an index vector for detecting heterozygous parents. Note that A i · B i = 0 for homozygous allele combinations and, thus, we can construct an index vector z = (A mother · B mother ) · (A f ather · B f ather ) that is 1 if both parents are heterozygous and 0 otherwise. Based on the equations (2) and (3), Algorithm 4 computes the fourfold counts of genotypes AA and BB of children. This will, again, be taken into account in the next step. The first input to Algorithm 4 is the shared genotype database [[SDB]] with k rows and columns. The fourth input is a zero-one mask [[childM ask]] with length /3, that contains 1 if the child has the disease in question. This mask is different from the ones used in the previous algorithms as it is only applied to 1/3 of the donors, namely the children. For this test, we assume that parent-child trios are stored as consecutive column triples, where the first column corresponds to the mother, the second to the father and the final to the child. This assumption is just for notational convenience. One can easily modify the algorithm for arbitrary donor layout, as long as the information about which columns correspond to each mother-father-child trio is public.

Security analysis
The Sharemind framework guarantees that nothing about the input and outputs are leaked when the secure computations use input shares to produce output shares [BLW08]. All algorithms presented so far declassify no values and therefore all these algorithms are secure. More formally, note that the flow of the protocol execution is independent from the genotype data and masks used to select donors. Hence, the corresponding simulation strategy in the formal security proof is straightforward: use canonical simulators for each private operation in the algorithm specification.

General Algorithm Information
All four algorithms given in this section calculate and publish a zero-one vector signif icance that is set to 1 if the SNP is significant according to the test statistic in question.
The arguments for all functions are the same in general. As the first input, the algorithms are given the shared genotype database [[SDB]] with 2k rows and columns where k is the number of SNPs and is the number of donors. The third and fourth input [[casemask]] and [[ctrlmask]] are the shared case and control masks that have the value 1 for each donor that belongs into the respective group, and 0 otherwise. These vectors are either given explicitly by the data analyst or computed securely form secret shared phenotype attributes with algorithms similar to Algorithm 1.
As mentioned in Section 1, floating-point arithmetic is very slow and its implementation is too imprecise for our needs. In the algorithms, therefore, instead of division, we rewrite the equation T ≥ T α in terms of integer operations. Let T α be represented as a fraction p q and the test statistic as a fraction x y . Then the condition T ≥ T α is equivalent to the condition xq ≥ yp. The last two inputs p and q of the algorithms are the dividend and divisor that make up the user specified threshold p q to which the test statistic is compared.

Standard χ 2 Test
Let a, b, c, and d denote the count of observed alleles A and B in cases and controls, as specified in the main body of the article. The test statistic for the standard χ 2 -test can be expressed as .
Algorithm 5 uses these four variables to calculate the test statistic T 1 based on equation (4). The data for the contingency table is obtained by using Algorithm 2. Using the reasoning in Subsection 4.1 to avoid floating-point operations, equation (4) gives us a test where the threshold T α = p/q can be determined by computing upper α quantiles of the χ 2distribution with one degree of freedom. We discuss how to determine appropriate significance level α in Subsection 4.7.

Equiproportionality of Allele A in Both Groups
To determine whether allele A is equiproportional in both the case and control group, we use the same counts a, b, c and d as for the standard χ 2 test. Thus, the initial part of Algorithm 6 coincides with Algortihm 5. As the test statistic for determining the equiproportionality is Algorithm 6 uses the condition to sieve out significant cases.

Cochran-Armitage Test
To calculate the test statistic for the Cochran-Armitage test for trend, we need compute all counts in Table 1. The data for the contingency table is obtained by using Algorithm 3. The test statistic for the Cochran-Armitage test for trend can be expressed as where the weights α 0 , α 1 , α 2 are chosen according to the suspected influence mechanism. Algorithm 7 uses the data from Table 1 to calculate the test statistic T 3 based on equation (7). As additional inputs, Algorithm 7 is given weights α 0 , α 1 , and α 2 . As mentioned in Subsection 3.2, results returned by Algorithm 3 are fourfold. Therefore, we can actually evaluate n i α i 2 and thus we must multiply y with 4 to correct the offset.

Transmission Disequilibrium Test
To calculate the test statistic for the transmission disequilibrium test (TDT), we need to know the count u of AA and the count v of BB genotypes among children. The statistic is computed as follows The counts u and v are obtained by using the Algorithm 4. Algorithm 8 uses this data to evaluate a test T 4 ≥ p q using equation (8). As an additional input, Algorithm 4 is given the number of donors . Instead of two masks, the algorithm gets a zero-one mask [[childM ask]] for marking children in the case group. The mask vector is 3 times smaller than the donor count, as it is only applied to 1/3 of the donors' data as described in Algorithm 4.

Security analysis
Sharemind ensures that inputs and intermediate values remain secure throughout the computation. However, all the test algorithms presented in this Section have a single declassification call in the end of the algorithm. Each call declassifies a vector of boolean values where ones represent SNPs where the difference between the case and control groups was significant. Zeroes represent SNPs where no significant difference was observed. As the final declassified vector of zeroes and ones is equivalent to the desired list of significantly differentiated SNPs, the declassification reveals nothing beyond the desired outcome.

Multiple testing and FDR correction
In order to determine a proper lower bound p q for statistical tests, we must fix a significance level α. The standard significance threshold used in statistics is 0.05. As we do a genomewise scan for finding significantly differentiated SNPs, we are in the multiple hypothesis testing scenario. As we know the total number of tested SNPs, we can apply Bonferroni correction, i.e, set α = 0.05 k where k is the number of observed SNPs. The latter is often too harsh compared to FDR correction. Unfortunately, FDR is not so straightforward to apply in the privacy-preserving manner, as it works on p-values instead of easily computable test statistics.
Next, we will sketch how standard Benjamini-Hochberg procedure [BH95] can be implemented without leaking the individual test statistics. First, recall that the original BH-correction procedure first orders p-values in ascending order and then finds the largest i such that where p (i) is the ith p-value and k is the total number of p-values. Second, note that all statistical tests we have devised for secure genome-wide association studies are one-tailed tests where the sampling distribution is independent from the size of the case and control group. As a result, we can reformulate the problem in terms of test statistics. 1 Use oblivious sorting to sort triples in descending order wrt to secret ratios ui vi = t i . Let ([[u (1) ]], [[v (1) ]], [[c (1) ]]), ([[u (2) ]], [[v (2) ]], [[c (2) ]]) . . . , ([[u (k) ]], [[v (k) ]], [[c (k) ]]) be the result.
Start opening values s k , s k−1 , . . . s 1 until the first one s i * = 1 is revealed. Open the locations c (1) , . . . , c (i * ) and declare corresponding SNPs significant.
More precisely, note that all statistical tests for GWAS use the upper tail of χ 2 -distribution with one degree of freedom to compute the p-value. Hence, the correspondence between pvalues and test statistics is anti-monotone, we can consider decreasingly ordered test statistics t (1) , . . . , t (k) instead of p-values: where Q upper denotes the upper quantile of the χ 2 -distribution As the significance threshold α and the function Q upper are public, we publicly compute fractional approximations of significance thresholds and use secure computing to evaluate comparisons t (i) ≥ p (i) q (i) . Let the test statistic t i be represented as a fraction ui vi and let u and v be the corresponding vectors for denominator and numerator values. Then Algorithm 9 depicts the corresponding privacy-preserving significance testing procedure, which reveals only the locations c i and ordering of significant SNPs. It is even possible to hide the ordering if the shares [[c (1) ]], . . . , [[c (i * ) ]] are obliviously shuffled before opening such that c (1) , . . . , c (i * ) are opened in unknown random order.
The most complex step in this algorithm is the oblivious sorting step which rearranges triples in descending order according to the ratios t i = ui vi . This is usually implemented by using oblivious compare-exchange blocks which securely implement the following functionality: Secure evaluation of such block requires 8 multiplications and 1 comparison. One of the best parallel sorting networks is the pairwise sorting network that needs n + n(log 2 n)(log 2 n−1) 4 − 1 such OblCompEx gates [Par92]. In our case, this mean that secure oblivious sorting requires approximately 162.6 million secure multiplication and 20.3 million secure comparison operations. A back-of-the envelope estimate based on Sharemind performance data shows that this would roughly take 10 minutes [BNTW12].
on the sizes of case and control groups. However, a general solution for these cases contains tedious technical details that are not relevant in our context of genome-wide association studies.

Optimisations for the FDR correction procedure
In most cases, we do not expect to see a large number of significant SNPs. In fact, lists that are over 1000 elements are hard to interpret and follow up. Hence, we could consider more conservative BH-correction procedure that always outputs a list of at most 1000 elements with false discovery rate below α. That is, we want to find the largest i ≤ 1000 that still satisfies the equation (9). The latter is a much simpler task, as we must first find the top 1000 elements which can be done with O(k) comparisons. After that we must still sort the top 1000 elements but this takes only approximately 26000 secure comparisons and thus significantly speeds up the FDR-correction so that the running time can me measured in seconds instead of minutes.
Another alternative is to reveal all test statistics in random order so that we can perform the BH-correction with public values but at the same time we will not know which test statistic corresponds to which SNP location. This can be achieved with oblivious shuffle for which efficient protocols are known [LWZ11]. Performance benchmarks on shuffling show that for our data size, shuffling can be completed in under a minute. After the appropriate threshold is known, the list of relevant SNPs can be detected as usual.
To conclude, FDR correction is feasible if we control the maximal output size or make tradeoffs between privacy and efficiency. However, the exact details and performance results are out of the scope of this manuscript.

Privacy-Preserving Methods for Quality Control
In this section, we describe example methods for detecting anomalies in the genomic data, which might influence the results of genome-wise association studies. These simplified examples are just meant to show what is possible, how much time it takes and what trade-offs are possible between performance, power for anomaly detection and privacy.

Algorithms for Monitoring Data Quality
A human genome can have long regions that are missing (the so-called genomic deletions) compared to the reference genome. Overrepresentation of such deletions in the selected study group can skew the test results or be an interesting fact itself. Due to the measurement procedure, such deletions induce NN measurements for SNPs located in the region of deletion, as probes for these SNPs have no complementary DNA material to hybridize. In our setting, NN measurements are represented as (0, 0) pairs. Consequently, an entire block of SNPs is missing if and only if the corresponding block of indicator variables sums to zero. Algorithm 10 extends this idea and performs this test in parallel for many blocks at the same time.
The algorithm takes as input the block length t and offset δ and outputs secret-shared indicator values deletions[i, j ] that show whether the ith block with offset δ of the jth donor is a streak of NN pairs. As the result is shared this result can be further processed. For instance, we can compute the number of deletions for the selected set of individuals by evaluating i ∈ {0, . . . , k/t } or testing if this sum exceeds a certain threshold. These types of tests are not computationally intensive, as additions are extremely cheap and the number of comparisons is proportional to the number of entries in SDB. As Sharemind can perform tens of thousands of secure tests in a second, we estimate that such a deletion detection will take under a minute [BNTW12].
The second common problem in SNP measurements is (geographic) variability in allele frequencies. Although each SNP commonly has only two alleles: a basic form and its point mutation, in some subpopulations mutations are different. Hence, one often filters out SNPs for which the SNP is constant -one allele is almost omnipresent in all individuals or frequency of both alleles is below a certain threshold. These conditions are straightforward to test. For that, we first compute frequencies of both alleles or allele pairs using Algorithms 2 and 3. Given the frequencies, we can use thresholding to detect whether counts of both alleles are over certain threshold or not. Again, since comparisons are not too expensive, this operation would not take more than a few minutes.

Examples of Outlier Detection and Filtering Algorithms
Note that there are two ways how to use results of various data quality checks. First, we can publish the end results and study them as in usual data analysis projects. The second option is to use corresponding zero-one index vectors to filter out invalid outputs without leaking which SNPs or SNP-blocks are invalid. This option is required when the results of the data quality tests leak private information. For instance, the fact that a certain block deletion occurs in the study might reveal the identity of a donor, when this is a rare deletion.
In the following, we sketch how oblivious filtering can be used in the context of phenotypic data quality indicators. It can be analogously applied also for the genotypic indicators discussed above by switching rows and columns. Let vector x consist of measurements of a specific phenotypic trait such a blood pressure. Then we might want to exclude patients with extreme values of blood pressure from the study. For blood pressure, we can define the threshold based on background knowledge, e.g., declare patients with systolic blood pressure over 170 mmHg inapplicable for the study. In general, we might try to compute the threshold based on standard deviation or on 5% and 95% quantiles. As the variance can be expressed in terms of simple arithmetic operations it is straightforward to compute shares of n 3 Var(x). Similarly, we can convert the outlier criterion |x i − x| ≤ α Var(x) into an equivalent condition without division operations Note that this condition can be securely evaluated even if the size n is in secret shared form. Computational complexity of these computations is marginal, as each test requires three secure multiplication and a single comparison operation. The simplest way to find out upper and lower quantiles is to use oblivious sort. For instance, if the vector x has 1000 elements then elements x (50) and x (950) correspond to lower and upper 5% quantiles. Of course, we can use more efficient algorithms for computing the quantiles. However, even with oblivious sort this step is practically feasible for 10000 individuals. In more elaborate cases, the the quantile falls between two elements x (i) and x (i+1) and we have to use standard interpretation formulae. But this does not significantly increase the complexity. before declassification. Alternatively, we can eliminate corresponding rows of the SDB by using oblivious filtering techniques [LWZ11]. This approach is useful in conjunction with FDRcorrection, as it reduces the number of performed statistical tests and thus increases the statistical power. As the oblivious filtering is roughly equivalent to oblivious database shuffle, then it is clearly feasible even for 300000 SNPs [LWZ11].

Performance evaluation
We tested all algorithms by splitting the set of donors into two equal parts. In biology, this is a rather uncommon way to test genome-wide association algorithms, as the relevance of the results greatly depends on how well the case and control groups are chosen. For our purposes, however, this strategy is as good as any other, as we do not test the statistical properties of the algorithms. One can wonder whether algorithms will run slower if we choose the case and control groups differently. The latter is impossible, as the number of operations depends only on the size of the SNP database. In fact, the algorithm would be insecure if its running-time depended on a group's size or its members.