Identifying disease-causing mutations with privacy protection

Abstract Motivation The use of genome data for diagnosis and treatment is becoming increasingly common. Researchers need access to as many genomes as possible to interpret the patient genome, to obtain some statistical patterns and to reveal disease–gene relationships. The sensitive information contained in the genome data and the high risk of re-identification increase the privacy and security concerns associated with sharing such data. In this article, we present an approach to identify disease-associated variants and genes while ensuring patient privacy. The proposed method uses secure multi-party computation to find disease-causing mutations under specific inheritance models without sacrificing the privacy of individuals. It discloses only variants or genes obtained as a result of the analysis. Thus, the vast majority of patient data can be kept private. Results Our prototype implementation performs analyses on thousands of genomic data in milliseconds, and the runtime scales logarithmically with the number of patients. We present the first inheritance model (recessive, dominant and compound heterozygous) based privacy-preserving analyses of genomic data to find disease-causing mutations. Furthermore, we re-implement the privacy-preserving methods (MAX, SETDIFF and INTERSECTION) proposed in a previous study. Our MAX, SETDIFF and INTERSECTION implementations are 2.5, 1122 and 341 times faster than the corresponding operations of the state-of-the-art protocol, respectively. Availability and implementation https://gitlab.com/DIFUTURE/privacy-preserving-genomic-diagnosis. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
In this supplementary document, we describe some parts that we do not provide in the main article due to the page limit. We give the algorithmic explanations of our methods. We show the results of our experiments on synthetic data.

Operations with Boolean Gates
In this section, we show how RECESSIVE, DOMINANT, COMPHET, MAX, SETDIFF and INTERSEC-TION operations can be implemented with only boolean gates. With the following figures, we show that the number of non-linear gates required to implement these operations with the boolean circuit increases linearly with the number of participants.

Algorithms
In this section, we give the algorithmic descriptions of our privacy-preserving methods implemented with both arithmetic and boolean gates.

Notations
We denote a shared value x with x b a . b ∈ {A, B} indicates the sharing type where A denotes Arithmetic Sharing and B denotes Boolean Sharing. a ∈ {0, 1} is a index of the computing party. We use ⊕, ∧, | and = for XOR, AND, OR and EQ operations respectively. x b a ← {y} initializes a shared vector whose all elements equals to y and size equals to the size of the variant or the gene vector. x b a ← y initializes a shared value equals to y. If x b a is a shared vector of random values, l is the bit length of the shares.

RECESSIVE operation on Variant Vectors
In this operation, a researcher wants to test the recessive inheritance model to find disease-causing variants. In the recessive model, the mother and the father are heterozygous carriers and the affected siblings have homozygous variants. If there are unaffected siblings that are considered as heterozygous carriers or homozygous reference. Non-family individuals included in the analysis are considered as homozygous reference. Figure 1 shows how to perform the RECESSIVE operation on variant vectors with boolean operators. The detailed description of our privacy-preserving RECESSIVE operation is given in Algorithm 2.
for i ← 1 to s do /* find common non-homozygous and non-heterozygous variants of others Algorithm 2: Secure Recessive Operation

DOMINANT operation on Variant Vectors
In this operation, the affected siblings have a single copy of a variant. Thus the parentship information is not important. If there are unaffected siblings they are considered as heterozygous carriers or homozygous reference. Non-family individuals included in the analysis are considered as heterozygous carriers or homozygous reference. Figure 2 shows how to perform the DOMINANT operation on variant vectors with boolean operators. The detailed description of our privacy-preserving DOMINANT operation is given in Algorithm 3.
t are homozygous variant vectors, p is the number of affected siblings, r is the number of unaffected siblings, s is the number of others, t ∈ {0, 1} is the index of the proxy server output: a B t , a is a vector that gives the locations on which all affected siblings have heterozygous variants and, all unaffected siblings and others have homozygous reference variants; the locations are marked with one a is a vector that gives the locations on which all affected siblings have heterozygous variants; the locations are marked with zero */ for i ← 1 to r do /* find common homozygous reference variants of unaffected siblings Algorithm 3: Secure Dominant Operation

COMPHET operation on Variant Vectors
The individual having a compound heterozygous mutation should have at least two heterozygous mutations in a given gene, one from the mother and the other from the father. The unaffected siblings are considered as homozygous carriers or homozygous reference. Non-family individuals included in the analysis are considered as heterozygous carriers or homozygous reference. Figure 3 shows how to perform the COMPHET operation on variant vectors with boolean operators. The detailed description of our privacy-preserving COMPHET operation is given in Algorithm 4.
m is a vector that gives the locations on which all affected siblings and the mother have non-heterozygous variants; the locations are marked with zero */ t ∈ m B t and gene g i

MAX operation on Gene Vectors
In this operation, a researcher is interested in a cohort of individuals with the same disease or phenotype. He or she wants to find the most commonly mutated genes in the cohort which are potentially associated with the disease. The gene vector indicates whether the genes have rare variants. Therefore, the sum of the gene vectors of the individuals in the cohort allows the researcher to find the most mutated genes. MAX operation with boolean operators is illustrated in Figure 4. The detailed description of our privacy-preserving MAX operation is given in Supplementary Algorithm 5.
t is a gene vector, n is the number of patients, t ∈ {0, 1} is the index of the proxy server output: m B t , m is a gene vector that gives the locations of most mutated genes; the locations are marked with one Algorithm 5: Secure Max Operation

SETDIFF operation on Variant Vectors
In this operation, a researcher wants to find the variants that cause a disease in a family. He or she wants to analyze an affected individual and unaffected individuals in the family. Non-family members can be included to the analysis for more accurate results. For SETDIFF operation, we need to obtain a bit vector that shows the ownership of information of variants in the given variant list. We can obtain this bit vector by adding the vector of heterozygous variants to the vector of homozygous variants. SETDIFF operation with boolean operators is illustrated in Figure 6. The detailed description of our privacy-preserving SETDIFF operation is given in Algorithm 6.

INTERSECTION operation on Variant Vectors
In this operation, a researcher is interested in a group of the unrelated individuals. He or she wants to find variants seen in all the individuals in this group. We have two variant vectors for each patient. For INTERSECTION operation, we need to obtain a bit vector that shows the ownership of information of variants in the given variant list. We can obtain this bit vector by adding the vector of heterozygous variants to the vector of homozygous variants. Variants that are seen in all the individuals are found by computing logical AND of the variant vectors of all the individuals. INTERSECTION operation is illustrated in Figure  5. The detailed description of our privacy-preserving INTERSECTION operation is given in Algorithm 7.
t is a heterozygous variant vector, p is the number of individuals, t ∈ {0, 1} is the index of the proxy server output: a B t , a is a variant vector that gives the locations on which all individuals have variants; the locations are marked with one

Experiments on Synthetic Data
In this section, we give the detailed results of our experiments on synthetic data. a All tests were performed on a mother, a father, 14 unaffected siblings, and 16 unaffected siblings. The number of non-family individuals varies.