Accessibility of covariance information creates vulnerability in Federated Learning frameworks

Abstract Motivation Federated Learning (FL) is gaining traction in various fields as it enables integrative data analysis without sharing sensitive data, such as in healthcare. However, the risk of data leakage caused by malicious attacks must be considered. In this study, we introduce a novel attack algorithm that relies on being able to compute sample means, sample covariances, and construct known linearly independent vectors on the data owner side. Results We show that these basic functionalities, which are available in several established FL frameworks, are sufficient to reconstruct privacy-protected data. Additionally, the attack algorithm is robust to defense strategies that involve adding random noise. We demonstrate the limitations of existing frameworks and propose potential defense strategies analyzing the implications of using differential privacy. The novel insights presented in this study will aid in the improvement of FL frameworks. Availability and implementation The code examples are provided at GitHub (https://github.com/manuhuth/Data-Leakage-From-Covariances.git). The CNSIM1 dataset, which we used in the manuscript, is available within the DSData R package (https://github.com/datashield/DSData/tree/main/data).


Supplementary Material
Proof of correctness for the Covariance-Based Attack Algorithm In this study, we consider an attack by a malicious client and provide an algorithm for data reconstruction based on covariance information.In the following, we provide the mathematical derivation of the algorithm.
For the attack, it is necessary to compute sample means Mean(x j,k ) = 1 n j n j s=1 x (s) j,k , and sample covariances, for all i = 1, 2, . . ., n j on the server side and return them to the client side.The vectors y i are chosen in a way to ensure their linear independence.
To reconstruct x j,k , we exploit the fact that the sample covariances can be reformulated to determine the inner product (5) We combine the equations for i = 1, . . ., n j from (5) to a system of equations in matrix form: Since the vectors y i were chosen to be linearly independent, Y , and therefore also Y T , are invertible.Hence, we can multiply both sides of equation ( 6) by the inverse of Y T to obtain where the right-hand side is known to the client.This provides a constructive proof for the recovery of x j,k using the proposed approach.
The same procedure can be repeated for all n j servers and all n p variables, yielding comprehensive information about potentially sensitive data on the servers.

Robustness of the Covariance-Based Attack Algorithm to noise perturbations in DataSHIELD
As a defense strategy against malicious clients, we consider the perturbation of mean and covariance with noise but without the additional use of differential privacy.More specifically, we consider the addition of zero-mean noise to means and covariances on the server side before sending them to the client side.Given only access to noisy data, one might assume that the client will not be able to reconstruct x j,k exactly.However, running the attack algorithm multiple times on the same variable and averaging these results yields a random variable that converges in probability to x j,k such that the malicious client is, given an appropriate communication and computational budget, able to retrieve all information about x j,k .We prove that the empirical mean of the noisy results of the Covariance-Based Attack Algorithm 1 R R r=1 x noisy j,k,r , with R denoting the number of calls in an attack, converges in probability to x j,k , i.e., formally for any c > 0 where x noisy j,k,r is the result of the r-th run of the Covariance-Based Attack Algorithm.
Let ε r be an n j dimensional random vector with mean E(ε r ) = 0 and covariance matrix V(ε r ) = σ 2 ε I n j for which σ 2 ε < ∞.Let γ r be a random variable with mean E(γ r ) = 0 and variance V(γ r ) = σ 2 γ < ∞.Furthermore, let γ r and ε r be uncorrelated so that E(γ r • ε r ) = 0.The noisy version of equation ( 6) is given by such that x noisy j,k,r can be decomposed into the true x j,k and a noise term.Combining ( 7) and (8) shows that ( 7) is proven if the mean of the noise term converges in probability to zero, such that it is sufficient to show that lim This can be shown by applying Markov's Inequality (10) Since ( 10) holds for all R, it is sufficient to show that the numerator of the right-hand side converges to 0 if R → ∞ in order to prove (9).To facilitate notation, the entries of A T A are denoted by a (s,s ′ ) and the entries of ε r by ε (s)  r .Note that the following holds: The numerator of the right-hand side of (10) can therefore be written as This is a constant multiplied by R −1 .Accordingly, (9) holds and therefore ( 7) is proven.
To assess the rate of convergence, we can repeat the convergence examinations and multiply the error rate by n α j .This yields Repeating the previous steps of the Markov Inequality, one obtains From the last line, it follows that the error converges for α ∈ (∞, 0.5) and therefore the upper bound for the rate of convergence is 0.5.
In the manuscript, we provide an analysis of the mean squared error for different numbers of calls from an attacker and different levels of noise.The relative mean squared error (RMSE) is here defined as

Derivation of privacy budget consumption of mean and covariances
The privacy budget consumption of the mean (covariance) algorithm is given by , where σ 2 is the noise level of the normally distributed noise that is added to the output of the mean (covariance).S A is the algorithm dependent local sensitivity which is the maximum amount that the output of a function can change given a change of a single input.It therefore determines the amount of noise that needs to be added to the output of a function in order to ensure that it satisfies the privacy guarantees of differential privacy (Dwork, Roth, et al., 2014) Mathematically, the local sensitivity S A is defined as Fig. 6.Leakage results for TensorFlow Federated are shown.The true data values from the first server of the CNSIM data set are plotted against the corresponding leaked data provided by the Covariance-Based Attack Algorithm.