Uncertainty Principle for Communication Compression in Distributed and Federated Learning and the Search for an Optimal Compressor

In order to mitigate the high communication cost in distributed and federated learning, various vector compression schemes, such as quantization, sparsification and dithering, have become very popular. In designing a compression method, one aims to communicate as few bits as possible, which minimizes the cost per communication round, while at the same time attempting to impart as little distortion (variance) to the communicated messages as possible, which minimizes the adverse effect of the compression on the overall number of communication rounds. However, intuitively, these two goals are fundamentally in conflict: the more compression we allow, the more distorted the messages become. We formalize this intuition and prove an {\em uncertainty principle} for randomized compression operators, thus quantifying this limitation mathematically, and {\em effectively providing lower bounds on what might be achievable with communication compression}. Motivated by these developments, we call for the search for the optimal compression operator. In an attempt to take a first step in this direction, we construct a new unbiased compression method inspired by the Kashin representation of vectors, which we call {\em Kashin compression (KC)}. In contrast to all previously proposed compression mechanisms, we prove that KC enjoys a {\em dimension independent} variance bound with an explicit formula even in the regime when only a few bits need to be communicate per each vector entry. We show how KC can be provably and efficiently combined with several existing optimization algorithms, in all cases leading to communication complexity improvements on previous state of the art.


Introduction
In the quest for high accuracy machine learning models, both the size of the model and consequently the amount of data necessary to train the model have been hugely increased over time (Schmidhuber, 2015;Vaswani et al., 2019). Because of this, performing the learning process on a single machine is often infeasible. In a typical scenario of distributed learning, the training data (and possibly the model as well) is spread across different machines and thus the process of training is done in a distributed manner (Bekkerman et al., 2011;Vogels et al., 2019). Another scenario, most common to federated learning (Konečný et al., 2016;McMahan et al., 2017;Karimireddy et al., 2019a), is when training data is inherently distributed across a large number of mobile edge devices due to data privacy concerns.

Communication bottleneck
In all cases of distributed learning and federated learning, information (e.g. current stochastic gradient vector or current state of the model) communication between computing nodes is inevitable, which forms the primary bottleneck of such systems (Zhang et al., 2017;Lin et al., 2018). This issue is especially apparent in federated learning, where computing nodes are devices with essentially inferior power and the network bandwidth is considerably slow (Li et al., 2019).
There are two general approaches to address/tackle this problem. One line of research dedicated to so-called local methods suggests to do more computational work before each communication in the hope that those would increase the worth/impact/value of the information to be communicated (Goyal et al., 2017;Wangni et al., 2018;Stich, 2018;Khaled et al., 2020). An alternative approach investigates inexact/lossy information compression strategies which aim to send approximate but relevant information encoded with less number of bits. In this work we focus on the second approach of compressed learning. Research in this latter stream splits into two orthogonal directions. To explore savings in communication, various (mostly randomized) compression operators have been proposed and analyzed such as random sparsification (Konečný & Richtárik, 2018;Wangni et al., 2018), top-k sparsification (Alistarh et al., 2018), standard random dithering (Goodall, 1951;Roberts, 1962;Alistarh et al., 2017), natural dithering (Horváth et al., 2019a), ternary quantization (Wen et al., 2017), and scaled sign quantization (Karimireddy et al., 2019b;Bernstein et al., 2018Bernstein et al., , 2019Liu et al., 2019). Table 1 summarizes the most common compression methods with their variances and the number of encoding bits.

Compressed learning
In order to utilize these compression methods efficiently, a lot of research has been devoted to the study of learning algorithms with compressed communication. Obviously, the presence of compression in a learning algorithm affects the training process and since compression operator encodes the original information approximately, it should be anticipated to increase the number of communication rounds. Table 2 highlights four gradient-type compressed learning algorithms with their corresponding setup and iteration complexity: (i) distributed Gradient Descent (GD) with compressed gradients , (ii) distributed Stochastic Gradient Descent (SGD) with gradient quantization and compression variance reduction (Horváth et al., 2019b), (iii) distributed SGD with bi-directional gradient compression (Horváth et al., 2019a), and (iv) distributed SGD with gradient compression and twofold error compensation (Tang et al., 2019).
In all cases, the iteration complexity depends on the variance (ω or α) of the underlying compression scheme and grows as more compression is applied. For this reason, we are interested in compression methods which save in communication by using less bits and minimize iteration complexity by introducing lower variance. However, intuitively and also evidently from Table 1, these two goals are in fundamental conflict, i.e. requiring fewer bits to be communicated in each round introduces higher variance, and demanding small variance forces more bits to be communicated.

Contributions
The contributions of our work are: • Uncertainty Principle. We formalize this intuitive trade-off and prove an uncertainty principle for randomized compression operators, which quantifies this limitation mathematically with the inequality  where α ∈ [0, 1] is the normalized variance / contraction factor associated with the compression operator (Definition 1), b is the number of bits required to encode the compressed vector and d is the dimension of the vector to be compressed. The notion of Uncertainty Principle (UP) for compression operators is introduced and theoretically proved in this paper. It is a universal property of compressed communication, completely independent of the optimization algorithm and the problem that distributed training is trying to solve. We visualize this fascinating principle in Figure 1, where we computed many possible combinations of parameters α and b /d for various compression methods. The dashed red line indicating the lower bound (1) bounds all possible combinations of all compression operators, thus validating the obtained uncertainty principle for randomized compression operators.
• Kashin Compression. Motivated by this principle, we then focus on the search for the optimal compression operator. In an attempt to take a first step in this direction, we design a new unbiased compression operator inspired by Kashin representation of vectors (Kashin, 1977), which we call Kashin Compression (KC). In contrast to all previously proposed compression methods, we prove that KC enjoys a dimension independent variance bound even in a severe compression regime when only a few bits per coordinate can be communicated. We give an explicit formula for the variance bound and show how KC can be provably and efficiently combined with several existing optimization algorithms, in all cases leading to communication complexity improvements on previous state of the art. We believe that KC has the potential to play a role in the discovery of an optimal compression method, perhaps when composed with some other operators, such as dithering.
• Experimental Validations. In our experiments, we observed the superiority of KC in terms of communication savings and stabilization property when compared against a vast array of compressors proposed in the literature. In particular, Figure 1 justifies that KC combined with Top-k sparsification and dithering operators yields a compression method which almost closes the gap to the UP. Kashin's representation has been used heuristically in federated learning (Caldas et al., 2019) to mitigate the communication cost. In contrast to this work, we generate the initial tight frame of KC randomly as suggested by the theory, and tune the parameters accordingly. Moreover, we consider combinations of KC and other compression techniques such as ternary quantization, Top-k sparsification and dithering. We believe KC should be of high interest in federated and distributed learning.

Uncertainty principle for compression operators
In general, an uncertainty principle refers to any type of mathematical inequality expressing some fundamental tradeoff between two measurements. The classical Heisenberg's uncertainty principle in quantum mechanics (Heisenberg, 1927) shows the trade-off between the position and momentum of a particle. In harmonic analysis, the uncertainty principle limits the localization of values of a function and its Fourier transform at the same time (Havin & Jöricke, 1994). Alternatively in the context of signal processing, signals cannot be simultaneously localized in both time domain and frequency domain (Gabor, 1946). The uncertainty principle in communication deals with the quite intuitive trade-off between information compression (encoding bits) and approximation error (variance), namely more compression forces heavier distortion to communicated messages and tighter approximation requires less information compression.
In this section, we present our UP for communication compression revealing the trade-off between encoding bits of compressed information and the variance produced by compression operator. First, we describe our UP for a general class of biased compressions. Afterwards, we specialize it to the class of unbiased compressions.

UP for biased compressions
We work with the class of biased compression operators which are contractive.

Definition 1 (Biased Compressions) Let B(α) be the class of biased (and possibly randomized) compression operators
The parameter α can be seen as the normalized variance of the compression operator. Note that the compression C does not need to be randomized to belong to this class. For instance, Top-k sparsification operator satisfies (2) without the expectation for α = 1 − k /d. Next, we formalize our uncertainty principle for the class B(α).
Theorem 1 Let C : R d → R d be any compression operator from B(α) and b be the total number of bits needed to encode the compressed vector C(x) for any x ∈ R d . Then the following form of uncertainty principle holds One can view the binary32 and binary64 floating-points formats as biased compression methods for the actual real numbers (i.e. d = 1), using only 32 and 64 bits respectively to represent a single number. Intuitively, these formats have their precision (i.e. √ α) limits and the uncertainty principle (3) shows that the precision cannot be better than 2 −32 for binary32 format and 2 −64 for binary64 format. Thus, any floating-point format representing a single number with r bits has precision constraint of 2 −r , where the base 2 stems from the binary nature of the bit.
Furthermore, notice that compression operators can achieve zero variance in some settings, e.g. ternary or scaled sign quantization when d = 1 (see Table 1). On the other hand, the UP (3) implies that the normalized variance α > 0 for any finite bits b. The reason for this inconsistency comes from the fact that, for instance, the binary32 format encodes any number with 32 bits and the error 2 −32 is usually ignored in practice. We can adjust our UP to any digital format, using r bits per single number, as

UP for unbiased compressions
We now specialize our UP to the class of unbiased compressions. First, we recall the definition of unbiased compression operators with a given variance.
Definition 2 (Unbiased Compressions) Denote by U(ω) the class of unbiased compression operators C : To establish an uncertainty principle for C ∈ U(ω), we show that all unbiased compression operators with the proper scaling factor are included in B(α).
Theorem 2 Let C : R d → R d be any unbiased compression operator with variance ω ≥ 0 and b be the total number of bits needed to encode the compressed vector C(x) for any x ∈ R d . Then the uncertainty principle takes the form

Compression with regular polytopes
Here we describe an unbiased compression scheme based on regular polytopes. With this particular compression we illustrate that it is possible for unbiased compressions to have dimension independent variance bounds and at the same time communicate a few bits per coordinate. Let x ∈ R d be the vector that we need to communicate. First, we project the vector on the unit sphere The magnitude is a dimension independent scalar value and we can transfer it cheaply, say by 32 bits. To encode the unit vector x/ x 2 we approximate the unit sphere by regular polytopes and then randomize over the vertices of the polytope. Polytopes can be seen as generalizations of planar polygons in high dimensions. Formally, let P m be a regular polytope with vertices {v 1 , v 2 , . . . , v m } ⊂ R d such that it contains the unit sphere, i.e. S d−1 ⊂ P m , and all vertices are on the sphere of radius R > 1. Then, any unit vector v ∈ S d−1 can be expressed as a convex combination m k=1 w k v k with some non-negative weights w k = w k (x). Equivalently, v can be expressed as an expectation of a random vector over v k with probabilities w k . Therefore, the direction x/ x 2 could be encoded with roughly log m bits and the variance ω of compression will depend on the approximation, more specifically ω = R 2 − 1. Kochol (2004Kochol ( , 1994) gave a constructive proof on approximation of the d−dimensional unit sphere by regular polytopes with m ≥ 2d vertices for which ω = O d log m /d . So, choosing the number of vertices to be m = 2 d , we get an unbiased compression operator with O(1) variance (independent of dimension d) and with 1 bit per coordinate encoding.
However, this method does not seem to be practical as 2 d vertices of the polytope either need to be stored or computed each time they are used, which is infeasible for large dimensions.

Compression with Kashin's representation
In this section we introduce the notion of Kashin's representation, the algorithm of Lyubarskii & Vershynin (2010) on computing it efficiently and then describe the quantization step.

Representation systems
The most common way of compressing a given vector x ∈ R d is to use its orthogonal representation with respect to the standard basis (e However, the restriction of orthogonal expansions is that coefficients x i are independent in the sense that if we lost one of them, then we cannot recover it even approximately. Furthermore, each coefficient x i may carry very different portion of the total information that vector x contains; some coefficients may carry more information than others and thus be more sensitive to compression.

Algorithm 1 Computing Kashin's representation
Input: orthogonal d × D matrix U which satisfies RIP with parameters δ, η ∈ (0, 1), a vector x ∈ R d and a number of iterations r.
For this reason, it is preferable to use tight frames and frame representations instead. Tight frames are generalizations of orthonormal bases, where the system of vectors are not required to be linearly independent.
Clearly, if D > d (the case we are interested in), then the system (u i ) D i=1 is linearly dependent and hence the representation (7) with coefficients a i is not unique. The idea is to exploit this redundancy and choose coefficients a i in such a way to spread the information uniformly among these coefficients. However, the frame representation may not distribute the information well enough. Thus, we need a particular representation for which coefficients a i have smallest possible dynamic range.
For a frame (u i ) D i=1 define the d × D frame matrix U by stacking frame vectors u i as columns. It can be easily seen that being a tight frame is equivalent to frame matrix to be orthogonal, i.e. U U = I d , where I d is the d × d identity matrix. Using the frame matrix U , frame representation (7) takes the form x = U a.

Definition 3 (Kashin's representation) Let
Optimality. As noted in (Lyubarskii & Vershynin, 2010), Kashin's representation has the smallest possible dynamic range K / √ D, which is √ d times smaller then dynamic range of the frame representation (7). Existence. It turns out that not every tight frame can guarantee Kashin's representation with constant level. The following existence result is based on Kashin's theorem (Kashin, 1977): Theorem 3 There exist tight frames in R d with arbitrarily small redundancy λ = D /d > 1, and such that every vector x ∈ R d admits Kashin's representation with level K = K(λ) that depends on λ only (not on d or D).

Computing Kashin's representation
To compute Kashin's representation we use the algorithm developed by Lyubarskii & Vershynin (2010), which transforms the frame representation (7) into Kashin's representation (8). The algorithm requires tight frame with frame matrix satisfying the restricted isometry property: Definition 4 (Restricted Isometry Property (RIP)) A given d×D matrix U satisfies the Restricted Isometry Property with parameters δ, η ∈ (0, 1) if for any In general, for an orthogonal d × D matrix U we can only guarantee the inequality U x 2 ≤ x 2 if x ∈ R d . The RIP requires U to be a contraction mapping for sparse x. With a frame matrix satisfying RIP, the analysis of Algorithm 1 from (Lyubarskii & Vershynin, 2010) yields a formula for the level of Kashin's representation: be a tight frame in R d which satisfies RIP with parameters δ, η. Then any vector x ∈ R d admits a Kashin's representation with level

Quantizing Kashin's representation
We utilize Kashin's representation to design a compression method, which will enjoy dimension-free variance bound on the approximation error. Let x ∈ R d be the vector that we want to communicate and λ > 1 be the redundancy factor so that D = λd is positive integer. First we find Kashin's representation of x, i.e. x = U a for some a ∈ R D , and then quantize coefficients a i using any unbiased compression operator C : R D → R D that preserves the sign and maximum magnitude: For example, ternary quantization or any dithering (standard random, natural) can be applied. The vector that we communicate is the quantized coefficients C(a) ∈ R D and KC is defined via Due to unbiasedness of C and linearity of expectation, we preserve unbiasedness for C κ : Then we bound the error of approximation uniformly (without the expectation) as follows The obtained uniform upper bound K(λ) 2 does not depend on the dimension d. It depends only on the redundancy factor λ > 1 which should be chosen depending on how less we want to communicate. Thus, KC C κ with any unbiased quantization (11) belongs to U K 2 (λ) . Note, that we are not restrained to use only unbiased compressions with Kashin's representation. For instance, instead of random sparsification (which is unbiased and satisfies (11)) one can use Top-k sparsification, which satisfies (11) and in practice works much better despite having similar theoretical properties.

Measure concentration and orthogonal matrices
The concentration of the measure is a remarkable high-dimensional phenomenon which roughly claims that a function defined on a high-dimensional space and having small oscillations takes values highly concentrated around the average (Ledoux, 2001;Giannopoulos & Milman, 2000). Here we present one example of such concentration for Lipschitz functions on the unit sphere, which will be the key to justify the restricted isometry property.

Concentration on the sphere for Lipschitz functions
Let S d−1 := {x ∈ R d : x 2 = 1} be the unit sphere. We say that f : for any x, y ∈ S d−1 .
Theorem 5 Let X ∈ S d−1 be a random vector uniformly distributed on the unit Euclidean sphere. If f : S d−1 → R is L−Lipschitz function, then for any t ≥ 0 Informally and rather surprisingly, Lipschitz functions on a high-dimensional unit sphere are almost constants. Particularly, it implies that deviations of function values from the average are at most 8L / √ d with confidence level more than 0.99. We will apply this concentration inequality for the function x → U x 2 which is 1−Lipschitz if U is orthogonal.

Random orthogonal matrices
Up to this point we did not discuss how to choose the frame vectors u i or the frame matrix U , which is used in the construction of Kashin's representation. We only know that it should be orthogonal and satisfy RIP for some parameters δ, η. We now describe how to construct frame matrix U and how to estimate parameters δ, η. Unluckily, there is no an explicit construction scheme for such matrices. There are random generation processes that provide probabilistic guarantees (Candès & Tao, 2005, 2006Lyubarskii & Vershynin, 2010).
Consider random d × D matrices with orthonormal rows. Such matrices are obtained from selecting the first d rows of orthogonal D × D matrices. Let O(D) be the space of all orthogonal D × D matrices with the unique translation invariance and normalized measure, which is called Haar measure for that space. Then the space of where P d : R D → R d is the orthogonal projection on the first d coordinates. The probability measure on O(d × D) is induces by the Haar measure on O(D). Next we show that, with respect to the normalized Haar measure, randomly generated orthogonal matrices satisfy RIP with high probability.
Theorem 6 Let λ > 1 and D = λd, then with probability at least Note that the expression for the probability can be negative if λ is too close to 1. Specifically, the logarithmic term vanishes for λ ≈ 1.0005 giving negative probability. However, the probability approaches to 1 quite rapidly for bigger λ's. To get a sense of how high that probability can be, note that for d = 1000 variables and λ = 2 inflation it is bigger than 0.98. Now that we have explicit formulas for the parameters δ and η, we can combine it with the results of Section 4 and summarize with the following theorem.
Theorem 7 Let λ > 1 be the redundancy factor and C be any unbiased compression operator satisfying (11). Then Kashin Compression C κ ∈ U(ω λ ) is an unbiased compression with dimension independent variance

Experiments
In this section we describe the implementation details of KC and present our experiments of KC compared to other popular compression methods in the literature.

Implementation details of KC
To generate a random (fat) orthogonal frame matrix U , we first generate a random matrix with entries drown independently from Gaussian distribution. Then we extract an orthogonal matrix by applying QR decomposition. Note that, for big dimensions the generation process of frame matrix U becomes computationally expensive. However, after fixing the dimension of to-be-compressed vectors then the frame matrix needs to be generated only once and can be used throughout the learning process. Afterwards, we turn to the estimation of the parameters δ and η of RIP, which are necessary to compute Kashin's representations. These parameters are estimated iteratively so to minimize the representation level K (10) subject to the constraint (9) of RIP. For fixed δ we first find the least η such 9 holds for unit vectors, which were obtained by normalizing Gaussian random vectors (we chose sample size of 10 4 − 10 5 , which provided a good estimate). Then we tune the parameter δ (initially chosen 0.9) to minimize the level K (10).

Empirical variance comparison
We empirically compare the variance produced by natural dithering against KC with natural dithering and observe that latter introduces much less variance. We generated n vectors with d independent entries from standard Gaussian distribution. Then we fix the minimum number of levels s that allows obtaining an acceptable variance for performing KC with natural dithering. Next, we adjust levels s for natural dithering to the almost same number of bits used for transmission of the compressed vector. For each of these vectors we compute normalized empirical variance via In Figure 2 we provide boxplots for empirical variances, which show that the increase of parameter λ leads to smaller variance for KC. They also confirm that for natural dithering, the variance ω scales with the dimension d while for KC that scaling is significantly reduced (see also Table 1 for variance bounds). This shows the positive effect of KC combined with other compression methods. For additional insights, we present also swarmplots provided by Seaborn Library. Figure 3 illustrates the strong robustness property of KC with respect to outliers.  (14) for natural dithering and KC with natural dithering.

Minimizing quadratics with CGD
To illustrate the advantages of KC in optimization algorithms, we minimized randomly generated quadratic functions (15) for d = 10 4 using gradient descent with compressed gradients.
In Figure 4a we evaluate functional suboptimality in log-scale for vertical axis. These plots illustrate the superiority of KC with ternary quantization, where it does not degrade the convergence at all and saves in communication compared to other compression methods and without any compression scheme.
To provide more insights into this setting, Figure 4b visualizes empirical variances of the compressed gradients throughout the optimization process, revealing both the low variance feature and the stabilization property of KC.

Minimizing quadratics with distributed CGD
Consider the minimization problem of the average of n quadratics  with synthetically generated matrices A i . We solve this problem with Distributed Compressed Gradient Descent (Algorithm 2) using a selection of compression operators. Figures 5 and 6 show that KC combined with ternary quantization leads to faster convergence and uses less bits to communicate than ternary quantization alone. Note that in higher dimension the gap between KC with ternary quantization and no compression gets smaller in the iteration plot, while in the communication plot it gets bigger. So, in high dimensions KC convergences slightly worse than no compression scheme, but the savings in communication are huge.

Conclusion and future plans
We formalized, for the first time, the limitation of (randomized) compression operators in communication and mathematically proved an uncertainty principle for communication compression. We also presented a highly robust new-Kashin compressor (KC)-and showed that in combinations with some other compression methods gives almost optimal compression, thus closing the gap established by our uncertainty principle. As a future work, we plan to implement a sparse and efficient generation of large-size random orthogonal matrices using block structured small-size orthogonal matrices. This should reduce both the storage requirement and the computational effort to use KC in practical applications.
Define probability functions p k as follows Then we stack functions p k together and get a vector valued function p : We can express the expectation in (17) as and taking into account the inequality (17) itself, we conclude The above inequality holds for the particular probability function p defined from the compression C. Therefore the inequality will remain valid if we take the minimum of left hand side over all possible probability functionŝ We then swap the order of min-max by adjusting domains properly: where the second minimum is over all probability vectorsp ∈ ∆ m (not over vector valued functions as in the first minimum). Next, notice that where v x ∈ arg min v∈{v1,...,vm} v − x 2 is the closest v k to x. Therefore, we have transformed (19) into The last inequality means that the set {v 1 , . . . , v m } is anR-net for the ball B d (R). Using the following result on covering numbers and volume (see Proposition 4.2.12, (Vershynin, 2018)) we conclude A.2 Proof of Lemma 1 which concludes the lemma.
B Proofs for Section 5 B.1 Proof of Theorem 5: Concentration on the sphere for Lipschitz functions Let S d−1 be the unit sphere with the normalized Lebesgue measure µ and the geodesic metric dist(x, y) = arccos x, y representing the angle between x and y. Using this metric, we define the spherical caps as the balls in S d−1 : B a (r) = {x ∈ S n−1 : dist(x, a) ≤ r}, a ∈ S d−1 , r > 0.
For a set A ⊂ S d−1 and non-negative number t ≥ 0 denote by A(t) the t-neighborhood of A with respect to geodesic metric: The famous result of P. Levy on isoperimetric inequality for the sphere states that among all subsets A ⊂ S d−1 of a given measure, the spherical cap has the smallest measure for the neighborhood (see e.g. (Ledoux, 2001)).
We also need the following upper bound on the measure of spherical caps 2 .
Lemma 2 Let t ≥ 0. If B ⊂ S d−1 is a spherical cap with radius π/2 − t, then These two results yield a concentration inequality on the unit sphere around median of the Lipschitz function.
Theorem 9 Let f : S d−1 → R be a L-Lipschitz function (w.r.t. geodesic metric 3 ) and let M = M f be its median, i.e.
Continuity of µ and f give the result with the relaxed inequality.
B.1.2 Proof of Theorem 5: Concentration around the mean Now, from (21) we derive a concentration inequality around the mean rather than median, where mean is defined via Again, without loss of generality we assume that L = 1 and d ≥ 3. Fix ∈ [0, 1] and decompose the set {x : |f (x) − Ef | ≥ t} into two parts: where M is a median of f . From the concentration (21) around the median, we get an estimate for A 1 A 1 ≤ π 2 exp − (d − 2)t 2 2 2 .
Now we want to estimate the second term A 2 with a similar upper bound so to combine them. Obviously, the condition in A 2 does not depend on x, and it is a piecewise constant function of t. Therefore where we bounded f − M 1 as follows We further upper bound A 2 to get the same exponential term as for A 1 : To check the validity of the latter upper bound, first notice that for t = π 2(1− ) √ d−2 both are equal to 1. Then, the monotonicity and positiveness of the exponential function imply (22) for 0 ≤ t < π 2(1− ) √ d−2 and t > π 2(1− ) √ d−2 . Combining these two upper bounds for A 1 and A 2 , we get if we set = 1 /2. To conclude the theorem, note that normalized uniform measure µ on the unit sphere can be seen as a probability measure on S d−1 .
Let x ∈ S D−1 be fixed. Any orthogonal d × D matrix U ∈ O(d × D) can be represented as the projection U = P d V of D × D orthogonal matrix V ∈ O(D). The uniform probability measure (or Haar measure) on O(D) ensures that if V ∈ O(D) is random then the vector z = V x is uniformly distributed on S D−1 . Therefore, if U ∈ O(d × D) is random with respect to the induced Haar measure on O(d × D), then random vectors U x and P d z have identical distributions. Denote f (z) = P d z 2 and notice that f is 1-Lipschitz on the sphere S D−1 . To apply the concentration inequality (23), we compute the expected norm of these random vectors: where we used the fact that coordinates z 2 i are distributed identically and therefore they have the same 1 /D mean. Applying inequality (23) yields, for any t ≥ 0 Prob U ∈ O(d × D) : U x 2 > d /D + t ≤ Prob z ∈ S D−1 : |f (z) − Ef (z)| > t ≤ 5 exp − Dt 2 9 .