On critical points of quadratic low-rank matrix optimization problems

The absence of spurious local minima in certain nonconvex low-rank matrix recovery problems has been of recent interest in computer science, machine learning and compressed sensing since it explains the convergence of some low-rank optimization methods to global optima. One such example is low-rank matrix sensing under restricted isometry properties (RIPs). It can be formulated as a minimization problem for a quadratic function on the Riemannian manifold of low-rank matrices, with a positive semidefinite Riemannian Hessian that acts almost like an identity on low-rank matrices. In this work new estimates for singular values of local minima for such problems are given, which lead to improved bounds on RIP constants to ensure absence of nonoptimal local minima and sufficiently negative curvature at all other critical points. A geometric viewpoint is taken, which is inspired by the fact that the Euclidean distance function to a rank-k matrix possesses no critical points on the corresponding embedded submanifold of rank-k matrices except for the single global minimum.


Introduction
On the space R m×n of real m × n matrices we consider a quadratic function with a given symmetric linear operator A : R m×n → R m×n and matrix B. The gradient of this function equals (1.2) Thus, critical points only exist if B is in the range of A. This is, for instance, the case when A is positive definite with respect to the Frobenius inner product, in which case the solution to the matrix equation (1.2) is also unique.
For m = n and L a symmetric positive definite n × n matrix a well-known example of the type described above is the Lyapunov matrix equation which has a unique solution X * for any right-hand side B, since the operator A[X] = LX + XL is symmetric positive definite on R m×n .

Rank constrained quadratic problems
In certain applications the aim is to solve (1.1) or (1.2) with the additional requirement that the solution (or its approximation) is of sufficiently low rank.For instance, when B in the Lyapunov equation (1.3) itself is low rank, it can be proven (Penzl, 2000, Thm. 1) that X * has exponentially decaying singular values and, hence, can be approximated well by a low-rank matrix.To obtain such approximations there exist a few methods that are built from classical solvers in numerical linear algebra, like the ADI and Krylov methods; see Simoncini (2016) for a recent overview.
Let k min(m, n) and denote by the smooth manifold of fixed rank-k matrices, and by its closure in R m×n .A natural alternative approach for obtaining low-rank (approximate) solutions to the matrix equation (1.2) is to minimize the quadratic function (1.1) on the set M k : min X∈M k f A,B (X). (1.4) Since the set M k is closed this problem admits at least one solution if A is positive definite on the cone M 2k , that is, X, A[X] F > 0 for all X ∈ M 2k (see Proposition 2.5 below).In the case that B = A[X * ] for some X * ∈ R m×n it holds that and thus, if, say, A is positive semidefinite, the minimizers of f A,B on M k admit an interpretation as best rank-k approximations of X * in energy (semi)norm.
The constrained optimization problem (1.4) is nonconvex and can be tackled by various methods.While our forthcoming theoretical results do not depend on the particular method used we point out two popular and efficient approaches.The first is based on the bilinear representation X = UV T for low-rank matrices, which allows us to optimize the m × k and n × k matrices U and V using local search algorithms.This is also known as the Burer-Monteiro factorization in the more general context of low-rank approximations for semidefinite programs (SDPs) (Burer & Monteiro, 2003) and can be very efficient when k mn since it avoids constructing large matrices of size m × n explicitly.It is advisable to break the nonuniqueness of the factorization X = UV T by adding penalty terms or applying alternating least squares; see, e.g., Li et al. (2019), Park et al. (2017), Wen et al. (2012), Zhu et al. (2018).
Another family of methods is based on exploiting that the set M k is a smooth Riemannian manifold.This allows again the use of local search algorithms but now using techniques from Riemannian optimization (Absil et al., 2008).As these methods directly optimize over M k they do not require regularization since nonuniqueness of representations is not an issue.In addition careful implementations using retractions and low-rank factorization have similar cost per iteration as algorithms based on bilinear factorizations; see, e.g., Shalit et al. (2012), Vandereycken (2013), Vandereycken & Vandewalle (2010).
There has been considerable interest in the case of positive semidefinite operators A. This concerns, for instance, the nonconvex formulation of low-rank matrix completion problems (Candès & Recht, 2009) min which corresponds to solutions of Here, A = P Ω is the orthogonal projection on a subset Ω of known entries, which means that A is not invertible.Solvers for these problems with good theoretical guarantees are based on convex relaxation (Candès & Recht, 2009;Candès & Plan, 2011), but they can also be treated well by nonconvex local optimization techniques (Keshavan et al., 2010;Wen et al., 2012;Jain et al., 2013;Vandereycken, 2013;Ge et al., 2016), which are much less costly per iteration; see also Chi et al. (2019) for a recent overview.More generally, problems with a semidefinite operator are instances of a matrix sensing problem (Recht et al., 2010), that is, the recovery of a matrix from a few linear measurements.Here in general one is faced with the problem min where F is a linear operator from R m×n to R d and d mn.Up to an inconsequential constant this problem fits our symmetric and semidefinite framework (1.4) using A = F T F and B = F T [b].

Contributions and existing results
In this paper we focus on problems of the form (1.1) where A is positive semidefinite.In many applications it can be observed in numerical experiments that if one true solution X * of the corresponding matrix equation has exactly low rank, that is, the global minimizer X * of the function (1.1) is an element of M k for some small rank k, and if this rank is known, then local optimization methods for (1.4) do typically recover this global minimizer.This is somewhat surprising since the problem is nonconvex.Furthermore, in the so-called noisy case, when X * is only close to but not in M k , such algorithms typically return different local minima on M k that are, however, all close to X * .
As explained in many works, the reason for this fortunate behavior seems to be a relatively benign optimization landscape: given an objective function f that is sufficiently well conditioned and convex when restricted to cones M k , the local minima always appear to be global minima after restricting f to M k .In other words other critical points are either saddle points or local maxima, and are hence unlikely to attract sequences generated by local optimization algorithms that impose monotonic reduction of the objective.Moreover, the saddle points have directions with sufficiently large negative curvature (called the strict saddle property in Ge et al., 2015), so that algorithms can escape them sufficiently fast.Such remarkable properties have been rigorously proven under suitable assumptions for different lowrank optimization problems like matrix completion (Sun & Luo, 2015;Ge et al., 2016), matrix sensing (Bhojanapalli et al., 2016;Park et al., 2017), more general convex functions on M k (Zhu et al., 2018;Li et al., 2019), SDPs (Boumal et al., 2016(Boumal et al., , 2019) ) and also for some other problems in the context of compressed sensing such as phase retrieval (Sun et al., 2018) and sparse dictionary recovery (Sun et al., 2015(Sun et al., , 2017a(Sun et al., ,b, 2018)).See also Ge et al. (2017) for an overview.
Our aim here is to provide similar results by studying the critical points of quadratic functions f A,B as in (1.1) on manifolds M k for semidefinite operators A. We will show that when the restriction of A to the cone M 2k behaves like a sufficiently small perturbation of identity, and if a solution of the matrix equation (1.2) lies on M k , then f A,B has no local minima on M k except the global one.Additionally, bounds on the negative eigenvalues of the Riemannian Hessian at other critical points are given.This is important for escaping such critical points in local search methods.These results are in Theorem 3.5 and Corollary 3.6, which, to our knowledge, provide improved and simple conditions on the restricted isometry constants δ k (see Definition 3.1) when applied to matrix sensing as compared to those we could find in the literature.For example, we obtain that δ 3k 0.3446 or δ 2k 0.2807 are each sufficient for absence of local minima in noiseless matrix sensing with nonsymmetric matrices.This can be compared to the condition δ 4k 0.0363 in Park et al. (2017), δ 4k 1/5 in Li et al. (2019), Zhu et al. (2018) and δ 2k < 1/5 in Bhojanapalli et al. (2016) for symmetric positive semidefinite matrices (observe that δ k δ for all integers k , hence our bounds are less restrictive).On the other hand there exist examples showing that with δ 2 1/2 quadratic functions may exhibit nonglobal local minima on the set of positive semidefinite rank-1 matrices, even in the noiseless case (Zhang et al., 2018;Li et al., 2019).Other sufficient conditions in the literature for guaranteed recovery of rank-k matrices using different approaches include δ 2k < 1/3 (Jain et al., 2010) for the singular value projection (iterative hard thresholding (IHT)) algorithm, and even δ 2k 1/2 (Cai & Zhang, 2013) for nuclear norm minimization, which has the additional theoretical advantage of not requiring the rank k as an input parameter.These last two approaches, however, can become very expensive to implement for large matrices compared to local methods that operate on rank-k matrices directly.
The most general version of our analysis is Theorem 3.9, which also deals with the inexact (or noisy) case, where the matrix equation (1.2) admits only an approximate solution on M k of some accuracy ε 0. In this case all critical points whose Riemannian Hessians have small or no negative eigenvalues (e.g., local minima) are optimal up to a constant.While the statements are in principal easy to use it might be difficult to gain intuition about the actual values.We therefore provide some concrete examples on the interplay of restricted spectral bounds, negative eigenvalues of the (Riemannian) Hessian and ε in Section 3.4.
Our strategy to obtain our results is motivated by an interesting observation on the Euclidean distance function f (X) = X − B 2 F , namely that for B ∈ M k it has no critical points on M k at all except for the global minimum X = B.In order to generalize this rather peculiar behavior of the operator A = Id to more general ones we introduce a certain norm in which we measure the distance of X − (A[X] − B) from the cone M k .By comparing upper and lower bounds for this distance we obtain our results.
Currently, the main results in this paper do not cover the important cases of matrix completion or matrix equations with badly conditioned operators (as they arise in numerical linear algebra), but some of the observations obtained alongside still provide general insight into the problem.
Finally, we hope to make a contribution to the subject by taking a geometric viewpoint on the problem that focuses on the critical points of the constrained problem (1.4), regardless of the method or parametrization used for representing the low-rank matrices.This is in contrast to Li et al. (2017Li et al. ( , 2019)), Park et al. (2017), Zhu et al. (2018) where an explicit regularization has to be used to cope with the nonuniqueness of the X = UV T factorization.Also, thanks to the manifold setup, we believe our analysis has potential implications for most local search methods for (1.4).This is illustrated in the last section on numerical examples where we solve matrix sensing problems with a few popular nonconvex algorithms.The methods are not novel but serve their purpose in confirming our theoretical result on nonexistence of spurious local minima.

Properties of critical points
In this work we study the critical points of the function f A,B defined in (1.1) on the smooth manifold M k of fixed rank-k matrices only.We briefly justify this restriction to the smooth part of M k in Section 2.2.In general we neither assume that B is in the range of A, nor that A is positive semidefinite.Instead, a so-called restricted positive definiteness on the cones M k will play a crucial role for the main results in Section 3 on the absence of local minima of f A,B on M k .

Tangent space and critical points
This tangent space is known to be the set (see, e.g., Helmke & Shayman, 1995, Prop. 4.1) (2.1) Note that all matrices in T X M k have rank at most 2k, that is, T X M k ⊆ M 2k .Let P col X and P row X denote the respective orthogonal projections on the column and row space of a matrix X.From (2.1) we see that a matrix Z is orthogonal to T X M k if P col X Z = 0 and ZP row X = 0 or, in other words, Hence, with X = UΣV T and Z = Ũ Σ ṼT two singular value decomposition (SVDs), we obtain that is also an SVD.A main consequence of this is that rank(X + Z) = rank(X) + rank(Z) for Z orthogonal to T X M k . (2. 3) The seemingly simple observation (2.2) turns out to be quite useful and is the main argument for Lemma 2.3 below.In fact its immediate consequence (2.3) already has some surprising implications on the critical points of the Euclidean distance function which arises, up to a constant, from f A,B by taking A = Id to be the identity operator.
The following equivalent statement is even more interesting from a geometric point of view.It follows directly from X − Y being the gradient of the function f Our aim in this paper is to study how far the observation in Proposition 2.1 for the identity operator carries over to those functions f A,B in which A is a perturbation of the identity, at least in a restricted sense.For this we will have to quantify the 'rank increase' property (2.3).The starting point will be the inequality stated in Lemma 2.3 below, which first requires some definitions.
By σ 1 (Z) σ 2 (Z) • • • we denote the singular values of a matrix Z, with the agreement σ i (Z) = 0 for i min(m, n).We then consider the norm Here the equality of both expressions is a consequence of the fact that truncated SVD yields best approximations in the Frobenius norm on the cone M k , and hence maximizes the orthogonal projection on it.The norm properties of Z σ ,k then follow easily from the expression on the right-hand side of (2.4).Note that X σ ,k X F for every matrix X.The norm (2.4) is a unitarily invariant norm, that is, UZV σ ,k = Z σ ,k for all orthogonal U and V. Hence, the truncated SVD also provides best rank-k approximations in this norm; see, e.g., Horn & Johnson (2013, Section 7.4.9).Therefore, for any fixed Z, For the case of the identity operator A = Id we have obtained a contradiction to the existence of two critical points X = X * = B on M k from two facts: on the one hand the matrix X X) should have a higher rank than X, that is, a positive distance to M k , while on the other hand it cannot, since it equals X * .For A close to Id we expect a similar contradiction, but to obtain it, we need both upper and lower bounds for the distance of X − (A[X] − B) from M k .Our key idea is to obtain such bounds for the distance in the (ii) Let 0 j k be the largest integer such that σ Proof.Item (i) is immediate from the definition of α and the triangle inequality: To show (ii) we use the characterization which holds by (2.5).Let . By definition of j the largest k of these numbers are ς 1 , . . ., ς k−j , s 1 , . . ., s j (the notation is slightly abusive when j = k), and hence α 2 is the sum of squares of the largest k remaining ones.In particular α 2 is larger than or equal to any sum of squares of k of the remaining singular values, which implies the asserted lower bound.
In agreement with what was pointed out above, the inequalities (i) and (ii) in Lemma 2.3 are contradictory in the case A = Id and Y = B ∈ M k unless X = B. Our strategy is to show that they remain contradictory when A acts like a perturbation of identity on low-rank matrices.However, different from the case A = Id, we will have to confine ourselves to local minima on M k , or at least critical points with almost positive semidefinite Riemannian Hessian (see Section 2.4), in order to deal with the a priori unknown singular values of X in the lower bound for α.

Restriction to the smooth part M k
We justify why we are ignoring potential local minima of f A,B on M k of rank less than k.By the textbook definition (e.g., Rockafellar & Wets, 1998, Theorem 6.12 to the polar cone of the Bouligand tangent cone at X.In particular local minima on M k are critical points.If rank(X) = s k the Bouligand tangent cone can be shown to be (Harris, 1995;Cason et al., 2013;Schneider & Uschmajew, 2015) As then follows, when s < k, the polar cone (T B X M k ) • is just the point {0}, and hence a critical point satisfies ∇f A,B (X) = 0, that is, solves the equation A[X] = B. Let us repeat this as a proposition.
In the latter case, if A is a positive semidefinite operator, then X is a global minimizer of f A,B on R m×n .Now we can make different assumptions regarding the critical points X satisfying rank(X) < k.If we assume A is positive semidefinite, then by the above proposition such X are necessarily unconstrained global minimizers of f A,B .If we assume instead that there exists at least one solution A[X * ] = B with rank(X * ) k and A satisfies the condition λ(A, 2k) > 0 (see (2.6) below for the definition) then it follows that X = X * .Finally, if we simply assume that the equation A[X] = B does not admit solutions of rank strictly less than k at all, such a critical point X cannot exist and therefore all critical points of f A,B on M k in this broader sense in fact lie in M k .This is for instance the case if A is positive definite and rank(X * ) k, where X * is the unique solution A[X * ] = B, that is, the global unconstrained minimizer of f A,B (Schneider & Uschmajew, 2015).
Based on these facts all subsequent theorems will be formulated for critical points on the smooth manifold M k only.A key challenge, however, is to bound the distance of critical points X ∈ M k to M k−1 , that is, the smallest singular value σ k (X), from below; see Section 2.4.

Restricted spectral bounds
The central tool to analyze the local minima of f A,B on M k is the 'restricted spectral bounds', that is, the minima and maxima of the Rayleigh quotient of the symmetric operator A on cones of low-rank matrices.We use the following definitions: and Note that both the minimum and maximum are attained since M k is closed.Obviously, whenever k k, In particular, for all combinations of k and .
If A = 0 is positive semidefinite then Λ(A, 1) > 0, since the space R m×n possesses an orthonormal basis of rank-1 matrices.Furthermore, one can then show that For this inequality to hold it is sufficient that λ(A, k + ) 0. 1  We note that the lower spectral bounds provide conditions for the existence of minimizers as follows.
Proposition 2.5 Assume λ(A, 2k) > 0. Then the function f A,B has at least one minimizer on M k , that is, problem (1.4) admits at least one solution.
to M k has bounded sublevel sets, and so the existence of a minimizer follows from the fact that M k is closed.
We will also need upper estimates for mixed products Y, A[Z] F in terms of the restricted spectral bounds.They can be derived using the 'parallelogram identity', similar to Candès & Plan (2011, Lemma 3.3).
which easily yields the asserted bound.
The upper bound (2.8) will be required for the shifted operator Id − A. Under the assumptions of the lemma it follows from the definitions that (2.9) 1 Using SVD, every matrix Z of rank at most k + can be written as Z = sX + tY, where s, t ∈ R and X and Y are of rank at most k and , respectively, and orthonormal with respect to the Frobenius inner product.Consider then the 2 × 2 symmetric matrix Let us further introduce the constants They can be related to the • σ ,k -norms introduced in (2.4) in the following way.
Lemma 2.7 Let Z have rank , then The proof is immediate from the right-hand side of (2.4).
The scaling behavior of Γ (A, k, ) with respect to the ranks k and will turn out to be useful later to relate our results to existing ones.
Lemma 2.8 For the positive integers p, q, Proof.Let Y and Z be the maximizers in Γ (A, pk, q ).Using SVD we can write where the matrices Y 1 , . . ., Y p ∈ M k are pairwise orthogonal and have Frobenius norm 1, and the scalars a 1 , . . ., a p are not negative.Observe that and the result follows from the Cauchy-Schwarz inequality.

Estimates related to the Riemannian Hessian
Here we provide lower estimates on the smallest singular values of a critical point X ∈ M k of f A,B on M k .These estimates are expressed in terms of the restricted spectral bounds (2.6)-(2.7) of the operator A and the singular values of the residual A[X] − B, as well as a lower bound on the eigenvalues of the Riemannian Hessian of f A,B at X.We refer to Absil et al. (2008, Ch. 5) for the concept of the Riemannian Hessian.
Denote by H X the Riemannian Hessian of f A,B (restricted to M k ) at X ∈ M k .As the metric on the submanifold M k we choose the restriction of the Frobenius inner product from the ambient space R m×n .2Let X = UΣV T an SVD of X, and set By (2.1) every tangent vector Z ∈ T X M k can be written as (2.10) for some matrices ΔG ∈ R m×k and ΔH ∈ R n×k .With this representation of tangent vectors we prove in Appendix A that (2.11) with P col X and P row X being the orthogonal projections onto the column and row spaces of X, respectively.
− B (see Section 2.1), and so the Riemannian Hessian at such points reads (2.12) In the case that X is a local minimum the Riemannian Hessian is positive semidefinite, that is, In the following proposition we consider arbitrary critical points X for which the Riemannian Hessian satisfies a nonpositive lower spectral bound.
Proposition 2.9 Let X be a critical point of f A,B on M k and let Assume for some μ 0 that the Riemannian Hessian satisfies Then for any j = 1, . . ., k and Λ 2j > 0 with Λ(A, 2j) Λ 2j , Proof.Let p 1 , . . ., p j be the normalized dominant j left singular vectors of A[X] − B, and q 1 , . . ., q j the dominant right singular vectors (or some of them if there are equal singular values).We consider the matrices . We now consider the tangent vector

2637
of the form (2.10) for this choice of ΔG and ΔH.Since X is a critical point, that is, A[X]−B is orthogonal to T X M k , each vector p i is orthogonal to all columns u 1 , . . ., u k of U, and each q i is orthogonal to the columns v 1 , . . ., v k of V (see Section 2.1), or s i = 0. Therefore, Z is a sum of 2j rank-1 (or 0) matrices that are pairwise orthogonal with respect to the Frobenius inner product.Thus, In light of (2.12) one obtains the inequalities Rearranging and applying the Cauchy-Schwarz inequality leads to which is equivalent to the asserted inequality.
We present two corollaries of Proposition 2.9 that will not be used later, but are of independent interest.They concern the positive semidefinite case μ = 0, which includes local minima, for the case that B = A[X * ] for some X * ∈ R m×n , that is, B is in the range of A.
Corollary 2.10 Let B = A[X * ] and X be a critical point of f A,B on M k at which the Riemannian Hessian H X is positive semidefinite.Assume rank(X − X * ) and λ(A, ) > 0. Then the kth singular value of X satisfies the inequality Then, since X − X * ∈ M and by (2.4), we have the lower bounds The assertion now follows from the previous proposition with μ = 0, j = 1, and Λ 2 = Λ(A, 2), which is positive by assumption.
Since the kth singular value of a rank-k matrix equals its distance to M k−1 in the Frobenius norm, the previous corollary can be rephrased in the following way.
Corollary 2.11 Under the same assumptions as in Corollary 2.10, , which is stronger than the asserted bound (since λ(A, ) Λ(A, 2)).If on the other hand X − X * F > σ k (X * )/2 the previous corollary provides the asserted bound since dist F (X, M k−1 ) = σ k (X).
Note that in the case that X * ∈ M k we can choose = 2k and the lower bounds in both corollaries become independent of the size of considered matrices.

RPD property and its implications for critical points
We now come to the main results of the paper on the critical points of the function f A,B for the case that A almost acts as an identity operator on cones of low-rank matrices.This property is quantified by the restricted positive definiteness (RPD) constants below, which are equivalent to the restricted isometry property (RIP) constants in matrix sensing.The most notable result then is for the case that B = A[X * ] for some X * ∈ M k .In this so-called 'noiseless case', and under the RPD assumptions, one can show that f A,B has no local minima on M k except the single global minimum X * .Moreover, at all other critical points the Riemannian Hessian has sufficiently negative eigenvalues, which is important in optimization methods in order to 'escape' such saddle points.The required bounds for the RPD constants for obtaining this conclusion are, to our knowledge, considerably weaker than the ones available in the literature.The results for the noiseless case are stated in Section 3.2.In Section 3.3 the most general version of our analysis is stated, which deals with the case that the equation A[X] = B admits only an approximate solution X ε on the set M k .Then local minima or saddle points with small negative curvature may exist, but their distance to X ε will be bounded.

RPD property
We still consider the family (1.1) of quadratic functions f A,B , and make some assumptions on the restricted spectral bounds of A that quantify deviation from the identity.Definition 3.1 (RPD property).Let k 1.We say that the symmetric operator A satisfies a (k, δ k )-RPD property if there exist 0 with λ(A, k) and Λ(A, k) defined in (2.6) and (2.7).
Remark 3.2 In the context of the matrix sensing problem (1.6) the RIP condition, was introduced in Recht et al. (2010) and used in many subsequent works.We have already mentioned that the matrix sensing problem fits our framework using the operator A = F T F. So the RPD property above is equal to the RIP in this model.In fact, if we assume A to be positive semidefinite, we can always find a decomposition A = F T F where F : R m×n → R d with d = rank(A).Then we can write , so that there is no essential difference between the matrix sensing problem and our setup of finding critical points of (then convex) quadratic functions f A,B on M k .Regarding the existence of operators A obeying suitable RPD bounds we can therefore rely on the well-known results on operators F satisfying RIP conditions; see Recht et al. (2010).A typical result is, for example, that specifically scaled random Gaussian matrices give rise to F that satisfy a (k, δ k )-RIP with high probability if d δ −2 k k(m + n); see Candès & Plan (2011).We refer to Davenport & Romberg (2016) for more references on the RIP.
Remark 3.3 Let us comment on operators whose restricted spectral bounds are not centered around 1. We may say that the symmetric operator A is positive definite on the cone M k if λ(A, k) > 0. We can then define the restricted condition number as and consider the scaled operator ω k A with , k) .
This operator has the spectral bounds where Therefore, the results below on those operators that satisfy RPD conditions translate to more general operators A if the properly scaled operator ω k A satisfies the assumptions.Yet this will mean that A must have a rather small restricted condition number κ(A, 2k) or κ(A, 3k).We will comment on this issue at the end of Section 3.2, and in Remark 3.11.
We note that with the RPD bounds (3.1) the estimate (2.9) leads in a straightforward way into the following upper bound (see also Park et al., 2017, Proposition 2.1 for essentially the same result).Table 1 Error bounds for different values of μ.The second column states a sufficient condition on δ 3k , taken as 0.9 times the upper bound (3.3), to obtain the estimate on X − X ε F in the third column (choosing δ 2 = δ 2k = δ 3k in (3.4)).The fourth and fifth columns display the results when δ 3k is less than 0.5 times the upper bound

Noiseless case
In the noiseless case we assume that there exist In other words it is assumed that a desired low-rank solution X * to the matrix equation A[X] = B can be found among the critical points of f A,B on M k (provided we know k), which allows, for example using Riemannian optimization methods, for finding it.
Theorem 3.5 Let X * ∈ M k such that B = A[X * ] and μ 0. Assume A satisfies RPD properties such that Then on M k , X * is the unique solution of A[X] = B and the unique global minimum of f A,B .At all other critical points X = X * of f A,B on M k the Riemannian Hessian satisfies for some tangent vector Z ∈ T X M k .Alternatively, the same statements hold in the case that Proof.The uniqueness statements follow from the representation (1.5) together with the RPD assumptions.The statement on the other critical points follows from considering the special case ε = 0 in the more general Theorem 3.9 proved below.
As an example consider the value μ = 1.Then under the conditions δ 3k 0.2399 or δ 2k 0.1861 the Riemannian Hessian at all critical points except the global minimum X * has a negative eigenvalue smaller than −1.More examples on the relation between μ and δ are presented in Tables 1 and 2. We highlight the case μ = 0 separately as it implies the absence of local minima.
Then on M k , X * is the unique solution of A[X] = B and the unique global minimum of f A,B .There exist no other local minima of f A,B on M k .Alternatively, the same statements hold in the case that For reference we also generalize Theorem 3.5 to operators whose restricted spectral bounds are not centered around 1. The proof follows according to Remark 3.3 by considering the scaled operators ω 3k A and ω 2k A, respectively.Corollary 3.7 Let X * ∈ M k and B = A[X * ] and μ 0. Assume λ(A, 3k) > 0 and that for some tangent vector Z ∈ T X M k .Alternatively, the same conclusions hold (with λ(A, 3k), Λ(A, 3k) replaced by λ(A, 2k), Λ(A, 2k)) in the case that λ(A, 2k) > 0 and Taking μ = 0 we get the following sufficient conditions for the absence of local minima X = X * :

General case
In the general case the exact solution X * of the matrix equation A[X] = B may have rank larger than k but still admits good low-rank approximations, that is, it will be close to M k .Or, the given matrix B may only satisfy A[X * ] ≈ B approximately and may not even belong to the range of A. In the matrix sensing problem (1.6) this may correspond to noisy observations b.For linear matrix equations (1.2) this corresponds to a perturbation of the right-hand side.
In the main result of this paper we focus on points ε and estimate the distance of certain other critical points on M k (including local minima) to X ε .After giving the proof we calculate some concrete values and make some comments on how to interpret this result in the context of the strict saddle point property.Regarding potential critical points of f A,B on M k with rank strictly less than k we refer once more to Proposition 2.4.
We first present a lemma that will allow us to state our conditions on the RPD constants in the main result more conveniently.
Lemma 3.8 Let c, μ > 0. Then the restriction of the function to the positive axis possesses a single pole and is positive if and only if For fixed c and μ it holds that K(c, μ, δ) → ∞ when δ approaches the upper bound.On the other hand the bound is monotonically decreasing with respect to c and μ.
Proof.The statement is equivalent to under the restriction δ > 0. This open parabola is negative at zero and therefore δ must lie between zero and the positive root, which is the asserted condition.
We state the main result.Note that we could replace the • σ ,2k -norm in the assumptions with the Frobenius norm, since A[X ε ] − B F ε would be a stronger condition.
Theorem 3.9 Let A, B and ε > 0 be given such that there exists The following two statements hold.
(i) If A satisfies RPD properties such that (the first two inequalities pose no restriction), then X satisfies the estimate (the first inequality is no restriction), then X satisfies the estimate Proof.We assume X = X ε , otherwise there is nothing to show.We consider upper and lower bounds for and by the triangle inequality, Lemmata 2.7 and 2.8 give Applying Lemma 3.4 to either of these bounds then results in the two estimates We turn to the lower bound on α provided by Lemma 2.3.Let ς 1 • • • ς k > 0 denote the singular values of X, and s 1 • • • s 2k the 2k largest singular values (some might be zero) of A[X] − B. By the lemma, there exists some 0 j k for which (3.8) (If j = 0 there are no ς i , while if j = k there are no s i .)By Proposition 2.9 (with j = 1), (3.9) Using (2.4) we can estimate With (3.9) we arrive at the lower bound Taken together and multiplying by √ 2 the bounds (3.6) and (3.10) yield the inequality Since δ 2 δ 2k δ 3k the term in brackets on the left-hand side is positive under the given condition (3.3) on δ 3k by Lemma 3.8 (here c = √ 2).This leads to the assertion in item (i).Item (ii) is obtained by combining (3.7) and (3.10) instead.
Remark 3.10 The theorem is formulated for the distances X − X ε F in the Frobenius norm, but in applications the difference in function values f A,B (X) − f A,B (X ε ) may be more relevant, in particular if f A,B takes the form (1.5) of a shifted squared (semi)norm for the operator A up to a constant (one may additionally assume f A,B (X ε ) = 0).Using Taylor expansion at X ε , and thus, under the assumptions of the theorem, Now the estimates for X − X ε F from the theorem can be used.
Remark 3.11 Similar to Corollary 3.7, and based on Remark 3.3, the theorem can be generalized to operators whose restricted spectral bounds are not centered around 1, but are otherwise well conditioned on M 2k or M 3k .If the conditions on δ 2k or δ 3k in Theorem 3.9 are fulfilled for a scaled operator ωA where ω > 0, the statement of the theorem remains true for the initial operator A if one replaces μ and ε by μ/ω and ε/ω, respectively.

Some concrete bounds
The constants in the estimates on X − X ε F provided by Theorem 3.9 become arbitrarily large when δ 3k and δ 2k approach the required upper bounds.Therefore, in order to obtain reasonable estimates one needs smaller values for δ 3k and δ 2k .To gain some intuition on the actual numbers, we computed for several values of μ the guaranteed error bounds for X − X ε F when A satisfies an RPD property with δ 3k or δ 2k is 90% or 50% of the critical upper bounds (3.3) and (3.5), respectively.These values are presented in Table 1 (for δ 3k ) and Table 2 (for δ 2k ), where ε and X ε are as in the theorem.Clearly, when μ is fixed, smaller RPD constants lead to better estimates.
In the context of so-called strict saddle point properties that have been discussed in related work, one can spell out these results as follows: for given μ > 0 and assuming δ 3k (or δ 2k ) satisfies the bound asserted in the table, all critical points X of f A,B on M k either have the asserted distance X − X ε F to a point X ε ∈ M k satisfying A[X ε ] − B σ ,2k < ε, or the Riemannian Hessian at X has at least one negative eigenvalue strictly less than −μ.In particular the rows for μ = 0 in the tables provide bounds on the distance of local minima to the set of all such X ε .

Numerical experiments
We now report on numerical experiments that verify our main result in Theorem 3.9.The experiments confirm that different algorithms indeed find ε-close solutions of the noisy matrix sensing problem, as predicted by theory.Note that the conditions on the RPD constants δ 2k or δ 3k obtained in this work are only sufficient, and perhaps still rather loose.The influence of these constants is not explored here in detail.
We consider m = n and construct matrix sensing problems involving two types of RPD operators A on R n×n .The first construction, called deterministic, is of the vectorized form A = Id + δ • QDQ T with Q a random orthogonal matrix and D a diagonal matrix with random ±1 on its diagonal.Such an A will be RPD with constant δ k = δ for all k.The other construction, called random, uses the nearly isometric random matrices from Recht et al. (2010).In particular A = F T F with F ∈ R d×n 2 and F ij random Gaussian N (0, 1/d).For certain large enough choices of n and d this A will satisfy any desired RPD property with high probability; see Recht et al. (2010).Due to size limitation, however, we took n = 50, k = 5, d = 5nk which does not yet correspond to δ 3k δ crit 3k but still allows us to verify convergence of the algorithms. 3n all cases we generate an 'exact' solution X * = GH T ∈ M k with G, H random Gaussian matrices of size n × k.We then compute B = A(X * ) + ε • N with N a random Gaussian matrix, scaled so that N F = 1, and ε 0 a noise factor.This guarantees in particular that A[X ε ] − B σ ,2k ε as required in Theorem 3.9.When ε > 0 the global minimizer of f A,B is unknown and we therefore take X ε = X * .
The methods that minimize f A,B for rank-k matrices are as follows: • Embedded SD: Riemannian steepest descent on M k with the embedded submanifold geometry and Euclidean restricted metric.This is the same geometry as in Shalit et al. (2012), Vandereycken (2013), Vandereycken & Vandewalle (2010), Wei et al. (2016).
• Embedded CG: Same as Embedded SD but now with nonlinear conjugate gradients.This corresponds to the solver GeomCG from Vandereycken (2013) but applied to sensing instead of completion.
• Quotient SD: Same as the embedded solver but using the quotient geometry from Mishra et al. (2012).
• ALS: Alternating least squares with QR stabilization of the iterates to avoid ill-conditioning.This appears, for example, in Wen et al. (2012, §2.1)where it is called the nonlinear Gauss-Seidel method.
All methods except ALS were implemented using Manopt (Boumal et al., 2014) with standard options except that we used exact line search to verify theoretical convergence rates.For ALS the asymptotic rate is computed as the largest eigenvalue of modulus less than one of the linearized iteration matrix at the limit point, which is obtained from a block triangular decomposition of the Hessian of F(G, H) = f (GH T ) as in the general nonlinear Gauss-Seidel method; see [Ortega & Reinboldt (1970), §10.3.4-5].Computations were done in Matlab v2017b using 34 decimal digits4 to better judge whether the iterates have converged to the global optimum.In the figures we will also display estimated asymptotic convergence rates ρ for the iterations = 1, 2, . . . .These were computed from the (Riemannian) Hessian H X at the limit point X.In particular, with κ the condition number of H X as computed by Manopt, we used The rate ρ SD of SD corresponds to the Euclidean case with exact line search and is well known.For SD on Riemannian manifolds this rate has been suggested to hold as well; see Udriste (1994, p. 270). 5The rate ρ CG is intended for numerical verification and is rigorous for the unconstrained CG method.For ALS the asymptotic rate is computed from a block triangular decomposition of the Euclidean Hessian; see Ortega & Reinboldt (1970).We could have compared to many different solvers, but for simplicity, we have restricted ourselves to mainly Riemannian algorithms since they are cheap per iteration and typically perform very well.In addition many other low-rank optimization methods, like iterative hard thresholding or projected gradient descent, have the same asymptotic behavior.Another reason was to verify that the Riemannian Hessian used in Theorem 3.9 indeed captures the correct behavior in the convergence plots.
The convergence plots are displayed in Figs 1 and 2. The left panels show convergence of the function values and clearly indicate a good correspondence of the theoretical asymptotic rates.The right panel is to verify that the error of the local minima obtained for the noisy problem are on the order of ε, as predicted by Theorem 3.9.This bound is explored in more detail in Table 3 for the deterministic case.Compared to Fig. 1, we continued the iteration for 500 iterations and took the minimal value of X ( ) − X ε F as approximation of the limit point of each iteration.Note that Table 1 predicts that the error should be bounded by 24.28 • ε, which is always achieved.Indeed, (for ε > 0) we observe a much better constant 1.2 • ε for the final error in this example.

Conclusion
We have studied some properties of critical points of quadratic functions on manifolds of fixed-rank matrices.In particular, estimates for singular values of local minima have been derived that relate them to the singular values of the gradient at local minima.Then, under certain assumptions on bounds for the Rayleigh quotient of the Hessian on the cones of bounded rank matrices, which generalize the popular RIP conditions for matrix sensing, our estimates imply that there cannot be spurious local minima far away from the global one.In particular, local minima are absent in the noiseless case.
The required restricted spectral bounds to obtain these results are considerably weaker than in related previous publications.So far our approach does not cover the important cases of matrix completion or the typical matrix equations in numerical linear algebra, as they do not meet the restricted spectral bounds.However, some of the presented techniques may still be useful when studying these cases as well.
∇f A,B (X) = A[X] − B and critical (or stationary) points of the function f A,B thus correspond to solutions of the linear matrix equation A[X] = B.

Fig. 2 .
Fig. 2. Random Gaussian A with n = 50, k = 5, d = 5nk.Both panels show in open circles for zero noise (ε = 0), and in closed circles for ε = 10 −5 (every five iterations shown).Left panel shows in line the asymptotic convergence rate based on spectrum of the Hessian.

Table 2
Same as Table 1 but with conditions on δ 2k Corollary 3.6 Let X * ∈ M k and A[X * ] = B. Assume A satisfies RPD properties such that

Table 3
Statistics for 20 random realizations of the deterministic A operator with n = 50, k = 5 and δ = 0.9 • δ crit