Fast, Robust and Non-convex Subspace Recovery

This work presents a fast and non-convex algorithm for robust subspace recovery. The data sets considered include inliers drawn around a low-dimensional subspace of a higher dimensional ambient space, and a possibly large portion of outliers that do not lie nearby this subspace. The proposed algorithm, which we refer to as Fast Median Subspace (FMS), is designed to robustly determine the underlying subspace of such data sets, while having lower computational complexity than existing methods. We prove convergence of the FMS iterates to a stationary point. Further, under a special model of data, FMS converges to a point which is near to the global minimum with overwhelming probability. Under this model, we show that the iteration complexity is globally bounded and locally $r$-linear. The latter theorem holds for any fixed fraction of outliers (less than 1) and any fixed positive distance between the limit point and the global minimum. Numerical experiments on synthetic and real data demonstrate its competitive speed and accuracy.


Introduction
In the modern age, data is collected in increasingly higher dimensions and massive quantities. An important method for analyzing large, high-dimensional data involves modeling it by a low-dimensional subspace. By projecting the data on this subspace, one can significantly reduce the dimension of the data while capturing its most significant information. Classically, this is the problem of principal component analysis (PCA), which finds the subspace of maximum variance. PCA is efficiently implemented for moderate-size data by using the singular value decomposition (SVD) of the data matrix. For larger data, recently developed random SVD methods have been proved to be stable, accurate and fast [21].
Despite the impressive progress with effective algorithms for PCA, the underlying idea of PCA is completely useless when the data is corrupted. Among the many possible models for corrupted data sets, here we follow an "inliersoutliers" corruption model. More precisely, we assume that some of the data points (the inliers) are sampled around a d-dimensional subspace, whereas, the rest of them (the outliers) are sampled from a different (and possibly arbitrary) model. The problem of Robust Subspace Recovery (RSR) asks to robustly estimate the underlying low-dimensional subspace in the presence of outliers. We note that this problem is distinct from what is commonly referred to as "robust PCA", that is, recovering the low rank structure in a matrix with sparse element-wise corruptions (see e.g., the work of Candès et al. [10]). Experience has dictated that robust PCA algorithms tend to perform poorly in the RSR regime, especially when the proportion of outliers is high. Much recent work has been devoted for developing numerically efficient solutions of the RSR problem. Current batch RSR formulations are at best comparable to full PCA (which computes all D eigenvectors). That is, their complexity is of order O(T N D 2 ), where T is the number of iterations till convergence, N is the number of points and D is the ambient dimension. We are unaware of sufficiently accurate RSR batch algorithms that scale at least like O(T N Dd), where d is the dimension of the approximated subspace.
To address this void, we propose a novel non-convex algorithm for RSR: the Fast Median Subspace (FMS) algorithm. The computational cost of FMS is of order O(T N Dd), which not only depends linearly on D (when d is small), but empirically FMS seems to obtain the smallest T and the highest accuracy among all other RSR algorithms (the Tyler M-estimator [57,65] has comparable accuracy in many cases, but its computational cost per iteration is significantly larger with moderate or high ambient dimensions). Theoretical guarantees under a model of corrupted data and empirical tests demonstrate the merit of the FMS algorithm.

Previous Works
PCA is by now a classic and ubiquitous method in data analysis [29]. Since it is obtained by the SVD of the data matrix, it enjoys a wealth of efficient numerical methods. In the last decade, various random methods have been proposed for fast and accurate computation of the top singular vectors and values (see the review by Halko et al. [21]). For example, Liberty et al. [38] demonstrated an order of O(N D log(d) + (N + D)d 2 ) randomized algorithm for d-approximation PCA; and Rokhlin et al. [51] have combined random dimension reduction with the power method to obtain a PCA algorithm with CN Dd complexity (where C is a small constant) and with significantly improved accuracy when the singular values decay sufficiently fast. The complexity of state-of-the-art algorithms for online PCA [4,5] is at best of order O(T Dd); however, in practice T is often large and their accuracy is often not competitive.
While PCA is ubiquitous for subspace modeling without corruption, there is still not yet a clear choice for a best RSR algorithm. Many strategies for RSR have been established in the last three decades (see the review by Lerman et al. [34] and some of the recent developments by Xu et al. [60,61], McCoy and Tropp [46], Zhang and Lerman [66], Lerman et al. [34], Zhang [65], Feng et al. [19], Hardt and Moitra [23], and Goes et al. [20]). Most of the emphasis of the theoretical analysis of RSR algorithms has been on quantifying the largest percentage of outliers under which the studied algorithm can be sufficiently accurate [61,34,66,65,23]. In particular, Hardt and Moitra [23] have shown that guaranteeing the success of an RSR algorithm with a fraction of outliers larger than (D − d)/D for a broad range of instances is as hard as solving the small set expansion problem; they also showed that this fraction can be achieved in their setting; though it is possible to achieve a better fraction in special instances [66, p. 766]. As opposed to the algorithms of Lerman et al. [34], Zhang and Lerman [66], Zhang [65], and Hardt and Moitra [23], other RSR algorithms may not be accurate with high percentage of outliers. Table 1 in [66] summarizes theoretical bounds for the percentage of inliers to outliers required for recovery. All of the algorithms in this table asymptotically depend on d and D, where some also depend on the variances of inliers and outliers.
Many of the successful RSR algorithms involve minimizing an energy, which is robust to outliers. For example, Xu et al. [61,60], McCoy and Tropp [46], Zhang and Lerman [66], Lerman et al. [34], and Goes et al. [20] use convex relaxations of the same energy, which is later formulated in (1) when p = 1 and δ = 0. We believe that since FMS targets the original robust energy and not a convex relaxation of it, FMS achieves higher accuracy and possibly even faster convergence; however, its analysis is difficult due to the non-convexity. The Tyler M-estimator minimizes a possibly more robust energy and thus obtains competitive accuracy (empirically, our method is as accurate as Tyler's M-estimator). However, it cannot obtain sufficiently competitive speed since it requires full eigenvalue decomposition as well as initial dimensionality reduction by PCA onto a subspace whose dimension is of the order of the number of points. While many of these algorithms for RSR are not sufficiently fast, others are also not very well justified in theory. For example, HR-PCA [60] and DHR-PCA [19] quantify their recovery by the "expressed variance" (EV), but their actual bounds seem to be weak. This is evident in Theorem 2 of [19], which gives asymptotic guarantees. Consider the case of 10% outliers drawn from a standard Gaussian, and inliers drawn from a standard Gaussian restricted to a subspace. Then it can be shown that their lower bound for EV is 0.09; an EV of 1 amounts to exact recovery.
On the other hand, the procedures of Hardt and Moitra [23] do not involve robust energy minimization, but try to fit many different subspaces until success. They are not sufficiently fast and we are unfamiliar with truly competitive implementations of them. Online algorithms [67,20] for RSR suffer from the same problems of online PCA algorithms mentioned above. Namely, the number of iterations required can be quite large, and their accuracy is often not competitive.
An important algorithm to compare with is spherical PCA (SPCA). SPCA involves performing PCA on the data after it is centered and then projected to the unit sphere. Maronna et al. [45] determined that SPCA was their method of choice when compared with various RSR algorithms [44]. Further, the complexity of running SPCA on a data set is O(N Dd), which is faster than FMS by a multiplicative constant. Our tests indicate that while SPCA is faster, it does not achieve the competitive accuracy of FMS on subspace recovery problems in the numerical tests of §4. Similarly to SPCA, many energy-minimization based algorithms (in particular, [61,46,66,34,20]) benefit from initial data normalization to the unit sphere (after robust centering). Indeed, while their underlying energies are robust to high percentages of some outliers, they may be sensitive to adversary outliers of very large magnitude.
It is also worth noting a couple recent works which scale to larger data than previous RSR algorithms. The work on Adaptive Compressive Outlier Sampling by Li and Haupt [37], can be viewed as a solution to the RSR problem with drastically reduced complexity that depends on how many rows and columns of the data are selected. However, it is not as effective at precisely identifying the underlying subspace as our method, which stems from the fact that it builds on Outlier Pursuit (OP) [61] (i.e. it is an approximate version of OP, which is not accurate enough). OP could not compete with the accuracy of other RSR algorithms in many of the regimes we test. Another recent algorithm with the potential to scale as well as FMS for RSR is the work on Grassmann Averages [24], provided that the correct robust function µ rob is chosen. However, Grassmann Averages lack any sort of theoretical justification, both for convergence and robustness.
A similar algorithm to FMS is explored in Wang et al. [59], which proposed a non-convex robust PCA algorithm. Although their algorithm is not suited for the RSR problem, it was still relevant for our work on FMS. First, we borrowed from Wang et al. [59] an argument for the proof of convergence of the FMS iterates to a stationary point (it is one of several different arguments used in our proof). Second of all, the FMS algorithm might be viewed as a soft analog of the alternating least squares (ALS) procedure of [59] (FMS divides by a power of the distance to a subspace and ALS divides by 1 or "infinity"; FMS applies randomized SVD, whereas ALS applies alternating low-rank approximation).
Finally, there are many recent works on the analysis of non-convex algorithms and their surprising effectiveness on problems with structured data. Some examples include works by Sun et al. [53,54], Zhang and Balzano [64], Jain et al. [28], Dauphin et al. [16], Keshavan et al. [31,32], Boumal [8], and Bandeira et al. [6]. In particular, there has been related work on non-convex analysis related to low rank modeling (see the work of Keshavan et al. [31,32], Jain et al. [27], Hardt [22], Netrapalli et al. [48], Jain and Netrapalli [26], Jain et al. [28], and Zhang and Balzano [64] for some examples). Our analysis of FMS presents yet another example where a non-convex algorithm is surprisingly accurate in low rank modeling despite potential issues of non-convex optimization, such as slow convergence or convergence to a non-optimal point. We also point the reader to the work of Daubechies et al. [15], which aims at analyzing the convergence of an IRLS method when the energy is non-convex. Although their method is for a different problem, there is a strong similarity in the use of non-convex energies: in their case when τ < 1 and in our case when p < 1.

This Work
The FMS algorithm improves on existing methods due to its fast runtime and state-of-the-art accuracy. However, the underlying minimization of FMS is non-convex and thus difficult to analyze. This work contributes to non-convex analysis and direct optimization on the Grassmannian manifold G(D, d) in the following ways: 1. We prove convergence of the FMS iterates to a stationary point over G(D, d). 2. For two special models of data, we prove: • This stationary point is sufficiently close to the global minimum with overwhelming probability; • The convergence rate is globally bounded and locally r-linear.
These two models are: (a) Inliers are drawn from a spherically symmetric distribution restricted to a fixed subspace L * 1 and outliers are drawn from a spherically symmetric distribution in the whole space; (b) The subspace dimension is d = 1 and outliers are either symmetrically distributed in the ambient space or lie on another subspace L * 2 (where it is less probable to draw points from L * 2 than L * 1 ). 3. For both models in 2, we guarantee approximate recovery for any percentage of outliers (less than 1); the theory of other RSR algorithms requires bounds on this percentage. 4. Out of all other RSR algorithms, we provide the only guarantees for the model 2b.
In addition to the theory, we rely on careful numerical experimentation and believe that the results reported in this paper strongly indicate the merit of FMS. The FMS algorithm displays competitive speed and accuracy on synthetic data sets. Unlike other RSR algorithms, FMS also shows strong performance as a dimension reduction tool for clustering data since it scales well, as we demonstrate on the human activity recognition data in §4.2.2.

Structure of The Paper
This paper begins by motivating and outlining our new algorithm in §2. Next, §3 establishes convergence of the FMS iterates to a stationary point, and further gives optimality and rate guarantees for FMS under a certain model of data. Experiments on synthetic and real data (of astrophysics, human activity, and face recognition) are done in §4 to demonstrate the usefulness of our new approach. Lastly, §6 concludes this work.

The FMS Algorithm
This section presents the FMS algorithm. First, §2.1 presents basic notation used throughout the paper. Then, in §2.2 we describe its underlying minimization problem and its robustness. Next, in §2.3 we propose the FMS algorithm, while motivating it in a heuristic way. Finally, §2.4 summarizes its complexity, and §2.5 discusses the choices of parameters for the FMS algorithm.

Notation
We assume a data set of N points in R D , We seek an approximating d-dimensional subspace, L, or in short a d-subspace. We denote by G(D, d) the Grassmannian manifold of linear d-subspaces of R D . For L ∈ G(D, d) we denote by P L the orthogonal projector onto L, which we view as an element of R D×D . Let · denote the Euclidean norm on R D . For x ∈ R D and L ∈ G(D, d) we denote by dist(x, L) the Euclidean distance of x to L, that is, x − P L x . For the distance between L 1 , L 2 ∈ G(D, d), which we denote by dist(L 1 , L 2 ), we use here the square-root of the sum of the squared principal angles between L 1 and L 2 .

The Underlying Minimization Problem
Many approaches for RSR are motivated by the following minimization problem: For the data set X ⊂ R D , 0 < p < 2 and δ > 0, find a d-subspace L that minimizes among all such subspaces the energy 2δ (1) Further, taking lim p→2 F p,δ results in the PCA energy. Thus, we let Setting p = 1 in this energy results in a natural robust extension of PCA, since the solution to the minimization can be thought of as a geometric median subspace. Even approximate minimization of this energy is nontrivial, since it has been shown to be NP hard for 1 ≤ p < 2 [12] (and assumed to be even harder for 0 < p < 1). This minimization was suggested with p = 1 and δ = 0 by Osborne and Watson [50], Späth and Watson [52], and Nyquist [49], who also proposed algorithmic solutions when d = D − 1. Later heuristic solutions were proposed for any d < D by Ding et al. [17] and Zhang et al. [67]. While the domain of this minimization is the set of all affine d-subspaces, experience shows that initial robust centering by the geometric median and then minimization over G(D, d) is successful. We thus assume in our discussion that the data is centered (if not, we center it at the geometric median) and the domain of the minimization is G(D, d). It is important to be aware that in the presence of a single outlier with sufficiently large magnitude, the minimizer of (1) (or any of its convex relaxations) may fail to approximate the underlying subspace. Such a case (and its many variants) can be avoided by normalizing the centered data points according to their Euclidean norms so that they lie on the unit sphere (see more discussion in [66,34]).
On the surface, this is a rather natural robust minimization problem, which formally generalizes the notion of the geometric median to subspaces (see discussion in [66]). However, this minimization is non-convex (since its domain, the Grassmannian, is non-convex). As was mentioned, convex relaxations of it when p = 1 have been studied [61,46,66,34]. Nevertheless, for some real data their solutions are not satisfying (see §4.2). Furthermore, it is possible that direct approaches to the non-convex minimization, especially with p < 1, can yield even more robust solutions. Finally, we are unaware of sufficiently fast implementations for such convex relaxations. We thus suggest here to revisit the original minimization problem while aiming to obtain faster and more accurate algorithms for practical data.
Robustness of the abstract global minimizer of (1) when δ = 0 was analyzed in [36,35] under special assumptions on the data. In particular, Lerman and Zhang [35] showed that for spherically symmetric outliers and spherically symmetric inliers within a d-subspace (possibly with additional outliers within less significant d-subspaces) asymptotic exact recovery is possible even when the fraction of outliers approaches 100%. Similarly, near recovery is possible with small amount of noise. Furthermore, in the noiseless case with outliers, the theory of Zhang and Lerman [66] and Lerman et al. [34] imply that its theory directly extends to the abstract global minimizer of (1) when p = 1 and δ = 0 (see Remarks of §2.3 in [34] and Theorem 1 of [66]).

Proposed Solution to the Non-convex Minimization
We heuristically develop the FMS algorithm that iteratively computes subspaces (L k ) k∈N ⊂ G(D, d); a more rigorous treatment of the resulting sequence follows from the proof of Theorem 1 (presented later in §3.1). Assume first that δ = 0. Since dist p (x i , L) = dist(x i , L) 2 / dist(x i , L) 2−p , instead of minimizing (1), we may try to minimize at iteration k + 1 the function The minimizer of (3) is easily obtained by weighted PCA, and thus the whole procedure can be viewed as IRLS (iteratively re-weighted least squares). However, since the weight 1/ dist(x i , L k ) 2−p may be undefined, we assume that δ > 0 and modify the weight to be 1/ max(dist(x i , L k ) 2−p , pδ) (an explanation for this regularized term follows from (29) which appears later in the proof of Theorem 1). To solve the weighted PCA problem, one first needs to weight the centered data points by the latter term and then apply PCA (without centering) to compute L k+1 . The ability to directly apply PCA, or equivalently SVD, to the scaled data matrix is numerically attractive, and we can apply any of the state-of-the-art suites for it. Our procedure at iteration k is outlined as follows. First, form the new weighted data points Then, compute top d right singular vectors of the data matrix Y , whose columns are the weighted data points {y i } N i=1 (for SVD, we found the randomized method of Rokhlin et al. [51] to be sufficiently fast without sacrificing accuracy). The subspace L k+1 is then the span of these vectors. This procedure is iterated until L k sufficiently converges. We formally call this iterative procedure the Fast Median Subspace (FMS) and summarize it in Algorithm 1 (for simplicity we use the notation for √ pδ used above).
. , x N ]: D × N centered data matrix, d: desired rank, p: robustness power (0 < p < 2; default: p = 1), n m : maximum number of iterations, τ , : parameters (default: 10 −10 for both) 2: Output: L: d-subspace in R D 3: k ← 1 4: L 0 ≡ L 1 ← PCA d-subspace in R D 5: while k < n m and dist(L k , L k−1 ) > τ do 6: for i=1:N do 7: In practice, we have found that the rate of convergence of the FMS algorithm is not affected by its particular use of RandomizedPCA [51]. In other words, if RandomizedPCA is replaced with the exact SVD, the convergence and accuracy are the same, but RandomizedPCA results in shorter runtime. Also, it seems advantageous to initialize L 0 with the result of RandomizedPCA (or SVD) on the full data set X , although initialization can also be done randomly. Random initialization is not recommended, though. For example, in a regime where outliers lie on another weaker subspace, starting too close to the outlier subspace can result in convergence to a local minimum rather than the global minimum. Empirically, with PCA initialization in these cases, FMS converges to the stronger subspace (i.e. it converges to the global minimum). Finally, we denote the FMS algorithm run with a parameter p by FMS p for the remainder of the paper.

Complexity
At each iteration, FMS p creates a scaled data matrix of centered data points, which takes O(DN d) operations, although the scaling can be done in parallel. It then finds the top d singular vectors of the scaled data matrix to update the subspace, which takes O(DN d). Thus, the total complexity is O(T DN d), where T is the number of iterations. Empirically, we have noticed that T can be treated as a relatively small constant. For example, in the special case of Theorem 5, T ≤ O 1/(pδ min(1, η 3(p−1) )) for an η-approximation to the limiting stationary point. This further reduces to T = O(log(1/η)) if the iterates are sufficiently close to the limiting stationary point by Theorem 6. The storage of the FMS p algorithm involves the N × D data matrix X and the weighted data matrix Y at each iteration. FMS p must also store the D × d bases for the subspaces L k and L k+1 . Thus, the storage requirement for FMS p is 2DN + 2Dd.

Choice of Parameters p, δ, and d
In the later experimental sections (see §4), we compare FMS p run with p = 1 and p = 0.1. Although there is not always a difference, in some cases we see one of the two choices of p performing better than the other. There also does not appear to be an advantage for using 1 < p < 2. Currently, the theory seems to support p = 1 for the best rates of recovery (see Theorem 2). Further, the theory seems to also indicate that smaller p leads to less robustness to higher levels of noise (see Theorem 3). Later experiments in §4 indicate that with small numbers of points, a choice of small p < 1 can lead to convergence to a non-optimal point, while p = 1 is still able to converge (see Figures 9 and 10).
We believe that p can be optimized for a specific data set, given some prior knowledge of it. In other words, p can be chosen if the user designates a training data set where the truth is known. Due to the low complexity of the method, it is possible to efficiently run it over an array of values of p. Thus, with the specified training set, cross-validation can be used to select the proper value of p for a given type of data, although this requires the user to have some prior knowledge of the data.
For choice of δ, we have not noticed too much difference between different values, although there may be certain cases where it is necessary to be careful with the choice of δ. In Theorem 2, we see rates of asymptotic recovery for the FMS p algorithm under a special model of data. This theory seems to point to taking δ as small as possible when 1 ≤ p < 2, but to take larger values of δ when 0 < p < 1. Further, some experiments with smaller sample sizes indicate that using a parameter δ too small with p < 1 can lead to convergence to a non-optimal point (i.e. a local minimum). Thus, while we advocate choosing δ as small as possible, some care must be taken when p < 1 to ensure that δ is not chosen to be too small.
Finally, one may ask how to select the subspace dimension d for FMS p . Picking the correct subspace dimension d is not well studied or justified in the literature. Heuristic strategies, such as the elbow method, can be used to guess what the best subspace dimension is; such strategies usually require a test over a range of possible values for d. On the other hand, in some domains there is prior knowledge of d. For example, in facial recognition type datasets, images of a person's face with constant pose under differing illumination conditions approximately lie on a d = 9 dimensional subspace [7]. In practice, we advocate either using the elbow method or domain knowledge to select the best value for d.

Theoretical Justification
Since FMS p proposes an iterative process it is rather important to analyze its convergence. In §3.1, we formulate the main convergence theorem, which establishes convergence of FMS p to a stationary point (or continuum of stationary points). Next, §3.2 assumes the data is sampled from a certain distribution and proves that FMS p converges to a point near to the global minimum with overwhelming probability. In §3.3, it is further shown under this model that the rate of convergence is globally bounded and locally r-linear (again with overwhelming probability). The proofs of all theorems are left to §5.

General Convergence Theorem
We establish convergence to a stationary point of the energy F p,δ over G(D, d): Theorem 1 (Convergence to a Stationary Point). Let (L k ) k∈N , be the sequence obtained by applying FMS p without stopping for the data set X for a fixed 0 < p < 2. Then, (L k ) k∈N converges to a stationary point L * of F p,δ over G(D, d), or the accumulation points of (L k ) k∈N form a continuum of stationary points.
The proof the theorem appears in §5.1. While there are no assumptions on X , it is important to discuss the implications of this Theorem. In §3.1.1 we discuss the possibility that FMS p converges to a continuum of stationary points. Then, §3.1.2 discusses the possibility of convergence to a saddle point, and §3.1.3 discusses convergence to a local minimum.

Convergence to a Continuum of Stationary Points
Theorem 1 proves convergence of the FMS p iterates to a stationary point or a continuum of stationary points. Another way to think of this issue is that the continuum of stationary points is also a continuum of fixed points for the FMS p algorithm. It is desirable to know when the algorithm converges to a single point versus a continuum. However, while we cannot see how to rule out the continuum case, we also cannot construct an example of a discrete data set with a continuum of stationary points when the rank of the data set is less than the subspace dimension d. When the rank of the data is less than d, all subspaces containing the data set are essentially equivalent with respect to the data. We conjecture that we have actual convergence to a single stationary point of G(D, d) for data sets which are full rank.

Can (L k ) k∈N Converge to a Saddle Point?
While we prove convergence to a stationary point of F p,δ over G(D, d), we are not able to say what kind of a stationary point we converge to. In theory we cannot rule out a saddle point, but we are unaware of an example of a saddle point which is also a fixed point of FMS p . The following example describes a saddle point of F p,δ which is not a fixed point of FMS p with probability 1. For this example, we assume that if the solution to PCA is not unique, then the PCA output is selected uniformly at random from the solution set. Consider the data set in R 3 X = (1, 0, 0) T , (0, 1, 0) T .
For the FMS p energy function F p,δ , the line defined by sad = Sp([1, 1, 0] T ) is a saddle point for the FMS p energy, since it is a minimum along the geodesic from max = Sp([0, 0, 1] T ) to sad , but a maximum along the geodesic from . However, sad is not a fixed point of FMS p . Suppose that sad is selected as a candidate subspace by FMS p . Then, the two data points x and x 2 are equidistant from sad and are scaled by the same amount. Then, when PCA is done to find the new subspace from the scaled data, all lines along the geodesic from min1 to min2 are solutions, and thus sad is selected again with probability 0. After a new line is selected, the points will no longer be equidistant and one of the two points will dominate the next round of PCA. This gives convergence to either min1 or min2 .
We also see examples of asymptotic saddle points in the model considered in §3.2. However, it can be shown with overwhelming probability that the finite sample counterparts to these asymptotic saddle points are not fixed points. Thus, in general, we are not concerned with saddle points that are also fixed points, since we have no proof that such a point exists.

Convergence to Local Minima
Unlike PCA (i.e. when p = 2), there can potentially be many local minima for the energy F p,δ when 0 < p < 2 . Such local minima can also be fixed points for the FMS p algorithm. Hence, we cannot guarantee that FMS p converges to an optimal stationary point in general, since it could converge to one such local minima. For example, some local minima are discussed by Lerman and Zhang [36,Example 2] when δ = 0. A modified argument can be used to show that local minima still exist when δ > 0. We observe that local minima generally occur when points concentrate around lower dimensional subspaces. Another simple example with many local minima is a symmetric case when D = 2 and d = 1. Suppose that points are symmetrically distributed on S 1 : for some even N ∈ N, the data set consists of points x i = (cos(2πi/N ), sin(2πi/N )) T , i = 1, . . . , N . Then, the span of each pair of antipodal points will define a local minimum for the FMS p energy F p,δ . However, we note that in this case all local minima are also global minima.

Convergence to the Global Minimum for a Special Model of Data
In this section, we will show that under a certain model of probabilistic generation of the data, the FMS p algorithm nearly recovers an underlying subspace with overwhelming probability (w.o.p.). By w.o.p., we mean that the probability of recovery is bounded below by an expression of the form 1 − C 1 e −C2N , where C 1 and C 2 are constants with respect to N (but depend on all other parameters, such as d, D, p, and δ). In §3.2.1 we lay out some necessary concepts for the statement of the theorems, and then in §3.2.2 we state the theorem giving near recovery of the underlying subspace. Next, §3.2.3 extends to a special case of recovery with multiple subspaces.

Preliminaries
This section proves global convergence for a special model of data. We use a simple version of the most significant subspace model outlined in [36], and much of the notation and concepts are borrowed from this paper. The general setting considers points distributed on the sphere S D−1 . In §3.2.2 we consider the special case of one underlying subspace, rather than the more general setting of K distinct underlying subspaces with one most significant. A more general theorem for K > 1 has been hard to prove, although §3.2.3 gives a theorem for approximate recovery in the case K = 2 and d = 1. However, we conjecture that a general theorem for K > 1 holds with any subspace dimension d due to empirical performance of the algorithm on data sets sampled from these distributions.
Let L * 1 be the most significant subspace within our data set. We construct a mixture measure by combining µ i for i = 0, ..., K, where µ 0 is the uniform distribution on S D−1 and µ i is the uniform distribution on S D−1 ∩ L * i . In the noisy case, we add an additive noise distribution ν i,ε such that supp(µ i + ν i,ε ) ⊆ S D−1 . We also require that the pth moment of ν i,ε is smaller than ε p (where p is the robustness parameter for FMS). Finally, we attach weights α 0 ≥ 0, α i > 0 to the measures µ i such that K i=0 α i = 1 and α 1 > K i=2 α i . The mixture distribution is given by We first consider a noiseless version of µ ε , and then extend the result to the noisy case. The noiseless measure is written as We assume data sampled independently and identically from µ, so that points sampled from µ 0 and µ i for i = 2, ..., K represent pure outliers, and points sampled from µ 1 represent pure inliers. Everything we prove for the spherical model we can generalize to a spherically symmetric model, where outliers are spherically symmetric and symmetrically distributed on K − 1 less significant subspaces, and inliers are symmetrically distributed on the most significant subspace L * 1 . Note that in practice, normalizing the latter distribution to the sphere S D−1 yields a mixture measure of the form (7).

Global Convergence Theorem for K = 1
We are now ready to state a global convergence theorem for FMS p .
Theorem 2 (Probabilistic Recovery of the Underlying Subspace). Let X be sampled independently and identically from the mixture measure µ given in (7) with K = 1. Then, for any 0 < η ≤ π/6 and 0 < p ≤ 1, the FMS p algorithm converges to an η-neighborhood of the underlying subspace L * 1 w.o.p. at least For 1 < p < 2, the FMS p algorithm converges to an η-neighborhood of the underlying subspace L * 1 w.o.p. at least For comparison, using the same techniques to analyze PCA (p = 2), PCA outputs a subspace in an η-neighborhood of L * 1 w.o.p. at least Here, C 2 , C 2 , and C 2 have no dependence on N , η, p, or δ, but may depend on D and d.
The proof of this theorem is given in §5.2. This theorem gives a probabilistic near recovery result for the FMS p algorithm and PCA. We note that the result given for PCA is comparable to the asymptotic result of Vershynin [58, Proposition 2.1], albeit by a different argument. Our result is more restricted, though, since the result of Vershynin [58, Proposition 2.1] applies for any η > 0. Also, our estimates for the PCA constants (see below) are not ideal (again see [58]); however, this is not an issue since we are more interested in contrasting the dependence of these probabilities on η. There is no restriction on α 0 and α 1 in (7), although the probability of recovery depends on the fractions. Bounds for the constants C 2 , C 2 and C 2 can be seen later in (79), (80), and (81) respectively. Worst case estimates of C 1 , C 1 and C 1 are given later in Proposition 4, where we examine their dependence on d, D, η, p, and δ.
This theorem shows the benefit of using FMS p over PCA. For the following discussion, we follow our default choice of the algorithm and assume that δ is on the order of machine precision. This means that we only need to consider the first term within the minimum function in (8) (since η cannot be lower than machine precision). Examining dependence on η for the probability bounds, the exponent in the PCA formulation is O(η 2 N ), the FMS 1 exponent is O(N ), and for p < 1 the FMS p exponent is O((pδ) 2(1−p)/(2−p) N ). Altogether, this means that FMS p is expected to have much more precise recovery for vastly smaller sample sizes. We also advocate choosing p = 1 when running the FMS p algorithm for this reason: when δ is chosen to be very small in this way, the O((pδ) 2(1−p)/(2−p) N ) exponent for p < 1 leads to a much worse bound than the O(N ) exponent for p = 1. Another consequence of this theorem is that we generally advocate for larger values of δ for smaller values of p (although we do not have optimal expressions for this choice of δ). For demonstrations of the phase transitions exhibited by the probability of recovery, see Figures 5, 6, and 7. Again, we emphasize the difference between the bounds on the probabilities of η-recovery for PCA, FMS 1 , and FMS 0.5 : their bounds The theoretical result of Theorem 2 extends to the noisy mixture measure (6) as well.
the FMS p algorithm converges to an η-neighborhood of L * 1 w.o.p. stated in (9). For comparison, using the same techniques to analyze PCA, if PCA outputs a subspace in an η-neighborhood of L * 1 w.o.p. stated in (10).
The proof of Theorem 3 is given in §5.2.4. Among choices of p, choosing larger values of p seems to give the most robustness to noise. This theorem indicates that PCA has the best stability to noise, although these estimates are not ideal. This result stands in contrast to the result of [13], which shows a higher robustness to noise for a convex relaxation of F 1,δ . Less robustness to noise for FMS p may be attributable to non-convexity, but we cannot make any definitive statement on this fact. In the future, we plan to follow Coudron and Lerman [13] and establish the stronger robustness to noise of FMS p at least when p = 1. For demonstrations of the phase transitions exhibited by the probability of recovery for this noisy model, see Figures 8,9, and 10.

Global Convergence Theorem for K = 2 and d = 1
Another important setting of the most significant subspace model where PCA does not recover the underlying subspace is when K > 1. We have found it hard to prove anything in general for the case K > 1 because it is hard to characterize the derivative of F p,δ in general. However, we are able to prove near recovery for the case K = 2 and d = 1.
The proof of Theorem 4 is given in §5.3. It has proven too hard to derive bounds or closed form expressions for the constants in the probability bound, and so we do not present them here. Further, a similar stability result as that in Theorem 3 holds for Theorem 4, however we do not display it here. We also note that this Theorem only holds for 0 < p ≤ 1. This case is particularly important because the guarantees of other algorithms, such as those of Hardt and Moitra [23] and Zhang and Lerman [66], break down in this setting. Further, although it is very specific, it is an important example for us because FMS p is still able to recover the correct subspace in the presence of another potential local minimum L * 2 . Finally, this is a clear example where PCA cannot recover the most significant subspace asymptotically while FMS p can. (7) For this section, we define L * to be a stationary limit point of the FMS p algorithm. We begin with a probabilistic global rate of convergence bound for the FMS p algorithm under (7) when K = 1 or K = 2 and d = 1. Under these models, Theorem 2 and Theorem 4 show that L * is near to L * 1 (the underlying subspace) w.o.p. The proof of Theorem 5 is given in §5.4.

Theorem 5 (Probabilistic Global Convergence Bound).
Suppose that X is sampled i.i.d. from the mixture measure µ in (7) with K = 1. Then, for 0 < p ≤ 1, the number of iterations T such that dist(L T , L * 1 ) < η is at worst In contrast, for 1 < p < 2, the number of iterations is at worst For µ with K = 2 and d = 1 and 0 < p ≤ 1, the number of iterations T such that arcsin Beyond this, Theorem 6 yields local r-linear convergence w.o.p. for the FMS p algorithm under (7) when K = 1 or when K = 2 and d = 1. The proof of Theorem 6 is given in §5.5.
. Then, w.o.p., there exists an index κ such that (L k ) k>κ converges r-linearly to its limit point L * .
The bound on the rate of this r-linear convergence can be seen in the proof of Theorem 6: specifically see (160) and (164). Theorems 5 and 6 can be combined to give a bound on the overall iteration complexity of FMS p . In general, given a choice of p ≤ 1, the number of iterations required to converge is bounded by However, once the iterates are sufficiently close to the limit point, the iteration complexity becomes O(log(1/η)). Figure 11 verifies that on a data set sampled from (7) with K = 1, the convergence of FMS 1 and FMS 0.5 is r-linear.

Numerical Experiments
In this section, we illustrate how the FMS p algorithm performs on various synthetic and real data sets in a MATLAB test environment (except for §4.2.4, which was run in Python). It was most interesting for us to test FMS p with the value of p = 1, in order to compare it with various convex relaxations of its energy in this case. FMS 1 denotes a case where the algorithm run with a value of p = 1. In certain cases, we have not noticed a difference in the performance by choosing lower values of p and will make clear when this is the case. In places where we noted such a difference we report them with p = 0.1 and let FMS 0.1 denote the case of p = 0.1. We set the parameter to be 10 −10 .
The algorithms we compare with are the Tyler M-estimator [65], Median K-flats (MKF) [67], Reaper [34], R1-PCA [17], GMS [66], and Robust Online Mirror Descent (R-MD) [20]. We also tried a few other algorithms, in particular, HR-PCA and DHR-PCA [60,19], LLD [46], and Outlier-Pursuit [61], but they were not as competitive; we thus do not report their results. For example, both HR-PCA and DHR-PCA were slower and surprisingly worse than PCA in many of our tests. Even though MKF and R-MD were not competitive either in many cases, it was important for us to compare with online algorithms. The comparison with R1-PCA was also important since its aims to directly minimize (1) when p = 1 and δ = 0. In addition, we compared with principal component pursuit (PCP) [10,39], which aims to solve the robust PCA problem. The code chosen for this comparison was the Accelerated Proximal Gradient with partial SVD [40] obtained from http://perception.csl.illinois.edu/ matrix-rank/sample_code.html, although similar results were given by the ALM codes [39]. We emphasize that Robust PCA methods are designed for the regime where there are sparse, element-wise corruptions of the data matrix, rather than the wholly corrupted data points, which we consider in this paper.
We have noticed that robust PCA algorithms based on this model exhibit quite poor performance compared to algorithms tailored for RSR when data points are wholly corrupted.
Experiments by [34] demonstrate the advantage of scaling each centered data point by its norm, i.e., by "spherizing" each data point (equivalently, projecting onto the unit sphere). In cases where we examine the effect of "spherizing" the data, we denote an algorithm run on a spherized data set with the prefix S. For example, as a baseline in many of the experiments we compare with SPCA, performed by using the RandomizedPCA [51] algorithm to find the top d singular vectors of the spherized data. In the same vein, we sometimes also compare with SFMS p , which is FMS p run on spherized data. Spherizing seems to reduce noise and produce a better subspace in some cases: we will display results from SFMS p and SPCA when this is the case (but omit them when there is no difference).
Tyler M-estimator is used with regularization parameter = 10 −10 . Median K-Flats passes over the data many times to find a single subspace of dimension d, with a step size of 0.01 and maximum number of iterations 10000. The Reaper algorithm is run with the regularization parameter δ = 10 −20 . R1-PCA uses stopping parameter 10 −5 and is capped at 1000 iterations. R-MD passes over the data 10 times and uses step size 1/ √ k at iteration k. The PCP parameter was set as λ = 1/ max(D, N ).

Synthetic Experiments
A series of synthetic tests are run to determine how the FMS p compares to other state-of-the-art algorithms. In all of these examples the data is sampled according to variants of the needle-haystack model of [34]. More precisely, inliers are sampled from a Gaussian distribution within a random linear d-subspace in R D , and outliers are sampled from a Gaussian distribution within the ambient space. Noise is also added to all points. In all of these experiments, the fraction of outliers is restricted by Hardt & Moitra's upper bound for RSR, that is, The first experiment demonstrates the effect of the percentage of outliers on the total runtime and error for subspace recovery. Let Σ in denote the orthogonal projector onto the randomly selected subspace, and let Σ out denote the identity transformation on R D . In this experiment, inliers are drawn from the distribution N (0, Σ in /d) and outliers from the distribution N (0, Σ out /D). Scaling by 1/d and 1/D respectively ensures that both samples have comparable magnitudes. Error is measured by calculating the distance between the found and ground truth subspaces. The ambient dimension is fixed at D = 100, the subspace dimension is d = 5, and the total number of points is fixed at N = 200. Every data point is also perturbed by added noise drawn from N (0, 10 −6 Σ out ). Figure 1 displays results for recovery error and total runtime versus the percentage of outliers in a data set. At each percentage value, the experiment is repeated on 20 randomly generated data sets and the results are then averaged. We note that the runtime of R-MD is too large to be displayed on the graph of It is interesting to note in these figures that GMS fails for lower percentages of outliers; Zhang and Lerman [66] acknowledge that to be safe, GMS needs at least 1.5(D − d) outliers are needed to ensure recovery. The authors advocate either initial dimensionality reduction or the addition of synthetic outliers to increase the chances of finding the correct subspace. In our tests, initial dimensionality reduction was still not competitive, but the addition of synthetic outliers results in precise recovery (although we do not show this to illustrate the deficiency of GMS in low percentages of outliers). A second experiment is displayed in Figure 2, where we demonstrate the total runtime superiority of FMS p versus other RSR algorithms. In the Figure 2a, the runtime is plotted as a function of ambient dimension for different algorithms. In Figure 2a, the corresponding errors for these runtimes are given. Here we fix the total number of points at N = 6000 with 3000 inliers and 3000 outliers. The subspace dimension is fixed at d = 5, and the variance model for the sampled points is as before. Again, all points also have noise drawn from N (0, 10 −6 Σ out ). The ambient dimension is varied from 100 to 2000, and for each method runtime is cut off at 100 seconds. The plotted runtime is averaged over 20 randomly generated data sets. Robust Online Mirror Descent is not shown in Figure 2 due to the very large runtime required for higher dimensions. For the data sets tested here, FMS 1 , Reaper, GMS, PCP, and Tyler M-estimator all precisely found the subspace for each ambient dimension. Among these, we note that FMS 1 has the best runtime at higher ambient dimension due to its lower complexity, while algorithms like Reaper, GMS, PCP, and Tyler do not scale nearly as well. We , and GMS all achieve competitive accuracy on these data sets (PCP also does for low ambient dimension, but we were unable to run in higher ambient dimension due to poor computational complexity).
note that the PCP algorithm is especially slow on such data, requiring a large runtime for even relatively low dimensional data. Runtimes were calculated on a computer with 8 GB RAM and an Intel Core i5-2450M 2.50 GHz CPU. We remark that not only is FMS p faster than Tyler M-estimator, it also does not require initial dimensionality reduction, which is required for Tyler M-estimator when applied for subspace recovery [65].
In Figure 3, we demonstrate accuracy achieved by different RSR algorithms as a function of evolving time. The data set has N = 6000 points consisting of 3000 inliers and 3000 outliers, with ambient dimension D = 2000, subspace dimension d = 5, and added noise drawn from N (0, 10 −6 Σ out ). Each mark on the graph corresponds to the accuracy achieved by the given algorithm after a certain amount of runtime has passed. Clearly, FMS p has the fastest convergence of the existing RSR algorithms, and achieves competitive accuracy in a matter of seconds. PCP is not shown here due to the large amount of time required to complete even one iteration.
A final test on synthetic data displays the accuracy when the scale of the variance is different between the inliers and outliers. For this experiment, inliers are still drawn from the distribution N (0, Σ in /d). The outliers are drawn from N (0, λΣ out /D), where λ is a scaling parameter used to change the variance. All points have noise drawn from N (0, 10 −6 Σ out ). The plot in Figure 4 displays the resulting error from various algorithms as the scaling parameter is changed. All  points are the average error over 20 randomly generated data sets. FMS 1 and Tyler M-estimator both have perfect performance across scale. FMS 0.1 is not displayed due to identical performance with FMS 1 . Again, GMS fails here due to too few outliers: the addition of synthetic outliers leads to better results in this figure (although it is not as competitive as Tyler M-Estimator and FMS p ).
The takeaway from these tests should be that the FMS p algorithm offers state-of-the-art accuracy for the synthetic data model while having a complexity that leads to better runtimes in high dimensions.
To conclude this section, we will display some plots verifying the convergence properties of FMS p to make sure they align with the theory in §3. We begin by displaying the phase transition of probabilistic recovery exhibited by FMS p and PCA under the models (7) and (6) with K = 1. These figures will validate Theorems 2 and 3, which states that both FMS p and PCA have asymptotic recovery of the underlying subspace, but the rate of FMS p is much better than that of PCA (i.e. FMS p requires smaller sample sizes for accurate recovery). We set α 0 = α 1 = 1/2, and sample sizes were varied. For each sample size, 100 data sets were generated, and the recovery error was calculated as the distance between the found subspace and the underlying subspace L * 1 . In the plots, the value at each log 10 (η) and sample size N is the percentage of times that the recovery error was less than or equal to η.
The noiseless case is displayed in Figures 5, 6, and 7. Within these figures, FMS p shows a clear advantage over PCA for the asymptotic rate of recovery for the underlying subspace. Even for very small numbers of points (N ≈ 40), FMS p for p = 1 or p = 0.5 can approximate the underlying subspace to a precision of 10 −7 . On the other hand, PCA can only approximate the subspace to a precision of 10 −1.5 for sample sizes as large as N = 50000. We do note that FMS 0.5 does seem to have some trouble around a sample size of N = 48, which indicates convergence to a non-optimal solution. This fits with earlier theory that indicates this possibility (see Theorem 2 and discussion). Although this may be mitigated with a larger choice of δ, some precision may be lost with larger values of δ.
The noisy case is displayed in Figures 8, 9, and 10. Within these figures, we notice that the rate of recovery for PCA does not change from the noiseless case. Between the two FMS p plots, p = 0.5 seems to have issues with small numbers of points. The issues occur around N = 48, which is where we saw slight issues in the noiseless case also. The convergence of FMS p to a nonoptimal solution for p < 1 fits in with Theorem 2 and Theorem 3. Again, this may be alleviated by choosing larger values of δ, but solutions may not be as precise. When comparing p = 1 versus p = 0.5 for larger N , it appears that the rate of asymptotic recovery may be better for smaller p (see N ≈ 204 in Figures 9 and 10). Recovery Probability for FMS, p=0.5

Real Data Experiments
One of the real strengths of PCA in reducing dimensionality comes from its denoising effect. Projection to a subspace by PCA has long been a popular preprocessing step for classification and clustering (see e.g. [30,25]) due to this denoising effect. In some cases, RSR and robust PCA algorithms seem to demonstrate higher resistance to noise in data than PCA. The first two experiments displayed in §4.2.1 and §4.2.2 show the viability of FMS p for denoising. We finish in §4.2.3 with a stylized experiment on real data with explicit outliers to demonstrate the accuracy of FMS p , and then §4.2.4 demonstrates the ability of FMS p to scale to truly massive data.

Eigenspectra Calculation from Astrophysics
The first experiment demonstrating the usefulness of the FMS p algorithm on real data that comes from astronomy. The goal of this experiment is to robustly locate eigenspectra in a large set of galaxy spectrum data. The eigenspectra found can be used in the classification of galaxies within the complete data set, since a the galaxy spectra can be decomposed by projection onto the span of the eigenspectra. Budavári et al. [9] provide criteria for determining what makes a resulting eigenspectra good. The key attribute of good eigenspectra is that they should not be noisy themselves. This would in turn introduce noise into the decomposition of individual galaxy spectra using the eigenspectra, which in turn leads to inaccurate classification using the reduced spectra. In this experiment, we judge the RSR algorithms on how noisy the eigenspectra they find are.
A data set is taken from the Sixth Data Release of the Sloan Digital Sky Survey [1]. A total of 83686 spectra were taken from databases using code from [18]. Spectral reduction was performed to account for resampling, restframe shifting, and PCA gap correction [62]. The resulting data consisted of 83686 data points in dimension 3841. To use RSR on this data set, we follow the example of previous work done with RSR for finding eigenspectra [9]. The data is first centered by subtracting the mean spectra from all values. FMS 0.1 , FMS 1 , RandomizedPCA [51], and the Tyler M-estimator are then applied to the data to find the top eigenspectra of the data set. Additionally, we spherize the data and run PCA and FMS 1 (SPCA and SFMS 1 ) to see whether it changes the resulting eigenspectra (SFMS 0.1 is not shown due to similarity with the results of SFMS 1 ). Other methods are not shown because they either do not do better than standard PCA, or because the methods take too long to be feasibly run due to the large size of the data set. Figure 12, shows the results from running FMS 0.1 , FMS 1 , SFMS 1 , Tyler M-estimator, PCA, and SPCA on the data. As we can see, parts of the eigenspectra in standard PCA are quite noisy, especially in the third, fourth, and fifth eigenspectra. Tyler M-estimator, although it converges in 3 iterations, makes no improvement on the eigenspectra found from standard PCA. The robustness of the FMS 0.1 and FMS 1 algorithms allows them to find eigenspectra that are not noisy while not sacrificing too much speed. We note here that SPCA also shows qualitatively good results, which are comparable to FMS p . However, FMS p , SFMS p , and SPCA all have qualitatively different looking results, and this suggests that more comprehensive testing should go into seeing which method produces the best eigenspectra.

Clustering Data
FMS p can also be used as a preprocessing step for clustering. In the previous section we examined projection to the robust subspace as a denoising technique, and in this section we demonstrate the gains denoising gives when preprocessing a data set by PCA or FMS p for k-means. Assume that we are given a data set and desire to partition it into k clusters. If one decides to use PCA or FMS p to reduce the dimensionality of the data set, some thought must be given to what dimension of subspace to project to. The literature suggests that there is no good rule of thumb for choosing the subspace dimension d without a clear model for generating the data (see e.g. [42]). In the following experiments, we show results over a range of possible values for d.
The data is taken from the "Daily Sports and Activities" data set available at https://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities [2], and the "Human Activity Recognition Using Smartphones" data set at https: //archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+ Smartphones [3]. The "Daily Sports and Activities" data set consists of sensor data taken over a 5 second period while the subject performs a certain action. Together, there are 19 different actions, and we would like to cluster the points according to action. In total, there are 9120 data points in dimension 5625. We compare three techniques for classifying the activities: k-means, PCA projection+k-means, and FMS projection+k-means. By FMS here we mean FMS 1 , since observed results were similar for FMS 1 and FMS 0.1 . For the projection methods, we find a low dimensional subspace and project the data to that subspace before running k-means. For k-means, we use the built in MATLAB method with default parameters, which initializes using k points of the data set. Clustering accuracy is measured by the number of correct pairwise relations (true positive and true negative) between points over the total number of pairwise relations. This accuracy measure is also known as the Rand index [43,Chapter 16]. Results are averaged over 20 runs. We display the resulting experiment in Figure 13, where the clustering accuracy and approximate 95% confidence intervals (dotted lines) are given for the three methods. For this experiment, FMS is the clear choice of denoising technique for this data. Our second clustering data set is the training set from the "Human Activity Recognition Using Smartphones" data set, which consists of 7352 points in dimension 561. Each point consists of sensor outputs taken in a 2.56 second window from the accelerometer and gyroscope of a Samsung Galaxy S II. There are six different activities performed by each subject, and we would like to classify the data by activity. Results of the test on this data set is displayed in  For each of these data sets, it was not feasible to run other RSR or robust PCA algorithms. Due to the large dimension and number of points, runtimes would be very large and we ran into memory issues trying to run them in MAT-LAB on a personal machine. In the future, it would be ideal to run more in depth experiments to determine how other algorithms perform at such dimensionality reduction tasks. However, we believe that these experiments demonstrate a selling point for FMS p . The data was able to be processed in MATLAB on a personal machine in a matter of minutes, and we are unaware of any other robust methods able to do this. It is worth noting, though, that PCA can be run in a matter of seconds and may be more efficient in some cases. While the first case shows better performance of FMS p , the second case shows an example where FMS p and PCA both improve results to the same degree.

Stylized Application: Faces in a Crowd
The next experiment we run on real data is a stylized example from image processing. The experiment shown here is the 'Faces in a Crowd' experiment outlined by [34]. This experiment is motivated by the fact that images of an individual's face with fixed pose under varying lighting conditions should fall on a subspace of dimension at most 9 [7]. We draw a data set of 64 cropped face images from the Extended Yale Face Database [33]. 32 of these face images are sampled to be the inliers of the data set, and 400 outlier images are selected from the "BACKGROUND Google" folder of the Caltech 101 database.
On this data, spherized algorithms tend to do better than running on the non-spherized data. Thus, we only report results of algorithms which apply such initial spherizing (after centering by the geometric median) and denote them with additional "S-". We remark that Tyler M-estimator implicitly spherizes the data. We fit a 9 dimensional subspace to the data set using SPCA, SFMS 1 , SFMS 0.1 , Tyler M-estimator, and S-Reaper. Pictures are downsampled to 20 × 20 and 30 × 30 in our two tests to show performance of the algorithms on images of different dimensions. Figure 15a demonstrates the accuracy of the found subspaces in pictures of dimension 20 × 20. For each subspace model, we project 4 images onto the subspace: one face from the inliers, one outlier point, and two out-of-sample faces. A better subspace should not distort the original image of the faces, and it is evident that the robust algorithms S-Reaper, Tyler, and both versions of FMS p appear to work well on this data. However, the first test image appears to be better for FMS 0.1 . Another comparison of the performance of these algorithms is given in Figure 15b. This graph displays the ordered distances for the 32 out-of-sample faces to the robust subspaces and against their ordered distances to the PCA subspace. The ith point for each algorithm corresponds to the ith closest distance to the robust subspace against the ith closest distance to the PCA subspace. In general, the closer faces are to the robust subspace the better. The algorithms all appear to offer robust approximations of the underlying  Fig. 15a and Fig. 15b correspond to the experiment run on pictures downsampled to 20 × 20, and Fig. 15c and Fig. 15d correspond to the experiment run on 30 × 30 dimensional pictures. On the top, we show projections of the pictures onto the subspaces found by each method. On the bottom, we show the ordered distances to RSR subspace against the distance to the PCA subspace. Lower distances to the robust subspace signify a greater degree of accuracy in locating the 9-dimensional subspace in the set with outliers.
subspace, but SFMS 0.1 seems to have a slight edge in the lower region. Figures 15c and 15d demonstrate the same experiment, but on faces of dimension 30 × 30. First we note that the S-Reaper algorithm cannot locate the robust subspace in this higher dimension. SPCA and Tyler M-estimator also struggle in this scheme. However, SFMS 1 and SFMS 0.1 do quite well at finding the face subspace. In fact, looking at Figure 15d, SFMS 1 and SFMS 0.1 outperform other algorithms by a significant degree.

FMS p Scales to Massive Data
To demonstrate the ability of FMS p to scale to truly large data sets, we follow the example of [24] and run the FMS p algorithm on a portion of the Star Wars Episode IV movie. Using p = 1, FMS p is run on 30 minutes of Star Wars Episode IV to find a 20 dimensional robust subspace. Each point in this data set is a 720 × 304 RGB image, which results in a data matrix of size 54000 × 656640. Altogether, it took ≈ 130 GB to store this matrix as single precision in memory. This experiment ran on two 1 TB nodes with 32 Intel Sandy Bridge processors each. FMS p was implemented in Python with numpy and the randomized TruncatedSVD in sklearn. With this set-up, the run took a total of 33 hours to complete. While there is no good choice of error metric to evaluate the found subspace here, we note that the average peak signal to noise ratio for FMS p was slightly better than that of plain PCA (20.23 vs. 20.13). However, we emphasize that the point of this experiment is to demonstrate that it is possible to run on data sets of this size: to our knowledge no other truly accurate RSR algorithm is able to do this.

Proof of Theorems
The following sections prove the Theorems presented in this paper.

Proof of Theorem 1
The proof of Theorem 1 is given in the following sections. First, in §5.1.1, we prove monotonicity and consequently convergence of (F p,δ (L k ; X )) k∈N . Next, in §5.1.2, we prove that the iterates L k converge to a fixed point. Finally, in §5.1.3, we prove that such a fixed point is necessarily a stationary point.

Monotonicity and
Convergence of (F p,δ (L k ; X )) k∈N We begin with a proposition demonstrating monotonicity and consequently weak convergence of the FMS p algorithm. Proposition 1. For a fixed data set X , let (L k ) k∈N be the sequence obtained by applying FMS p without stopping and let F p,δ be the function expressed in (1). Then (F p,δ (L k ; X )) k∈N is non-increasing and converges in R.
Proof (Proof of Proposition 1). For this analysis, it is useful to define a majorizing function H p,δ for our cost function F p,δ by: This function is said to majorize F p,δ since it satisfies the following two properties We prove these two relations in the following. The relation in (19) can be simply shown by evaluating H p,δ (L 0 , L 0 ; X ) to find that H p,δ (L 0 , L 0 ; X ) = F p,δ (L 0 ; X ). For the relation in (18), we will examine H p,δ and F p,δ term by term. Let We will show that H p,δ (L, L 0 ; x i ) ≥ F p,δ (L; x i ) ∀ x i ∈ X , ∀ L, L 0 ∈ G(D, d).
We first choose an arbitrary x i ∈ X . With β 0 a fixed constant, it is helpful to define two auxiliary functions f : [0, ∞) → R and h : [0, ∞) → R by With these functions in hand, the proof of H p,δ (L, L 0 ; x i ) ≥ F p,δ (L; x i ) follows by looking at cases. It is helpful to note that f is convex for 0 < p < 2, f (β p 0 ) = 0, and f (β p 0 ) = 0. First, suppose dist 2−p (x i , L 0 ) < pδ. In this case, if dist 2−p (x i , L) < pδ, then H p,δ (L, L 0 ; x i ) = F p,δ (L; x i ). On the other hand, if dist 2−p (x i , L) ≥ pδ, consider the function f given in (22). Taking β 0 = (pδ) 1/(2−p) and β = dist(x i , L), proving that H p,δ (L, L 0 ; x i ) ≥ F p,δ (L; x i ) is then equivalent to showing that f (β p ) ≥ 0, which follows from the convexity of f and the fact that f (β p 0 ) = 0 and f (β p 0 ) = 0. Next, suppose that dist 2−p (x i , L 0 ) ≥ pδ. Again, let β = dist(x i , L) and now let β 0 = dist(x i , L 0 ). We must further break the problem down into two more subcases. H p,δ (L, L 0 ; x i ) ≥ F p,δ (L; x i ) becomes equivalent to showing that the following two inequalities hold: If β 2−p ≥ pδ, then we consider the case in (24). In this case, using (22), we again have that f (z) ≥ 0 for all z ∈ [0, ∞), and thus f (β p ) ≥ 0 and H p,δ (L, L 0 ; x i ) ≥ F p,δ (L; x i ) follows. If β 2−p < pδ, we must consider the case in (24). Recall the definition of h given in (23). The function h(z) is a parabola with vertex at z = 0 that opens down since β 0 ≥ pδ. Thus the minimum of h(z) in the interval {z : z 2−p < pδ, z ≥ 0} is at z * = (pδ) 1/(2−p) . It suffices to show that h(z * ) ≥ 0. That is, We notice that the inequality (26) is equivalent to f ((pδ) p/(2−p) ) ≥ 0, which follows from f (z) ≥ 0 for all z ∈ [0, ∞). From the previous analysis, we are able to conclude that for any x i ∈ X From this we finally obtain the relation in (18) as follows: Now consider H p,δ (L, L k ; X ) as a function of L. We will show that the minimization of H p,δ (L, L k ; X ) over all L ∈ G(D, d) is simply a least squares minimization that can be solved by PCA on the data set X scaled by max(dist(x i , L k ) (2−p)/2 , √ pδ) −1 . Suppose that we want to use PCA to calculate a subspace L k+1 from the data set . By the definition of PCA, L k+1 is given by Thus, this definition of H p,δ allows us to write the iterates L k of the FMS p algorithm as The proof of the proposition is completed by noting that (30) and (18) imply that Thus the sequence (F p,δ (L k ; X )) k∈N forms a non-increasing sequence that is bounded below by 0, and so it must converge to a point in R.

Convergence of (L k ) k∈N to a Fixed Point
The next step is to show convergence of the iterates L k to a fixed point over G(D, d). Before we continue, it is useful to remind ourselves of some general results on algorithms and point-to-set maps. Our discussion closely follows the discussion given by Tropp et al. [56], Wang et al. [59], and Luenberger and Ye [41].
First, given two general spaces U, V, a point-to-set map F is a function F : U → P(V). A point-to-set map F is closed at x if for a sequence {x k } ⊂ U that converges to x, any sequence {y k } ⊂ V such that y k ∈ F(x k ) and y k → y gives y ∈ F( x). For this discussion, it suffices to consider point-to-set maps which take a general space U to P(U). A point x of the map F : U → P(U) is a fixed point if {x} = F(x), and x is a generalized fixed point if x ∈ F(x). Given a point-to-set map F, an associated iterative algorithm F * generates a sequence of points by F * (x k ) = x k+1 ∈ F(x k ). If we have a cost function F : U → [0, ∞), the algorithm is said to be monotonic with respect to F if y ∈ F(x) implies that F (y) ≤ F (x), and strictly monotonic if the equality only holds when y = x. With this in mind, we are able to use two useful theorems on the convergence of monotonic algorithms.
Theorem 7 (Zangwill [63]). Let F : U → P(U) be a point-to-set map with an associated algorithm F * : U → U that is monotonic with respect to F . Suppose also that given an initial point x 0 , F * generates a sequence {x k } which lies in a compact set. Then, the sequence has at least one accumulation point x and Theorem 8 (Meyer [47]). Let F : U → P(U) be a point-to-set map with an associated algorithm F * : U → U that is strictly monotonic with respect to F , which generates a sequence {x k } that lies in a compact set. If F is closed at an accumulation point x, then x is a fixed point of F. If U is a metric space with metric d(·, ·), then d(x k+1 , x k ) → 0. It follows then that {x k } converges to x or that the accumulation points of {x k } form a continuum.
Using the results of these two theorems, we are able to prove the following proposition, which establishes convergence of FMS p to a fixed point. Proposition 2. The sequence (L k ) ∞ k=1 generated by the FMS p algorithm converges to a fixed point or a continuum of fixed points in G(D, d).
Proof (Proof of Proposition 2). For the given data set {x i } N i=1 , let L k ∈ G(D, d) be a sequence of iterates obtained by applying FMS p . We can define an equivalence relation on G(D, d) which declares subspaces as equivalent if they yield the same FMS p iteration: d). Specifically, for a given data set The corresponding quotient space is then defined as For each L k let L k denote its equivalence class in G.
If L k+1 ∼ L k , there are three cases to consider. In each case, we demonstrate strict monotonicity of the sequence L k with respect to F p,δ . In the proof of Proposition 1, we showed that the termwise inequality F p,δ (x i , L k+1 ) ≤ H p,δ (x i , L k+1 , L k ) held. Thus, for the strict inequality F p,δ (L k+1 ; X ) < F p,δ (L k ; X ), it suffices to show that there is a strict inequality for one term F p,δ (L k+1 ; First suppose that for some index j, dist 2−p (x j , L k ) < pδ and dist 2−p (x j , L k+1 ) ≥ pδ. We recall the function f defined in (22), while this time letting β 0 = dist(x j , L k ) and β = dist(x j , L k+1 ). We see that f (β p ) = 0 only if β p = β p 0 , which in this case means that F p,δ (L k+1 ; Next, suppose that for some index j, dist 2−p (x j , L k ) ≥ pδ, and dist(x j , L k+1 ) = dist(x j , L k ). If dist 2−p (x j , L k+1 ) < pδ, then note that the function defined in (23) is a parabola that opens downwards, with infimum on the interval {z : z 2−p < pδ, z ≥ 0} at z * = (pδ) 1/(2−p) . From the previous proof, h(z) ≥ 0 on this interval, and h(z) = 0 can only be zero at the infimum z * . Noting that , this corresponds to the case that β 2−p = pδ = β 2−p 0 . Thus, h(z) > 0 when β = β 0 , which then gives that F p,δ (L k+1 ; x j ) < H p,δ (L k+1 , L k ; x j ) and F p,δ (L k+1 ; X ) < F p,δ (L k ; X ). On the other hand, if dist 2−p (x j , L k+1 ) ≥ pδ, then the function f given by (22) satisfies f (z) = 0 if and only if z = β p 0 . In other words, f (β p ) = 0 only if β = β 0 . Since β = β 0 , we must have that f (β p ) > 0 and thus F p,δ (L k+1 ; X ) < H p,δ (L k+1 , L k ; X ) ≤ F p,δ (L k ; X ). Now suppose that L k+1 ∼ L k . It is apparent that H p,δ (L, L k+1 ; X ) = H p,δ (L, L k ; X ) and (29) imply that L k+1 = argmin L∈G(D,d) H p,δ (L, L k+1 ; X ). This then implies that L k+1 is a fixed point. In the case that there are more than one solution to the minimization H p,δ (L, L k+1 ; X ), this motivates an additional stopping condition for FMS p by: "Stop if consecutive iterates belong to the same equivalence class". In practice, we find that checking this condition is not needed, and therefore do not include it in Algorithm 1.
We have found that if L k+1 ∼ L k , then F p,δ (L k+1 ; X ) < F p,δ (L k ; X ), and if L k+1 ∼ L k , then L k+1 is a fixed point. That is, the sequence of iterates generated over G(D, d) is either strictly monotonic or FMS p converges to a fixed point in a finite number of iterations.
Let us finally consider the case that FMS p does not converge to a fixed point in a finite number of iterations. Since H p,δ (L, L 0 ; X ) is continuous as a function of L, the infimal map M : G(D, d) → G(D, d) given by M (L 0 ) = argmin L∈G(D,d) H p,δ (L, L 0 ; X ) is closed [14]. Therefore, by Theorem 7, the sequence generated by FMS p over G(D, d) has an accumulation point L * . Strict monotonicity over G(D, d) implies that this accumulation point is in fact a fixed point by Theorem 8. The Grassmannian G(D, d) is a compact metric space, and so following the result of Theorem 8, we get that dist(L k+1 , L k ) → 0. Consequently, FMS p generates a sequence L k over G(D, d) which converges to fixed point or a continuum of fixed points.

The Fixed Point L * is a Stationary Point
We finish the proof of Theorem 1 by showing that any FMS p fixed point L * is a stationary point of the cost function F p,δ . Since L * is a fixed point, we know that L * = argmin L∈G(D,d) H p,δ (L, L * ; X ). Let L 1 ∈ B(L * , 1) be arbitrary, and parametrize a geodesic between L * and L 1 by L(t), where L(0) = L * and L(1) = L 1 . The facts that L * is a fixed point and that the derivative Examining F p,δ and H p,δ termwise, it is readily apparent that Thus, we conclude that Since the point L 1 was arbitrary, L * must be a stationary point of F p,δ over G (D, d). This concludes the proof of Theorem 1.

Proof of Theorems 2 and 3
The proof of Theorems 2 and 3 proceed in the following sections. First, we prove some preliminary lemmas in §5.2.1. Next, in §5.2.2 we prove a proposition that gives probabilistic estimates on where the stationary points of F p,δ occur when X is sampled from (7) when K = 1. Then, we finish the proof of Theorem 2 in §5.2.3. Next, §5.2.4 gives the proof of Theorem 3. Finally, §5.2.5 gives bounds on some constants used to prove Theorems 2 and 3

The Limiting Stationary Behavior of F p,δ
We begin with some notation, and then proceed to prove two lemmas. For a fixed subspaceL, we can parametrize a geodesic on the Grassmannian by L(t) : [0, 1] → G(D, d) fromL to L ∈ B(L, 1) by where {θ j } d j=1 are the principal angles betweenL andL, {v j } d j=1 is a basis forL, and {u j } d j=1 is a complementary orthogonal system for L. For a more detailed discussion on the construction of this geodesic, see [36, §3.2.1]. For all arguments in this paper, we assume that the interaction dimension between L(0) and L(1) is greater or equal to one, which means that θ 1 > 0. When θ 1 = 0, the geodesic is trivial since L(t) = L(0) = L(1) and consequently the proof becomes trivial.
We also consider an asymptotic limit of the cost F p,δ given in (1) when the data set X is sampled i.i.d. from the mixture measure (7) with K = 1. For a given subspace L, let U L,p,δ ⊂ S D−1 denote the set of points Then we define the asymptotic cost function F * p,δ for 0 < p < 2 by It is readily apparent that F p,δ (L; X )/N a.s.
→ F * p,δ (L; µ). On the other hand, the PCA energy F 2,δ (L; X )/N converges almost surely to its asymptotic cost Lemma 1 gives formulas for the directional derivatives of F p,δ and F * p,δ along the geodesic L(t) given in (37). Lemma 1. The derivatives of F * p,δ and F p,δ for 0 < p < 2 have the following forms: For p = 2, F * 2,δ and F 2,δ have the forms Proof (Proof of Lemma 1). The proof of this lemma borrows from the derivations done in §3.2.2 of [36]. For a given geodesic L(t), the directional derivative of the distance function is (provided x ∈ L(0)): In the regularized cost, when dist(x, L(0)) < δ, we instead calculate d dt (46) Thus, we can derive the derivative expressions for the cost functions F p,δ , F * Finally, when p = 2, we can directly apply the derivative formula of dist 2 (x, L(t)) with respect to t (seen in (46)) to find the derivatives of F * 2,δ and F 2,δ have the forms (43) and (44).
Let Z F * p,δ denote the set of stationary points of the energy F * p,δ . These are precisely points on G(D, d) at which all geodesic directional derivatives are zero.
For the noiseless mixture measure (7) with K = 1, this set is precisely This is proved in Lemma 2 below. We notice that in the set Z F * p,δ , L * 1 is the global minimum, L ⊂ L * ⊥ 1 are the global maxima, and any L ∈ Z F * p,δ that contains basis vectors in both L * 1 and L * ⊥ 1 is a saddle point.
Proof (Proof of Lemma 2). We first show that L * 1 and L ⊂ L * ⊥ 1 are stationary points. The cost F * p,δ (L; µ 0 ) is constant with respect to L. When 0 < p < 2, the application of this observation in (41) leads to the simplification A similar result holds for p = 2. For L(0) = L * 1 and L(1) ∈ B(L * 1 , 1), u j ∈ L * ⊥ 1 for all j. Thus, the expression (50) is 0 by the orthogonality of u j and L * 1 . On the other hand, if L(0) ⊆ L * ⊥ 1 , then v j is orthogonal to L * 1 for all j, and the expression (50) is 0. The same argument can be used to show that d dt F * 2,δ (L(t); µ) t=0 is zero in these cases. The next case to consider is a subspace which has basis vectors in both L * 1 and L * ⊥ where v 1 , ..., v k ∈ L * 1 and v k+1 , ..., v d ∈ L * ⊥ 1 . Again, we first restrict to the case 0 < p < 2. The derivative formula (50) of F * p,δ can be rewritten as All the terms in the sum corresponding to v k+1 , ..., v d are zero due to orthogonality with L * 1 . Consider a single integral corresponding to an index 1 ≤ l ≤ k within the sum over j. For the l th term, if u l ∈ L * 1 or u l ∈ L * ⊥ 1 , then the integral is 0 by symmetry or orthogonality respectively. On the other hand, if u l lies in between L * 1 and L * ⊥ 1 , we can write u l = c * w l + c ⊥ w ⊥ l , where w l ∈ L * 1 and w ⊥ l ∈ L * ⊥ 1 . We notice that necessarily w l is orthogonal to v j for 1 ≤ j ≤ k.
Thus, the l th term in the derivative of F * p,δ reduces to (52) The last integral is zero by symmetry of the integral over L * 1 . Again, the same logic can be applied to the case p = 2.
It remains to show that the right hand side of (49) contains all stationary points of F * p,δ . Consider L(0) ∈ Z F * p,δ . The derivative of F * p,δ is negative in the direction of L * 1 and positive in the direction of L(1) ∈ L * ⊥ 1 due to the positive measure on L * 1 . This can be seen from the following logic. First, assume L(1) = L * 1 , and we can assume that 0 < θ j < π/2 for j = 1, . . . , k for some k. Then we can write the integral in (50) as Notice that in the inner integral, since L * 1 ∩ Sp(v j , u j ) has angle less than π/2 to both v j and u j , (v j · x) and (u j · x) have the same sign for all x ∈ L * 1 . Thus, the integral inside the sum in the right hand side of (53) is strictly positive and the overall derivative is negative. A similar argument can be used to show that the derivative is positive from L(0) in the direction of L * ⊥ 1 . Again, the same argument can be used for F * 2,δ .

The Non-Stationarity of F p,δ in a Large Region
We will continue the proof of Theorem 2 with a proposition. For a given 0 < η ≤ π/6, we define the set In other words, L η is the set of all subspaces in G(D, d) which cannot be spanned by vectors in B(L * 1 , η) ∪ B(L * ⊥ 1 , η).

Proof (Proof of Proposition 3).
In the first part of the proof, we will show that the derivative of F p,δ is bounded away from zero on L η w.o.p. For eachL ∈ L η , letL(t) be the geodesic fromL towards L * 1 . In (41) and (43), the derivative alongL(t) has a dependence on θ j , which are the principal angles betweenL(0) andL(1). When these two subspaces are close together, these angles are small. This means that the directional derivative is made smaller from the fact that the geodesic is shorter. To fix this issue, suppose we want to take the directional derivative betweenL and L * 1 , which have principal angles θ j . We remind ourselves of the geodesic defined withL(0) =L andL(1) = L * 1 given in (37). This geodesic can be reparametrized aṡ The reparametrization (55) now hasL(0) =L andL(2θ 1 /π) = L * 1 . We have in effect lengthened the geodesic so that the maximum principal angle is π/2; noẇ L(1) is a subspace that has principal angles (θ j π/2θ 1 ) withL(0). We will refer to this as the extended geodesic betweenL and L * 1 . This geodesic maintains the property that it is still a geodesic on G(D, d) fromL to L * 1 , only now the dependence on θ j is lessened in the derivatives (41), (42), (43), and (44).
From here we fix a pointL ∈ L η , and letL(t) be the extended geodesic betweenL and L * 1 . Define a function C η on L η which is the magnitude of the directional derivative of F * p,δ at each pointL ∈ L η , using the extended geodesic parametrization towards L * 1 . By compactness of L η , C η has a nonzero lower bound which we can use to define C * η : For all subspaces in L η , the magnitude of the derivative of F * p,δ in the direction of L * 1 is greater than C * η (p) (using the extended geodesic parametrization). Using Lemma 1, for 0 < p < 2 we can write When p = 2, the expression (57) becomes For x ∈ S D−1 and 0 < p < 2, let J p (x) be the random variable For p = 2, the random variable J 2 (x) is defined as We plan to use Hoeffding's inequality on the random variable J p (x) to get the overwhelming probability bounds in the theorem. For 0 < p < 2, (59) is absolutely bounded for x ∈ S D−1 We must split (61) into two cases: On the other hand, when p = 2, we have a tighter bound for (60) For 0 < p ≤ 1 and a data set X sampled i.i.d. from (7) with K = 1, we can use (62) and apply Hoeffding's inequality to the random variable J p (x) to find For the case of 1 < p < 2, we use (63) and again apply Hoeffding's inequality to find Finally, when p = 2 we use (64) and apply Hoeffding's inequality to find We observe that When 0 < p < 1, from (57) and (65) we conclude d dt For 1 < p < 2 we use (57) and (66) to conclude d dt Finally, for p = 2, (58) and (67) imply that d dt In other words, for allL ∈ L η , the derivative of F p,δ is bounded away from zero w.o.p., which concludes the first part of the proof. We show that for anyL ∈ L η , there is a radius R(L) such that forL ∈ B(L, R(L)), there exists a directional derivative of F p,δ atL that is bounded away from zero. To show this, we must look at the derivative expression aṫ L given by (42) when L(0) =L for the extended geodesic through L * 1 . This derivative is continuous as a function ofL, θ j , v j , and u j . Thus, there is a γ > 0 such that θ j − θ j < γ, v j − v j < γ, u j − u j < γ, and dist(L,L) < γ imply that Now fix γL = γ with this property and fix a subspaceL such that dist(L,L) < γL. There is a minimal rotation R which takes vectors inL toL. LetL denote the subspace obtained from R(L * 1 ). Then, we note that the principal angles betweenL andL are identical to those betweenL and L * 1 . Further, we can define an orthonormal basis forL as R(v 1 ), . . . , R(v d ), and a complementary orthogonal basis forL by R(u 1 ), . . . , R(u d ). Then, since dist(L,L) < γL, we get the inequalities ∠(v j , R(v j )) < γL and ∠(u j , R(v j )) < γL. In turn, this then implies that v j − R(v j ) < γL and u j − R(u j )) < γL. Putting this all together, for 0 < p ≤ 1, from (68) and (71) we get the bound Repeating this argument for 1 < p < 2 yields a bound similar to (71) with a new γL, which we combine with (69) to find Finally, by repeating the continuity argument for p = 2, we again find a similar expression to (71) with a new γL, which we combine with (70) to find Finally, if we let R p (L) = γL, we can find this relation for allL ∈ L η , and the existence of the function R p is concluded.
The set L η can be covered by the set of open balls {B(L, R p (L)) : L ∈ L η } by Proposition 3. By compactness of the set L η , there is a finite sub-cover {B(L 1 , R(L 1 )), . . . , B(L mp , R(L mp ))}.
Within each of these balls, there are no stationary points w.o.p. depending on the directional derivatives of F * p,δ atL 1 , . . . ,L mp towards L * 1 . Using this observation and (72), for 0 < p ≤ 1, we get the desired result by the union bound Pr (L η contains no stationary points) For 1 < p < 2, from (73) we get Pr (L η contains no stationary points) For the case p = 2, using (74), the union bound gives Pr (L η contains no stationary points) The final probability bounds in (8), (9), and (10) follow from (76), (77), and (78) using the bounds derived later for C * η (p) in (102), (103), and (101). For 0 < p ≤ 1, the probability bound (76) becomes Pr (L η contains no stationary points) ≥ For 1 < p < 2, the probability bound (77) becomes Pr (L η contains no stationary points) ≥ For the case p = 2, the probability bound (78) becomes Pr (L η contains no stationary points) Since FMS p must converge to a stationary point, by Lemma 2 and (76) we conclude that (L k ) k∈N converges to B(Z F * p,δ , η).

Conclusion of Theorem 2
It remains to show that the point recovered by FMS p or PCA lies in B(L * 1 , η). We define a set B(L, c) = {L ∈ G(D, d) : θ 1 (L, L) < c}.
To show that the PCA solution is componentwise separated from L * ⊥ 1 , we use the following Lemma. With abuse of notation here, we also B(L, η) = {v ∈ S D−1 : ∠(v, L) < η}, and the context will make clear whether B(L, η) is a set of vectors or subspaces.
Proof. To prove this lemma, we first look at the asymptotic PCA cost F * 2,δ in each of these neighborhoods. We notice that we can separate this cost by measure: Define the function ϕ * η : For η < π/6, we have the bound Thus, for small enough η, ϕ(·, ·; µ) is bounded above zero. Define the random variable ϕ(x; v, u) for x ∈ S D−1 by Then, ϕ(x; v, u) ∈ [−1, 1] for all x ∈ S D−1 . We use Hoeffding's inequality to write By continuity of ϕ(v, u; x), there exists a γ such that ∠(v, v ) < γ and ∠(u, u ) < γ implies that Thus, by a covering argument similar to the proof of Proposition 3, we get that ϕ(v, u; X ) > 0 for all (v, u) ∈ B(L * ⊥ 1 , η) × B(L * 1 , η) w.o.p.

Conclusion of Theorem 3
In this section, we analyze the stability of the global convergence result when small noise is added to the data set. We now consider the more general mixture measure µ ε given in (6) and are able to get convergence of L k to a point in B(L * 1 , η), provided that the noise is not too large.
Proof (Proof of Theorem 3). We require that the pth moment of ν 1,ε is less than ε p . Let L(t) be an extended geodesic from a point in L η through L * 1 . Then, for 0 < p < 2, the difference between the derivatives asymptotic costs associated with µ and µ ε can be written as On the other hand, for p = 2, the difference between the asymptotic derivatives can be written as With these bounds, Proposition 3 holds for K = 1 with the noisy mixture measure (6) for small enough ε. For example choosing ε < 1/2 for p = 2 allows the proof to still work. Combining this fact with the bounds on C * η (p) given later in (101), (102), and (103), we get the upper bounds on the magnitude of ε for η-recovery given in Theorem 3.
In fact, we can modify the proof of Proposition 3 to hold for any π/6 ≥ η > 0 5.2.5. Bounds on C * η (p) and m p This section will seek to provide useful bounds on the constants C * η (p) defined in (56) and m p defined in (75). While (76), (77), and (78) give the overwhelming probability of recovery for both FMS p and PCA, we must examine the constants to see what kind of gain FMS p gives over PCA. In the following analysis, we restrict ourselves to the set of subspaces B(L * 1 , π/6). It readily apparent that B(L * 1 , π/6) ⊇ B(L * 1 , π/6). We restrict to this set to allow favorable estimates on C * η (p): the derivative d dt F * p,δ (L(t)) t=0 → 0 as L(0) → L * ⊥ 1 . This restriction is reasonable if we take PCA initialization for FMS p . From Theorem 2 with p = 2, taking η = π/6, the PCA solution lies in this set w.o.p. Thus, FMS p with PCA initialization starts in B(L * 1 , π/6) w.o.p., and so we focus on probabilistic recovery (76) and (77) within this neighborhood of L * 1 . We begin by noticing that where x = (x 1 , . . . , x d ) and σ is the uniform distribution on S d−1 . By symmetry, this integral is equal for all choices of x 1 , . . . , x d . Now consider the derivative of F * 2,δ from a point L(0) ∈ B(L * 1 , π/6) in the direction of L * 1 using the extended Let x j be a unit vector spanning L * 1 ∩ Sp(u j , v j ) such that ∠(x j , u j ) < π/2 and ∠(x j , v j ) < π/2. We can rewrite (94) using (93) (θ j π/2θ 1 ) cos(θ j ) sin(θ j ).
From this formulation of the derivative, we obtain the inequality In a similar fashion, we restate the derivative of the asymptotic FMS p cost derived in Lemma 1 using the extended geodesic parametrization Let c(L(0)) = max(max x∈L(0) (dist 2−p (x, L * 1 )), pδ). Then, we have the following bound on the derivative of F * p,δ (L(t); µ) over B(L * 1 , π/6) We will now make the bounds (97) and (98) more clear. Using the fact that Further, using the fact that c(L(0)) ≤ max(dist 2−p (L(0), L * 1 ), pδ), (98) becomes From (99) restricted to the set B(L * 1 , π/6) ∩ L η = B(L * 1 , π/6) \ B(L * 1 , η), the bounds on the PCA constant C * η (2) are given by Over the set B(L * 1 , π/6) ∩ L η , we use (100) to derive two bounds for the FMS p constant C * η (p). If 0 < p ≤ 1, then On the other hand, if 1 < p < 2, From this, we see the dependence of C * η (p) on p, d, α 1 , and η. While it is hard to come up with closed form expressions for the constants m p in the probabilities (76) and (78), it is still important to see the dependence on D, d, p, and δ. Proposition 4 gives our bounds for the covering numbers for FMS p . Proposition 4. At worst, the number of covering balls m p for 0 < p ≤ 1 is .
At worst, the number of covering balls m p for 1 < p < 2 is .

(105)
At worst, the number of covering balls m 2 for p = 2 is Proof. We first remind ourselves of some details of the proof of Proposition 3 used in the proof of Theorem 2. The covering argument relies on finding a function R p : L η → (0, ∞) such that for each L ∈ L η , the derivative is nonstationary for all points in B(L, R p (L)) w.o.p. To prove this proposition, we will bound the radius below. The bound on the radius function then gives a bound on the necessary number of covering balls.
Looking at the continuity statement on the derivative of F p,δ , we can rewrite the difference expressed in (71), when θ j = θ j and dist(L,L) ≤ γ as Thus, if we choose then the radius function R is bounded below by γ 1 . We can cover G(D, d) by balls of radius γ 1 using Remark 8.4 of [55], for a universal constant C 4 . As a consequence, for 0 < p ≤ 1 we can use the inequality (102) to bound the order of the covering number m p for G(D, d) .
For 1 < p < 2, we use the inequality (103) to bound the order of the covering number m p for G(D, d) .
For p = 2, a simpler continuity argument shows that choosing yields the desired continuity. We use the inequality (101) to bound the order of the covering number m 2 for G(D, d)

Proof of Theorem 4
The proof of Theorem 4 follows from the propositions in this section. We assume that 0 < p ≤ 1 in the following discussion. It is useful to define a set M η that is used frequently in the following: For a fixed L 0 , we compare the global minimum of H p,δ (L, L 0 ; X ) to the global minimum of H * p,δ (L, L 0 ; µ), which will characterize the FMS p sequence.
We omit the proof of this proposition since it is essentially the proof of Theorem 2 for p = 2 with a reweighted measure. We continue with a proposition on the derivative of F * p,δ . To simplify things, let m(η, pδ) = max(η, arcsin((pδ) 1/(2−p) )).
Proof. Let L 0 be a point on the geodesic between L * 2 and L * 1 such that dist(L 0 , L * 1 ) < dist(L 0 , L * 2 ). We note that there are two geodesics between L * 2 and L * 1 , one which has length less than or equal to π/2 and one that has length greater than or equal to π/2. For this proof, L 0 can lie on either of these geodesics. Let U L0,p,δ = {x ∈ S D−1 : dist 2−p (x, L 0 ) < pδ}. For two subspaces L 0 and L, we can write the asymptotic majorizing function H * p,δ (corresponding to H p,δ given in (17)) under the mixture measure as Now define the point L 1 = argmin L∈G(D,1) H * p,δ (L, L 0 ; µ). Let x 0 be a basis vector for L 0 , x 1 be a basis vector for L * 1 , and x 2 a basis vector for L * 2 such that ∠(x 1 , x 2 ) ≤ π/2, ∠(x 0 , x 1 ) ≤ π/2, and ∠(x 0 , x 2 ) ≤ π/2. Differentiating the function H * p,δ with respect to its first argument along the geodesic L(t), with L(0) = L 0 and L(1) = L * 1 , we get Since the derivative (117) is negative, the stationary point along L(t) must be closer to L * 1 than L 0 . It is apparent that this stationary point is also the global minimum of the function H * p,δ (L, L 0 ; µ). By Proposition 5, the global minimum of H p,δ (L, L 0 ; X ) is arbitrarily close to the global minimum of H * p,δ (L, L 0 ; µ) w.o.p., and therefore Proposition 7 is proved.
Thus, we can put this all together and finish the proof of Theorem 4. Assume that we are given a data set X sampled i.i.d. from the mixture measure (7) with K = 2 and d = 1, and a number η > 0. By Proposition 9, we can cover M m(η,pδ) by {B(L 0 , ζ L0 ) : L 0 ∈ M m(η,pδ) }. The next FMS p iterate for each point in each ball is closer to L * 1 w.o.p. This cover has a finite sub-cover by compactness of M m(η,pδ) , and thus there are no fixed points in M m(η,pδ) w.o.p. Finally, due to the fact that the iterates get closer to L * 1 , we get that FMS p must converge to a point in B(L * 1 , max(η, arcsin((pδ) 1/(2−p) ))) w.o.p.

Proof of Theorem 5
Denote the FMS p sequence by (L k ) k∈N , and assume that X is sampled i.i.d. from the mixture measure (7) with K = 1. Let L * k (t) : [0, 1] → G(D, d) denote the extended geodesic from L k in the direction of L * 1 , and let s * k be the length of this geodesic (i.e. s * k = dist(L * k (0), L * k (1))). We begin by reminding ourselves that At a pointt ∈ (0, 1), we can instead write the derivative of H p,δ (L * k (t), L k ; X ) as for a new set of basis vectors v j and u j . Continuity of the derivative of H p,δ with respect to t implies If dist(L(0), L(t) < C * η (p)pδ/8, then The first order Taylor expansion of H p,δ (·, L k ; X ) at L k in the direction of L * 1 is given by for somet ∈ (0, t). Define the quantity λ k as The inequality λ k < 1 follows from a simple estimate for C * η (p). The first order Taylor expansion (131) evaluated at t = λ k is for somet ∈ (0, λ k ). Using (18), (30), and (133), we conclude that We now split into different cases by p. For 0 < p ≤ 1, C * η (p) > O min π 6 p−1 , η pδ by (102), which implies For 1 < p < 2, C * η (p) = O min η p−1 , η pδ by (103), which implies Thus, by (135) for 0 < p ≤ 1 This follows from the fact that the cost cannot be negative. From this, the global convergence bound is concluded. On the other hand, by (136) for 1 < p < 2 Again, the global convergence bound is concluded. A similar proof can be done for the case K = 2 and d = 1. The only difference now is that the constant C * η (p) has a new bound. In this case, we bound the magnitude of the derivative of F * p,δ over the set M m(η,pδ) . Let L(t) be the extended geodesic between a point L(0) ∈ M m(η,pδ) and L * 1 . From (115), we get the following bound for 0 < p ≤ 1: In other words, for 0 < p ≤ 1, we now have C * η (p) > π 2 1 2 p/2 (α 1 − α 2 ). This means that Thus, by (140) with 0 < p ≤ 1,

Proof of Theorem 6
In order for a r-linear rate of convergence proof for the FMS p iterates, we need strong geodesic convexity in a neighborhood of the limit point L * (or for global convergence, geodesic convexity). The following theorem shows that under the mixture measure (7) with K = 1, the FMS p algorithm is strongly geodesically convex around the global minimum w.o.p. under a condition on α 0 and α 1 . Another consequence of this theorem is that H p,δ (L, L * 1 ; X ) is strongly geodesically convex at L * 1 .
Proof (Proof of Proposition 10). A useful fact for the asymptotic FMS p theory comes in the separability of the cost function with respect to the mixture measure We note that F * p,δ (L; µ 0 ) is constant with respect to L due to the spherical symmetry of µ 0 , and therefore any geodesic derivative of this term is zero. If we parametrize a geodesic L(t), t ∈ [0, 1], and take the derivative of F * p,δ (L; µ) with respect to t, we find that Further, for each i, the derivative of the cost function is max(dist 2−p (x, L(t)), pδ) dµ i .
When L(0) = L * 1 , the second derivative is strictly positive, because (u j · x) = 0 for all x ∈ L * 1 . In the case of K = 2 and d = 1, we have When L(0) = L * 1 , letting x 1 be a basis vector for L * 1 and x 2 a basis vector for L * 2 such that ∠(x 1 , x 2 ) ≤ π/2, we can bound (149) as follows: .
Finally, for K = 1 in (7) (or K = 2, d = 1), the second derivative of F * p,δ is continuous. To finish the proof the proposition, we let b α1 be the minimum second derivative of F * p,δ across all directions. By (148) (and (150) with α 1 > α 2 (2 − p)), b α1 > 0. For each directional derivative along a geodesic L(t), for some constant C 1 . Further, by continuity of the second derivative of F p,δ , there exists a number ξ L(1) such that for another geodesic L (t) with L (0) = L * 1 and dist(L(1), L (1)) < ξ L(1) , By another covering argument, for a data set sampled i.i.d. from (7) with K = 1 (or K = 2, d = 1) the second derivative of F p,δ is bounded away from zero w.o.p.
By Proposition 10, the second derivative of F p,δ at L * 1 is positive w.o.p. for data sets sampled i.i.d from (7) with K = 1 (or K = 2, d = 1, α 1 > α 2 (2 − p) and dist(L * 1 , L * 2 ) > 2 arcsin(pδ 1/(2−p) )). Due to the fact that in the case K = 2 and d = 1, we cannot guarantee an η-approximation to L * 1 for any η (it is capped at arcsin((pδ) 1/(2−p) )), we must be sure that Proposition (10) can be extended to strong geodesic convexity at the limit point of the FMS sequence. This can be guaranteed if we set η = arcsin((pδ) 1/(2−p) ), let L * be the limit point of FMS that is within B(L * 1 , η), and notice that a modified version of (150) is still positive at such an L * w.o.p.
By continuity of the second derivative, F * p,δ is strongly geodesically convex in a neighborhood of L * 1 w.o.p. Further, strong geodesic convexity of F p,δ (L; X ) at L k implies strong geodesic convexity of H p,δ (L, L k ; X ) at L k since H p,δ majorizes F p,δ . Let L * be the true limit point of the FMS p algorithm. We have strong geodesic convexity at L * w.o.p. by the previous argument, and further there exists κ > 0 such that for k > κ, all geodesics between L * and L k are strongly convex. Let L * k (t) denote the geodesic from L k to L * for t ∈ [0, 1], and s * k = dist(L k , L * ). For k > κ, by Taylor's Theorem we can write for somet k ∈ (0, 1) where C(L * k (t)) is strictly positive function depending on L * k (t). We now follow the proof of Chan and Mulet [11] for r-linear convergence of the generalized Weiszfeld method with some slight twists. We define a further majorization function for H p,δ (L * k (t), L k ; X ) as H k (t) for t ∈ [0, 1] by where C(L k ) = max t∈[0,1] C(L * k (t)), which is defined in (153). Define λ k as λ k := H k (1) − F p,δ (L * ; X ) 1 2 (s * k ) 2 C(L k ) .
Thus, if we can prove that the λ k are strictly bounded below 1, then (157) gives linear convergence of the cost iterates (F p,δ (L k ; X )) k∈N . First, we can write the first order Taylor expansion of F p,δ at L k towards L * as F p,δ (L * k (t), L k ; X ) = F p,δ (L k ; X )+ts * k d dt F p,δ (L * k (t); X ) t=0 + 1 2 (ts * k ) 2 d 2 du 2 F p,δ (L * k (u); X ) u=t k .
Finally, let L k * (t) denote the geodesic from L * to L k . The Taylor expansion of F p,δ at L * towards L k is given by F p,δ (L k * (t); X ) = F p,δ (L * ; X ) + 1 2 (ts * k ) 2 d 2 du 2 F p,δ (L k * (u); X ) u=t k for somet k ∈ (0, t). Using (161) evaluated at t = 1, with the corresponding t k ∈ (0, 1), results in the estimate We rewrite (162) and define y k as the quantity s * k ≤ y k := 2 (163) (164) Therefore, y k+1 ≤ √ Λy k , and so the sequence (L k ) k∈N is r-linearly convergent for k sufficiently large w.o.p. Further, the rate of convergence is at most √ Λ given in (160).

Conclusions
We have proposed the FMS p algorithm for fast, robust recovery of a lowdimensional subspace in the presence of outliers. The algorithm aims to solve a non-convex minimization, which has been studied before. Recent successful methods minimize convex relaxations of this problem. The main reason that we aimed to solve the non-convex problem was the ability of obtaining a truly fast algorithm for RSR. Indeed, the complexity of the FMS p algorithm is of order O(T N Dd), where the number of required iterations T is empirically small. We also prove globally bounded and locally r-linear convergence for a special model of data. A side product of minimizing the non-convex problem is that its minimizer seems to be more robust to outliers than the minimizers of convex relaxations of the problem. Furthermore, it can even include non-convex energies when p < 1 (on top of non-convex domain), which may yield faster convergence, although the theoretical results point to problems with p < 1. Empirically we see faster convergence for p < 1 in Figure 11, which is similar to the result of Daubechies et al. [15], although it is not obvious from our theory why this is the case.
The non-convexity of the minimization makes it hard to theoretically guarantee the success of FMS p . We were able to verify the convergence of the iterates to a stationary point. Further, in special cases when data is sampled from the mixture measure (7), the FMS p algorithm converges to the global minimum w.o.p. There are a few interesting directions in which the theory of FMS p can be extended. First, we plan to extend the robustness to noise result of GMS and Reaper [13] to our setting. Also, we empirically find that FMS p converges to the correct solution in all cases of the most significant subspace model (7) when N is sufficiently large. We hope to extend our theorems to encompass all cases when K > 1 and d > 1. Finally, Figure 11 shows global linear convergence under (7), which should be theoretically justified.
It was interesting to notice that in both synthetic and real data that reflect our model, we never had problems with global convergence of the iterates L k . In view of the current theory and strong experimental experience that we had, the FMS p algorithm seems very promising due to its potential for robustly reducing dimension in clustering and classification tasks. The denoising effect of dimensionality reduction by FMS p seems to have the potential to be better than PCA, as is demonstrated in Figure 13. While PCA is a standard technique for dimensionality reduction, FMS p does not add much complexity and thus can easily be tested anywhere PCA is used. We will make our implementation (including the randomized PCA implementation) available.

Acknowledgments
This work was supported by NSF awards DMS-09-56072 and DMS-14-18386 and the Feinberg Foundation Visiting Faculty Program Fellowship of the Weizmann Institute of Science. We thank Joshua Vogelstein for his recommendation of the astronomy data; Ching-Wa Yip for creating the astronomy dataset used in this paper; Tamás Budavári and David Lawlor for their help with processing and interpreting the astronomy data; Teng Zhang for useful comments on earlier versions of this manuscript and helpful discussions; and Nati Srebro for encouraging us to write up and submit our results.