The Statistical Complexity of Early Stopped Mirror Descent

Recently there has been a surge of interest in understanding implicit regularization properties of iterative gradient-based optimization algorithms. In this paper, we study the statistical guarantees on the excess risk achieved by early stopped unconstrained mirror descent algorithms applied to the unregularized empirical risk with squared loss for linear models and kernel methods. We identify a link between offset Rademacher complexities and potential-based analysis of mirror descent that allows disentangling statistics from optimization in the analysis of such algorithms. Our main result characterizes the statistical performance of the path traced by the iterates of mirror descent in terms of offset complexities of certain function classes depending only on the choice of the mirror map, initialization point, step-size, and number of iterations. We apply our theory to recover, in a rather clean and elegant manner, some of the recent results in the implicit regularization literature, while also showing how to improve upon them in some settings.


Introduction
Mirror descent (Nemirovsky and Yudin, 1983;Beck and Teboulle, 2009) is increasingly becoming the tool of choice in optimization and machine learning, being applied well beyond the traditional setting of online learning.Among the properties that make mirror descent appealing are its ability to exploit non-Euclidean geometries via properly designed mirror maps, the fact that the algorithm admits a general potential-based convergence analysis in terms of Bregman divergences, and the fact that mirror descent monotonically decreases Bregman divergences regardless of the reference point being used.The latter property reveals the adaptive nature of mirror descent and explains its success in online adversarial learning.
Traditionally, in statistical learning theory, optimization and statistical properties have been considered separately.The statistical guarantees of an estimator are usually established by bounding a complexity measure, such as Rademacher complexity, of the class in which the estimator lies.Localized complexity measures, which work for explicitly regularized empirical risk minimizers, have become default tools in statistical learning theory and empirical processes theory (Bartlett et al., 2005).A rich and general theory regarding these complexity measures has been developed and used to provide excess risk bounds in both classification and regression settings, in several cases yielding minimax-optimal results.Recent work has refined the machinery of localized complexities leading, in particular, to the notion of offset Rademacher complexity, which has provided a very clean theory for developing excess risk bounds of constrained empirical risk minimizers and the two-step estimator under quadratic loss (Liang et al., 2015;Mendelson, 2014).
Recent years have also witnessed an increased interest in the study of the statistical learning capabilities of gradient descent methods, particularly in relation to the notions of implicit regularization and early stopping.However, most results in this area either focus on optimization guarantees that do not provide any direct link to statistical guarantees on out-of-sample prediction (Gunasekar et al., 2017(Gunasekar et al., , 2018;;Azizan and Hassibi, 2019), or establish a connection to statistics via some forms of explicit regularization (Suggala et al., 2018;Ali et al., 2019).The latter work shows connections between the iterates on the entire path and the solutions on the regularization path for a suitable regularized risk minimization problem.Yet other results have applied solvers directly to constrained problems by encoding regularization-promoting structures into the algorithm (Matet et al., 2017).Earlier work has also established guarantees on the generalization properties of early stopped stochastic gradient via algorithmic stability (Hardt et al., 2016;Bousquet and Elisseeff, 2002).
In some settings, primarily involving the ℓ 2 geometry, implicit regularization properties of gradient descent methods have been established without invoking any connection to explicit regularization, and have been related to notions of localized complexities.However, these works rely on exploiting a close interplay between statistics and optimization, such as unraveling the iterates of gradient descent and decomposing the excess risk into a bias and variance term (Raskutti et al., 2014), or controlling the deviation between gradient descent applied to the empirical risk and gradient descent applied to the population risk, using concentration and lower-level arguments involving symmetrization and peeling (Wei et al., 2019).This theory does not easily generalize to non-Euclidean settings.A general theory that connects the notion of early stopping in mirror descent with that of localized complexities is still missing.More generally, a general language to reason about statistical properties of trajectories traced by optimization algorithms applied to the unregularized empirical risk is still lacking.

Contributions
We develop a general theory for learning linear models (including kernel machines) with the squared loss function that shows how the optimization trajectory of unconstrained mirror descent applied to minimize the unregularized empirical risk is inherently connected to excess risk guarantees via offset Rademacher complexity.Unlike in most prior work, the notion of statistical complexity appears naturally from intrinsic properties of mirror descent applied to the unregularized empirical risk, without invoking lower-level arguments related to concentration to the fictitious population version of the algorithm.Furthermore, our theory leads to an explicit characterization of stopping times from the point of view of both optimization and statistics, which directly yields excess risk bounds and allows us to re-derive previously established results, and some new results, in a much simpler fashion.
We consider functions parameterized by a vector α ∈ R m .Our main theorem establishes the following two facts for mirror descent, which hold with respect to any given reference point α ′ due to the adaptive nature of algorithm.First, we show that up to a prescribed stopping time t ⋆ , the iterates of mirror descent (α t ) t∈[0,t ⋆ ] stay within a Bregman divergence "ball" B centered at α ′ with radius D ψ (α ′ , α 0 ), where ψ is the mirror map, D ψ (•, •) is the associated Bregman divergence, and α 0 is the initialization point.1 Second, we show that at the stopping time t ⋆ , f α t ⋆ satisfies a statistical/geometric property that immediately yields excess risk bounds with respect to f α ′ .These bounds are derived using the offset Rademacher complexity of the ball B.
By choosing α ′ as the best predictor in a given class F, our result allows to derive new results and recover previously-derived ones in a concise and elegant manner.
The key insight behind our main result is the following identity bridging the potential-based analysis of mirror-descent to statistical guarantees derived from offset complexities.The identity is most concisely expressed for the continuous-time variant of mirror descent: Here, L represents the empirical square error and the term P n (•) 2 denotes the empirical ℓ 2 distance between functions represented by α t and α ′ .This quadratic term coincides with the in-sample prediction error when α ′ is the "true" parameter in a well-specified model.
From an optimization point of view, mirror descent monotonically decreases the Bregman divergence D ψ (α ′ , α t ) at a rate lower-bounded by L(α t )− L(α ′ ).Thus, optimization slows down only when L(α t ) is sufficiently close to L(α ′ ).In a statistical setting, optimizing the empirical loss function L is, however, not the primary aim.The above identity shows that the argument used to establish that L(α t ) − L(α ′ ) gets small, in fact, establishes the slightly stronger claim that the statistics term gets arbitrarily closed to 0. The term P n (•) 2 , which is the empirical prediction error, is of no interest in optimization and is thus ignored as it is always non-negative, plays a vital role in statistics.In particular, the statistics term being zero is the starting point from which the theory of offset Rademacher complexities develops (Liang et al., 2015).In our work, we show that bounding the statistics term by an arbitrarily small ε > 0 is sufficient to derive all the results from that theory.The above identity and the resulting observations suffice to decouple the analysis of optimization and statistics.The optimization analysis is based on the Bregman divergence potential, and as a side effect provides a set, the "ball" B defined above, whose offset Rademacher complexity characterizes the statistical properties of the solution at the stopping time.To summarize: 1. Our main result, in a short and transparent way, yields bounds on the excess risk of the iterates of (both continuous-time and discrete-time) mirror descent using offset Rademacher complexities.We require no low-level tools such as symmetrization, peeling, concentration, or contraction.
2. We recover, in a concise and elegant way, recent results connecting optimization and regularization paths (Section 4.1), as well as those deriving early stopping bounds for kernel methods (Section 4.2).
3. We apply our theory to analyze an unconstrained mirror-descent algorithm yielding, up to a log factor, minimax-optimal rate for in-sample linear prediction error over ℓ 1 balls (Section 4.3).Crucially, the ℓ 1 norm is non-Euclidean, and to the best of our knowledge this result is the first of its kind.

Preliminaries
For the convenience of the reader, a table of notation is provided in Appendix F.

Problem Setup
Let D n = {(x i , y i ) | i = 1, . . ., n} denote a dataset of size n sampled i.i.d.from some unknown probability distribution P supported on X × Y where Y ⊆ R and further let ℓ(ŷ, y) = (ŷ − y)2 denote the quadratic loss function.For a class of functions F mapping X to R and for any f ∈ F denote ℓ f (x, y) = ℓ(f (x), y).Expectations of arbitrary integrable functions mapping X × R to R will be denoted by P g and the same notation will be adopted for functions with domain X : Conditionally on the observed data D n , P n = 1 n n i=1 δ (x i ,y i ) denotes the empirical counterpart to P so that Let G be a class of functions mapping X to R and let f = f (D n ) be an estimator mapping datasets to functions in G.The performance of an estimator f with respect to some reference class of functions F is formalized by its excess risk defined as follows: With no loss of generality, 2 we will assume that the above term involving infimum is attained by some function f F ∈ F. We will be interested in estimators g arising from early-stopped unconstrained mirror-descent algorithms.
In the rest of the paper, we will restrict ourselves to estimators that conditionally on the observed dataset Zα) i for all α and x i .For instance this could arise from a feature map x → φ(x) with z i = φ(x i ) and f α (x) = α ⊤ φ(x), or by using kernels.We let L(•) denote the empirical loss function with respect to the parameters α so that where y = (y 1 , . . .y n ) T .For example, in a usual linear regression set up with and let Z be an n × d matrix such that the i th row is given by x i .We show in Section 4.2 how this setup also admits kernel methods.

Offset Rademacher Complexity
We begin by defining the offset Rademacher complexity of an abstract function class G.
Definition 2.1 (Offset Rademacher Complexity, Liang et al. (2015)).Let x 1 , . . ., x n be fixed and let ε 1 , . . ., ε n be independent Rademacher random variables.For any c ≥ 0, the offset Rademacher complexity of a function class G is defined as The offset Rademacher complexity is a decreasing function in the parameter c.For c = 0, we recover the classical Rademacher complexity.On the other hand, for any c > 0, the quadratic term in the above definition has a localization effect by compensating for the fluctuations in the term involving Rademacher variables (cf.Liang et al. (2015, Section 5.2)).The following lemma characterizes a sufficient condition on an arbitrary estimator f , which allows controlling its excess risk via offset Rademacher complexity theory.

Lemma 2.1. Let F and G be classes of functions such that sup
Suppose an estimator f always outputs a function in G and satisfies the following deterministic inequality: (1) where When ε = 0, the condition given in Equation ( 1) coincides with Lemma 1 in Liang et al. (2015) which is the starting point of the analysis leading to the theory of offset Rademacher complexity bounds.In particular, Liang et al. (2015) shows that such a condition is satisfied by empirical risk minimizers over convex classes (with c 1 = 1) and for a more general "two-step" estimator (with c 1 = 1/18), selecting a function over a possibly non-convex class.Since the only difference between Lemma 2.1 above and Lemma 1 in Liang et al. (2015) is the ε term, the proof of Lemma 2.1 is a simple corollary of Liang et al. (2015) and is delegated to the Appendix A.
Remark 2.1.Lemma 2.1 also implies high-probability analogs of the provided bound under a slightly refined notion of offset Rademacher complexity (cf.Liang et al. (2015, Theorem 4)).Such high-probability analogs bypass the application of Talagrand's contraction lemma and yield bounds that scale correctly with the noise level in the well-specified case (cf.Mendelson (2014)).Since obtaining high-probability bounds is not central to our contributions, for simplicity of presentation, we will restrict ourselves to bounded problems and bounds in expectation.

Mirror Descent
We briefly describe the unconstrained mirror descent algorithm, both in continuous and discrete time.An interested reader may refer to the book by Bubeck (2015) for a review.We begin by defining the mirror map-a key object characterizing the geometry of the algorithm.In continuous time the trajectory of mirror descent algorithm is characterized by the choice of mirror map ψ and initialization point α 0 , with the dynamics given by In discrete time, the updates become where η > 0 is the step-size.A key notion in the analysis of mirror descent algorithms is the Bregman divergence defined below.

Definition 2.3 (Bregman Divergence). The Bregman divergence associated with a mirror map
Let α ′ be any reference point in the domain of ψ.The Bregman divergence enters the analysis of mirror descent algorithms through the following equality derived from an elementary calculation: In the next section, we will show how the above equation (and its discrete time analog given in Lemma C.1 in Appendix C) can be exploited to characterize the statistical properties of the trajectory traced by mirror descent in terms of offset Rademacher complexities of certain classes of functions.

Main Results
This section presents the main results of our paper.Section 3.1 contains the key lemma that connects the statistical performance of mirror descent iterates to offset Rademacher complexities.In Section 3.2, we prove our main theorem in continuous time which provides a transparent way to understand our theory; the discrete-time results appear in Section 3.3.

Key Lemma
The following lemma proved in Appendix B provides a direct link between offset Rademacher complexities and key quantities appearing in the analysis of mirror descent algorithms.
Lemma 3.1.For any α, α ′ ∈ R m , the following holds: To understand the implications of the above lemma, let α ′ correspond to the best parameter in some reference set of interest.For continuous-time mirror descent, Equation (4) together with the above lemma gives Note that the left-hand side in the above equation is precisely equal to the condition given in Lemma 2.1 that connects an arbitrary estimator's statistical performance to the offset Rademacher complexity of some class of functions.Equation ( 5) puts us in a win-win situation.If the left hand side is large, then we get closer to the parameter of interest α ′ .On the other hand, if the left-hand side is small, then we satisfy the condition of Lemma 2.1, yielding excess risk bounds.

Continuous Time
In this section we show how Lemma 3.1 can be used to show that early stopped continuoustime mirror-descent satisfies conditions of Lemma 2.1, thus relating its statistical complexity to offset Rademacher complexities.We introduce the following definitions in order to simplify the notation: Theorem 3.1.Consider continuous-time mirror-descent updates as given in Equation (2).Let α 0 be the initialization point, α ′ be some arbitrary reference point and let Then, for any ε > 0 there exists a stopping time Before giving the proof of this theorem, we remark upon a curious property of mirror descent that this theorem reveals.Suppose F is some class of functions and that the excess risk is minimized by f F ∈ F which is represented by α ′ .If D ψ (α ′ , α 0 ) = R, then the upper bound on the excess risk of the solution obtained by early-stopped mirror descent depends on the complexity of the the class G = {f α | D ψ (α ′ , α) ≤ R}.In some cases, this set could be smaller than F. Furthermore, it is entirely unclear how explicit constrained optimization could be performed on such a set, as both α ′ and the radius R are unknown, and in general the set G may be non-convex, as Bregman divergences are necessarily convex only in their first argument.

Discrete Time
Theorem 3.2 is the discrete time version of Theorem 3.1 which holds for smooth functions.For some general norm • let • * denote the dual norm.We say that L is β-smooth with respect to • if its gradient map is β-Lipschitz with respect to • * .That is, for any α, α ′ the following inequality holds: Additionally, we need ρ-strong convexity of the mirror map ψ with respect to • , meaning that for all α, α ′ we have Such a set up is standard in the optimization literature.We remark that for gradient descent (ψ = • 2 2 /2) we have ρ = 1.For equivalent characterizations of smoothness and strong convexity see the monograph by Bubeck (2015).
The proof of Theorem 3.2 appears in Appendix D. The proof idea is identical to the continuous-time case, but we also have to handle additional error terms arising due to the discretization.The effect of this discretization error appears in the ζ term below.(3).Suppose that L is β-smooth and ψ is ρ-strongly convex with respect to some norm • .Let α 0 be the initialization point, α ′ be some arbitrary reference point, and let

Theorem 3.2. Consider discrete-time mirror-descent updates as given in Equation
Suppose that the step size satisfies η ≤ ρ/β.Then, for any ε > 0 there exists a stopping time t * = t * (D n , α ′ , α 0 , ψ, η) ≤ (D ψ (α ′ , α 0 ) + ηL(α ′ ))/ε such that for all t ≤ t ⋆ we have f αt ∈ B(ηL(α ′ )) and f α t ⋆ is an estimator satisfying conditions of Lemma 2.1 with F = {f α ′ }, G = B(C) and c 1 = 1, where C is some constant where ηL(α ′ ) ≤ C holds uniformly over all datasets.3Comparing Theorems 3.1 and 3.2, we see that we recover the same O(1/t) convergence rates under the same step size condition η ≤ ρ/β that one would obtain in an optimization setting (controlling δ t ) rather the statistical setting (controlling δ t + r t ).However, there is a statistical price in the form of the term ζ = ηL(α ′ ) which increases the radius of the ball in which our early stopped estimator ends up (in continuous time ζ = 0, as there is no discretization error).Intuitively, as α ′ is independent of the data, L(α ′ ) ≈ P ℓ f α ′ which corresponds to the noise level of the problem.In a bounded setting, we can set η ≤ D ψ (α ′ , α 0 )/(B + M ) 2 to recover a radius at most twice as large the one in the continuous case.We remark that such an increase in the radius proportional to the noise level of the problem also appears in the work of Wei et al. (2019), to which we return in Section 4.2.

Consequences of the Main Results
In this section we show how our main results can be used to recover and improve upon some of the existing results in implicit regularization literature.These are illustrative applications, and the way we obtain the results is more interesting, than the results themselves.

Implicit vs Explicit Regularization
Theorem 3.1 immediately implies that along its optimization path, continuous time mirror descent optimally solves, whenever offset Rademacher complexities give optimal bounds, a series of constrained convex optimization problems over balls with varying radii.These balls are centered at α 0 , the initial iterate, and are defined in terms of the Bregman divergence.For smooth loss functions and sufficiently small step-sizes, Theorem 3.2 gives similar results.

Corollary 4.1. For any α
Recent work on implicit regularization has also sought to provide statistical guarantees on solutions along the optimization path of gradient descent and mirror descent.Suggala et al. (2018) establish that the optimization paths and the regularization paths of corresponding regularized problems, when suitably aligned are point-wise close.This allows them to port existing results on regularized optimization to early-stopped descent algorithms; in contrast, our main result shows that the excess risk of solutions along the optimization path of (continuous time) mirror descent can be directly bounded by the offset Rademacher complexity.We also do not require the loss function to be strongly convex.Ali et al. (2019) study the optimization path of continuous time gradient descent for linear regression, which can be computed analytically, and show that the solution at time t has risk at most 1.69 times the risk of the ridge solution with λ = 1/t.Their results require the model to be well-specified and their out-of-sample analysis requires a certain Bayesian averaging.Gunasekar et al. (2018) also consider the implicit regularization properties of several flavours of descent algorithms.Their work mainly establishes properties of the solutions obtained at convergence.Azizan and Hassibi (2019) also show implicit regularization properties of stochastic gradient and mirror descent at convergence by drawing upon connections to robust control theory.

Early Stopping for Non-Parametric Regression
Statistical and computational properties of iterative regularization for non-parametric regression has been a subject of intense study over the past two decades (Bühlmann and Yu, 2003;Yao et al., 2007;Bauer et al., 2007;Raskutti et al., 2014;Rosasco and Villa, 2015;Wei et al., 2019).The analysis of such problems typically revolves around carefully balancing some choice of bias and variance terms, both as a function of the number of gradient descent iterations.This results in a fairly involved analysis, requiring intricate knowledge of both statistics and optimization.
In contrast, statistical properties of corresponding explicit regularization schemes, such as kernel ridge regression, can be understood in a rather clean manner by employing localized Rademacher complexities (Bartlett et al., 2005;Koltchinskii, 2011), which often yield minimax optimal rates.The work of Raskutti et al. (2014) was the first one to connect iterative regularization to such complexity measures, albeit still resorting to the aforementioned analysis based on a bias-variance trade-off.Recently, Wei et al. (2019) extend these results to general loss functions and make the connections to localized complexity measures more explicit, in particular by resorting to the control of gradient descent iterates similar in spirit to that of Theorem 3.2.
Let k : X ×X → [0, ∞) be a Mercer kernel which induces a Hilbert space of functions H.For f ∈ H the Hilbert space norm of f will be denoted by f H .We assume that there exists some L > 0 such that sup x∈X k(x, x) ≤ L so that for all f ∈ H we have sup x∈X |f (x)| ≤ f H L. Such a set up is standard in the literature and we refer an interested reader to Scholkopf and Smola (2001) for background on reproducing kernel Hilbert spaces.
Conditionally on the observed data, we let K ∈ R n×n given by K Following Raskutti et al. (2014, Section 2.2), it will be convenient to analyze gradient descent updates in a transformed co-ordinate system α = √ Kω so that .
We now show how to use Theorem 3.2 to obtain excess risk bounds that recover statistical rates proved in Raskutti et al. (2014, Theorem 2) and Wei et al. (2019, Theorem 1) at the same computational cost.In fact, our theory yields a stronger result since it also applies to the random-design misspecified setting.
Theorem 4.1.For any R > 0 let B R = {h ∈ H : h H ≤ R}.Let c 3 and c 4 be some constants depending only on M, L and R that we specify in the proof below.Consider running mirror descent with with constants c 3 and c 4 depending only on and M, L and R.
Proof.Note that our parameter system is conditional on the data, so there is no α ′ ∈ R n such that f B R = f α ′ for all realizations of the observed data.Hence Theorem 3.2 is not directly applicable.We circumvent this issue by working conditionally on the data and relating gradient descent iterates to an empirical risk minimizer of L over B R , which can always be represented using our coordinate system by the representer theorem.This in turn allows us to upper-bound f α ⋆ t H by 3R and puts us in a setting of Lemma 2.1.We remark that the above approach is precisely what allows to derive excess risk bound in a random design setting.We provide the details below.
Conditionally on the data, let ω denote any solution to min ω T Kω≤R By the representer theorem we also have Let α = √ K ω.Fix any ε > 0. Theorem 3.2 with α ′ = α shows that there exists a stopping time t * ≤ R 2 /(ηε) such that the following two conditions hold: Recall that f B R denotes the best function in B R with respect to the population loss.Equation ( 7) implies that where we remark that c 5 depends only on constants R, L and M since η ≤ 1.Note that setting smaller η will improve the constant factors in the resulting bound at the expense of increased computational cost.However, the prescribed choice of η is enough to attain optimal rates in terms of the dataset size n.Finally, Equation (8) implies that Where the equation denoted by ( * ) follows from convexity of B R and Equation 6 (cf.Liang et al. (2015, Lemma 1)).Combining the above equation together with Equation 9 and applying Lemma 2.1 (with c 1 = 1/2 and F = B R ) we obtain ] yields the desired result.

Implicit Regularization Under ℓ 1 Norm Constraints
Interest in understanding the generalization properties of neural networks has sparked research into implicit regularization properties of various factorized models, which has led to an emergence of a theory of implicit sparsity (Gunasekar et al., 2017;Li et al., 2018;Zhao et al., 2019;Vaskevicius et al., 2019;Arora et al., 2019;Woodworth et al., 2019;Gidel et al., 2019).All of the existing results, however, require restrictive assumptions, such as zero noise, analysis at convergence, continuous-time analysis, or strong conditions on the data .None of these results provide any reasonable guarantees on the in-sample linear prediction error.The problem is described as follows: Let X ∈ R n×d be a fixed design matrix such that the ℓ 2 norms of the columns of X are bounded by some constant κ > 0. Consider the well-specified case so that there exists some w ⋆ ∈ R d such that the observations are given by y i = x i , w ⋆ + ξ i , where ξ i are zero-mean independent σ 2 -subGaussian random variables.
A candidate algorithm, known to be optimal for sparse recovery under restricted isometry assumption is defined as follows: Let w t ∈ R d denote the iterate at time t.Let ⊙ denote the Hadamard product and let 1 denote a vector of ones.Consider the following re-parametrization Instead of running gradient descent directly on w t , the algorithm is defined by running gradient descent updates on the concatenated parameter vector (u, v), yielding the updates Noting that 1 + x ≈ e x for small x, we can approximate the above updates (with the step size η rescaled by a constant factor) by the unconstrained EG± algorithm (Kivinen and Warmuth, 1997) with updates given by (10) Ghai et al. (2019, Theorem 24) shows that the above updates correspond to running unconstrained mirror descent initialized at 0 with mirror map given by This puts us directly in the setting of Theorem 3.2, applying which yields minimax-optimal rates up to a factor log γ −1 (Raskutti et al., 2011).We prove the following in Appendix E.
Theorem 4.2.Consider the fixed-design in-sample prediction setting described above and run the EG± algorithm with parameters γ ≤ Then, there exists some stopping time (dependent on the observations) such that the iterate with probability at least 1 − 2e −nc 6 − 1 8d 3 for some absolute constant c 6 .

A Proof of Lemma 2.1
Following along the lines of Liang et al. (2015, Corollary 2) we obtain The above term excluding ε is the same as the term obtained in Liang et al. (2015, Corollary 2).From here on, the remainder of the proof is identical to the proof of Liang et al. (2015, Theorem 3) which is based on standard symmetrization and contraction techniques.

B Proof of Lemma 3.1
Recall that L(α) = 1 n Zα − y 2 2 .We hence have where the penultimate line follows by applying the equality which holds for all vectors a, b ∈ R m .

C Technical Lemmas
This section contains some technical lemmas appearing in the proofs.
The following lemma is a generalization of the identity a 2 2 + b 2 2 = a − b 2 2 + 2 a, b where euclidean norms are replaced by Bregman divergences.
Lemma C.1.For any mirror map ψ and any points z, y, z in the domain of ψwe the following identity holds: Lemma C.2.Let the mirror map ψ be ρ-strongly convex with respect to some general norm • .Then, for a sequence of mirror descent iterates (α t ) t≥0 we have

D Proof of Theorem 3.2
We follow along the lines of proof of Theorem 3.1.We begin by applying Lemmas C.1 and C.2: Letting T = D ψ (α ′ , α 0 ) + ηL(α ′ ) ηε and summing both sides of the above equation for t = 0, . . ., T we get where we have also used the fact that r 0 ≥ 0 and −ηδ T +1 ≤ ηL(α ′ ).The above equation shows that the following definition of our stopping time is well-defined: 0 then we are done.Otherwise, by telescoping from 0 to t − 1 we obtain which concludes our proof.

E Proof of Theorem 4.2
We begin by stating two lemmas which relate φ γ divergences to ℓ 1 norms.We prove both lemmas at the end of this section.
Condition on the event A 1 = {L(w ⋆ ) ≤ 2σ 2 }.Since the noise random variables are σ 2 -subGaussian, by sub-Exponential concentration we have P(A 1 ) ≥ 1 − 2e −nc 6 where c 6 is an absolute constant independent of any problem parameters.(cf.Vershynin (2010, Section 5.2.4)).By Theorem 3.2, Lemma E.1 and L(w ⋆ ) ≤ 2σ 2 , it is hence enough to set so that there exists a stopping time such that w t ∈ B R ⋆ for all t ≤ t ⋆ and also Rearranging the above inequality, we obtain Since the ℓ 2 norms of columns of X are bounded by κ and since noise consists of independent σ 2 sub-Gaussian random variables, the term X T ξ/n ∞ if upper-bounded by 4κσ √ log d/ √ n with probability at least 1 − 1 8d 3 .By the union bound, events A 1 and A 2 happen simultaneously with probability at least 1 − 2e −nc 6 − 1 8d 3 .Setting ε = R ⋆ κσ √ log d/ √ n concludes our proof.

E.1 Proof of Lemma E.1
The upper-bound is shown in Ghai et al. (2019, Section 3).For the lower-bound we proceed as follows: The result follows by plugging in w 1 = 0 and using γ ≤ w ⋆ 1 /(e 3 d).

E.2 Proof of Lemma E.2
Note that for any x > γ > 0 we have arcsinh(x/γ) ≤ log(3x/γ).Hence, continuing from Equation ( 11) we have The result follows by applying the upper-bound given by Lemma E.1.

F Table of Notation
Definition 2.2 (Mirror Map).D ⊆ R m be some open set.We say that ψ : D → R is a mirror map if ψ is strictly convex, differentiable and {∇ψ(α) | α ∈ D} = R m .

Table 1 :
Table of notation Symbol Description n Number of data points.P Data generating distribution.(xi,yi)The i th datapoint sampled independently from P .P ℓ fThe population loss of f given by E ) − y i ) 2 .P n (f − g) 2 Empirical ℓ 2 distance between f and g given by1 n n i=1 (f (x i ) − g(x i )) 2 .mDimensionality of the parameter space.f α A function parameterized by α ∈ R m .Z ∈ R n×m A matrix such that for all α ∈ R m we have f α (x i ) = (Zα) i .
t A shorthand for L(α t ) − L(α ′ ).r t A shorthand for P n (f α − f α ′ ) 2 K An n × n kernel matrix.XAn n × d design matrix for linear prediction.B RA ball of some function class with radius R. F Some generic class of functions.f F A function f ∈ F such that P ℓ f = inf f ∈F P ℓ f .