Large Noise in Variational Regularization

In this paper we consider variational regularization methods for inverse problems with large noise that is in general unbounded in the image space of the forward operator. We introduce a Banach space setting that allows to define a reasonable notion of solutions for more general noise in a larger space provided one has sufficient mapping properties of the forward operators. A key observation, which guides us through the subsequent analysis, is that such a general noise model can be understood with the same setting as approximate source conditions (while a standard model of bounded noise is related directly to classical source conditions). Based on this insight we obtain a quite general existence result for regularized variational problems and derive error estimates in terms of Bregman distances. The latter are specialized for the particularly important cases of one- and p-homogeneous regularization functionals. As a natural further step we study stochastic noise models and in particular white noise, for which we derive error estimates in terms of the expectation of the Bregman distance. The finiteness of certain expectations leads to a novel class of abstract smoothness conditions on the forward operator, which can be easily interpreted in the Hilbert space case. We finally exemplify the approach and in particular the conditions for popular examples of regularization functionals given by squared norm, Besov norm and total variation, respectively.


Introduction
Motivated by stochastic modelling of noise, in particular white noise, the treatment of inverse problems with large noise has received strong attention recently [20,21,36,37,45]. In this case large noise means that the norm of the data perturbation introduced by the noise is not small or might be even unbounded in the image space of the forward operator. Recently several papers have tackled such problems in the setting of linear regularization methods (corresponding to quadratic variational regularization), but also in those approaches some points were restrictive. The work by Eggermont et al. [21] assumes noise potentially large in the image space of the forward operator, but still being an element of this space. This allows to gain some insight, but still excludes white noise, where the latter condition is satisfied with probability zero. Also some difficulties related to the appropriate formulation of the regularized problem with white noise are not appearing in this way. Another line of research restricts to inverse problems with special settings of function spaces, namely some Sobolev spaces [36,37] or Hilbert scales [41,42,43,44]. In these works estimates are obtained in weaker norms however and the setting still partly shadows the general structure. In this paper we directly tackle the issue of large noise variational regularization with convex regularization functionals in Banach spaces. We derive a rather general theory that can be adapted to special homogeneity properties of the regularization functional, in particular to quadratic (Tikhonov) and one-homogeneous regularizations as popularized via total variation methods [51,8] and sparsity (see e.g. [5,17,47]). We consider the linear ill-posed problem noisy data are given by where n ∈ Z * and δ > 0 models the noise level. Notice carefully that f δ ∈ Z * can be unbounded in the norm of Y , which yields our setting of large noise. It is crucial that due to the continuous extension property K * n is bounded in X * . As usual in variational methods we obtain a regularized solution of (1.2) by computing a minimizer u δ α of a weighted sum of the square residual (in the norm of Y ) and the regularization functional. However, since the (squared) norm of f δ is not necessarily finite, it is more appropriate to consider an expansion of the square residual [36,37] and compute u δ α as a minimizer of J δ α (u) = with a convex regularization functional R : X → R ∪ {∞}.
Our main assumptions on R in addition to convexity are (R1) the functional R is lower semicontinuous in some topology τ on X, (R2) the sub-level sets M ρ = {R ≤ ρ} are sequentially compact in the topology τ on X and (R3) the convex conjugate R ⋆ is finite on a ball in X * centered at zero.
The first two are the standard conditions needed for existence proofs and as we shall see below together with (R3) they will also lead to a general existence result for minimizers of J δ α in the case of positive α. Note that we assume that K : X → Z is continuous in τ topology. A standard example is R being a power of a norm in a Banach space possessing a predual space. In this case the Banach-Alaoglu theorem yields compactness in the weak-star topology, for which we have genuine lower semicontinuity of the norm. We mention that a major difference to the case of bounded noise is that there is no natural lower bound for J δ α (the lower bound in the case of bounded noise is − 1 2 f δ 2 Y + αR(u 0 ), with u 0 being a minimizer of R), which is the only complication in the analysis below and needs a suitable approximation of the noise together with (R3). To make some results below more accessible we will further employ the symmetry condition (R4) R(−u) = R(u) for all u ∈ X, which is however not essential for the overall line of arguments.
Our key observation is related to error estimates between u δ α and a solution u † minimizing R among all possible solutions of Ku = f . The usual way to obtain such is starting from the optimality condition for a minimizer K * (Ku δ α − f δ ) + αµ δ α = 0, µ δ α ∈ ∂R(u δ α ), (1.4) where stands for the subdifferential. Next, the form (1. 2) of f δ is inserted and multiples of a subgradient µ † ∈ ∂R(u † ) are added on both sides to arrive at where η = K * n ∈ X * . The following step is to take a duality product with u δ α − u † and hence derive error estimates in the Bregman distance [4,6]. In doing so one can strongly benefit if µ † satisfies a source condition, i.e., if µ † = K * w † for some w † ∈ Y . Note that in the bounded noise model also η satisfies such a condition, which becomes violated in our setting. Since η and µ † appear in a similar fashion on the right-hand side we see that the unboundedness of the noise in Y leads to a similar technical issue as the violation of the source condition for µ † . However, the latter is reasonably well understood and has been tackled by the concept of distance functions and approximate source conditions [26,30,32,52], which are related to the growth rate of w † Y as K * w † approximates µ † . Due to the analogous role of µ † and η it is natural to use the same paradigm for approximating the large noise and this is the basic foundation of the analysis in this paper.
Following this idea our key contribution is to derive Bregman distance based error estimates between u δ α and u † for a general R. Given deterministic noise model, one can derive explicit converge rate results given (a variation of) an approximate source condition on µ † and η. In this paper, we prove convergence rates for the special cases of 1-homogeneous R(u) = u X as well as the p-homogeneous R(u) = 1 p u p X for 1 < p < ∞. For our main motivation, random noise, the approximate source condition needs to be reconsidered in a statistical framework. In this work our interest lies in the frequentist risk between estimator U δ α = U δ α (ω) and the true unknown u † . In such paradigm we find that the expected decay rate of the approximate source condition is sufficient to guarantee a convergence rate result. Here we study and derive the convergence rate of frequentist risk for three examples: quadratic Tikhonov regularization, Besov norm regularization and total variation regularization. As for the noise we assume the canonical Gaussian white noise model on the Gelfand triplet (Z, Y, Z * ) that has the well-known property that n is almost surely unbounded in Y .
Let us shortly discuss some earlier work. After introducing the idea in [7], Bregman distances have been frequently used as an error measure for studying convergence rates of regularized solutions in Banach spaces. Convergence rates for the Bregman distance were further developed in e.g. [2,9,25,31,38,39,49,50]. Iterative regularization based on Bregman distances were analysed e.g. in [9,46]. The literature on regularization theory in Banach spaces is quite extensive, but throughout the paper we often refer to an excellent textbook on the topic [52]. For a recent discussion of Bregman distances we refer to [6].
The remainder of the paper is organized as follows: in Section 2 we consider the theory for general functional R. Main results of this section include the proof of existence of u δ α as well as a related a-priori estimate in Section 2.1. The general error estimates are given in Section 2.4. In Section 3 we derive convergence rates for different homogeneous examples of R. Next, we turn our focus on random noise in Section 4 and consider examples of regularization by a quadratic Tikhonov functional (Section 4.2), Besov norm (Section 4.3) and total variation functional (Section 4.4). Finally, we give an outlook to applications of our work to Bayesian inference in Section 5.

General Estimates
In the following we discuss the general approach for variational regularization under the assumptions above. We start by establishing the existence of a minimizer of J δ α for α > 0, which also yields some a-priori bounds for the solution.

Existence and a-priori Estimates
For general noise the existence of J δ α is not clear from standard arguments. While usual lower semicontinuity arguments remains unchanged, the key issue is compactness, which follows from an a-priori estimate on R due to the compactness of sublevel sets. In deriving such an estimate we need to bypass the missing lower bound of J δ α . Proposition 2.1. Let R satisfy the assumptions (R1)-(R4), then the functional J δ α has a minimizer. Moreover, any such minimizer u δ α satisfies for any γ ∈ (0, 1) and w ∈ Y . Above η = K * n.
where 0 < γ < 1 and w ∈ Y is arbitrary. For the definition of the convex conjugate R ⋆ see Appendix A. Due to assumptions (R2), (R3) and Y being dense in Z * we can now choose w ∈ Y such that for a constant C > 0 and hence we obtain which implies M is compact due to assumption (R2). Now the existence follows by standard arguments. Without loss of generality we can assume that {u j } ∞ j=1 ⊂ M is a minimizing sequence of J δ α . Since M is compact, there exists a converging subsequence u j k → u ∈ X. Finally, the lower semicontinuity of J δ α yields that u is a minimizer. Note that with existence of a minimizer u we directly obtain the a-priori estimate (2.1).
Remark 2.2. We can prove a similar a-prior estimate for R also without the symmetry assumption (R4). In that case we get for the minimizer u δ

Basic Ingredients of Error Estimates
In the following we discuss some basics needed for the derivation of error estimates and the use of the approximate source conditions. The starting point for error estimates is the optimality condition mentioned above. Since the first two terms are linear and quadratic it is straight-forward to verify that they are Frechet-differentiable in our setting. Then the subdifferential of the whole functional equals the sum of the Frechet derivative of the first part and the subdifferential of the regularization functional (cf. [22]), which immediately implies the following statement: 3. Under the assumptions above, a minimizer u δ α of J δ α satisfies the optimality condition (1.4).
As mentioned above, error estimates are based on rewriting (1.4) and then taking a duality product with u δ α − u † . This naturally leads to estimates in the Bregman distance, whose definition we recall for completeness: Definition 2.4 (Bregman distance). Let R : X → R ∪ {∞} be a convex functional. Then for each µ v ∈ ∂R(v) ⊂ X * we define generalised Bregman distance between u and v as Moreover, for µ u ∈ ∂R(u) we define symmetric Bregman distance between u and v as Let us now sketch the basic steps in the derivation of error estimates and the standard route in the case of bounded noise. Taking a duality product with (1.4) and u δ α − u † we get The nice case leading directly to estimates is η = K * n with n ∈ Y and the additional source condition µ † = K * w † ∈ X * for w † ∈ Y . Then the right-hand side becomes and Young's inequality implies The problem now becomes more difficult if η or µ † are not in the range of K * (if the range is defined as K * Y and not K * on a larger space including the noise). Note that with the notation using η instead of K * n it becomes apparent that technically η not in the range of K * is equally difficult as µ † not in the range of K * . The latter case is however reasonably well understood, at least in the case of strictly convex functionals R. This is discussed in detail in [52]. The idea is to use a so-called approximate source condition, quantifying how well µ † can be approximated by elements in the range of K * . Since µ † needs to be in the closure of the range, there exists a sequence w n with K * w n → µ † . On the other hand it is not in the range, hence w n necessarily diverges. Thus, one can measure how well µ † , respectively in our case η − αµ † can be approximated by elements K * w with a given upper bound on w . The best estimates are then obtained by balancing errors containing the approximation of η − αµ † and w .
In the case of no strict source condition and unbounded noise we will approximate µ † and η with separate elements K * w 1 and K * w 2 respectively. Then we can write The second term on the right hand side can now be estimated using Young's inequality as above, while for the first term it is natural to apply the generalized Young's inequality as in the proof of Proposition 2.1. We shall estimate the terms multiplied by δ and α separately and overall study a problem of estimating a term of the form η, u δ α − u † X * ×X . For this sake we could separately estimate the duality products with u δ α and u † as in the proof of Proposition 2.1. However, as we are interested mainly in functionals with some homogeneity properties and in particular (R4) we shall see that it is beneficial to use the following direct estimate which we shall employ further with appropriately chosen ζ > 0. We observe that in proceeding as above we are left with two terms in dependence on w 1 , namely α 2 2 w 1 2 and . Analogous reasoning holds for w 2 , with α replaced by δ. This motivates our approach to the approximate source conditions to be detailed in the following.

A Variation on Approximate Source Condition
The standard concept of approximate source condition is to consider the case R(u) = u r X for some power r > 1 (cf. [52]). The key concept is the so-called distance function and its asymptotics as ρ → ∞. Note that in the case of a fulfilled source condition d ρ (ϑ) = 0 for ρ sufficiently large, while in the really approximate case d ρ (ϑ) decays to zero at a finite rate. Hence, the speed of decay of d ρ (ϑ) is a natural measure to quantify the approximateness of the source condition. Unfortunately the existing theory employing the approximate source conditions or the even more implicit variational inequalities only works for the special normtype functionals above (cf. [52]) and in addition uses some moduli of strict convexity of the norms. This of course excludes the most interesting cases of one-homogeneous regularizations such as sparsity and total variation. Hence we propose to consider a more general formulation based on convex duality. As we have seen above it is crucial to approximate some elements ϑ ∈ X * by K * w with w ∈ Y in some kind of Fenchel dual problem defined by K and R. More precisely, we are interested in minimal values of the functional which we shall denote as e α,ζ (ϑ) = inf w∈Y E α,ζ (w; ϑ).
(2.6) Remark 2.5. Indeed it can be inferred from the Fenchel duality theorem (cf. [22]) that and it holds that Thus, the measure e α,ζ measures how fast a regularization method approximating ϑ (related to the noise or source element) diverges and is hence a natural quantity. For R ⋆ (ϑ) being finite, this immediately implies a bound on e α,ζ (ϑ) via the generalized Young inequality This results into Obviously, this estimate is not optimal under most conditions since it does not involve the first term in F α,ζ . As we shall see below the bound can be improved under certain conditions, depending also on the homogeneity properties of R.
In the case of a Hilbert space regularization, and the problem of computing the minimizer is a classical Tikhonov regularization problem. In particular in this example but also in the more general case the minimization of E α,ζ is closely related to the minimization in the definition of distance functions, roughly it can be understood as some kind of Lagrange multiplier formulation of the constrained problem for computing d ρ .
We finally mention that we can also rewrite the a-priori estimate from Proposition 2.1 in terms of the approximate source condition (2.8) with any ζ ∈ (0, 1).

Error Estimates
In order to obtain error estimates we start from the rewritten version of the optimality condition (1.5) and take a duality product with u δ α − u † in the same way as sketched above. Then the right-hand side is estimated as This immediately leads to the following error estimates: Proposition 2.6. Let R satisfy (R1)-(R4). Then with the assumptions above we obtain for any positive real numbers ζ 1 , ζ 2 : In order to obtain meaningful estimates we need to further estimate R(u δ α − u † ), ideally in terms of the Bregman distance, which however strongly depends on the specific scaling properties of the underlying functional R. Inspired by p-convex functionals (cf. [3]), we shall consider the following assumption: There exists θ ∈ [0, 1] such that The canonical examples to be considered are square norms (leading to θ = 1) and one-homogeneous functionals (leading to θ = 0). Example 2.7. Let X be a Hilbert space, L a bounded linear operator, and and inequality (2.14) holds with θ = 1 and C θ (u, v) ≡ 1 2 . Example 2.8. Let R be one-homogeneous, symmetric around zero, and convex. We immediately obtain a triangle inequality and hence (2.14) holds with θ = 0 and C 0 (u, v) = R(u) + R(v). It is easy to see that for R of the above form no estimate with θ > 0 can hold. As an example consider R : R → R, R(u) = |u|. If u and v differ, but have equal sign, we obtain |u − v| = 0, but D p,q R (u, v) = 0.

Convergence theorems
With assumption (2.14) we can further estimate the right-hand side in the above estimates as if θ < 1. In the case θ = 1 the first estimate is the only relevant one. This leads to the following result Theorem 2.9. Let R satisfy the assumptions of Proposition 2.6 and (2.14). Then for, θ < 1 we obtain We finally mention an alternative statement of Theorem 2.9, which also takes into account an estimate of the residual. In the subsequent parts of the paper we will not discuss estimates for the residual, but obviously those can be obtained in the same way using the following result: Theorem 2.10. Let R satisfy the assumptions of Proposition 2.6 and (2.14). Then for, θ < 1 we obtain Note that the constant C θ (u δ α , u † ) above depends on R(u δ α ) and hence also on the corresponding a-priori estimate.

Convergence Rates for Homogeneous Regularizations
Let us shortly introduce some notation. Throughout the following sections we denote f g for two functions if there exists a universal constant C > 0 such that f ≤ Cg as functions. Moreover, if functions f and g are equivalent we write f ≃ g. Notice that if a random variable X has a probability distribution π, we write X ∼ π.

Regularization by one-homogeneous functional
Let us directly proceed to the case of a one-homogeneous functional R such as Besov-one norms or total variation. We assume that X is a suitable space such that R has a trivial nullspace (note that the nullspace of a one-homogeneous convex functional is always a linear space, and if it is finite-dimensional this component can be eliminated via similar arguments as in the total variation case detailed in [8]).
In this case we can define a dual "norm" S on X * via Note that S is again one-homogeneous. The one-homogeneity of R implies for all u ∈ X and q ∈ X * . In the case of one-homogeneous R we can relate R ⋆ and S as follows: Lemma 3.1. Let R : X → R ∪ {∞} be convex, non-negative and one-homogeneous and let S : X * → R ∪ {∞} be defined by (3.1). Then for any c ∈ R + , we have Note that under the convexity condition and the homogeneity R(cu) = |c|R(u), that is, the regularisation functional R is sublinear. Hence, the proof follows from general results on sublinear functionals in [29, Section V]. Next we formulate an alternative approximate source condition for the unknown and noise in one-homogeneous case.
Assumption 3.2. We assume to have an approximate source condition of order r 1 ≥ 0 for the unknown, that is we require when β > 0 small enough. We also require similar condition of order r 2 ≥ 0 for the noise and assume Notice carefully that in the case when we do not have strict source condition the corresponding parameter r j must be strictly positive. Before proceeding, let us record the following technical lemma: Proof. Variational calculus yields Theorem 3.4. Let X be a Banach space and R(u) = u X . Suppose that Assumption 3.2 is satisfied with some orders r 1 , for r 1 ≤ r 2 and 1 for r 2 < r 1 , for r 1 ≤ r 2 and δ 1 1+r 1 for r 2 < r 1 .
Proof. Using Lemma 3.1 we can write Recall from Example 2.8 that the one-homogeneous case corresponds to parameter θ = 0 in condition (2.14) and . The a priori estimate in Proposition 2.1 gives us for any γ ∈ [0, 1) and w ∈ Y . Now it follows from Theorem 2.9 that and From assumption 3.2 we get the following estimates: By assuming that α ≃ δ κ with some κ > 0 we can write Above r 3 = 2 − κ + r 2 (1 − κ) > 0 when κ ≤ 1. Now by Lemma 3.3 we get estimate Optimizing the above we get κ = (1+r 1 )(2+r 2 ) (2+r 1 )(1+r 2 ) when r 1 ≤ r 2 which gives us convergence rate In the case r 1 ≥ r 2 we choose κ = 1 to get Corollary 3.5. If in addition to the assumptions of Theorem 3.4 we assume an exact source condition for the unknown u † , i.e., r 1 = 0, we get a convergence rate

Regularization by p-homogeneous functional for 1 < p < ∞
In this section we consider regularization by functionals of type R(u) = 1 p u p X for 1 < p < ∞. Below p, q ∈ (1, ∞) are conjugate exponents such that Here we utilize additional assumptions regarding the Banach space X. Let J p : X → X * denote the set-valued duality mapping J p (u) = {µ ∈ X * | µ, u X * ×X = u X µ X * and µ X * = u p−1 X }. A Banach space X is said to be p-convex if there exists a constant c p > 0 such that for all u, v ∈ X and all j p ∈ J p . Moreover, X is called p-smooth if there exists a constant G p > 0 such that for all u, v ∈ X and all j p ∈ J p . The basic consequences and properties of these geometrical assumptions are listed in [52]. For what follows, an important connection between the convexity and smoothness assumptions is given in [52, Thm 2.52]: X is p-smooth if and only if X * is q-convex. Moreover, X is p-convex if and only if X * is q-smooth. Some examples of max{2, p}-convex and min{2, p}-smooth spaces are sequence spaces ℓ p , Lebesgue spaces L p , and Sobolev spaces W m,p . Notice also that in this Section we consider a p-smooth Banach space X for some p > 1. In that case it is well known (see [52,Remark 2.38]) that the duality mapping J p is single-valued. Next we define an alternative approximate source condition for the unknown and noise in case R(u) = 1 p u p X for 1 < p < ∞.
Assumption 3.6. We assume to have an approximate source conditions of order r 1 ≥ 0 for the unknown, i.e., we require that when β > 0 is small enough. We also require a similar condition of order r 2 ≥ 0 for the noise and assume Case 1 < p < 2 Theorem 3.7. Suppose that the Banach space X is p-smooth and 2-convex and R(u) = 1 p u p X for some 1 < p < 2. Moreover, suppose that Assumption 3.6 is satisfied with some orders r 1 , r 2 ≥ 0 and r 1 < 1. Then for the choice α ≃ δ κ where κ = ν 1 ν 2 ν 1 ν 2 +q(r 2 −r 1 ) for r 1 ≤ r 2 and 1 for for r 2 < r 1 < 1.
Above we have denoted ν i = 2 + r i (q − 2) and q = p p−1 . For the constant C p we have C p → ∞ when p → 2.
Proof. We can apply the Xu-Roach inequality II [52,Thm. 2.40 (b)] in X to obtain This gives us an estimate By applying the trivial upper bound max { u X , v X } ≤ u X + v X and the a priori bound given in (2.10) for any γ ∈ (0, 1) we have Considering Theorem 2.9 we now obtain where we have for s = 2 2−p that and Since R(u) = 1 p u p X we can write From Assumption 3.6 we directly obtain following estimates: where we have set t i = (q − 1)r i ≥ 0. By assuming further that α ≃ δ κ for some κ > 0, we can reduce the two upper-most estimates to e δ 2 , αγ δ (η) δ 1−r 2 +(1−κ)t 2 and e α,ζ 1 (µ † ) δ κ(1−r 1 ) ζ −t 1 1 .
When r 1 ≤ r 2 the above expression is minimized at In order to have convergence we have to assume r 1 < 1. If r 2 < r 1 < 1 the optimal convergence rate is achieved when κ = 1 and we obtain Above the constant C p → ∞ when p → 2. Note that with the chosen κ the assumption r 3 ≥ 0 is always true when r 1 < 1.
holds. Further, notice that assuming an exact source condition on the noise leads to the standard convergence rate of O(δ) in the classical setting [52].

Next by Xu-Roach inequality IV [52, Thm. 2.42] we obtain
where we have considered the inequality in X * which is 2-smooth by assumption.
Combining the two inequalities above yields Case p = 2 Finally we simplify the estimates in the quadratic case: Theorem 3.10. Suppose that X is a Banach space and R(u) = 1 2 u 2 X . Moreover, suppose that Assumption 3.6 is satisfied with some orders r 1 , r 2 ≥ 0 and r 1 < 1. For the choice α ≃ δ κ where κ = 2 2+r 2 −r 1 for r 1 ≤ r 2 and 1 for r 2 < r 1 < 1 we get convergence for r 1 ≤ r 2 and δ 1−r 1 for r 2 < r 1 < 1.
Proof. Recall from the Example 2.7 that case R(u) = 1 2 u 2 X corresponds to parameter θ = 1 and C 2 (u δ α , u † ) = 1 2 in condition (2.14). Hence the second part of the Theorem 2.9 gives us where ζ 1 + δ α ζ 2 < 2 in Σ. If we choose ζ 1 = c < 1 and ζ 2 = α δ we can write where we need to assume r 1 < 1. The above convergence is optimized by κ = 2 2+r 2 −r 1 when If r 1 > r 2 then we choose κ = 1 which gives us convergence Corollary 3.11. If we assume that u † fulfills the source condition r 1 = 0 we get convergence Case p > 2 Theorem 3.12. Suppose that X is a p-convex Banach space with some p > 2 and R(u) = 1 p u p X . Moreover, suppose that Assumption 3.6 is satisfied with some orders r 1 , r 2 ≥ 0 and r 1 < 1. For the choice α ≃ δ κ where κ = 2 2+r 2 −r 1 for r 1 ≤ r 2 and 1 for r 2 < r 1 < 1 We have convergence for r 1 ≤ r 2 and C p δ 1−r 1 for r 2 < r 1 .
Proof. Below we assume that X is a p-convex Banach space with some p > 2 and R(u) = 1 p u p X .
We can give an alternative definition for the general Bregman distance by where µ u ∈ ∂R(u). We get a same kind of estimate for the Bregman distance as in [3] D µu The last estimate above is given by the Xu-Roach inequalities [57]. The symmetric Bregman distance given by (3.9) coincides with our previous definition (2.2) for any µ u ∈ ∂R(u) and µ v ∈ ∂R(v). Hence we get an estimate That is (2.14) holds with θ = 1 and C θ (u, v) = C p . Hence when p > 2 we get the same convergence rate as in case p = 2.
Remark 3.13. It is straightforward to see that a polynomial decay of the distance function [52] implies an approximate source condition in Assumption 3.6. Suppose we have where k > 0. This yields an estimate kq+2 .
Choosing k = 2(1−r 1 ) r 1 q , where r 1 ∈ [0, 1), we see that the last estimate above can be written which corresponds to the estimate given by Assumption 3.6 and (3.8).

Hilbert Space Embedding
Since many estimates are crucially simplified by using Hilbert space structures, we discuss in the following an approach to obtain (possibly suboptimal) rates deduced from the results above using embedding. Hence we consider the case where R is the p-th power of a norm in a Banach space, with p ≥ 1, and there exists a continuous embedding into a Hilbert space X 0 . Indeed, we can assume the slightly weaker condition for all u ∈ X. Note that by extending R as infinite outside X we can also state the same condition for arbitrary u ∈ X 0 . Obviously the case p = 1 is of particular interest here to cover e.g. total variation regularization (with the obvious embedding into L 2 for dimension less or equal two) and sparsity regularization (with the obvious embedding of ℓ 1 into ℓ 2 ). In order to reduce to a Hilbert space framework, we assume that K can be extended to X 0 and maps this space continuously to Z. Thus, L = K * K is a bounded self-adjoint operator on X 0 and thus has a spectral decomposition. In particular, we can formulate smoothness of a vector ϑ ∈ X 0 with the condition ϑ = L µ ω (3.12) for ω ∈ X 0 and some µ ∈ (0, 1 2 ). We then use the relation e α,ζ (ϑ) = − inf v∈X F α,ζ (v; ϑ) and estimate F α,ζ (v; ϑ) from below. For this sake we use (3.11) and (3.12) to get Using the interpolation inequality and Young's inequality it is a straight-forward estimate to have for some constant C independent of v, ζ, and α, which directly yields an upper bound for e α,ζ (ϑ). We mention that in the case p = 1 we obtain (3.14) 4 Examples with random noise

Frequentist framework
Let us recall that our work above towards unbounded noise was mostly motivated by random noise models, especially, the statistics of white noise. It is hence natural to reinterpret the results of Theorem 2.9 as pointwise estimates for a random variable U δ α , which arises due to the randomness of the noise N . In the frequentist settings one is interested in the model where the data F δ is generated by a deterministic true solution u † . In (4.1) the measurement F δ = F δ (ω) and the noise N = N (ω) are thought to be random variables. Here ω ∈ Ω is an element of a complete probability space (Ω, Σ, P). Following the idea in the earlier sections we consider a general frequentist risk denoted by E B between the estimator U δ α = U δ α (ω) and u † . Here, our error measure is given by the Bregman distance From the previous section we directly obtain a bound where H(ω) = K * N (ω). A canonical example of frequentist risk (4.2) is the mean integrated squared error (MISE) where a quadratic regularization term R(u) = u 2 X is assumed. Convergence rates of MISE have been widely studied in the literature, see [10,12] We observe that a finite estimate can only be obtained if E(e δ,ζ (H)) < ∞ at least for some ζ > 0. Under the typical choices of R the finiteness for any δ and ζ is obtained if This condition can be interpreted as an abstract smoothing condition for the operator K, as we shall see it can be identified with K being a trace-class operator.
In order to choose optimal parameters we first have to clarify which of them are random. Since ζ 2 is an auxiliary parameter appearing in the estimates only, not affecting any computation, it can be optimized in dependence of H and hence it also becomes a random variable. The situation is less obvious with respect to α. Indeed it turns out that the question is exactly related to the issue of a-priori vs. a-posteriori parameter choice in the deterministic setup (cf. [23]). The a-priori parameter choice α = α(δ) leads to a parameter independent of the noise realization N , while the a-posteriori parameter choice α = α(δ, F ) makes the parameter a random variable of N . Since the specific choices of α rely on the form of the regularization functional, we shall further investigate the general risk (4.2) in three very prominent cases, the classical one of Tikhonov regularisation (two-homogeneous R), the more general regularisation with Besov penalty and the popular total variation regularisation.

Gaussian case
Let us review the implications of our results in the canonical special case of a squared norm based regularization penalty R(u) = 1 2 u 2 X for X = Y = L 2 (T d ). We assume that N is generalized white noise statistics in D ′ (T d ), that is, we have EN = 0 and for any test functions φ, ψ ∈ C ∞ (T d ), where ·, · D ′ ×D denotes the duality pairing. It is wellknown that the realizations of N belong to Z * = H −d/2−ǫ (T d ) almost surely for any ǫ > 0. For a sharp result, see [55]. We want to concentrate on the phenomena appearing due to large noise and hence assume an exact source condition for the true unknown u † in the following.
In this example, two factors simplify our analysis remarkably. First, the symmetric Bregman distance coincides with the squared norm (as discussed in Example 2.7) Secondly, the term e α,ζ (H) can be explicitly estimated since Let us record the following short calculation as a lemma. For precise notation, let use denote , β > 0, to highlight the restriction of K * (and K) to X = Y = L 2 (T d ).
Lemma 4.1. Consider K as a bounded linear operator K : and Proof. The minimizing estimator of problem (4.4) is given by W αζ = KR αζ H. Hence we can write where t > d/2 It is well-known that N as white noise has a series representation N = ∞ j=1 N j ψ j almost surely, where N j ∼ N (0, 1) are i.i.d. and {ψ j } ∞ j=1 constitutes any orthonormal basis of L 2 (T d ). The claim (4.5) now follows easily by applying the series representation together with independence of N i and N j for i = j.
Let us mention that the quantity on the right-hand side of the estimate (4.5), is known as the effective dimension in literature (cf. [58]). In the finite dimensional case it is between zero (as αζ → ∞) and the rank of KK * (as αζ → 0). In the following we use a conservative estimate of the effective dimension in order to illustrate the results, optimal estimates can be achieved under special assumptions, which is beyond our scope (cf. e.g. [40]). Our analysis in the non-Gaussian case indicates that Ee α,ζ (H) is the basis for understanding a generalization of effective dimension for such, its analysis is a possibly important question for future research.
. When the true unknown u † fulfills the exact source condition µ † = K * w † , where w † ∈ L 2 , we obtain the convergence rate Proof. Considering (1.4) we have now u ∈ ∂R(u). Therefore, we can write where H = K * N and u † = µ † = K * w † for w † ∈ L 2 (T d ). Taking duality product of U δ α − u † and equation (4.6) yields We will approximate the right hand side terms separately. Following the idea behind the estimate (2.13) we bound the first term on the right hand side of (4.7) by for any ζ 2 > 0. For the last term in (4.7) we have Since w = W (ω) ∈ L 2 (T d ) is arbitrary, using the estimates above we get Above, we obtained the last estimate by choosing ζ 2 = δ α . In order to derive a convergence rate we point out that R β is a self-adjoint semipositivedefinite bounded linear operator satisfying R Since K * is a Hilbert-Schmidt operator, we have Tr L 2 (T d ) (KK * ) < ∞. Now it follows from equations (4.9) and (4.10) that The bound in (4.11) is optimised by choosing α ≃ δ 2/3 , which also yields the claim.
From the previous theorem we see that the assumption of finite trace of KK * : L 2 (T d ) → L 2 (T d ) is indeed equivalent to the condition E(e δ,γ (H)) < ∞ (4.12) for some δ, γ > 0 as well as to the condition which appears to be a natural requirement. One can observe better convergence rates if a faster decay of eigenvalues of KK * can be verified.
Then, when the true unknown u † fulfills the exact source condition µ † = K * w † , where w † ∈ L 2 , it follows that for α ≃ δ κ , where κ = 2 2+m , we obtain Proof. Suppose p and q are Hölder conjugates such that m = 1 q . By applying Young's inequality to (4.14) This yields Therefore, by equation (4.9) the frequentist risk is bounded by The proof is concluded by optimizing α ≃ δ κ .
As mentioned before the mean integrated squared error (MISE) of an estimator U δ α is defined The minimax risk r δ (H r (T d ), M ) on the Sobolev space H r (T d ) is then given by where the infimum is taken over all estimators of the form U δ α = g(F δ ). Here we have denoted is the set of Borel measurable functions from H −d/2−ǫ (T d ) to H r (T d ). Next we will compare the convergence results of Theorems 4.2 and 4.3 to the known minimax convergence rates for same problems. Remark 4.4. As an example of a group of operators that fills the conditions in Theorem 4.3 we can take bijective elliptic pseudodifferential operators that are t > d 2m (where m = 1 in the case described in Theorem 4.2) orders smoothing e.g. K = (I − ∆) − t 2 . We assume the exact source condition in the Theorems 4.2 and 4.3, that is, u † = µ † = K * w † , where w † ∈ L 2 , hence we can conclude u † ∈ H r (T d ), where r = t. This means we do not assume extra smoothness from u † but assume that the smoothness of the unknown and the order of smoothing of the forward operator are the same.
Since r = t we can rewrite the convergence rate κ in form Note that tm = d/2 + ǫ and hence the convergence rates achieved in Theorems 4.2 and 4.3 agree, up to ǫ > 0 arbitrarily small, with the minimax convergence rate, see e.g. [11,33].

Besov prior
Suppose that functions {ψ ℓ } ∞ ℓ=1 form an orthonormal wavelet basis for L 2 (T) on the onedimensional torus T, where we have utilized a global indexing. We can characterize the periodic Besov space B s pq (T) using the given basis in the following way: the series belongs to B s pq (T) if and only if We assume that the basis is r-regular for r large enough in order to provide a basis for a Besov space with smoothness s [16]. Here we are concerned with the special case p = q = 1 and use abbreviation B s p = B s pp . It is well-known that an equivalent norm to (4.17) is given by Let us then explain the framework for our analysis. Suppose that X = Y = L 2 (T) with orthogonal basis {ψ ℓ } ℓ . The noise N is assumed to have same statistics as in previous section.
Here, we consider a regularization term given by for s ≥ 1, where u = ∞ ℓ=1 u ℓ ψ ℓ . Similarly, the dual norm in (3.1) is simply the norm of B −s ∞ (T), i.e., . Notice carefully that for parameters s ≥ 1 the functional R satisfies conditions (R1)-(R4) with the weak topology, since there is a continuous embedding from B s 1 (T) to L 2 (T). In addition, we make an assumption on the smoothness of K and K * by requiring that there exists a constant C > 0 and t > 1 2 such that both satisfy (similar for K * ) for r ∈ R and ψ ∈ B r 2 (T). Under the given assumptions we can write (recall equation (3.5)) with a fixed realization η = K * n = K * N (ω), where Lemma 4.5. Let us assume that K : L 2 (T) → L 2 (T) satisfies condition (4.20) with a parameter t such that min(s, 2t) > 1 and R is defined by (4.19). Then it holds that Proof. From condition (4.20) it follows that almost surely. The condition w ∈ W for a fixed realization n = N (ω) is equivalent to uniformly for all ℓ ∈ N and hence Now (K * N ) ℓ = K * N, ψ ℓ is a normally distributed random variable with zero mean and variance σ 2 (4.20)). Therefore, we have Due to our assumptions on s and t we notice that the last sum converges. The sum above can be approximated as follows where we applied a change of variable y = ζ Notice that by Lemma 4.5 and assumption α = δ κ , κ ≤ 1, we have for a constant γ, where we denote s ′ = 2 2s+2t−1 > 0 for convenience. Therefore, we see as in the proof of Theorem 3.4 that The convergence rate is minimized for κ which satisfies When r 1 ≤ 2 2s+2t−1 we conclude that κ = 1 + r 1 2 + r 1 · 2 + s ′ 1 + s ′ = 1 + r 1 2 + r 1 · 4s + 4t 2s + 2t + 1 .
Remark 4.7. We point out that minimax rates for linear statistical inverse problems in wavelet basis have been studied for estimators based on Galerkin methods and non-linear thresholding algorithms (see [18,13,34] and references therein). In the first two papers the authors construct a finite-dimensional estimator u ǫ for any ǫ > 0 such that the forward operator K is t times smoothing (similar to (4.20)) and Such rates are also known to be optimal [13]. Compared to (4.23) our method builds upon more general source condition. We do not necessarily require that the true solution is in the range of K * or that it has additional smoothness. However, there is interplay between smoothness of K and our source condition. In addition, the rate in (4.22) is achieved in a L 2norm, whereas the symmetric Bregman distance of B s 1 -norm in Theorem 4.6 is a pseudonorm. It remains future work to provide more comparison between the Galerkin approach and the method represented here.

Total Variation-type Regularization
In the following we discuss the case of total variation regularization 24) or related regularizations such as infimal convolutions with higher order total variation (cf. [8] and references therein) in the case of spatial dimension d ≤ 2, when there is an embedding into X 0 = L 2 (T d ). Thus, it is natural to use Hilbert space embedding in this case. We assume that K can be extended to a t > d/2 + ǫ times smoothing bijective bounded linear operator in Sobolev scale. We also assume that N is white noise taking values in H −d/2−ǫ as in the previous sections. We will use the estimate (3.14) for realizations of H = K * N (noting L = K * K) and write to obtain an estimate for the expectation E(e δ,ζ (H)). Subsequently one could use similar reasoning as in the previous section respectively Section 3.1 to obtain full rates, which we leave to the reader.
The key question for the finite expectation of e δ,ζ (H) is the choice of ν such that Note that by Fernique's theorem any moment of white noise is finite in H −d/2−ǫ (T d ) for any ǫ > 0 [14]. Thus, we do not need to worry about the exponent 1 ν in the expectation, but rather optimize ν to have With the above smoothing assumptions, we see that K * maps from For K being the inverse of a translation invariant differential or pseudo-differential operator one obtains that L ν : L 2 (T d ) → H 2tν (T d ). In the following we write K ∈ Ψ ρ for a pseudodifferential operator K if its symbol is in S ρ (T d ; T d ) [54]. The condition above means that we should choose ν = t−d/2−ǫ 2t . As a specific example consider the pseudodifferential operator K = (−∆ + I) −1 . Then K is a twice smoothing bijective operator between H −d/2−ǫ (T d ) and H 2−d/2−ǫ (T d ), which gives us ν = 2−d/2−ǫ 4 , i.e., one can choose ν arbitrarily close to 4−d 8 . Theorem 4.8. Let us assume that K ∈ Ψ −t where t > d/2 + ǫ with some ǫ > 0, that is, K is of order t smoothing pseudodifferential operator. Regularization functional R is defined by (4.24) and µ † satisfies the approximate source condition of order r 1 ≥ 0 in Assumption 3.2. Then for the choice α ≃ δ κ where κ = where terms M 1 and M 2 are given in equation (4.21). Similarly, we have for a constant γ and α ≃ δ κ , κ ≤ 1 that Ee δ, αγ δ (H) γ 2− 1 ν δ κ+( 1 ν −1)(1−κ) 1.
For r 1 ≥ d+2ǫ t we choose κ = 1 and consequently the claim holds.

Outlook to the Bayesian approach
In the Bayesian approach to inverse problems the model equation (1.2) is often written in the form F δ = KU + δN (5.1) where, in addition to the observational random noise N , we describe our prior beliefs about the unknown in terms of the probability distribution of the random variable U : Ω → X. The solution to the inverse problem is then the probability distribution of U conditioned on a measurement outcome F δ . The posterior distribution now provides means for uncertainty quantification. The analysis of small noise limit, in Bayesian case also known as the theory of posterior consistency, has attracted a lot of interest in the last decade. Posterior convergence rates were first studied in [24,53]. In those two papers Gaussian noise and prior are assumed and the interest is on the convergence of the approximated solution U α δ , generated by a 'true' u † , to the same truth u † . Similar convergence is further studied e.g. in papers [1,15,35,48,56]. In [36,37] Bayesian cost estimator similar to (5.2) in Gaussian case is considered.
A widely used approach to extract information from a posteriori distribution is to find so-called maximum a posteriori (MAP) estimator. In finite dimensional problems, the MAP estimate maximizes a posteriori probability density function and is, loosely speaking, the most probable solution to the problem (5.1). In the infinite-dimensional case, the MAP estimator is less understood. In certain probabilistic models, the MAP estimate is known to minimize a problem of type (1.3). We refer to our earlier work in [27,28] and other authors in [15,19] for more discussion on the topic. We point out that, in general, the connection between the estimator induced by (1.3) and the MAP estimate is not well-established. Despite this deficit, understanding the Bayes cost in such a case based on Bregman distance would be highly interesting for practical problems.
Our results in Theorem 2.9 now directly yields that E N,U (D µ δ α ,µ R (U δ α , U )) ≤ E N E U inf (ζ 1 ,ζ 2 )∈(R + ) 2 (ζ 1 + δ α ζ 2 ) 1/(1−θ) C θ (U δ α , U ) 1/(1−θ) + 1 1 − θ e α,ζ 1 (M ) + δ α(1 − θ) e δ,ζ 2 (H) , where M : Ω → X * formally satisfies M (ω) ∈ ∂R(U (ω)) and H = K * N . The Bayes cost for the MAP estimate, however, is not a straightforward matter since the subgradient set ∂R(U ) is not necessarily well-defined. Consider a Gaussian prior U in a Hilbert space X with zero-mean and covariance C U : X → X. In such a case, the functional R induced by the prior satisfies R(u) = C It is know from the earlier work [37] by the last author that in Gaussian setting the Bregman distance based Bayes cost can be estimated using a weaker norm than induced by the prior. Hence an intriguing question for future work is to characterize functional R for which the Bayes cost (and the bound) in (5.2) makes sense.
Let us finally comment that in a purely Bayesian approach the prior information should be independent of the measurement F δ . For instance, MAP estimate of problem (1.2) for a δ-independent prior and a noise distribution δN with white noise N formally correspond to an estimator (1.3) where α is replaced by αδ 2 for a constant α. In literature this principle is occasionally omitted and general a priori rules α = α(δ) are considered. Such an approach resembling the frequentist method leads to 'priors' that are scaled with respect to the noise level δ and hence no longer independent of the measurement. With general α(δ) the minimisation problem (1.3) can not be seen as a proper MAP estimate. However, it is a useful estimator to study since with constant α we often do not get convergence in the original space.