A semismooth Newton method with analytical path-following for the H 1-projection onto the Gibbs simplex

An efficient, function-space-based second-order method for the H1-projection onto the Gibbs-simplex is presented. The method makes use of the theory of semismooth Newton methods in function spaces as well as Moreau-Yosida regularization and techniques from parametric optimization. A path-following technique is considered for the regularization parameter updates. A rigorous first and second-order sensitivity analysis of the value function for the regularized problem is provided to justify the update scheme. The viability of the algorithm is then demonstrated for two applications found in the literature: binary image inpainting and labeled data classification. In both cases, the algorithm exhibits meshindependent behavior.


Introduction
In many modern applications such as image inpainting [2,9], topology optimization [4,5], multiclass data segmentation [12], and multiphase fluid dynamics [24,22], the associated variational problems contain multiple phase-field functions, denoted here by u 1 , . . . , u n , that fulfill the conditions u i ≥ 0 and u i = 1 (in a pointwise almost everywhere (a.e.) sense). This typically leads to situations in which one needs to project onto the so-called Gibbs-simplex in the Sobolev space H 1 (Ω). Mathematically speaking, this involves the solution of the following infinite dimensional optimization problem: where G ⊂ H 1 (Ω), given by G := u ∈ V 1 u = 1, u i ≥ 0, i = 1, . . . , n , is the Gibbs simplex. Here, we let Ω ⊂ R n be non-empty, open, and bounded with Lipschitz boundary Γ, and set V := H 1 (Ω) and V := V n . In addition, we fix some ϕ ∈ V and note that 1 is the column vector of ones in R n . The conditions on the functions u in G are always understood pointwise a.e. Moreover, since Ω is bounded, G is in fact a non-empty, closed, and convex subset in H 1 (Ω) n ∩ L ∞ (Ω) n , which is also bounded in L ∞ (Ω) n .
Although the constraint set G is formulated with pointwise a.e.-constraints, it is crucial that the correct norm, in this case the H 1 -norm, is used in the projection. Indeed, let Γ be smooth, n = 1, and ϕ ∈ H 2 (Ω) such that ∇ϕ · n| Γ = 0. Then (P) reduces to solving the obstacle problem in H 1 given by where f = −∆ϕ + ϕ ∈ L 2 (Ω). It is well-known, cf. [8,18], that the unique optimal solution u is in H 2 (Ω). Furthermore, the pointwise/thresholding/L 2 -projection of ϕ, given by ϕ + max(0, −ϕ) − max(0, ϕ − 1), is feasible, but only in H 1 (Ω); as the pointwise max(0, ·)-operator does not preserve H 2regularity. Hence, the pointwise projection of ϕ onto the Gibbs simplex is feasible, yet it cannot be the solution to the original problem.
Following the approach in [16,15], we approximate (P) by replacing G with a Moreau-Yosidatype regularization of the associated indicator function. This yields the approximation: The associated first-order system can be solved by a semismooth Newton method in function space. In order to increase numerical efficiency by avoiding excess parameter updates, we derive a full differential sensitivity analysis of the optimal value function θ(γ). These results lead to an explicit parameter update scheme referred to as "path-following." We note that our analysis is much more compact than [16,15]. Moreover, no assumptions on the active sets are needed to derive second-order properties of θ(γ).
A further important factor, which motivates this work, arises from the fact that a first-discretizethen-optimize approach will typically exhibit mesh dependence under adaptive refinements (cf. the examples in Section 5). The main culprit for this aspect is due to a lack of multiplier regularity for the constraints in G. By regularizing the problem, we increase its regularity, which enables us to obtain superlinear function-space convergence for the semismooth Newton methods for the regularized problems. Moreover, by deriving an efficient approximation of the primal-dual path value functional θ, we avoid problems connected with ill-conditioning.
The rest of the paper is structured as follows. In Section 2 we analyze the parameter dependent problem (P γ ), which lays the foundation for the path-following. Afterwards, we suggest a path-following scheme and present the algorithm in Sections 3 and 4. Finally, in Section 5, we demonstrate the performance of the algorithm on two applications taken from the literature. This is done for both path-following schemes and a comparison is made to a mesh-dependent scheme based on the direct solution of the first-order system of (P).

Sensitivity Analysis
As indicated in the introduction, we need to analyze the first and second-order differentiability properties of the parameter dependent quantities related to (P γ ) in order to derive a pathfollowing scheme. We thus begin this section with some basic facts about the optimal solution mapping and optimal value functions associated with (P) and (P γ ). Throughout the discussion, we use • to indicate the dependence on the penalty parameter γ > 0.
Theorem 2.1. Under the standing assumptions, the following facts hold: 1 (P) has a unique solutionū and, for all γ > 0, (P γ ) has a unique solutionū γ .
Proof. To see that 1. holds, consider that both (P) and (P γ ) involve the minimization of a strongly convex continuous functional over a non-empty closed set in a real Hilbert space, both problems admit unique solutionsū andū γ , respectively.
To prove continuity ofū • on (0, ∞), we need several results. For readability, let By definition of a global optimum, we have Therefore, it follows from the feasibility ofū and non-negativity of the objectives that Furthermore, since β(·) ≥ 0, we deduce the uniform boundedness in H 1 (Ω) n of the set {ū γ } γ>0 .
Next, let γ k → +∞ and define the sequence ū k byū k :=ū γ k . Since ū k ⊂ {ū γ } γ>0 , which is bounded, there exists a weakly convergent subsequence ū k l with limitū . Returning to (1) in this setting and dividing both sides by γ k l , it follows from the weak-lower semicontinuity of β that β(ū ) = 0; from which it can be easily deduce thatū ∈ G. But then given it follows from the weak lower-semicontinuity of · −ϕ 2 V and the uniqueness of the global minimizerū thatū =ū. Moreover, the uniqueness ofū also implies, by the Urysohn property, that the entire sequence ū k converges weakly toū. Using the previous inequality, we have once again by weak lower-semicontinuity and optimality (respectively): Since V is a Hilbert space, it follows thatū k →ū strongly in V. This proves 3.
Building on the results of Theorem 2.1, we show that θ : (0, ∞) → R is continuously differentiable.
Proof. Fix γ, η > 0. According to (4), we have Similarly, we have (for sufficiently small η > 0) We also have the upper bounds and (for sufficiently small η > 0) This yields It follows from Theorem 2.1.2, the continuity of β, (5), and (7) that Similarly, we have from Theorem 2.1.2., the continuity of β, (6), and (8) that Hence, θ is differentiable from (0, ∞) → R. Finally, asū • : (0, ∞) → V and β are continuous, θ is continuously differentiable. that the objective functional is continuous and inf-compact on some Hausdorff topological vector space. In our case, the level sets of the objective are inf-compact in the weak topology on V. However, the objective is weakly lower-semicontinuous, but not continuous with respect to the weak topology. Therefore, we cannot simply embed the problem into the setting of Danskin's theorem by replacing (V, · V ) by (V, σ weak ).

Remark 2.3 (Relation to Danskin's Theorem
In order to approximate θ in the path-following scheme by a simple model function, we need second-order information. This will require a differential sensitivity analysis ofū • . For this discussion, we will make use of the first-order necessary and sufficient optimality conditions for (P γ ), which for each i = 1, . . . , n are given by we define the difference quotients: We then have the following result.
whered ∈ V is the unique solution of the following variational problem: Moreover, it holds that Proof. We first show that {d η } η>0 is uniformly bounded in V. First note that d η i solves the following elliptic PDE with homogeneous Neumann boundary conditions: i and by the monotonicity of the gradients Then, testing (10a) with d η i and summing over i we have It follows from Theorem 2.1.3, that {d η } η>0 is uniformly bounded in V. Next, we derive expansions for λ η i and µ η .
From the Lebesgue dominated convergence theorem we see that the superposition operator (·) + : L 2 (Ω) → L 2 (Ω) is Lipschitz continuous and Hadamard directionally differentiable with Moreover, due to [21, Proposition 3.1] it is continuous in the second variable. Since {d η i } η>0 is uniformly bounded in V and there exists a small o-function such that Arguing similarly for ν η , we have Now, letting η k ↓ 0, there exists a weakly converging subsequence d k l i with d k l i := d η k l i and a weak limit pointd i ∈ V . Since V embeds compactly into L 2 (Ω) andū • : (0, ∞) → V is continuous, we have It is then easy to see that the vectord ∈ V solves (D γ ). Since (D γ ) has a unique solution, we can deduce by the Urysohn property that the entire sequence d k converges weakly tod.
We are now ready to derive a formula for the directional derivative of θ (γ). Consider that Using the chain rule, formula (12) andū γ+η =ū γ + ηd γ , we have Since η k → 0 implies d η k d (weakly in V), we can use the compactness of the embedding V into L 2 (Ω) and pass to the limit in the previous equation. This yields Finally, it follows from (11) that β (u γ ; 1) ≤ 0. This completes the proof.

A Semismooth Newton Iteration
The main component of the algorithm involves the direct solution of (9) using a semismooth Newton method. We refer the reader to the papers [23,14] for the full theory of this method and only note that in the current setting, the iterates are generated by solving the following (reduced) linear system: Given γ > 0 and u old , solve Here, A is the PDE-operator associated with the V-inner product and I {u old <0} is the characteristic function for the set {u old < 0}. Once the residual of (9) is sufficiently small, the method terminates and γ is updated. The previous γ-dependent solution serves as the starting value for the next iteration. We summarize this procedure in Algorithm 3.1.

10: until Stopping criterion is satisfied
Defining the residuals the outer loop terminates whenever (r 1 k , r 2 k , r 3 k ) 2 ≤ tol.
Note that r 1 k , r 2 k are feasibility errors, whereas r 1 k , r 3 k represent the complementarity error. The stopping criteria for the inner loop depend on the type of γ-update strategy. These are detailed below in Section 4.

Feasibility Restoration
In some applications, e.g., topology optimization, where the phase field functions u 1 , . . . , u n arise as parameters in a second-order PDE-operator, it is imperative that u ∈ G. For such situations, we are forced to employ a mesh-dependent numerical method. However, the (very) warm start provided by Algorithm 3.1 provides a point that is sufficiently close to the true solution of (P) so that the jump to feasibility requires minimal effort. We demonstrate this fact in the examples.

Path-Following Scheme
We now detail a strategy for efficiently updating γ k in Algorithm 3.1. As suggested in [16], an ideal update scheme for γ (at step k) would be to choose γ k+1 such that where τ k ↓ 0 is a given null sequence. However, as noted there, this is not a practical strategy as we cannot evaluate θ(γ k+1 ) and θ ∞ . Nevertheless, our analysis shows that θ : (0, ∞) → R is positive, bounded, increasing, continuously differentiable and concave. Hence, we first fit a model function, using the information available at step k, and then update according to an approximation of (17). This serves the basis for the path-following approach.
Our approach follows the arguments in [16,. We therefore consider the following This amounts to: Clearly, E, C 1 , C 2 > 0. Moreover, it can be argued using (3) that γβ(ū γ ) → 0 as γ → +∞. Hence, m(·) monotonically increasing, m(·) ≤ C 1 and implies that m(δ) → C 1 as δ → ∞ and C 1 → θ ∞ as γ → ∞. Now, we may use (17), which leads to the following explicit update formula: By choosing three different reference values for γ in (19), we see in Figure 1 that m is an excellent local approximation of θ.
In this method, the Newton iteration (inner loop) is terminated whenever the active index sets {u k < 0} are the same in two subsequent iterations or

Numerical results
In this section, we demonstrate the performance and wide applicability of the proposed methods on three examples. The first example shows the mesh-independent behavior of the pathfollowing methods and compares this to the first-discrete-then-optimize approach suggested above, which utilizes the primal-dual active set method (adapted to our setting) from [14]. This method does not admit a function space convergence analysis and, hence, is expected to behave mesh dependently, i.e., utilizing the same stopping rule and initial point on all meshes, the number of iterations until successful termination will grow as the mesh is refined. For the latter two examples, we consider two optimization problems from mathematical image processing and labeled data classification that may be solved using a gradient-flow or projected gradient method, see e.g. [3]. At each step of the line search in the projected gradient method, a projection onto the Gibbs simplex is required.
In all examples, the underlying domain Ω = (0, 1) 2 is discretized using a uniform triangular mesh with parameter h > 0. The function space H 1 (Ω) is discretized using the associated nodal basis comprised of the standard continuous piecewise affine finite elements, see, e.g., [10,7]. Note that the pointwise almost-everywhere constraints are approximated in the finite dimensional problem by requiring the coefficients for each nodal basis function to be in the finite dimensional Gibbs simplex. In order to calculate the H 1 (Ω; R n ) * -norm in (22), we need the Riesz representation of the dual object. This is done by solving a (vector-valued) Poisson problem with homogeneous Neumann boundary conditions with the residual of interest on the right-hand side. We make use of the P1AFEM package [11]. Concerning the path-following parameters, we choose τ k = 10 −2k , initial parameter γ 1 = 10 4 , maximum parameter γ max = 10 14 and the stopping criterion tol = 10 −6 . PDAS is stopped once the residual of the optimality conditions to (P) drops to 10 −10 .

Projection onto the Gibbs Simplex
For this example, we set n = 3 and construct ϕ i such that the optimal solution of (P) exhibits biactivity at the optimal solution, i.e, both the inequality constraint and the associated Lagrange multiplier are equal to zero. Such problems can be notoriously difficult to solve. Let and construct ϕ by solving the problem: Note that λ = (λ 1 , λ 2 , λ 3 ) ∈ L 2 (Ω) 3 , that (λ, v) L 2 = 0 and that the above system forms the Karush-Kuhn-Tucker conditions for (P). Thus, v is its optimal solution, λ is the Lagrange multiplier associated with the inequality constraint and biactivity is present on the whole third component. The starting point was chosen as ϕ.
The results of Algorithm 3.1 are presented in Table 1. We compare the behavior of the primaldual active set (PDAS) and the path-following method (PF) on six different meshes with decreas-

Applications in Image Processing and Labeled Data Classification
In this subsection, we consider two examples inspired by the work in [2,12]. Having n phases, the common goal is to minimize a criterion J such that the phases are (almost) pure. This amounts to solving the problem where ζ > 0 is the scaling parameter and parameter > 0 is proportional to the interfacial width. The second part of the objective is the multiphase Ginzburg-Landau energy, which along with the indicator function for G, serves as a penalty functional that avoids pathological free boundaries between the phases. Together, the sum of these functionals Γ-converges to the perimeter of the characteristic functions associated with the true phases in a sharp interface regime [19,20]. As such, straight edges are more favorable. We use the standard projected gradient method with line search to solve (23).

Labeled Data Classification
In the first example we separate n = 3 different classes of observations with known labels. First, we 'train' three segmentation phase field functions u on M noisy data and then test the accuracy of these solutions on further noisy realizations. This amounts to solving (23) with where m ∈ {1, . . . , M } is the index of a noisy data point x m ∈ Ω and B(x m , δ) is a small ball of radius δ > 0 around this point; j(m) ∈ {1, 2, 3} refers to the known label (class) of this point. Due to the structure of the Gibbs simplex, J is minimized when u j(m) = 1 and u i = 0 for i = j(m) on B(x m , δ). Thus, J expressed the number of incorrectly classified points. In (23) we consider = 2 −6 and ζ = 2 · 10 −2 and in (24) we set δ = 3 · 2 −7 . The algorithm was started with a uniform mixture of phases.
We randomly generated M = 1500 points (500 for each category +/left half circle, •/bottom half circle, ×/right half circle), see the left-hand side of Figure 2. We then solved the optimization problem in order to train three phase field functions, see the right-hand side of Figure 2. In order to test the accuracy of these fields, we again generated the same number of points and checked how many of these points lie in the correct region. In this instance, the solution was correct for 99.6% of the data points. The numerical comparison of both methods is shown in Table 2. Even though PF needs more iteration than PDAS, the iteration growth is much slower upon mesh refinement. The difference diminishes when we use warm start from the previous mesh. As we will see it the next application, on sufficiently fine meshes PF will eventually need less iterations than PDAS.

Inpainting
For the inpainting application, we assume that an original image (top left of Figure 3) is degraded by adding a text on top of it to produce a known imageû (top right of Figure 3). 1 The goal is then to recover the original image as faithfully as possible. We follow the suggestion in [2] to use a modified Allen-Cahn equation for image inpainting, which leads to (23) with In (23) we consider = 2 −7 and ζ = 7 · 10 −3 . The algorithm was started with the degraded image.
On the contrary to the previous cases, we use non-uniform mesh, where the phase interface is much more refined, see the bottom right part of Figure 3 which depicts the mesh for the right panda eye. For simplicity, we consider a locally adapted mesh which is given. A similar mesh may be obtained from developing an a posteriori error estimates for adaptive mesh refinement see e.g. [1] or [13]. As the latter technique, however, is not the focus of this work, we rely on an a priori mesh. For simplicity, we consider locally adapted mesh which is given a priori; for results The recovered image is shown in the bottom left part of Figure 3. The numerical evidence is summarized in Table 3, where h is the smallest element size. On the coarsest mesh PDAS outperforms PF but as the mesh is getting finer, PF needs less iterations to converge to the solution. In all cases, the projected gradients needed 12 iterations with 26 projections.

Summary
In this paper, we motivate the need for a function-space-based algorithm for the H 1 -projection onto the multiphase Gibbs simplex. In particular, the wide array of applications alone involving optimization or control problems with the Gibbs simplex as a constraint warrants such a study.
In order to make the best use of fast second-order methods for semismooth equations, an analytical path-following study is carried out. Our analytical approach drastically reduces the length of the arguments and necessary assumptions required in [16,15]; the two works that inspired our approach. Due to the generality of our arguments, they can clearly be extended to similar constraint sets. As in [16,15], we suggest a path-following strategy for the update of the Moreau-Yosida regularization parameter, which is based on a concave model function. In our numerical experiments, we first demonstrate the clear advantage of our methods over a firstdiscretize-then-optimize approach. Finally, we consider two small examples inspired by recent work in [2,12] that require the H 1 -projection onto the Gibbs simplex. Further applications, for example in topology optimization will be considered in future work.