Rates of Convergence for Chains of Expansive Markov Operators

We provide conditions that guarantee local rates of convergence in distribution of iterated random functions that are not nonexpansive mappings in locally compact Hadamard spaces. Our results are applied to stochastic instances of common algorithms in optimization, stochastic tomography for X-FEL imaging, and a stochastic algorithm for the computation of Fréchet means in model spaces for phylogenetic trees.


Introduction
This work concerns abstract mathematical structures that guarantee convergence in distribution of Markov chains when the transition kernel is generated by mappings that are not nonexpansive.Our motivation for the abstract study pursued here is the very concrete application of X-ray free electron laser (X-FEL) imaging experiments [18,27,73].Our framework provides conditions under which algorithms for the reconstruction of electronic densities or orbitals of a molecule from X-FEL measurements are guaranteed to be successful.Another motivating application comes from already established stochastic algorithms for computing the Fréchet means of phylogenetic trees in nonlinear model spaces.For this problem our analysis immediately yields improvements to the state of the art for these kinds of algorithms.While our focus is mainly mathematical, we briefly introduce the X-FEL imaging problem to set the scene.
The goal of X-FEL imaging is to determine the electron density of a molecule from experimental samples of its scattering probability distribution (i.e.diffraction pattern).Without going into the intricate aspects of the measurement device or scattering theory, we imagine a threedimensional measurement device that counts the occurrence of photons in a three-dimensional domain partitioned into voxels of a fixed size far away from (that is, in the far field of) a molecule that has been illuminated by a short X-FEL pulse.The molecule under observation is at a random orientation relative to the measurement device.The illuminating pulse is only long enough to cause a few (averaging between 10 and 100) scattering events from the interaction of a molecule's electrons with the X-ray.The experiment is repeated about 10 9 times, each time with the molecule at a different random orientation.
The physical model mapping the probability of a scattering event at a particular location on the molecule to the resulting electromagnetic field at large distances, is, to first order, simply the Fourier transform of the electronic density of the molecule.An X-FEL measurement is truly a random sample of the electromagnetic field far away from the molecule.The measurement device, however, can only measure the occurrence of photons, not their direction as indicated by the real and complex parts of the electromagnetic field.If both the probability of observing a photon at a certain location, and its direction -i.e. its phase -were known, then the electronic density of the molecule that produced those photons could be reconstructed from the observation by simply inverting the Fourier transform.When the probabilities of observing photons at each location in the far field of the molecule can be measured (as is the case in conventional coherent diffraction imaging [66] and wavefront sensing [52]), then the problem becomes the well-known phase retrieval problem, for which the mathematical analysis has been developed in [23,52,53,56].
The algorithms that we envision for performing such reconstructions from X-FEL data proceed by constructing a projection mapping of the current estimate for the electron density of the molecule onto the nearest electron density that is consistent with each new X-FEL observation; in other words, the projection mapping is itself stochastic.We view this type of algorithmic procedure as a random function iteration generating a Markov chain whose transition kernel maps the current estimate to a new estimate based on new randomly sampled data.
In [38] it is shown that iterations of randomly selected α-firmly nonexpansive mappings in a Euclidean space converge in the Prokhorov-Lévy metric to an invariant probability measure of the corresponding Markov operator.The analysis of Markov chains for the case of nonexpansive transition kernels in queuing theory can be found in [5,6].The transition kernels that we propose for the X-FEL application, however, are not nonexpansive; our analysis extends the more recent study of nonexpansive transition kernels in [38] to the expansive case using results from the deterministic theory of expansive fixed point mappings developed in [56].We return to the specific application of X-FEL imaging in Section 4.
Random function iterations (RFI) [28] generalize deterministic fixed point iterations, and are a useful framework for studying a number of important applications.An RFI is a stochastic process of the form X k+1 := T ξ k X k (k = 0, 1, 2, . . . ) initialized by a random variable X 0 with distribution µ 0 and values on some set G. This of course includes initialization from a deterministic point x 0 ∈ G via the δ-distribution.Here ξ k (k = 0, 1, 2, . . . ) is an element of a sequence of i.i.d.random variables that map from a probability space into a measurable space of indices I (not necessarily countable) and T i (i ∈ I) are self-mappings on G.The iterates X k form a (time-homogeneous) Markov chain that takes values in the space G, which is, for our purposes, a Polish space.Deterministic fixed point iterations are included when the index set I is just a singleton.
Notation, basic facts, and the main results are presented in Section 2. The assumptions on the mappings generating the Markov operators have been shown to be necessary and sufficient for local rates of convergence in the analysis of deterministic algorithms for continuous optimization ([55, Theorem 2] and [48,Theorem 16]), and also necessary for rates for the case of consistent stochastic feasibility [37,Theorem 3.15] in a Euclidean setting.Whether these assumptions are also necessary in the setting considered here is not known, though we conjecture that this is true.In Section 2.4 we present the statement of the main result, Theorem 2.6 which provides sufficient conditions for a quantification of convergence of the RFI.The proof of the main result is delayed until Section 3.4 so that the regularity assumptions can be developed; namely, that the transition kernel is almost α-firmly nonexpansive in expectation (Definition 3.1) and the discrepancy between a given measure and the set of invariant measures of the Markov operator, (22), is metrically subregular (Definition 3.6).We conclude this study with Section 4 where we focus on applications to nonconvex optimization on measure spaces and inconsistent feasibility.

RFI and the Stochastic Fixed Point Problem
This section follows the development of [38].Our notation is standard.The natural numbers are denoted by N. For G, an abstract topological space, B(G) denotes the Borel σ-algebra and (G, B(G)) is the corresponding measure space.We denote by P(G) the set of all probability measures on G.The notation X ∼ µ ∈ P(G) means that the law of X, denoted L(X), satisfies L(X) := P(X ∈ •) = µ, where P is the probability measure on some underlying probability space.All of these different ways of indicating a measure µ will be used.We take for granted a familiarity with probability theory and Markov chains.For general background and fundamental results, in particular regarding the convergence of Markov chains, see for instance [15,35,36,39,44,49,58,71,74].Let us point out that the results of the standard probability texts on Markov chain convergence are not directly applicable to our particular setting, which includes the special conditions that are placed on the operators T i .
Throughout, the pair (G, d) denotes a separable metric space with metric d.Results concerning existence of invariant measures and convergence of Markov chains require completeness of the metric space, that is that the space is Polish.In characterizing the regularity of the building blocks, however, completeness is not required.For our main result, Theorem 2.6, we restrict the setting to self-mappings of a compact subset of a Hadamard space, a separable complete uniformly convex metric space with nonpositive curvature.Our application examples (Section 4) are mostly in Euclidean spaces, which are Hadamard spaces with zero curvature, but the example of computing mean phylogenetic trees (Section 4.4) requires the generality of Hadamard spaces.The purpose of starting with abstract separable metric spaces is to make it easier to see which results rely in part on the regularity of the space in addition to those properties we require of the mappings.Markov operators generated from mappings on Hadamard spaces represent the current limits of the direct applicability of the theory presented here.
The distance of a point x ∈ G to a set A ⊂ G is denoted by d(x, A) := inf w∈A d(x, w).Continuing with the development initiated in the introduction, we will consider a collection of mappings T i : G → G , i ∈ I, on (G, d), where I is an arbitrary index set -not necessarily countable.The measure space of indexes is denoted by (I, I), and ξ is an I-valued random variable on a probability space.The pairwise independence of two random variables ξ and η is denoted ξ ⊥ ⊥ η.The random variables ξ k in the sequence (ξ k ) k∈N (abbreviated (ξ k )) are independent and identically distributed (i.i.d.) with ξ k distributed as ξ (ξ k ∼ ξ).The method of random function iterations is formally presented in Algorithm 1.We will use the notation to denote the sequence of the RFI initialized with X 0 ∼ µ 0 .When characterizing sequences Algorithm 1: Random Function Iterations (RFI) initialized with the delta distribution of a point x we use the notation X x k .The following assumptions will be employed throughout.
We define the Markov chains corresponding to the random function iteration in terms of its transition kernel: ) is called a Markov chain with transition kernel p if for all k ∈ N and A ∈ B(G) P-a.s. the following hold: Under Assumption 2.1, the sequence of random variables (X k ) generated by Algorithm 1 is a Markov chain with transition kernel p given by for the measurable update function Φ : The Markov operator P is defined pointwise for a measurable function f : when the integral exists.The Markov operator , where C b (G) is the set of bounded and continuous functions from G to R. This property is central to the theory of existence of invariant measures.An application of Lebesgue's dominated convergence theorem immediately yields that if T i is continuous for all i ∈ I then the Markov operator P is Feller, see also [35,Theorem 4.22].
We denote the dual Markov operator P * : P(G) → P(G) acting on a measure µ by action on the right by P via We can thus write the distribution of the k-th iterate of the Markov chain generated by Algorithm 1 as L(X k ) = µ 0 P k .

The Stochastic Fixed Point Problem
The stochastic feasibility problem is: This was first studied in [24,25] in the context of stochastic convex set feasibility and has been extended more generally to finding the (almost sure) fixed points of multi-valued mappings in [37,38].A point x such that x = T i x is a fixed point of the operator T i ; the set of all such points is denoted by In [37] it was assumed that C = ∅.If C = ∅, following Butnariu [24] this is called the inconsistent stochastic feasibility.
One need only consider linear systems of equations, Ax = b, to demonstrate an inconsistent feasibility problem.Each linear equation, a j , x = b j represents a hyperplane; the solution to Ax = b is the intersection of the hyperplanes.Obviously, when the system is overdetermined, or underdetermined with noise, the intersection is likely empty -an inconsistent feasibility problem.Most readers will reflexively reformulate the problem as a least squares minimization problem, but there are compelling reasons why this is not always the best way around the dilemma of the inconsistent feasibility problem.
Avoiding debate about least-squares versus feasibility, what all can surely agree upon is that the inconsistency of the problem formulation is an artifact of asking the wrong question.A fixed point of the dual Markov operator P is called an invariant measure and satisfies πP = π.The set of all invariant probability measures is denoted by inv P. The solution to the dilemma of the inconsistent feasibility problem is instead to solve the following stochastic fixed point problem: Example 2.2 (inconsistent stochastic feasibility).Consider the (trivially convex, nonempty and closed) sets C −1 := {−1} and C 1 := {1} together with a random variable ξ such that P(ξ = 1) = P(ξ = −1) = 1/2.The mappings T i x = P C i x = i for x ∈ R and i ∈ I = {−1, 1} are the projections onto the sets C −1 and C 1 .The RFI iteration then amounts to just random jumps between the values −1 and 1.So it holds that P(T ξ x = i) = 1/2 for all x ∈ R and hence there is clearly no feasible fixed point to this iteration, that is, the set C defined in (3) is empty.Nevertheless, by disintegration we have for all k ∈ N.That means the unique invariant distribution to which the distributions of the iterates of the RFI (i.e.(P( , and this is attained after one iteration.The least squares solution is the mean of this invariant distribution.

Modes of convergence
Algorithm 1 generates a sequence of random variables whose distributions, when conditions allow, converge to solutions to (4).For this we focus on convergence in distribution.Let (µ k ) be a sequence of probability measures on G.The sequence (µ k ) is said to converge in distribution to µ whenever µ ∈ P(G) and for all f ∈ C b (G) it holds that µ k f → µf as k → ∞, where µf := f (x)µ(dx).Equivalently a sequence of random variables (X k ) is said to converge in distribution if their laws (L(X k )) do.
We consider convergence in distribution for the corresponding sequence of measures (L(X k )) to a probability measure π ∈ P(G), i.e. for any An elementary fact from the theory of Markov chains is that, if the Markov operator P is Feller and π is a cluster point of the sequence of measures (µ k ) := (µ 0 P k ) with respect to convergence in distribution then π is an invariant probability measure [36,Theorem 1.10].We will assume existence of invariant measures.For more on this theory readers are referred to [15,35,36].Quantifying convergence is essential for establishing estimates for the distance of the iterates to the limit point, when this exists.A sequence (x k ) in a metric space (G, d) is said to converge R-linearly to x with rate c ∈ [0, 1) when The sequence (x k ) is said to converge Q-linearly to x with rate c ∈ [0, 1) if By definition, Q-linear convergence implies R-linear convergence with the same rate; the converse implication does not hold in general.Q-linear convergence is encountered with contractive fixed point mappings, and this leads to a priori and a posteriori error estimates on the sequence.This type of convergence is referred to as geometric or exponential convergence in different communities.The crucial distinction between R-linear and Q-linear convergence is that R-linear convergence permits neither a priori nor a posteriori error estimates.For more on these notions see [62,Chapter 9].A common metric for spaces of measures is the Wasserstein metric.For p ≥ 1 let The Wasserstein p-metric on P p (G), denoted W p , is defined by where C(µ, ν) is the set of couplings of µ and ν: We will also refer in Proposition 4.1 to the Prokhorov-Lévy distance, d P , metrizing weak convergence in distribution: To conclude this section, we state without proof the elementary fact that, for the setting considered here, the set of invariant measures is closed.For proof see for example [35,Section 5].
Lemma 2.3.Let G be a Polish space and let P be a Feller Markov operator, which is in particular the case under Assumption 2.1, if T i is continuous for all i ∈ I. Then the set of associated invariant measures inv P is closed with respect to the topology of convergence in distribution.

Regularity
The regularity of T i depends on the application.In [38] the regularity of T i is used to obtain generic convergence results for the corresponding Markov operator, but the regularity of the Markov operator is never explicitly defined.We make this explicit here by lifting the regularity of T i to corresponding notions of regularity of Markov operators in probability spaces in the case that T i is expansive following the development of such mappings in [56] in a Euclidean space setting.When working with expansive mappings, multi-valued mappings appear naturally.In the stochastic setting, to ease the notation and avoid certain technicalities, we will consider only single-valued mappings T i that are only almost α-firmly nonexpansive in expectation.Recent studies define the regularity of fixed point mappings in p-uniformly convex spaces (p ∈ (1, ∞)) with parameter c > 0 [13,48].These are uniquely geodesic metric spaces (G, d) for which the following inequality holds [60]: where w = (1 − t)x ⊕ ty for t ∈ (0, 1) denotes the point w on the geodesic connecting x and y such that d(w, x) = td(x, y).The definitions below hold formally in this nonlinear setting.The only object whose properties are strongly tied to the geometry of the space is the transport discrepancy ψ defined below in (14).We are limited so far to locally compact Hadamard spaces, which are locally compact complete CAT(0) spaces (Alexandrov [1] and Gromov [33]).More generally, a CAT(κ) space is a geodesic metric space with sufficiently small triangles possessing comparison triangles with sides the same length as the geodesic triangle but for which the distance between points on the geodesic triangle are less than or equal to the distance between corresponding points on the comparison triangle.CAT(κ) spaces are separable, but not complete, and locally 2-uniformly convex with parameter c ≤ 2. A CAT(0) space has p = c = 2 in (11).
The definition below conforms with the same objects defined in [13,48].(i) The mapping F is said to be pointwise almost nonexpansive at x 0 ∈ D on D, abbreviated pointwise ane, whenever The violation is a value of ǫ for which (12) holds.When the above inequality holds for all x 0 ∈ D then F is said to be almost nonexpansive on D (ane).When ǫ = 0 the mapping F is said to be (pointwise) nonexpansive.
(ii) The mapping F is said to be pointwise almost α-firmly nonexpansive at x 0 ∈ D on D, abbreviated pointwise aα-fne whenever ∃ǫ ∈ [0, 1) and α ∈ (0, 1) : where the transport discrepancy ψ of F at x, x 0 , F x and F x 0 is defined by When the above inequality holds for all x 0 ∈ D then F is said to be almost α-firmly nonexpansive on D, (aα-fne).The violation is the constant ǫ for which (13) holds.When ǫ = 0 the mapping F is said to be (pointwise) α-firmly nonexpansive, abbreviated (pointwise) α-fne.
Nonexpansive and α-firmly nonexpansive mappings have been studied for decades under various names and in various settings [3, 4, 7, 8, 20-22, 30, 32, 34, 47, 57].The violation ǫ in ( 12) and ( 13) is a recently introduced feature in the analysis of fixed point mappings, first appearing in this form in [56].Many are familiar with mappings for which (12) holds with ǫ < 0 at all x 0 ∈ G , i.e. contraction mappings.In this case, the whole technology of pointwise aαfne mappings is not required since an appropriate application of Banach's fixed point theorem delivers existence of fixed points and convergence of fixed point iterations at a linear rate.We will have more to say about this later; for the moment it suffices to note that the mappings associated with one of our target applications are expansive on all neighborhoods of fixed points and we will therefore require another property to guarantee convergence.
Our definition with α = 1/2 and ǫ = 0 is equivalent to the definition of firmly contractive mappings given in [20,Definition 6].In linear spaces the mappings with ǫ = 0 are called "averaged" [8].Ariza-Ruiz, Leuştean and López-Acedo [3] defined λ-firmly nonexpansive operators on subsets D of W-hyperbolic spaces, as those operators satisfying ∃λ ∈ (0, 1) : Another notion of regularity in the context of Hadamard spaces that is equivalent to (15) for an operator In [32,Chapter 24] an operator F : H → H is called firmly nonexpansive whenever φ F is nonincreasing on [0, 1] (see also [12,Definition 2.1.13]).It is clear from the definition that F : D → D satisfies (15) if and only if φ F is a nonincreasing function on [0, 1] for all x, y ∈ D.
Banert [9,Remark pp.658] shows that any mapping F satisfying (15) for all λ ∈ (0, 1] is α-firmly nonexpansive with constant α = 1/2.The transport discrepancy ψ is the key to identifying the regularity required for convergence of fixed point iterations in metric spaces and the definition makes clear the contribution of the geometry of the space.Lemma 2.5 (ψ is nonnegative in CAT(0) spaces, Proposition 4 of [13]).Let (G, d) be a CAT(0) metric space and F : D → G for D ⊂ G. Then the transport discrepancy defined by ( 14) is nonnegative for all x, y ∈ D.Moreover, if F is pointwise aα-fne at x 0 ∈ D with violation ǫ on D, then F is pointwise ane at x 0 on D with violation at most ǫ.
In CAT(κ) spaces the above statement does not hold.It is well known, for example, that in a CAT(κ) metric space the projector onto a convex set is α-fne with α = 1/2, but it is not nonexpansive [3].In Hadamard space settings, when a mapping is firmly nonexpansive, it is clear that it is also nonexpansive.This implication was shown in [3] to be a consequence of Busemann convexity and does not hold in general metric spaces.Nevertheless, the implication is recovered for p-uniformly convex spaces for pointwise firmly nonexpansive mappings at their fixed points, since the corresponding transport discrepancy ψ is nonnegative in this case [13,Proposition 4(i)].It would be interesting to investigate these notions in Busemann spaces where, based on known extensions of the tools of variational analysis to Banach spaces, we conjecture that many of these notions of regularity carry over.
On normed linear spaces, when • is the norm induced by the inner product and d(x, y) = x − y , the transport discrepancy ψ defined by ( 14) has the representation This representation shows the connection between our definition and more classical notions.Indeed, in a Hilbert space setting (G, • ), a mapping

Main results
We can now state the main theorem concerning Markov operators P with update function Φ(x, i) = T i (x) and transition kernel p given by (2) for self mappings T i : G → G .For any µ 0 ∈ P 2 (G), we denote the distributions of the iterates of Algorithm 1 by ).It will be assumed that inv P = ∅.
Convergence is quantified by an implicitly defined gauge function.Recall that ρ : [0, ∞) → [0, ∞) is a gauge function if ρ is continuous, strictly increasing with ρ(0) = 0, and lim t→∞ ρ(t) = ∞.This is defined in terms of another nonnegative mapping θ τ,ǫ : [0, ∞) → [0, ∞) with parameters τ > 0 and ǫ ≥ 0: The gauge then is given by for τ > 0 fixed.The rate of convergence of the sequences of measures will be determined by θ τ,ǫ , with parameters τ and ǫ given by a certain characterization of the regularity of the Markov operator P (see Definition 3.1), which translates to a second regularity of the Markov operator P, metric subregularity (see Definition 3.6) via the gauge ρ.In the case of linear convergence this becomes The conditions in (19) in this case simplify to θ τ,ǫ (t) = γt where Theorem 2.6 (convergence rates).Let (H, d) be a separable Hadamard space and let G ⊂ H be compact.Let T i : G → G be continuous for all i ∈ I and define Ψ : Assume furthermore: (a) there is at least one π ∈ inv P ∩P 2 (G) where P is the Markov operator with update function Φ given by (2); , and (c) Ψ(π) = 0 ⇐⇒ π ∈ inv P and for all µ ∈ P 2 (G) with gauge ρ given implicitly by (20) with τ = (1 − α)/α and the function θ τ,ǫ satisfying (19).
Then for any τ,ǫ (t 0 ); in other words An immediate corollary of this theorem is the following specialization to linear convergence.
Corollary 2.7 (linear convergence rates).Under the same assumptions as in Theorem 2.6, if Ψ satisfies (c) with gauge ρ(t) = r • t and constant r satisfying where 3 Background theory and proofs of the main results.
As indicated by the discussion following Definition 2.4, assumption (b) of Theorem 2.6 has deep roots in fixed point theory.Assumption (c) has a similarly central significance in variational analysis.We develop each of these assumptions for the present setting in order.

Almost α-firm nonexpansive mappings in expectation
To begin, we develop assumption (b) of Theorem 2.
Let ψ be defined by ( 14) and let ξ be an I-valued random variable.
(i) The mapping Φ is said to be pointwise almost nonexpansive in expectation at x 0 ∈ G on G, abbreviated pointwise ane in expectation, whenever When the above inequality holds for all x 0 ∈ G then Φ is said to be almost nonexpansive -ane -in expectation on G.As before, the violation is a value of ǫ for which (26) holds.When the violation is 0, the qualifier "almost" is dropped.
(ii) The mapping Φ is said to be pointwise almost α-firmly nonexpansive in expectation at x 0 ∈ G on G, abbreviated pointwise aα-fne in expectation, whenever When the above inequality holds for all x 0 ∈ G then Φ is said to be almost α-firmly nonexpansive (aα-fne) in expectation on G.The violation is a value of ǫ for which (27) holds.When the violation is 0, the qualifier "almost" is dropped and the abbreviation α-fne in expectation is used.
Proposition 3.2.Let (G, d) be a CAT(0) space.The mapping Φ : G × I → G given by Φ(x, i) = T i x is pointwise aα-fne in expectation at y on G with constant α and violation at most ǫ and pointwise ane in expectation at y on G with violation at most ǫ whenever T i is pointwise aα-fne at y on G with constant α and violation no greater than ǫ for all i.
Proof.By Lemma 2.5, whenever (G, d) is a CAT(0) space ψ(x, y, Φ(x, i), Φ(y, i)) ≥ 0 for all i and for all x, y ∈ G, so the expectation E [ψ(x, y, Φ(x, ξ), Φ(y, ξ))] is well-defined and nonnegative for all x, y ∈ G (the value +∞ can be attained).This implies that, for all i, T i is pointwise ane at y on G with violation at most ǫ on G whenever it is pointwise aα-fne at y with constant α on G with violation at most ǫ on G for all i.It follows immediately from the definition, then, that Φ is pointwise aα-fne in expectation at y with constant α on G with violation at most ǫ on G, and also pointwise ane in expectation at y on G with violation at most ǫ on G.
Following [38, Defintion 2.10] we lift these notions of regularity to Markov operators.Denote the set of couplings where the distance W 2 (µ 1 , µ 2 ) is attained by Even though W 2 (µ 1 , µ 2 ) is defined as the infimum over all couplings, whenever this is finite the infimum is attained, and hence in this case C * (µ 1 , µ 2 ) is nonempty [38,Lemma A.7].
Definition 3.3 (pointwise almost (α-firmly) nonexpansive Markov operators).Let (G, d) be a CAT(0) metric space, and let P be a Markov operator with transition kernel where ξ is an I-valued random variable and Φ : G × I → G is a measurable update function.
(i) The Markov operator is said to be pointwise almost nonexpansive in measure at µ 0 ∈ P(G) on P(G), abbreviated pointwise ane in measure, whenever When the above inequality holds for all µ 0 ∈ P(G) then P is said to be almost nonexpansive (ane) in measure on P(G).As before, the violation is a value of ǫ for which (29) holds.When the violation is 0, the qualifier "almost" is dropped.
(ii) The Markov operator P is said to be pointwise almost α-firmly nonexpansive in measure at µ 0 ∈ G on P(G), abbreviated pointwise aα-fne in measure, whenever ∃ǫ ∈ [0, 1), α ∈ (0, 1) : When the above inequality holds for all µ 0 ∈ P(G) then P is said to be aα-fne in measure on P(G).The violation is a value of ǫ for which (30) holds.When the violation is 0, the qualifier "almost" is dropped and the abbreviation α-fne in measure is employed.
Remark 3.4: By Lemma 2.5, when (G, d) is a CAT(0) space the expectation on the right hand side of ( 30) is nonnegative, and the corresponding Markov operator is pointwise ane in measure at µ 0 whenever it is pointwise aα-fne in measure at µ 0 (Proposition 3.2).In particular, when µ = µ 0 ∈ inv P the left hand side is zero and Here the optimal coupling is the diagonal of the product space G × G and ψ(x, x, T ξ x, T ξ x) = 0 for all x ∈ G.

Metric subregularity
We move now to assumption (c) of Theorem 2.6.In [56] a general quantitative analysis for iterations of expansive fixed point mappings is proposed consisting of two principle components: the constituent mappings are pointwise aα-fne, and the transport discrepancy of the fixed point operator is metrically subregular.Recall that, for any mapping Ψ : A → B , the inverse mapping Ψ −1 (y) := {z ∈ A | Ψ(z) = y}, which clearly can be set-valued.
Our definition is modeled after [29], where the case where the gauge is just a linear function -ρ(t) = rt for r > 0 -is developed.In this case, metric subregularity is one-sided Lipschitz continuity of the (set-valued) inverse mapping Ψ −1 .We will refer to the case when the gauge is linear as linear metric subregularity.For connections of this notion to the concept of transversality in differential geometry and its use in variational analysis see [41].In the case of a finite dimensional linear transition kernel, the constant ρ of metric subregularity characterizes the spectral gap, or the difference between the two largest eigenvalues of Id −T i .For another example derived from [55, Propositions 9-10], the alternating projections operator for finding the intersection of two affine subspaces (assumed nonempty) has a transport discrepancy mapping ψ whose constant of metric subregularity grows to infinity as the (Friedrichs) angle between the sets goes to zero; in other words, the smaller the angle between the subspaces, the slower the rate of convergence of alternating projections.
Metric subregularity plays a central role in the implicit function paradigm for solution mappings [17,29]; it is also notoriously difficult to verify.Linear metric subregularity was shown in [37,Theorem 3.15] to be necessary and sufficient for R-linear convergence in expectation of random function iterations for consistent stochastic feasibility.This result is a stochastic analog of [55, Theorem 2] in the deterministic setting and all of this has been extended to nonlinear spaces in [48,Theorem 16].It is an open problem whether metric subregularity is necessary in the present setting, though we see no reason why it should not be.
We apply this to the Markov operator P on the metric space (P 2 (G), W 2 ) in the following manner.Recall the transport discrepancy ψ defined in (14).We construct the surrogate mapping Ψ : P(G) → R + ∪ {+∞} defined by (22).We call this the invariant Markov transport discrepancy.It is not guaranteed that both inv P and C * (µ, π) are nonempty; when at least one of these is empty, we define Ψ(µ) := +∞.It is clear that Ψ(π) = 0 for any π ∈ inv P. Whether Ψ(µ) = 0 only when µ ∈ inv P is a property of the space (G, d).Indeed, as noted in the discussion after Lemma 2.5, in CAT(κ) spaces with κ > 0 the transport discrepancy ψ can be negative, and so by cancellation it could happen on such spaces that the invariant Markov transport discrepancy Ψ(µ) = 0 for µ / ∈ inv P. In CAT(0) spaces, and hence Hadamard spaces, ψ is nonnegative but this is still not enough to guarantee that the invariant Markov transport discrepancy Ψ takes the value 0 at µ if and only if µ ∈ inv P. The regularity we require of P is that the invariant Markov transport discrepancy Ψ takes the value 0 at µ if and only if µ ∈ inv P, and is metrically subregular for 0 relative to P 2 (G) on P 2 (G) defined in (7).

Contractive Markov operators
Before moving to our main results, we put the more familiar contractive mappings into the present context.A survey of random function iterations for contractive mappings in expectation can be found in [69].An immediate consequence of [69,Theorem 1] is the existence of a unique invariant measure and linear convergence in the Wasserstein metric from any initial distribution to the invariant measure.
Contraction Markov operators have been studied in [43,61] using the parallel notion of the coarse Ricci curvature κ(x, y) of the Markov operator P between two points x and y: κ(x, y) := 1 − W 1 (δ x P, δ y P) d(x, y) .
Generalizing this definition to W p yields the coarse Ricci curvature with respect to W p : A few steps lead from this object for the Markov operator P with update function Φ(•, ξ) = T ξ and transition kernel defined by (2) to the violation ǫ in Proposition 3.5.Indeed, a formal adjustment of the proof of [61, Proposition 2] establishes that the property κ 2 (x, y) ≥ κ ∈ R for all x, y ∈ G is equivalent to When κ > 0, i.e. when the coarse Ricci curvature is bounded below by a positive number, this characterizes contractivity of the Markov operator.The negative of the violation in ( 29) is just a lower bound on the coarse Ricci curvature in W 2 : −ǫ = κ ≤ κ 2 (x, y) for all x, y ∈ G.The consequences of Markov operators with Ricci curvature bounded below by a positive number have been extensively investigated.Error estimates for Markov chain Monte Carlo methods under the assumption of positive Ricci curvature in W 1 (i.e.negative violation) are explored in [43].Applications to waiting queues, the Ornstein-Uhlenbeck process on R n and Brownian motion on positively curved manifolds, as well as demonstrations of how to verify the assumptions on the Ricci curvature are developed in [61].Our approach extends this to expansive mappings, which allows one to treat our target application of electron density reconstructions from X-FEL experiments (see Section 4).The next result shows that update functions Φ that are contractions in expectation generate α-fne Markov operators with metrically subregular invariant Markov transport discrepancy.In this context see [45].Theorem 3.7.Let (G, • ) be a Hilbert space, let T i : G → G for i ∈ I and let Φ : G × I → G be given by Φ(x, i) := T i (x).Denote by P the Markov operator with update function Φ and transition kernel p defined by (2).Suppose that Φ is a contraction in expectation with constant r < 1, i.e.E[ Φ(x, ξ) − Φ(y, ξ) 2 ] ≤ r 2 x − y 2 for all x, y ∈ G. Suppose in addition that there exists y ∈ G with E[ Φ(y, ξ) − y 2 ] < ∞.Then the following hold.
Proof.Note that for any pair of distributions µ 1 , µ 2 ∈ P 2 (G) and an optimal coupling γ ∈ C * (µ 1 , µ 2 ) (nonempty by [38,Lemma A.7]) it holds that where ξ is independent of γ.Moreover, P is a self-mapping on P 2 (G).To see this let µ ∈ P 2 (G) independent of ξ and let y be a point in G where E[ Φ(y, ξ) − y 2 ] < ∞.Then by the triangle inequality and the contraction property Therefore µP ∈ P 2 (G).Altogether, this establishes that P is a contraction on the separable complete metric space (P 2 (G), W 2 ) and hence Banach's Fixed Point Theorem yields existence and uniqueness of inv P and Q-linear convergence of the fixed point sequence.
To see (ii), note that, by (17), where the last inequality follows from the Cauchy-Schwarz inequality and the fact that Φ(•, ξ) is a contraction in expectation.Again using the contraction property and (34) we have The right hand side of this inequality is just the characterization (27) of mappings that are α-fne in expectation with α = (1 + r)/2.The rest of the statement follows from Proposition 3.5.The proof of (iii) is modeled after the proof of [13,Theorem 32].By the triangle inequality and part (i) we have On the other hand, (33) implies that Ψ takes the value zero only at invariant measures so that by the uniqueness of invariant measures established in part (i) Combining this with ( 35) and (33) then yields for all k ∈ N In other words, Since this holds for any sequence(µ k ) initialized with µ 0 ∈ P 2 (G), we conclude that Ψ is metrically subregular for 0 relative to P 2 (G) with gauge ρ(t) = (q(1 − r)) −1 t on P 2 (G), as claimed.

Proofs of the main results
We are now in a position to prove the main result.
Proof of Theorem 2.6.First note that by assumption (i) the Markov operator P a self-mapping on P 2 (G), hence W 2 (µ, µP) < ∞, and for any µ 1 , µ 2 ∈ P 2 (G) the set of optimal couplings C * (µ 1 , µ 2 ) is nonempty [38,Lemma A.7]. Since (H, d) is a Hadamard space and G ⊂ H, the function Ψ(µ) defined by ( 22) is extended real-valued, nonnegative (see Lemma 2.5), and finite since C * (µ, π) and inv P are nonempty.Moreover, by assumption (c) Ψ −1 (0) = inv P and On the other hand, note that assumption (b) is that Ψ is aα-fne in expectation, so by definition (22), assumption (b) and Proposition 3.5 (which applies because we are on a separable Hadamard space) we have Incorporating ( 37) into (38) and rearranging the inequality yields Since this holds at any µ ∈ P 2 (G), it certainly holds at the iterates µ k with initial distribution µ 0 ∈ P 2 (G) since P is a self-mapping on P 2 (G).Therefore Equation ( 39) simplifies.Indeed, by Lemma 2.3, inv P is closed with respect to convergence in distribution.Moreover, since G is assumed to be compact, P 2 (G) is locally compact ([2, Remark 7.19] so, for every k ∈ N the infimum in ( 39) is attained at some π k .This yields Taking the square root and recalling ( 19) and ( 20) yields (24).

Remark 3.8:
The compactness assumption on G can be dropped if (H, d) is a Euclidean space.Proof of Corollary 2.7.In the case that the gauge ρ is linear with constant r ′ , then θ τ,ǫ (t) is linear with constant . Specializing the argument in the proof above to this particular θ τ,ǫ shows that, for any k and m with k < m, we have Letting m → ∞ in (41) yields R-linear convergence (5) with rate c given above and leading constant β = 1+c 1−c d 0 .If, in addition, inv P is a singleton, then {π µ 0 } = inv P in the above and convergence is actually Q-linear, which completes the proof.

Examples: Stochastic Optimization and Inconsistent Nonconvex Feasibility
To fix our attention we focus on the following optimization problem minimize It is assumed throughout that f i : R n → R is continuously differentiable for all i ∈ I f and that g i : R n → R ∪ +∞ is proper (not everywhere infinite), lower semicontinuous (g i (x) ≤ lim inf x→x g(x) for all x) for all i ∈ I g and subdifferentially regular, i.e. the limiting subdifferential and the regular subdifferential are the same, where the regular subdifferential of g i at x, denoted ∂g i (x), is defined by In particular, when g i is differentiable at x, the subdifferential is a singleton consisting of the gradient: ∂g i (x) = {∇g i (x)}.For more background on nonsmooth analysis see [67].The random variable with values on I f × I g will be denoted ξ = (ξ f , ξ g ).This model covers deterministic composite optimization as a special case: I f and I g consist of single elements and the measure µ is a point mass.
The algorithms reviewed in this section rely on resolvents of the subdifferentials/gradients of the functions f i and g i , denoted J ∂f i ,λ and J ∂g i ,λ .The resolvent of a multi-valued mapping F from elements in G ⊂ R n to one or more elements in R n is defined by For proper, lower semicontinuous convex functions f : R n → R ∪ {+∞} , this is equivalent to the proximal mapping [59] (often just called the prox mapping) defined by prox f,λ (x) := argmin In general one has prox f,λ (x) ⊂ J ∂f,λ (x) (44) whenever the subdifferential is defined.When λ = 1 in the definitions above, we will just write prox f or J ∂f .

Stochastic (nonconvex) forward-backward splitting
A splitting algorithm applied to problem ( 42) is any algorithm that proceeds by taking steps with respect to f ξ f (x) and g ξ g (x) separately and combining these either through convex combinations or compositions of some sort.Classical examples of splitting methods are the Gauss-Seidel and Jacobi algorithms for solving linear systems.We present a general prescription of what is known as the forward-backward splitting algorithm together with abstract properties of the corresponding fixed point mapping, and then specialize this to more concrete instances.Algorithm 2 is called forward-backward because the step in the direction of the negative gradient −t∇f ξ f k (X k ) is interpreted as a "forward" step (in the context of differential equations, this would be an explicit Euler step), while the step taken by applying the resolvent J ∂g ξ g k is interpreted as a "backward" step (in the context of differential equations, this would be an implicit Euler step).
Algorithm 2: Stochastic Forward-Backward Splitting with values in I f × I g , and set t > 0.
When f ξ f (x) = f (x) + ξ f • x and g ξ g is the zero function, then this is just steepest descents with linear noise discussed in Section 2.1.More generally, (45) with g ξ g the zero function models stochastic gradient descents, which is a central algorithmic template in many applications.We show how the approach developed above opens the door to an analysis of this basic algorithmic paradigm for nonconvex problems.The next statement is the nonconvex analog to [38, Proposition 4.1].The main difference here is that the violation given by ( 48) depends on the step size t in (45).We will have more to say about this below.Part (i) of the statement is the nonconvex analog to [38, Proposition 4.1(ii)]; part (iv) of the statement covers the convex case already established in [38,Proposition 4.1(iii)]; this is included for completeness.Proposition 4.1.On the Euclidean space (R n , • ) suppose the following hold: (b) there is a τ g such that for all i ∈ I g , the (limiting) subdifferential ∂g i satisfies at all points (x + , z) ∈ gph ∂g i and (y + , w) ∈ gph ∂g i where z = x − x + for {x + } = J ∂g i (x) for any x ∈ i∈I f (Id −t∇f i ) (G) and where w = y − y + for {y + } = J ∂g i (y) for any Then the following hold.
(i) T F B i is aα-fne on G with constant α = 2/3 and violation at most for all i ∈ I.
(iv) Suppose that assumption (a) holds with condition (46) being satisfied for τ f < 0 (that is, ∇f i is strongly monotone for all i), and that condition (47) holds with τ g = 0 (for instance, when g i is convex).Then, whenever there exists an invariant measure for the Markov operator P corresponding to (45), for any fixed step length t ∈ (0, 2α/L] the distributions of the sequences of random variables converge to an invariant measure in the Prokhorov-Lévy metric. (v) Let G be compact and P 2 (G) ∩ inv P = ∅.If Ψ given by (22) takes the value 0 only at points in inv P and is metrically subregular for 0 on P 2 (G) with gauge ρ given by (20) with τ = 1/2, ǫ satisfying (48), and θ τ,ǫ satisfying (19), then the Markov chain converges to an invariant distribution in the W 2 metric with rate given by (24).
Before proving the statement, some background for conditions (46) and ( 47) might be helpful.The inequality (46) is satisfied by functions f that are prox-regular [64].This traces back to Federer's study of curvature measures [31] where such functions would be called functions whose epigraphs have positive reach.Inequality ( 47) is equivalent to the property that J ∂g i is aα-fne with constant α i = 1/2 and violation τ g on G [56, Proposition 2.3].Any differentiable function g i with gradient satisfying (46) with constant τ g /(2(1 + τ g )) will satisfy (47) with constant τ g .In the present setting, if g i is prox-regular on G, then ∂g i is hypomonotone on G and therefore satisfies (47) [56].Convex functions are trivially hypomonotone with constant τ = 0.If the functions g i are convex, then the violation (48) can be made arbitrarily small by taking the step size t small enough.
(ii).This follows immediately from Part (i) above and Proposition 3. The compactness assumption on G in part (v) is just to permit the application of Theorem 2.6.As noted in Remark 3.8 this assumption can be dropped for mappings T i on Euclidean space.The result narrows the work of proving convergence of stochastic forward-backward algorithms to verifying existence of inv P. The case of convex stochastic gradient descent was presented in [38,Proposition 4.2].

Stochastic Douglas-Rachford
Another prevalent algorithm for nonconvex problems is the Douglas-Rachford algorithm [50].This is based on compositions of reflected resolvents: Algorithm 3 has been studied for solving large-scale, convex optimization and monotone inclusions (see for example [19,26]).Proposition 4.3 below opens the analysis to nonconvex, nonmonotone problems; the setting for this statement is given by the following assumptions.Assumption 4.2.On the Euclidean space (R n , • ) the following assumptions hold.
(a) There is a τ g such that for all i ∈ I g , the (limiting) subdifferential ∂g i satisfies at all points (x + , z) ∈ gph ∂g i and (y + , w) ∈ gph ∂g i where z = x − x + for {x + } = J ∂g i (x) for any x ∈ G ⊂ R n and where w = y − y + for {y + } = J ∂g i (y) for any y ∈ G.
Algorithm 3: Stochastic Douglas-Rachford Splitting (b) There is a τ f such that for all i ∈ I f , the (limiting) subdifferential ∂f i satisfies at all points (x + , z) ∈ gph ∂f i and (y + , w) ∈ gph ∂f i where z = x − x + for {x + } = J ∂f i (x) for any x ∈ j∈Ig {J ∂g j (G)} and where w = y − y + for {y + } = J ∂f i (y) for any y ∈ j∈Ig J ∂g j (G).
on G.
(iv) Suppose that parts (a) and (b) of Assumption 4.2 hold with conditions (51) and (52) being satisfied for τ g = τ f = 0 (i.e., when f i and g i are convex for all i).Then, whenever there exists an invariant measure for the Markov operator P corresponding to (50), the distributions of the sequences of random variables converge to an invariant measure in the Prokhorov-Lévy metric.
Proof.(i).By [56,Proposition 3.7] for all j ∈ I g , J ∂g j is aα-fne with constant α = 1/2 and violation ǫ g = 2τ g on G. Likewise, for all i ∈ I f , J ∂f i is aα-fne with constant α = 1/2 and violation ǫ f = 2τ f on j∈Ig {J ∂g j (G)}.By [56,], for all i ∈ I f × I g the Douglas-Rachford mapping T DR i is therefore aα-fne with constant α = 1/2 and violation at most Here as in Proposition 4.1 the compactness assumption on G in part (v) can be dropped since T i is a mapping on R n (see Remark 3.8).

Application to X-FEL Imaging
We apply the Stochastic Forward-Backward Algorithm 2 and the Stochastic Douglas-Rachford Algorithm 3 to the problem of X-ray free electron laser imaging discussed in the introduction.
Here, a high-energy X-ray pulse illuminates molecules suspended in fluid that is streaming across the beam.A low-count diffraction image is recorded for each pulse.The goal is to reconstruct the three-dimensional electron density of the target molecules from the observed diffraction images.This is a stochastic tomography problem with a nonlinear model for the data -stochastic because the molecule orientations are random, and uniformly distributed on SO(3).Computed tomography with random orientations has been studied for more than two decades [10,11] and been successfully applied for inverting the Radon transform (a linear operator) with unknown orientations [63,68].The model for the data in X-FEL imaging is nonlinear and nonconvex: Fraunhoffer diffraction (mathematically equivalent to Fourier transformation) with missing phase [16,72].The problem of recovering an object from diffraction intensity data is the optical phase retrieval problem [65,66].The most successful and widely applied methods for solving this problem are fixed point algorithms where the fixed point mappings consist of compositions and averages of projection mappings onto nonconvex sets [54]; the connection to the general framework considered here is through the fact that the projector is the proximal mapping of the indicator function [67], and the gradient of the squared distance of a point to a set is twice the difference between that point and its projection onto the set [52,Eq(5.3)].With additional constraints ensuring that the reconstructions are confined to a certain region, or are real-valued, the feasibility model for the X-FEL problem, and phase retrieval in general, is an example of an inconsistent, nonconvex feasibilty problem: there does not exist a point that simultaneously explains the measurement and satisfies the a priori constraints.A theoretical framework for unifying and extending the first proofs of local convergence of projection methods for inconsistent, nonconvex feasibility, with rates, was established in [56].This analysis accommodates iterations of averages and compositions of expansive mappings.Moreover, unlike many other approaches, the framework does not require that the constituent mappings have common fixed points.This has been applied to prove, for the first time, local linear convergence of a wide variety of fundamental algorithms for phase retrieval [40,53,56,70].
The discretized mathematical model for the physical experiment is Here F ξ : C n → C n is the Fourier transform composed with a rotation ξ ∈ SO(3) accounting for the propagation of an electromagnetic wave with incident angle ξ; ρ ∈ C n is the unknown electron density in discretized voxels numbered from 1 to n (real-valuedness and nonegativity will be applied in the qualitative constraints); this density interacts with the wave at one end (the pupil or object plane) of the instrument, and φ ξ i ∈ [0, 1] is the probability of observing a scattered photon in the i'th voxel (i = 1, 2, . . ., n) of the imaging volume in the far field under the rotation ξ.The problem is to determine ρ from φ ξ •1 .The set of possible vectors satisfying such measurements is given by This set has been studied extensively in [23,[51][52][53]56].Although this set is nonconvex, it is prox-regular [51] and hence "weakly nonconvex".Projectors onto C(ξ) are therefore pointwise almost α-firmly nonexpansive at any point ρ ∈ C(ξ) with violation ǫ vanishing to zero as the neighborhood of ρ collapses.
To give an idea of the size of this problem, in a typical experiment n is about 10 8 .In practice, the observation described above for a randomly selected orientation ξ k is repeated about 10 9 times, each time with a different orientation.
In addition to the sets generated by the data, there are certain a priori qualitative constraints that can (and should) be added depending on the type of experiment that has been conducted.Often these are support constraints, or real-valuedness, or nonnegativity; an electron density, for example is by definition real-valued and nonnegative.All of these are convex constraints for which we reserve the set C 0 for the qualitative constraints.
The problem is a specialization of (42) where I f = {1}, I g = SO(3), ξ f k = 1 for all k, ξ g k is a uniformly distributed random variable on SO(3) for all k, and Both Algorithm 2 and Algorithm 3 can be applied to this problem.Assumptions (a) and (b) of both Proposition 4.1 and Proposition 4.3 hold on all neighborhoods of the sets C 0 and C(ξ) small enough [56,Example 3.6].Indeed, for this application τ f = 0 since C 0 is convex and τ g ξ g k in (51) can be estimated from the distance of ρ to the set C(ξ g k ), which is easy to calculate.Deterministic versions of Algorithm 2 have been fully studied in [56], though this algorithm is not recommended for this problem in particular due to the frequent occurrence of local minima.In [53] the fixed points of the deterministic version of Algorithm 3 for the phase retrieval problem have been characterized, and metric subregularity of the transport discrepancy (14) has been determined for geometries applicable to cone and sphere problems [54] such as this.So for a majority of relevant instances, there is good reason to expect that Propositions 4.1 and 4.3 can be applied provably to X-FEL measurements.The determination of the local domain G in condition (c) of Proposition 4.1 and Assumption 4.2 is therefore key.There are some unresolved cases, however, that are relevant for optical phase retrieval (see [53,Example 5.4]), and this needs further study.

Stochastic Proximal Algorithms in Tree Space
We conclude with proximal splitting in locally compact Hadamard spaces.Our target application in this setting is the computation of Fréchet means in what is known as the BHV space, a model space for phylogenetic trees [14].This is a Hadamard space consisting of stratified Euclidean spaces.Each strata of the space consists of all trees (connected, acyclic, undirected graphs) with the same sets of nodes and edges, but different edge lengths connecting the various nodes.The leaves of such trees are individual species, and the nodes connecting the species are points of common genetic ancestry; the lengths of the edges between nodes represents the time required for the observed genetic variation to occur.A particular tree relationship depends on what part of the genome is being compared; measurements from two different parts of the genome will generally lead to two different phylogenetic trees.If two trees belong to the same strata, i.e. if the trees differ only in edge lengths, then a geodesic connecting these trees is simply a line between two points in a Euclidean space; otherwise, the geodesic passes from one stratum to another by passing through the origin, i.e. by shrinking an edge length to zero and/or by extending an edge from the origin.
The distance between two points x and y in this space, d(x, y) is (obviously) the length of the geodesic connecting them, and this can be decomposed into the sum of piecewise Euclidean distances of the geodesic confined to individual strata.The distance function is clearly convex.The task we consider in this example is that of computing the average, or more precisely the Fréchet mean, of more than two trees in the BHV space: Given N trees, x 1 , x 2 , . . ., x N find x * that solves Readers interested in learning more about the application to phylogenetic trees are referred to [14] and references therein; our primary interest is in the mathematical structure of the problem and demonstrating the theory developed in the previous sections.
Let (H, d) be a Hadamard space, and let f i : H → R be proper, lower semicontinuous convex functions for i = 1, 2, . . .N .The Fréchet mean problem is an instance of following convex optimization problem In a Hadamard space the proximal mapping of a function f is defined by (43) with the only difference that the domain of f is H.Note that the prox mapping is well-defined, even though the resolvent of the subdifferential is not since H is not a linear space.This has been studied in CAT (0) spaces in [3,9,42] and in the Hilbert ball in [46].In these earlier works it was already known that proximal mappings of lower semicontinuous convex functions are (everywhere) αfirmly nonexpansive with α = 1/2.In [12] a randomized proximal splitting algorithm is studied under the assumption that the proximal parameters λ in (43) decay to zero, in which case the prox mapping converges to the identity mapping.The theory developed above allows us to conclude convergence to an invariant probability measure without the assumption that the prox mapping converges to the identity.
Letting ξ be a uniformly distributed random variable with values on {1, 2, . . ., N } the backwardbackward splitting method applied to this problem yields Algorithm 4. In the deterministic where A random version of this algorithm with diminishing constants λ ξ k was shown to converge in [12,Theorem 3.7].
In a Euclidean setting this assumption for this problem can be shown to hold by demonstrating that the discrete set of points {x 1 , . . ., x N } is (trivially) subtransversal [56, Definition 3.2 and Proposition 3.4].We conjecture that this holds in the present setting of randomized algorithms on Hadamard spaces, but to show this would take more space than we have here.
This example does not use the full potential of our framework since this problem is convex, and hence the proximal mappings are all α-firmly nonexpansive with constant α = 1/2.One prominent nonconvex problem in this setting to which our framework can be applied is the Kmeans clustering problem: find the best assignment of a sample of N trees {x 1 , x 2 , . . ., x N } into K clusters.In this problem one begins with N clusters (the individual trees) and reduces the number of clusters by assigning nearest neighbors to a single cluster whose center is the Fréchet mean.The operation of assigning a tree to a cluster is expansive since two trees arbitrarily close to a point that is equidistant to more than one cluster will be assigned to distinct clusters with possibly very distant Fréchet means.

6 .
The next definition uses the update function Φ defined in Assumption 2.1(b) and is an extension of [38, Definition 2.8].Definition 3.1 (pointwise almost (α-firmly) nonexpansive in expectation).Let (G, d) be a CAT(0) metric space, let T i : G → G for i ∈ I, and let Φ :

Proposition 4 . 3 .
(c) T DR i is a self-mapping on G ⊂ R n for all i.Under Assumption 4.2 the following hold.(i) For all i ∈ I f × I g the mapping T DR i defined by (50) is aα-fne on G with constant α = 1/2 and violation at most