Geometric Aspects of Data-Processing of Markov Chains

We examine data-processing of Markov chains through the lens of information geometry. We first establish a theory of congruent Markov morphisms within the framework of stochastic matrices. Specifically, we introduce and justify the concept of a linear right inverse (congruent embedding) for lumping, a well-known operation used in Markov chains to extract coarse information. Furthermore, we inspect information projections onto geodesically convex sets of stochastic matrices, and show that under some conditions, projecting (m-projection) onto doubly convex submanifolds can be regarded as a form of data-processing. Finally, we show that the family of lumpable stochastic matrices can be meaningfully endowed with the structure of a foliated manifold and motivate our construction in the context of embedded models and inference.


Introduction
The information divergence rate of two stochastic processes Y = (Y t ) t∈N , Y = (Y t ) t∈N on some finite space Y measures the average discrepancy between the processes in unit time, and is defined by where D denotes the Kullback-Leibler divergence [Kullback and Leibler, 1951].Information monotonicity dictates that merging symbols in Y and recording the processes on the resulting smaller space X must lead to a decrease in their divergence, originating from an information loss.When the two processes under consideration are independent and identically distributed (iid) according to discrete distributions µ, ν ∈ P (Y ), this property can be represented by the action of a memoryless channel W : P (Y ) → P (X ), D (µ ν) ≥ D (µW νW) . (1) We can alternatively consider embeddings of distributions into some larger space Z, and it is known that we achieve equality in (1) for any pair of distributions if and only if the embedding belongs to the class of congruent Markov morphisms [ Čencov, 1978].
While the divergence rate between two Markov chains can still be expressed in terms of their transition kernels, Markov processes are more challenging to consider for several reasons.Firstly, processing them, even by merely merging symbols, can easily break the Markov property.Furthermore, although simulating the process resulting from the action of a Markov morphism in the iid setting is straightforward (existence of an operational definition), it is not clear what actions on Markov kernels would also allow us given some trajectory of observations, to simulate (requiring only basic operations) the trajectory obtained from the processed kernel.These considerations invite us to put forward the below listed desiderata.Data-processing of Markov chains (i) should be expressible in terms of the action of an operator upon its transition kernel, (ii) should preserve Markovianity, (iii) should have an operational definition.
The natural operation of reducing the state space whilst preserving Markovianity is commonly referred to as lumping.The last-mentioned satisfies all of our above requirements, and will be our starting point.
Our first question concerns natural inverse operations of lumping, i.e. embedding1 a Markov chain in a possibly larger state space.This will lead us to define congruent morphisms with respect to lumping of Markov kernels, and result in a characterization of this class of morphisms in terms of Markov embeddings, the central notion we will introduce in Section 3.1.
Taking an information geometry perspective, we then view irreducible Markov kernels as dually flat information manifolds [Nagaoka, 2005].In this context, embeddings are, in turn, injective morphisms, i.e. structure-preserving maps.It will be interesting to determine what structure (Fisher information metric, dual affine connections, e-families,...) is preserved under Markov embeddings.Natural questions we will answer include investigating the existence of larger classes of embeddings that preserve the same structure -albeit possibly sacrificing some of our desiderata; conversely, we will also find sub-classes of embeddings that preserve additional structure of interest, and understand how known embeddings (e.g.Hudson expansions) fit in our theory.
Finally, we will take a thorough look at the geometric structure of families of lumpable Markov kernels.Some families of Markov kernels are known to enjoy favorable properties, e.g.reversible Markov chains form an e-family and an m-family.Although it will become evident that lumpable kernels generally do not form e-families or m-families, we will show that we can still endow the family with the structure of a foliated manifold, and offer interpretation of this construction.

Related work
The question of whether processing a Markov chain by a function retains its Markovianity can be traced back to Burke and Rosenblatt [1958] (see also Pitman and Rogers [1981]).Chains that satisfy this property under merging some of their states were later termed lumpable by Kemeny and Snell [1983].A complete survey of lumpability is beyond the scope of this paper, but we mention the works of Rubino [1989] and Buchholz [1994].The notion was also extended in several ways, for instance quasi-lumpability [Franceschinis and Muntz, 1994] where the transition matrix is lumpable modulo some perturbation, or higher-order lumpability [Gurvits andLedoux, 2005, Geiger andTemmel, 2014], where a lumped Markov chain may lose its first-order Markov property, but retains a kth-order Markov property.The problem of lumpability is also directly related to that of identifiability of hidden Markov sources [Ito et al., 1992, Kabayashi et al., 1991, Hayashi, 2019].
Following an axiomatic approach, Čencov [1978] first introduced, motivated and analyzed Markov morphisms, as the statistical mappings of interest for data-processing.More recently, similar approaches have been taken for conditional models [Lebanon, 2005, 2004, Mont úfar et al., 2014], which put forward several classes of embeddings.Although a Markov kernel corresponds to some conditional model, in this paper we also think of it as a stochastic process.This invites us to consider a more restricted class of natural embeddings than that of the aforementioned works.
Exponential tilting of stochastic matrices can be traced back to Miller [1961].The large deviation theory for Markov chains, was developed to a greater extent by Donsker and Varadhan [1975], Gärtner [1977], Dembo and Zeitouni [1998].Csiszár et al. [1987] first recognized the exponential structure of the family of irreducible kernels, and Nagaoka [2005] later gave a full treatment in the language of information geometry.We refer the reader to the related work section of Wolfer and Watanabe [2021] for a brief history of the series of works [Ito and Amari, 1988, Takeuchi and Barron, 1998, Takeuchi and Kawabata, 2007, Takeuchi and Nagaoka, 2017, Nakagawa and Kanaya, 1993] that contributed to this construction.
The first appearance of the Pythagorean theorem for e-projections onto m-families of distributions can be found in Čencov [1968].For a complete treatment of the theory of projections onto α-families of distributions, we refer the reader to the excellent monograph of Amari and Nagaoka [2007].In the context of Markov chains, Pythagorean identities for orthogonal projections can be found in Hayashi and Watanabe [2016].In Wolfer and Watanabe [2021], closed-form expressions are given for the e/m-projections onto the e/m-family of reversible Markov chains.
More general information projections onto convex sets of distributions are also a well-studied topic.The Pythagorean inequality in this context is credited to Csiszár [1975], Topsøe [1979] (see also Csiszár [1984]).Inequalities involving the reverse information projection have also been devised, for example a four-point property in Csiszár and Tusnády [1984], or a Pythagorean inequality on logconvex sets Csiszár and Mat úš [2003].The reader is invited to consult Csiszár et al. [2004] for a complete exposition.In the Markovian setting, we mention Boza [1971], Csiszár et al. [1987] who considered information projections of Markov kernels onto convex sets of edge measures.To the best of our knowledge, reverse information projections onto e-convex sets of Markov kernels had not yet been analyzed.

Outline and main contributions
In Section 2, we set out our notations, recall the information geometric structure of irreducible Markov kernels introduced by Nagaoka [2005], and briefly discuss the alternative characterization of lumping of Markov chains.
In Section 3.1, we first constructively extend the well-known notion of a Markov morphism in the context of distributions, to that of a Markov morphism in the context of Markov kernels (Definition 3.2).We further show that the latter preserves the Fisher information and dual affine connections on irreducible kernels (Lemma 3.1).In Section 3.2, we develop a theory of lumpable linear operators, and define congruent embeddings (Definition 3.4) in this context.We proceed to state and prove that congruent embeddings are exactly the Markov morphisms we constructed (Theorem 3.1).In Section 3.3, we expand the class of Markov embeddings to the larger class of exponential embeddings.Although the latter are generally not isometric, we show that they still preserve e-structures within families of irreducible kernels (Theorem 3.2).In Section 3.4, we explore a few notable classes of embeddings.We begin with the special case of Hudson embeddings, presented in Kemeny and Snell [1983] as the natural inverse operation to lumping.We express Hudson embeddings as Markov embeddings, and propose an interpretation of the embedding of a family of irreducible kernels as a first-order subfamily of second-order Markov kernels.We continue by inspecting a natural subset of Markov embeddings, which we term memoryless Markov embeddings, and which also preserve the m-structure of irreducible kernels, as well as reversibility.In particular, we show that this class is already rich enough to embed any rational stochastic kernel into the set of bistochastic kernels (Corollary 3.2).We close this section by more systematically investigating reversibility preservation of Markov embeddings, and discuss some of the advantages that arise from this property.The reader is invited to consult Table 1 for a complete nomenclature of the numerous classes of embeddings we treat, and Figure 3 for an illustration of their hierarchical structure.
In Section 4, we complement the theory of information projection on geodesically convex sets of Markov kernels by procuring a Pythagorean inequality for reverse information projections onto e-convex sets (Proposition 4.2).We further explore the Markovian equivalent of the four-point property, and show that under favorable conditions, the m-projection of two kernels onto an em-family can be regarded as a form of data-processing.
In Section 5, we analyze the family of lumpable kernels from an information geometric standpoint.Although this family forms neither an e-family nor an m-family in general, we show that it can be endowed with the structure of a foliated manifold (Theorem 5.1) and determine its dimension.We motivate this construction by considering Pythagorean projections on leaves, and propose an interpretation in the context of estimation for embedded models.
Finally, Section 6 briefly discuss compositions of embeddings, and higher-order lumping / embedding.We show, by example, that composition opens the door to a class of embeddings that is significantly richer than the Markov embeddings.

Notation and preliminaries
We write [n] = {1, 2, . . . ,n} for n ∈ N, and δ[•] → {0, 1} for the predicate indicator function.Let X , Y be sets such that |X | = n, |Y | = m with n < ∞, m < ∞, where to avoid trivialities, we additionally assume that n, m > 1.We denote P (X ) the probability simplex over X , and P + (X ) = {µ ∈ P (X ) : ∀x ∈ X , µ(x) > 0}.All vectors will be written as row-vectors, for x ∈ X , e x is the unit vector verifying ∀x ∈ X , e x (x ) = δ[x = x ].For some real matrices A and B, ρ(A) is the spectral radius of A, A • B is the Hadamard product of A and B, A > 0 (resp.A ≥ 0) means that A is an entry-wise positive (resp.non-negative) matrix.We will routinely identify a function f : X 2 → R with the linear operator f : R X → R X , and consider implicit sums A(x, S) = ∑ s∈S A(x, s) when S ⊂ X .

Irreducible Markov chains
We let (Y, E ) be a strongly connected directed graph, where Y is the set of vertices, and E ⊂ Y 2 is the set of directed edges.Let F (Y, E ) be the set of all real functions over the set E , identified with the totality of functions over Y 2 that are null outside of E , and let F + (Y, E ) ⊂ F (Y, E ) be the subset of positive functions over E .For a fully connected graph, we write F (Y ) = F (Y, Y 2 ), and we identify F (Y ) with the set of real square matrices of size |Y |.In particular, the following inclusions hold We write W (Y ) for the set of (not necessarily irreducible) row-stochastic transition kernels over the state space Y, W (Y, E ) for the subset of irreducible kernels whose support is E , and W + (Y ) when E = Y2 .For P ∈ W (Y ), P(y, y ) corresponds to the transition probability from state y to state y .Formally, For P ∈ W (Y, E ), there exists a unique π ∈ P + (Y ), such that πP = π [Levin et al., 2009, Corollary 1.17], called the stationary distribution of P. We write Q = diag(π)P for the edge measure matrix, [Levin et al., 2009, (7.5)], which encodes stationary pair-probabilities of P, i.e.Q(y, y ) = P π (Y t = y, Y t+1 = y ).Following notations in Wolfer and Watanabe [2021, will respectively denote the subsets of reversible, doubly stochastic, symmetric, and memoryless transition kernels that are irreducible over (Y, E ).We define a stochastic rescaling 2 mapping s that constructs a proper irreducible stochastic matrix from any non-negative irreducible matrix over (Y, E ), where ρ A and v A are respectively the Perron-Frobenius (PF) root and right PF eigenvector of A, which we will henceforth refer to as the right PF pair.The mapping s is invariant under scaling of the argument by a positive constant or conjugation of the argument by a diagonal matrix.Namely, for (3)

Information geometry of Markov kernels
Following Nagaoka [2005], we take a differential geometry perspective, and view W (Y, E ) as a smooth manifold of dimension where for each P ∈ W (Y, E ), we endow the tangent space T P at P with a d-dimensional vector space structure (see e.g.Wolfer and Watanabe [2021, Section 3.1] for an argument on how to derive (4)).On W (Y, E ), we can introduce the Riemannian metric g, expressed in some chart θ : where ∂ i = ∂/∂θ i , as well as the pair of torsion-free affine connections ∇ (e) and ∇ (m) respectively termed e-connection and m-connection, expressed by their Christoffel symbols as The Fisher metric and connections defined in ( 5) and ( 6) are natural counterparts of the ones defined in the context of distributions [Amari and Nagaoka, 2007].The connections ∇ (e) and ∇ (m) are dual with respect to g in the sense that for any vector fields X, Y, Z, X Z .
The tuple W (Y, E ), g, ∇ (e) , ∇ (m)   encodes the information geometric structure of W (Y, E ) in the sense where it defines the notions of straight lines, parallelism, and distances.The information divergence of a kernel P from another kernel P is given by π(y)P(y, y ) log P(y, y ) P (y, y ) , while the dual divergence verifies D (P P ) = D (P P).Notably, (7) corresponds to the divergence rate of the Markov processes induced from P and P , where for k ∈ N, defines the distribution of stationary paths of length k induced from P.

Mixture family and exponential family
Definition 2.1 (m-family of transition kernels [Amari and Nagaoka, 2007, 2.35]).We say that a family of irreducible transition kernels V m is a mixture family (m-family) of irreducible transition kernels on (Y, E ) when the following holds.There exists C, F 1 , . and , and Q ξ is the edge measure that pertains to P ξ .Note that Ξ is an open set, ξ is called the mixture parameter and d is the dimension of the family V m .See also [Wolfer and Watanabe, 2021, Definition 1] for alternative equivalent definitions of an m-family.
Definition 2.2 (e-family of transition kernels).Let Θ ⊂ R d be some open connected parameter space.We say that the parametric family of irreducible transition kernels is an exponential family (e-family) of transition kernels with natural parameter θ, when there exist functions K, g 1 , . . ., Fixing some θ ∈ Θ, we will write for convenience ψ θ for ψ(θ) and R θ for R(θ, •) ∈ R Y .
Remark 2.1.An e-family V e can be identified with some affine space as follows.Denote, Then N (Y, E ) is an |Y |-dimensional vector space [Nagaoka, 2005, Section 3], and we can consider the quotient linear space We identify a coset of G(Y, E ) with a representative function in that coset.For Definition 2.2, and unless stated otherwise, we will assume that the g 1 , . . ., g d form an independent family in G(Y, E ).In this case dim V e = d [Nagaoka, 2005, Theorem 2].

Lumping of Markov chain
Let P ∈ W (Y, E ), and let Y 1 , Y 2 , . . ., Y k sampled from P. For φ : Y → X , a possibly random mapping, we call data processing the application of φ onto the trajectory sampled from P.
This corresponds to the action of a memoryless black box that takes a Markovian trajectory as input, and returns the image stochastic process.Note that the output process is generally not Markovian itself [Kelly, 1982], but corresponds to a functional hidden Markov model.We will consider multiletter extensions of this model in Section 6.
Lumping3 is a particular type of deterministic processing that projects a chain onto a state space of smaller size where some symbols are merged together.In the distribution setting, this operation is also referred to as a statistic [Ay et al., 2017].More formally, let us write a lumping as a surjective map Observe that a lumping is completely characterized by a partition x∈X S x = Y, where for x ∈ X , S x = κ −1 ({x}) is the collection of symbols in Y that are mapped to the new symbol x.When the Markovian nature of the lumped process is preserved -regardless of the initial distribution-, we say that P is κ-lumpable.In this case, defining the lumped edge set there exists an irreducible "push-forward" kernel, denoted by κ P ∈ W (X , D), such that (κ(Y t )) t∈N is sampled according to κ P. For a fixed lumping κ, we can then consider W κ (Y, E ) ⊂ W (Y, E ), the set of all κ-lumpable irreducible kernels over (Y, E ), as well as the push-forward map The following theorem characterizes lumpable chains in terms of their transition kernel.
Theorem 2.1 (Kemeny and Snell [1983, Theorem 6.3.2]).Let κ : Y → X a lumping function with associated partition x∈X S x = Y, and let P ∈ W (Y, E ).Then P ∈ W κ (Y, E ) if and only if for all x, x ∈ X , and for all y 1 , y 2 ∈ S x , P(y 1 , S x ) = P(y 2 , S x ), and in this case, κ P(x, x ) = P(y, S x ), y ∈ S x .
Corollary 2.1.Denoting κ π and κ Q the respective stationary distribution and edge measure of the lumped kernel κ P, it is straightforward to verify that When defining the lumping of a class of Markov kernel, it can be that To avoid this trivial case, we must assume that E and κ are compatible in the sense of the next proposition.
Proof.Easily verified with Theorem 2.1.

Kernel embeddings
Theorem 2.1 enables us to formulate lumping of a chain as a function of its kernel.In a similar spirit, we now define embeddings of Markov chains by viewing their kernels as first-class citizens.Namely, a Markov kernel embedding E is a map from a submanifold of irreducible kernels V ⊂ W (X , D) to another family W (Y, E ), and such that E is a diffeomorphism4 onto its image.
Lumping and embedding can thus be essentially seen as inverse operations.By abuse of notation, E π and E Q will denote respectively the stationary distribution and edge measure of the embedded kernel E P.

Markov embeddings
We first recall the notion of a Markov morphism5 in a probability distribution setting.Definition 3.1 ( Čencov [1978], Campbell [1986]).Let there be some partition x∈X S x = Y.To each x ∈ X , we associate W x ∈ P (Y ) concentrated on S x .We then define the induced mapping, referred to as Markov morphism, W : where for any µ ∈ P + (X ), and y ∈ Y, Example 3.1.Let X = {0, 1} and µ = (η, 1 − η).We can embed µ into the larger space Y = {0, 1, 2} by considering the mapping induced from the channel The resulting distribution is then We can also give a probabilistic definition for this embedding, where for a sequence of observations X 1 , X 2 , . . .sampled iid from µ, we record Y t = 0 when X t = 0, and flip a coin with bias p when X The new process Y 1 , Y 2 , . . . is drawn from the embedded distribution.
We now introduce Markov morphisms in the context of Markov kernels, which we term Markov embeddings, and which embed a stochastic matrix over some space X into a stochastic matrix over a larger space Y.
Definition 3.2 (Markov embedding).We call a Markov embedding, a mapping and where (ii (iii) Writing x∈X S x = Y for the associated partition of κ, Λ is such that for any y ∈ Y and x ∈ X , (κ(y), x ) ∈ D =⇒ (Λ(y, y )) y ∈S x ∈ P (S x ).
Note that if E and κ fail to satisfy the condition of Proposition 2.1, then no Λ satisfying (iii) exists.It is instructive to observe that a valid Λ corresponds to a block matrix, , where each block W x,x is either a channel from S x to S x when (x, x ) ∈ D, or is set to 0 when (x, x ) ∈ D. It is straightforward to verify that Markov embeddings are well-defined in the sense that a stochastic matrix irreducible over (X , D) is mapped to a stochastic matrix over W (Y, E ).Condition (ii) ensures that Markov embeddings preserve irreducibility.Crucially, when P ∈ W (X , D), Λ P ∈ W κ (Y, E ), where κ is the lumping function associated with the embedding defined at Definition 3.2-(i).We say that an embedding is κ-compatible when it produces κ-lumpable kernels.Suppose we are interested in a finer model, where we now consider two types of rainy states Showers and Thunderstorm.This corresponds to splitting the state Rain, and refining the transition probabilities to the newly defined states.We can represent this splitting operation naturally by a Markov embedding where S 0 = {0} , S 1 = {1, 2}, and See Figure 1.The embedded Markov chain is and the lumping function , and any elements P ∈ W (X , D) and P ∈ W (Y, E ) can be written, for some p, q, q 1 , q 2 ∈ (0, 1).The only possible Markov embedding is defined by the matrix A typical element P κ ∈ W κ (Y, E ) can then be expressed as for some p, q ∈ (0, 1), and the lumping function In this case, Markov embeddings are now defined by κ and the matrix that enjoys a degree of freedom λ ∈ (0, 1).
We now give a more operational definition of a Markov embedding by showing that it can be interpreted as randomly embedding the sequence of observations of a Markov chain to a larger space.
Let P ∈ W (X , D), and µ ∈ P + (X ) some initial distribution.Consider a single Markovian trajectory X 1 , X 2 , . . .sampled according to P and started from µ.Let ν ∈ P + (Y ) be such that for any x ∈ X , and for x ∈ X , define the conditional probability distribution ν |x ∈ P (S x ), We can verify that one can simulate a trajectory Y 1 , . . ., Y t , . . .sampled according to the embedded kernel Λ P and with initial distribution ν as follows: Markov embeddings of chains can therefore essentially be simulated from a single trajectory of the original chain, similar to lumpings.In certain cases it is possible to obtain an expression for the stationary distribution Λ π of the embedded chain (see e.g.Lemma 3.5, Lemma 3.8).Setting ν = Λ π then starts the embedded chain stationarily.
In the subsequent lemma, we show that the Fisher metric, dual connections and information divergence are preserved under Markov embeddings.
and consider the embedded kernels P θ Λ Pθ and P θ Λ Pθ .For any i, j ∈ [d], it holds that Proof.Let κ and x∈X S x = Y be the associated lumping function and partition of the Markov embedding.For all y, y ∈ Y and i ∈ [d], it holds that It follows from Corollary 2.1 that g is preserved.We proceed to prove conservation of the e-connection, where the last equality follows from Corollary 2.1.Invariance of the m-connection and information divergence can be proven using similar arguments6 .
We conclude this section by showing that we can always view a lumpable matrix as the image of its lumped version by some canonical Markov embedding.Lemma 3.2 (Canonical embedding).Let P ∈ W κ (Y, E ).There exists a κ-compatible Markov embedding, denoted by Λ (P) such that P = Λ (P) κ P. ) is determined by κ and Λ (P) ∈ F + (Y, E ) such that for any (y, y ) ∈ E , Λ (P) (y, y ) = P(y, y ) κ P(κ(y), κ(y )) .

Congruent embeddings
Let us fix a lumping function κ : Y → X , and compatible edge sets . Consider an arbitrary embedding E : W (X , D) → W (Y, E ), that does not necessarily follow the prescribed structure of Definition 3.2.When composing the embedding with its associated lumping always yields back the original chain (κ E P = P), E is a right-inverse of κ .Adapting the terminology of Campbell [1986], Čencov [1981], we will say in this case that E is κ-congruent embedding (Definition 3.4).In a finite space distribution setting, Markov morphisms and congruent embeddings are known to coincide (see e.g.Example 5.2 in Ay et al. [2017]).The proof strategy for this claim consists in expanding the notion of Markov morphisms to general positive measures, and proving the claim for morphisms seen as linear operators over a real vector space.We will show that a similar fact holds in the Markovian setting and for our definition of Markov morphisms (Definition 3.2).We begin by extending the definition of lumpable Markov kernels to more general lumpable matrices.
Definition 3.3 (κ-lumpable matrix).Let κ : Y → X a lumping function with associated partition x∈X S x = Y, and let A ∈ F (Y, E ).Then A is a κ-lumpable matrix if and only if for all x, x ∈ X , and for all y 1 , y 2 ∈ S x , A(y 1 , S x ) = A(y 2 , S x ).
In this case, the lumped matrix κ A is such that We write F κ (Y, E ) ⊂ F (Y, E ) for the subset of all κ-lumpable matrices.
Recall that F (Y, E ) can be endowed with a real vector space structure of dimension dim F (Y, E ) = |E |.Our next step consists in viewing F κ (Y, E ) as a linear subspace of F (Y, E ).
Then, for all x, x ∈ X , and for all y 1 , y 2 ∈ S x , by operations on matrices, = αA(y 2 , S x ) + βB(y 2 , S x ) = (αA + βB)(y 2 , S x ), ) is a subspace of F (Y, E ), and (i) holds.Moving on to (ii), let A, B ∈ F (Y, E ), and α, β ∈ R. For any x, x ∈ X , and y ∈ S x , thus κ is a linear map.In order to prove surjectivity in (ii) and claim (iii), we proceed to construct a basis.Taking the total order on Y = [m] induced from the natural numbers, and for (x, x ) ∈ D, we write R x,x (y, y ) ∈ E : y ∈ S x , y ∈ S x , y = y(S x , y) , where y(S x , y) max y ∈ S x : (y, y ) ∈ E .
For simplicity, we will use the shorthands y = y(S x , y) and y0 = y(S x 0 , y 0 ).
, and there exists a y,y ∈ R : (y, y ) ∈ E such that A = ∑ (y,y )∈E a y,y E y,y .Since A is κ-lumpable, it holds that for any (x, x ) ∈ D, and any y 1 , By decomposition, , thus B(κ) is also a generating family for F κ (Y, E ).Further notice that In fact, hence the surjectivity of κ , and from the rank-nullity theorem, which is oblivious to the exact partition defined by κ, and only depends on the alphabet sizes of its domain and range.
We can expand the domain of Markov embeddings in Definition 3.2 to subsets of F (X , D), and verify that embedded matrices are lumpable (Definition 3.3).It is noteworthy that Proposition 2.1 seamlessly extends to F κ (Y, E ).Inspired by the definition of congruent mappings in the context of distributions [Ay et al., 2017, Definition 5.1] and statistics in the sense of [Ay et al., 2017, Section 5.1.1],we introduce embeddings of matrices that are congruent for lumpings.Definition 3.4 (κ-congruent embedding).Let a mapping We say that K is a κ-congruent embedding when (i) K is a linear map.
(ii) K is monotonic in the sense that non-negative matrices are mapped to non-negative matrices, i.e. for any A ∈ F (X , D), (iv) K is a right inverse of κ , i.e. for any ∀A ∈ F (X , D), The surjectivity of κ together with Ker κ = {0} (Lemma 3.3) guarantee the existence of multiple right inverses to κ , i.e. potential candidates for congruent embeddings.We will now show that κcongruent embeddings are exactly the Markov embeddings whose partition of states coincides with the one defined by κ.Proof.Let K be a κ-congruent embedding.Recall the basis B(κ) of F κ (Y, E ) introduced in (10).Since K is a linear map, we can define it uniquely by the coordinates of the image of basis vectors of F (X , D) onto the basis B(κ).Namely, for (x 0 , x 0 ) ∈ D, we write where the K x 0 ,x 0 y,y and K x 0 ,x 0 x,x are real numbers.Let (x, x ) ∈ D, and (y, y ) ∈ (S x × S x ) ∩ E .Since E x 0 ,x 0 is non-negative, it follows from monotonicity that when y = y, is non-negative.On the other hand, From the requirement that κ K = Id F (X ,D) , linearity of κ (Lemma 3.3-(ii)), and (12), we have for any (x 0 , x 0 ) ∈ D, Thus, on one hand, for (x, x ) = (x 0 , x 0 ), K x 0 ,x 0 x,x = 0, and from (13), it follows that for any (y, y ) ∈ R x,x , K x 0 ,x 0 y,y = 0. On the other hand, K ,(y 0 , ȳ0 )∈E , ȳ0 = y0 K x 0 ,x 0 y 0 , ȳ0 ≤ 1 for any y 0 ∈ S x 0 , and from non-negativity, each individual coefficient K x 0 ,x 0 y 0 , ȳ0 is also in [0, 1].We therefore obtain that for any (x, x ) ∈ D, y,y F y,y .

Exponential embeddings
We fix a lumping function κ : Y → X , compatible edge sets E , D = κ 2 (E ), such that W κ (Y, E ) = ∅, and consider κ : W κ (Y, E ) → W (X , D).We now introduce another class of embeddings, which we call exponential embeddings.We show that this class preserves certain geometric features of families of kernels, and strictly encompasses the previously defined Markov embeddings.Definition 3.5 (Exponential embedding).Let P ∈ W κ (Y, E ), such that P κ P ∈ W (X , D).For a given P ∈ W (X , D), we let P ∈ W (Y, E ) be such that for any y, y ∈ Y, P(y, y ) P (y, y ) P(κ(y), κ(y )).
The mapping is called the κ-compatible exponential embedding with origin P .

P
Figure 2: Exponential embedding of a family V, with origin P .When V forms an e-family, so does Φ V.
Proposition 3.1.Let P ∈ W (X , D), and consider the exponential embedding with κ-lumpable origin P , (i) For any y, y ∈ Y, where (ρ, v) is the right PF pair of P • P, and P is as in Definition 3.5. (ii Proof.Let y ∈ Y, and let (ρ, v) be the right PF pair of P • P. It holds that where the fourth equality stems from κ P = P , hence (i) holds.Furthermore, for all x, x ∈ X , and for all y ∈ S x , where the first equality is (i).Observe that the expression we obtain is independent of y, thus the chain is κ-lumpable, and in fact, by definition of lumping, κ Φ P = s( P • P), whence (ii).
Remark 3.3.While κ-lumpability of the origin P ensures κ-lumpability of all exponentially embedded chain, note that composing the embedding with κ-lumping does not generally recover the original chain P, but rather some translated version s( P • P).This leads to non-congruency of the embedding, except for some well-chosen origin (Theorem 3.3).
For P 0 , P 1 ∈ W (Y, E ), the e-geodesic between P 0 and P 1 , can be defined [Nagaoka, 2005, Corollary 2] by γ (e) Essentially, γ (e) P 0 ,P 1 is the straight line in W (Y, E ) with respect to the e-connection, that goes through P 0 and P 1 , and forms the simplest kind of (1-dimensional) e-family.Similarly, the m-geodesic that goes through P 0 , P 1 with respective edge measures Q 0 , Q 1 can be defined as γ (m) where P t is the unique kernel that pertains to Q t .
A compelling property for an embedding E : W (X , D) → W (Y, E ), is to preserve the geometric structure in the sense of mapping an e-family to an e-family or an m-family to an m-family.This quality reduces to that of being a geodesically affine map.Definition 3.6 (Geodesically affine map).Let E : W (X , D) → W (Y, E ) be an embedding.When for all P 0 , P 1 ∈ W (X , D) and t ∈ R, E γ (e) P 0 ,P 1 (t) = γ (e) E P 0 ,E P 1 (t), then E is said to be e-geodesic affine.When for all P 0 , P 1 ∈ W (X , D) and for t ∈ [0, 1], E γ (m) , then E is said to be m-geodesic affine.
Theorem 3.2.Let Φ be the exponential embedding with origin P .
(i) Φ is an e-geodesic affine map.
(ii) Φ is generally not an m-geodesic affine map.
Proof.To prove (i), we rely on the fact that the mapping s induces an equivalence class for diagonally similar matrices -see (3).For any P0 , P1 ∈ W (X , D) and t ∈ R, Statement (ii) stems from the later Lemma 3.4.
Remark 3.4.The example in Lemma 3.4 actually shows the stronger, and somewhat surprising statement that even Markov embeddings are not generally m-geodesic affine.This is in stark contrast with Markov morphisms in the context of distributions, which can be shown to be geodesically affine for both the m-connection and e-connection.We will later construct (Section 3.4.2) a non-trivial subset of Markov embeddings that also preserves the m-structure.
Remark 3.5.In addition, it is not difficult to see that extending the invariance Lemma 3.1 to all exponential embeddings is not possible, as the latter distort the Fisher metric and affine connections.In particular, for P, P embeddings of P, P , it holds that D P P ≥ D κ P κ P = D s( P • P) s( P • P ) .
Let us introduce the special element Ū ∈ W (X , D), Observe that when D = X 2 , Ū = 1 |X | 1 1 is the Markov kernel that induces a uniform iid process over X .We will later recall in Section 5.2.1 that Ū corresponds to the maximum entropy rate kernel defined in W (X , D).
This corresponds to the canonical Markov embedding (Lemma 3.2) constructed from P .Conversely, for any κ-compatible Markov embedding Λ , by setting P = Λ Ū, we can create the exponential embedding Φ .

Hudson embeddings
In this section, we discuss a particular expansion of a Markov process that appears in Kemeny and Snell [1983, Section 6.5,p.140],which they consider to be the natural inverse of lumping.The first analysis of this expansion being credited to S. Hudson, we henceforth refer to it as the Hudson embedding, and denote it by H .We invite the reader to consult [Kemeny and Snell, 1983, Example 6.5.1] for an illustrative example of this expansion.
Hudson embedding as a Markov embedding.Our first order of business is to show that the Hudson embedding is a very particular Markov embedding where the target space is nothing but the set of edges in the directed graph defined by the original chain.Namely, for some E ⊂ D 2 and lumping function κ that we will define.Let us introduce the Hudson lumping, outputting the destination vertex of some edge For x ∈ X , we let S x = {e = (x 1 , x 2 ) ∈ D : x 2 = x}.Then x∈X S x = D is the partition associated with h.We further define Proof.We verify that for x, x ∈ X and e = (x 1 , It is easy to obtain [Kemeny and Snell, 1983, Theorem 6.5.2] a closed form expression for the stationary distribution of the embedded chain π = H π in terms of the edge measure Q that pertains to P, H π(e) = Q(e).
Moreover, being a Markov embedding, H is isometric, and preserves the dual affine connections.However, although H is e-geodesic affine (Theorem 3.3), we now show that it fails to preserve the m-structure, proving that Markov embeddings -and a fortiori exponential embeddings-are not generally m-geodesic affine (Theorem 3.2).
Proof.Let p ∈ (0, 1), p = 1/2, X = {0, 1}, and consider the two positive kernels We compute successively, which is not lumpable, i.e. γ (m) Hudson embeddings a sliding windows of observations.We can view the Hudson embedding more operationally as considering sliding windows of observations of a Markov chain.Namely, for X 1 , X 2 , . . ., irreducible Markov chain with dynamics governed by P, and with stationary π, the stochastic process defined by (X 1 , X 2 ), (X 2 , X 3 ), . . ., (X t , X t+1 ), . . .also defines a Markov chain, whose dynamics are governed by H P, and stationary distribution is Q (see for example [Wolfer and Kontorovich, 2021, Lemma 6.1] or Qiu et al. [2020]).In particular, it is straightforward to simulate a trajectory of H P as a deterministic function of a trajectory from P. Furthermore, let us consider second-order Markov chains over some state space X , whose dynamics can be encoded in P (2) such that for any t ∈ N, and for any x, x , x ∈ X , Following the identification in Csiszár et al. [1987, Section IV], we can then introduce P ∈ W (D, H D ) such that i.e. we can regard a second-order Markov kernel on X as a first-order kernel on (D, H D ).This allows us to view the Hudson embedding of W (X , D) as a first-order Markov sub-family of W (D, H D ), the family identified with second-order kernels.Note that Lemma 3.4 also implies that the Hudson embedding of W (X , D) does not form an m-family in W (D, H D ).
Higher-order Hudson embeddings.For an observation window of size k > 1, (X 1 , X 2 , . . ., X k ), (X 2 , X 3 . . ., X k+1 ), . . ., still defines a Markov chain, inviting us to extend the definition of H to higher orders.We write D (k) the collection of all possible paths s = (x 1 , x 2 , . . ., x k ) of length k over the connection graph of P. In particular, D (2) = D and D (1) = X .The definition of the Hudson lumping extends seamlessly as follows, and the edge set of the target space is naturally defined by The k-th order Hudson embedding is then , where , and Theorem 3.4 can be extended to any kth order.Finally, we can also view the kth order Hudson embedding of W (X , D) as a first-order subfamily of W (D (k) , H (k) D ), the family identified [Csiszár et al., 1987, Section IV] with kth order Markov kernels.

Memoryless Markov embeddings
Recall that there is a natural one-to-one embedding of the positive simplex P + (X ) into positive Markov kernels that forms W iid (X ), the family of irreducible memoryless kernels [Wolfer and Watanabe, 2021, Section 8].In the same spirit, we now define memoryless Markov embeddings, as the one-toone embedding of all Markov morphisms in the context of distribution into Markov morphisms in the context of kernels.For some fixed family of kernels W (X , D), every memoryless Markov embedding, as a Markov embedding, is associated with a lumping function κ, and a linear operator L ∈ F + (Y, E ) (Definition 3.2), with the additional property that for any y, y ∈ Y, where L : Y → R essentially defines a Markov embedding in the context of distributions.We write L : W (X , D) → W (Y, E ) P(x, x ) → P(y, y ) = P(κ(y), κ(y ))L(y ).
The stationary distribution of a kernel embedded with L has a closed-form expression.
Proof.We simply verify that πP = P.Let y ∈ Y, A direct consequence of Lemma 3.5 is that for Markov chains with rational stationary distributions, we can construct a natural embedding that produces a doubly stochastic matrix.
for m ∈ N and p 1 , . . ., p n ∈ N with ∑ n i=1 p i = m.There exists a lumping function and a memoryless embedding L such that Proof.We construct κ and L as follows.For any j ∈ [m], Then, for any j ∈ N, L π(j) = π(κ(j))L(j) = 1 m .The stationary distribution is uniform, thus L P is bistochastic.
Remark 3.6.Since a transition kernel with rational entries enjoys a rational stationary distribution, any transition kernel can be embedded into a doubly stochastic one, modulo some rational approximation.Lemma 3.6.Memoryless Markov embeddings are m-geodesic affine.
Since memoryless Markov embeddings are also e-geodesic affine, they preserve the entire emstructure of Markov kernels.

Reversible embeddings
A Markov chain is reversible when its transition kernel P ∈ W (Y, E ) with stationary distribution π satisfies the detailed balance equation, i.e. for any y, y ∈ Y, π(y)P(y, y ) = π(y )P(y , y).
Recall that W rev (Y, E ) denotes the subset of W (Y, E ) of reversible Markov chains.While lumping always preserves reversibility of Markov kernels (Proposition 3.2), embedding a reversible chaineven by a Markov embedding-can yield a non-reversible one.To illustrate this fact, consider for example the Hudson embedding of P0 ∈ W rev (X = {0, 1} , X 2 ) as defined in Lemma 3.4, and notice that H(X 2 ) is not symmetric, precluding the reversibility8 of H P0 .Some Markov embeddings, however, do preserve reversibility (Lemma 3.7).
Proof.Write P = κ P, and π for the stationary distribution of P. It immediately follows from Corollary 2.1, that for any (x, x ) ∈ D, Remark 3.7.Interestingly, Proposition 3.2 implies that Markov embeddings are "non-reversibility preserving" in the sense that a non-reversible Markov kernel cannot be congruently embedded into a reversible one.
Proof.Let P ∈ W rev (X , D), with corresponding stationary distribution π, and let P = L P, with stationary distribution π.We verify that P satisfies the detailed-balance equation.For y, y ∈ Y, π(y)P(y, y ) where (i) follows from Lemma 3.5, and (ii) from reversibility of P.
This last observation enables us to isometrically embed elements of W rev (X ) with rational stationary distributions into W sym (Y ).The Hudson embedding of order k Section 3.4.1 (p.24) Table 1: Nomenclature of embedding classes.
Lemma 3.7 and Corollary 3.2 yield the claim.
Finally, we show that the canonical Markov embedding induced from a reversible kernel is reversibility preserving.

MM-Emb κ
Figure 3: Landscape of the different classes of embeddings, as summarized in Table 1 It is instructive to observe that unlike the distribution setting, congruent Markov embeddings are not necessarily m-geodesic affine.

Information projection on geodesically convex sets
Geodesic convexity generalizes the familiar notion of convexity in the Euclidean space, to Riemannian manifolds.Recall that in the Riemannian setting, straight lines, termed geodesics, are defined with respect to some affine connection ∇.A submanifold C is geodesically convex with respect to ∇ whenever all ∇-geodesics joining two points in C remain in C at all times.
We complement these results by noting that unlike the m-geodesic, the e-geodesic is defined in W (X , D) for any t ∈ R. We readily obtain the general expression, D P γ (e) P 0 ,P 1 (t) = (1 − t)D (P P 0 ) + tD (P P 1 ) + log ρ t , where ρ t is the PF root of P •1−t 0 • P •t 1 .When |t| > 1, the strong convexity of t → ρ t and ρ 0 = ρ 1 = 1 immediately implies the following property of the information divergence, D P γ (e) P 0 ,P 1 (t) > (1 − t)D (P P 0 ) + tD (P P 1 ) .Remark 4.2.It is also noteworthy that the information divergence for Markov kernels does not seem to enjoy any joint m-convexity.Consider This contrasts with the distribution setting, where the information divergence belongs to the class of fdivergences, hence is jointly m-convex.

Pythagorean inequalities
In Euclidean geometry, the projection of a point onto a convex body yields a natural Pythagorean inequality involving the point, its projection, and other points on the surface of the convex body.An information geometric analogue of this fact, with the information divergence in lieu of the squared Euclidean distance, is also well-known to hold in the simplex (see e.g.Csiszár et al. [2004, Theorem 3.1]).We briefly recall an extension of this Pythagorean inequality for m-convex sets of Markov kernels Csiszár et al. [1987, Lemma 1].Let C ⊂ W (X , D ) with D ⊂ D be non-empty, closed and m-convex, we define the I-projection (information projection) onto C as the mapping11  For a fixed P ∈ W (X , D), the function P → D ( P P) is continuous and strictly m-convex.
Proposition 4.1 (Pythagorean inequality -I-projection onto m-convex - [Csiszár et al., 1987, Lemma 1]).Let P ∈ W (X , D), and let C ⊂ W (X , D ) with D ⊂ D, be non-empty, closed and m-convex (Definition 4.1).Then I (C) P exists in the sense where the minimum is attained for a unique element in C. Let P 0 ∈ C, the following two statements are equivalent: (i) For any P ∈ C, D( P P) ≥ D( P P 0 ) + D(P 0 P).See Figure 4 (left).
Let now C ⊂ W (X , D) be non-empty, closed and e-convex.In the distribution setting, a logconvexity counterpart of Proposition 4.1 is known to hold [Csiszár and Mat úš, 2003, Theorem 1].For fixed P ∈ W (X , D), adapting the terminology therein to the Markovian setting, we define the rI-projection (reverse information projection) onto C as the mapping For a fixed P ∈ W (X , D), the function P → D (P P) is continuous and strictly e-convex.We show the following counterpart to Proposition 4.1.
where (a) follows from and (b) stems from Nagaoka [2005, Theorem 4], which yields By a first-order Taylor expansion around P 0 , there exists s ∈ [0, t] such that Moreover, since P 0 , P ∈ C, it follows by e-convexity that also P t ∈ C, and P 0 being the minimizer implies 1 t (D (P P t ) − D (P P 0 )) ≥ 0.
Taking the limit t → 0 yields that Uniqueness follows from strict e-convexity, whence the theorem.
Remark 4.3.When C forms an m-family (resp.e-family), the Pythagorean inequality in Proposition 4.1 (resp.Proposition 4.2) becomes an equality [Hayashi and Watanabe, 2016, Corollary 4.7, Corollary 4.8].When C is m-convex (resp.e-convex), the I-projection (resp.rI-projection) is commonly referred to as the e-projection (resp.m-projection) [Amari and Nagaoka, 2007].This can be understood by verifying via the Pythagorean inequality that for an m-convex C and P 0 = I P, letting P t = s(P •1−t • P •t 0 ), leads to I P t = P 0 for any 0 ≤ t ≤ 1.Similarly, for an e-convex C, and rI P = P 0 , letting Q t = (1 − t)P + tP 0 yields rI P t = P 0 for any 0 ≤ t ≤ 1.

Information projection as a form of data processing
A natural question is whether taking the rI-projection onto an m-convex set also yields practical inequalities.This is the case for example in the context of distributions, where the following fourpoint property is known to hold with respect to the informational divergence.
We first wish to extend the aforementioned property to the geometry of Markov kernels.We let C ⊂ W (X , D) be non-empty, closed and m-convex, and for any P, P ∈ W (X , D) and P ∈ C, we will say that the four-point property holds for the quadruple (P, P , C, P) whenever it holds that D P P 0 ≤ D P P + D P P , with P 0 = rI (C) P.
Example 4.2.We derive an inequality for divergences involving two irreducible chains and their respective additive and multiplicative reversiblizations (Figure 5, left).Let C = W rev (X , D), which is known to form an em-family (both an e-family and an m-family) [Wolfer and Watanabe, 2021].Consider P + = P+P 2 the m-projection (i.e. the rI-projection) of P onto W rev , and P × = P (P ) the multiplicative reversiblization of P (note that P × is not the I-projection of P ).Writing π, π , π × , π + for the stationary distributions of the chains under consideration, we note that by construction, π = π + , π = π × , while π and π generally need not be equal.Straightforward calculations yield D P P + ≤ D P P × + D P P .
We now show that under the four-point property, the operation of taking m-projections onto a doubly autoparallel submanifold of kernels can only bring kernels closer together (Figure 5, right).
Proposition 4.4 (Contraction property for m-projection onto em-family).Let V em be an em-family in W (X , D) (doubly autoparallel submanifold), P, P ∈ W (X , D), and let P m (resp.P m ) be the m-projection of P (resp.P ) onto V em .Suppose that the four-point property holds for the quadruple (P, P , V em , P m ).Then it holds that D P m P m ≤ D P P .
Proof.Since P m is the m-projection onto the e-family V em , the Pythagorean identity yields Combining with (16) yields the claim.This contractive property is similar to the one enjoyed by nearest-point projections onto convex sets in Hilbert spaces.We now briefly see how m-projecting can be interpreted as a form of dataprocessing.Let Λ be a memoryless Markov embedding, and define J {Λ P : P ∈ W (X , D)} .
Since W (X , D) forms an em-family, and Λ is e-geodesically and m-geodesically affine (Theorem 3.2, Lemma 3.6), J also forms an em-family.Let P, P ∈ W κ (Y, E ), and P m , P m the m-projections of P, P onto J .We henceforth suppose that the four-point property holds in our context for the quadruple (P, P , J , P m ).In this case, the m-projections of P and P are readily obtained by composition of lumping and embedding by Λ (Lemma 4.1).By Proposition 4.4, we then recover the data-processing inequality D P P ≥ D Λ κ P Λ κ P = D κ P κ P .Example 4.3.Irreducible kernels can only be brought closer together by taking their additive reversiblization, which corresponds to the m-projection onto W rev (X , D).Indeed, without even relying on the four-point property, we directly prove by joint convexity of the KL divergence in the context of distributions that D((P 0 + P 0 )/2 (P

Information geometry of lumpable kernels
Many important families of Markov kernels (e.g.doubly stochastic matrices, symmetric matrices, reversible matrices,...) are known to enjoy favorable geometrical features [Wolfer and Watanabe, 2021, Table 1].In this section we analyze the geometrical structure of the family W κ (Y, E ) of lumpable kernels.

The foliated manifold of lumpable kernels
Recall that in the example at Lemma 3.4, we inspected the nature of the geodesic midpoint γ (m) P 0 ,P 1 (1/2) where P 0 , P 1 ∈ W h (X , H X 2 ) with X = {0, 1} and h is the Hudson lumping.There, we found that this point is not lumpable, i.e. γ leaves the manifold of lumpable kernels.It is then a consequence of Nagaoka [2005, Corollary 3] that lumpable kernels do not form e-families either.
A foliation is the decomposition of a manifold into a union of connected but disjoint submanifolds, called leaves, all sharing the same dimension.See for example Lee [2013, Chapter 19] for a thorough exposition.The concept of mutually dual foliations and mixed coordinate systems play a significant role in information geometry [Amari and Nagaoka, 2007, Section 3.7].Let us fix some origin P0 ∈ W (X , D), and let be the submanifold in W κ (Y, E ) of all kernels that κ-lump into P0 .For any P ∈ L( P0 ), following Lemma 3.2, recall that we can construct the canonical embedding Λ (P) , which verifies Λ (P) κ P = P.For P ∈ W κ (Y, E ), we can then define J (P ) to be the image of the entire manifold W (X , D) by the embedding Λ (P ) .Proof.To prove (i), notice that since W (X , D) forms an e-family [Nagaoka, 2005, Corollary 1], and Λ (P ) preserves the e-structure (Theorem 3.3), J (P ) also forms an e-family.Since Λ (P ) is an embedding, it is a diffeomorphism onto its image, thus dim J (P ) = dim W (X , D).It remains to prove (ii), i.e.L( P0 ) is closed under affine combination.Let two Markov embeddings induced by Λ 1 , Λ 2 that embed P0 respectively into P 1 , P 2 ∈ L( P0 ), P 1 (y, y ) = P0 (κ(y), κ(y ))Λ 1 (y, y ), P 2 (y, y ) = P0 (κ(y), κ(y ))Λ 2 (y, y ).
We now prove that the manifold of κ-lumpable kernels can be foliated, with the collection of submanifolds {J (P)} P∈L( P0 ) acting as leaves, for any base point P0 .Fixing P0 , we can then refer to a κ-lumpable kernel P in two steps.We first specify the leaf it belongs to, i.e. its coordinate along the family L( P0 ), to which corresponds to some lumpable P .As a second step, we indicate the coordinates of P in J (P ).κ P (κ(y), κ(y )) .
Remark 5.3.We can intuitively relate the dimension of W κ (Y, E ) as a manifold to that of the vector space F κ (Y, E ) (Lemma 3.3).Indeed, setting |X | additional constraints to a lumpable matrix ensures it is row- stochastic:

Interpretations & Applications
In this section, we first illustrate how L( P0 ) forming an m-family in W (Y, E ) enables us to efficiently select a chain on a finer state space that lumps into P0 , while making the fewest additional assumptions (Section 5.2.1).We then proceed to show how the foliation introduced in Theorem 5.1, leads Figure 9: Embedded model J (P ).The Pythagorean leaf L( Pm ) contains all lumpable frequency matrices P T 1 , . . ., P T N that would result in P m being the minimizer.quency matrices constructed from types that would have resulted in P m being selected as the minimizer (see Figure 9).In fact, from a straightforward computation, arg min where Y 1 , . . ., Y n is sampled according to P ,θ with arbitrary initial distribution µ13 .

Extension to higher-order data processing, and composite embeddings
We an naturally extend the data-processing model to the multi-letter case.
and such that E and κ satisfy the condition of Proposition 2.1.

Figure 5 :
Figure 5: Illustration of the four-point property for multiplicative and additive reversiblization (left).Interpretation of m-projection as a form of data-processing (right).

Table
and introduce the characteristic function H ∈ F (D, H D ), H(e, e ) = δ (e, e ) ∈ H D .
P → H P, with H P(e, e ) = H(e, e )P(h(e), h(e )), is a Markov embedding congruent with the Hudson lumping h.
P0 (κ(y), κ(y ))Λ t (y, y ), Non-negativity of P t (y, y ) and P0 (y, y ) implies that Λ t (y, y ) is non-negative.Moreover, for x ∈ X , As a result, Λ t defines a proper Markov embedding, and P t ∈ L( P0 ).The dimension of L( P0 ) is obtained by considering its one-to-one correspondence with the canonical embedding map, which has the same number of degrees of freedom as a Markov embedding (Remark 3.2).