Distributed information-theoretic biclustering

This paper investigates the problem of distributed biclustering of memoryless sources and extends previous work [1] to the general case with more than two sources. Given a set of distributed stationary memoryless sources, the encoders' goal is to find rate-limited representations of these sources such that the mutual information between two selected subsets of descriptions (each of them generated by distinct encoders) is maximized. This formulation is fundamentally different from conventional distributed source coding problems since here redundancy among descriptions should actually be maximally preserved. We derive non-trivial outer and inner bounds to the achievable region for this problem and further connect them to the CEO problem under logarithmic loss distortion. Since information-theoretic biclustering is closely related to distributed hypothesis testing against independence, our results are also expected to apply to that problem.


Introduction
The recent decades witnessed a rapid proliferation of digital data in a myriad of repositories such as internet fora, blogs, web applications, news, emails and the social media bandwagon. A significant part of this data is unstructured and it is thus hard to extract relevant information. This results in a growing need for a fundamental understanding and efficient methods for analyzing data and discovering valuable and relevant knowledge from it in the form of structured information.
When specifying certain hidden (unobserved) features of interest, the problem then consists of extracting those relevant features from a measurement, while neglecting other, irrelevant features. Formulating these idea in terms of lossy source compression [44], we can quantify the complexity of the representations via its compression rate and the fidelity via the information provided about specific (unobserved) features.
In this paper, we introduce and study the distributed clustering problem from a formal informationtheoretic perspective. Given correlated samples X 1 , X 2 observed at two different encoders, the aim is to extract a description from each sample such that the descriptions are maximally informative about each other. In other words, each encoder tries to find a (lossy) description W j = f j (X n j ) of its observation X n j subject to a complexity requirement (coding rate), maximizing the mutual information I(W 1 ;W 2 ). Our goal is to characterize the optimal tradeoff between the relevance (mutual information between the descriptions) and the complexity of those descriptions (encoding rate).

Related work
Biclustering (or co-clustering) was first explicitly considered by Hartigan [26] in 1972. A historical overview of biclustering including additional background can be found in [33,Section 3.2.4]. In general, given an S × T data matrix (a st ), the goal of a biclustering algorithm [32] is to find partitions B k ⊆ {1, . . . , S} and C l ⊆ {1, . . . , T }, k = 1 . . . K, l = 1 . . . L such that all the elements of the 'biclusters' (a st ) s∈B k ,t∈C l are homogeneous in a certain sense. The measure of homogeneity of the biclusters depends on the specific application. The method received renewed attention when Cheng and Church [6] applied it to gene expression data. Many biclustering algorithms have been developed since (e.g., see [48] and the references therein). An introductory overview of clustering algorithms for gene expression data can be found in the lecture notes [45]. The information bottleneck (IB) method, which can be viewed as a uni-directional information-theoretic variant of biclustering, was successfully applied to gene expression data as well [46].
In 2003, Dhillon et al. [10] adopted an information-theoretic approach to biclustering. They used mutual information to characterize the quality of a biclustering. Specifically, for the special case when the underlying matrix represents the joint probability distribution of two discrete random variables X and Y , i.e., a st = P{X = s,Y = t}, their goal was to find clustering functions f : {1, . . . , S} → {1, . . . , K} and g : {1, . . . , T } → {1, . . . , L} that maximize I f (X); g(Y ) for specific K and L. This idea was successfully employed in numerous research papers since, e.g., [20,30,35,47], where mutual information is typically estimated from samples.
In the present work, we investigate a theoretical extension of the approach in [10], where we consider blocks of n i.i.d. sources and S n , K n , T n , and L n scale exponentially in the blocklength n. The resulting information-theoretic biclustering problem turns out to be equivalent to hypothesis testing against independence with multi-terminal data compression [23] and to a pattern recognition problem [52]. Both these problems are not yet solved in general (for a survey on the hypothesis testing problem, see [24]). The pattern recognition problem has been extensively studied on doubly symmetric binary and jointly Gaussian sources.
A special case of the information-theoretic biclustering problem is given by the IB problem, studied in [18], based on the IB method [49]. This problem is solved in terms of a single-letter characterization and is known to be equivalent to source coding under logarithmic loss. A generalization to multiple terminals, the CEO problem under logarithmic loss [8], is currently only solved under specific Markov constraints.

Contributions
The aim of the this paper is to characterize the achievable region of the information-theoretic biclustering problem, its extensions and special cases, and connect them to known problems in network information theory. This problem is fundamentally different from 'classical' distributed source coding problems like distributed lossy compression [13,Chapter 12]. Usually, one aims at reducing redundant information, i.e., information that is transmitted by multiple encoders, as much as possible, while still guaranteeing correct decoding. By contrast, in the biclustering problem we are interested in maximizing this very redundancy.
In this sense, it is complementary to conventional distributed source coding and requires adapted proof techniques.
More specifically, the main contributions are as follows.
• We formally prove the equivalence of the information-theoretic biclustering, the hypothesis testing [23], and the pattern recognition problem [52] (Theorem 3.4) and connect it to the IB problem [18,49] (Proposition 6.2).
• We extensively study the doubly symmetric binary source (DSBS) as a special case (Section 5).
In order to perform this analysis, we require stronger cardinality bounds than the ones usually obtained using the convex cover method [13, Appendix C].
• We are able to improve upon the state-of-the-art cardinality bounding techniques by combining the convex cover method with the perturbation method [13, Appendix C] and leveraging ideas similar to [37], which allow us to restrict our attention to the extreme points of the achievable region. The resulting bounds (Proposition 4.3) allow for the use of binary auxiliary random variables in the case of binary sources.
• Based on a weaker conjecture (Conjecture 5.2), we argue that there is indeed a gap between the outer and the inner bound for a DSBS. (Conjecture 5.1).
• We propose an extension of the CEO problem under an information constraint, studied in [8], which requires multiple description (MD) coding [14] (see [21] for applications) to account for the possibility that descriptions are not delivered. Using tools from submodularity theory and convex analysis, we are able to provide a complete single-letter characterization of the resulting achievable region (Theorem 7.3), which has the remarkable feature that it allows to exploit rate that is in general insufficient for successful typicality decoding.

Notation and conventions
For a total order on a set E (cf. [41,Definition 1.5]) and e ∈ E we will use the notation e := {e ∈ E : e e} and accordingly for , and . For example, given the total order on {1, 2, 3} with 3 1 2, we have 3 = {1, 2}, 1 = {2} and 2 = ∅. We will use the shorthand [l : k] := {l, l + 1, . . . , k − 1, k}. The notation 1 A , A, conv(A), and |A| is used for the indicator, topological closure, convex hull, and cardinality of a set A, respectively. When there is no possibility of confusion we identify singleton set with its element, e.g., we write {1, 2, 3} \ 1 = {2, 3}. Let R + and R − be the set of non-negative and non-positive reals, respectively.
We denote random quantities and their realizations by capital and lowercase letters, respectively. Furthermore, vectors are indicated by bold-face type and have length n, if not otherwise specified. Random variables are assumed to be supported on finite sets and unless otherwise specified, the same letter is used for the random variable and for its support set, e.g., Y takes values in Y and X 3 takes values in X 3 . Given a random variable X, we write p X for its probability mass function (pmf), where the subscript might be omitted if there is no ambiguity. The notation X ∼ p indicates that X is distributed according to p and X ∼ B(p) and Y ∼ U(Y) denote a Bernoulli distributed random variable X with parameter p ∈ [0, 1] and a uniformly distributed random variable Y on its support set Y. We use E[X] and P{A} for the expectation of the random variable X and the probability of an event A, respectively. Subscripts indicate parts of vectors, e.g., We further use the common notation x x x j i := x x x {i,..., j} , x x x j := x x x j 1 . If a vector is already carrying a subscript, it will be separated by a comma, e.g., x x x 5 3 Let 0 0 0 denote the all-zeros vector and e e e i = (e i,1 , e i,2 , . . . , e i,n ) ∈ R n the ith canonical base vector, i.e., e i, j = 1 i ( j). We use the notation of [9, Chapter 2] for information-theoretic quantities. In particular, given random variables (X,Y, Z) and pmfs p and q, H(X), H(X|Y ), I(X;Y ), I(X;Y |Z), and D(p q) denote entropy, conditional entropy, mutual information, conditional mutual information, and Kullback-Leibler divergence, respectively. All logarithms in this paper are to base e and therefore all information theoretic quantities are measured in nats. The notation h 2 (p) is the binary convolution operation and the symbol ⊕ denotes binary addition. The notation X • −− Y • −− Z indicates that X, Y , and Z form a Markov chain in this order and X ⊥ Y denotes that X and Y are independent random variables. Slightly abusing notation we consider ∅ to be a degenerate random variable that is almost surely a constant, such that, e.g., To simplify the presentation (cf. [13]) when generating codebooks, we will assume that the codebook size is an integer. We will use superscript to indicate that a relation follows from a specific equation. For example, the inequality a

Problem statement
In this section we will introduce the information-theoretic biclustering problem (or biclustering problem for short) with two sources and provide bounds for its achievable region. A schematic overview of the problem is presented in Fig. 1. Let (X,Y ) be two random variables. The random vectors (X X X,Y Y Y ) consist of n i.i.d. copies of (X,Y ). Given a block length n ∈ N and coding rates R 1 , R 2 ∈ R + , an (n, R 1 , R 2 )-code consists of two functions f : X n → M 1 and g : Y n → M 2 such that the finite sets M k satisfy log |M k | nR k , k ∈ {1, 2}. Thus, the coding rates R 1 and R 2 limit the complexity of the encoders. In contrast to rate-distortion theory, we do not require a specific distortion measure; rather, we quantify the quality of a code in pure information-theoretic terms, namely via mutual information. The idea is to find functions f and g that extract a compressed version of the common randomness in the observed data X X X and Y Y Y . To this end, we use the normalized mutual information I f (X X X); g(Y Y Y ) /n to quantify the relevance of the two encodings.
The achievable region R is defined as the closure of the set R of achievable triples.
REMARK 2.1 Note that a standard time-sharing argument shows that R is a convex set (cf. [13,Section 4.4]).
We also point out that stochastic encodings cannot enlarge the achievable region as any stochastic encoding can be represented as the convex combination of deterministic encodings and R is convex.

Equivalent problems
The biclustering problem turns out to be equivalent to a hypothesis testing and a pattern recognition problem. In this section we will clarify this equivalence by showing that the multi-letter regions agree. These equivalences will provide us with the achievability of R * , the 'multi-letter' region R * of the biclustering problem.
DEFINITION 3.1 Let R * be the set of triples (µ, R 1 , R 2 ) such that there exist n ∈ N and random variables Next, we consider the hypothesis testing problem with data compression when testing against independence [23, Section 6] and the pattern recognition problem [52]. For completeness sake we briefly describe the problem setups. DEFINITION 3.2 (Hypothesis testing against independence) Given the potentially dependent sources (X,Y ), define the independent random variables (X,Y ) ∼ p X × p Y . An (n, R 1 , R 2 ) hypothesis test consists of an (n, R 1 , R 2 )-code ( f n , g n ) and a set A n ⊆ M 1 × M 2 , where M 1 and M 2 are the ranges of f n and g n , respectively. The type I and type II error probabilities of ( f n , g n , A n ) are defined as α n := P f n (X X X), g n (Y Y Y ) ∈ A n and β n := P f n (X X X), g n (Y Y Y ) / ∈ A n , respectively. A triple (µ, R 1 , R 2 ) is HTachievable if, for every ε > 0, there is a sequence of (n, R 1 , R 2 ) hypothesis tests ( f n , g n , A n ), n ∈ N such that lim n→∞ α n ε, Let R HT denote the set of all HT-achievable triples. DEFINITION 3.3 (Pattern recognition) Let X X X(i),Y Y Y (i) be n i.i.d. copies of (X,Y ), independently generated for each i ∈ N. A triple (µ, R 1 , R 2 ) is said to be PR-achievable if, for any ε > 0, there is some n ∈ N, such that there exists an (n, R 1 , R 2 )-code ( f , g) and a function φ : where C := f (X X X(i)) i∈[1 : e nµ ] is the compressed codebook and X X X(i),Y Y Y (i) i∈N ⊥ W ∼ U([1 : e nµ ]). Let R PR denote the set of all PR-achievable triples. To see this, note that (using the notation of [52]) the point for any b > 0 even if the random variables X and Y are independent. But this point is clearly not achievable in general. However, the region R in defined in the right column of [52, p. 303] coincides with our findings and the proof given in [52, Appendix A] holds for this region.
The biclustering, hypothesis testing and pattern recognition problems are equivalent in the sense that their 'multi-letter' regions agree. The proof of this result is given in Appendix A.1.

Bounds on the achievable region
The following inner and outer bound on the achievable region follow from the corresponding results on the hypothesis testing and pattern recognition problems.
with U and V any pair of random variables satisfying U The region R o is convex since a time-sharing variable can be incorporated into U and V . The inner bound R i , however, can be improved by convexification.
Numerical evaluation of R o and R i requires the cardinalities of the auxiliary random variables to be bounded. We therefore complement Theorems 4.1 and 4.2 with the following result, whose proof is provided in Appendix A.2.
where U, V , and Q are random variables such that p X,Y,U,V,Q = p Q p X,Y p U|X,Q p V |Y,Q , |U| |X |, |V| |Y|, and |Q| 3. The cardinality bound |Q| 3 follows directly from the strengthened Carathéodory theorem [12, Theorem 18(ii)] because conv(R i ) is the convex hull of a connected set in R 3 .
Note that the cardinality bounds in this result are tighter than the usual bounds obtained with the convex cover method [13,Appendix C], where the cardinality has to be increased by one. We will exploit this fact with binary sources in Section 5, to show that binary auxiliary random variables suffice. The smaller cardinality bounds come at the cost of convexification for the outer bound since in contrast to R o , the region S o is not necessarily convex.
A tight bound on the achievable region can be obtained if µ is not greater than the Gács-Körner common information (cf. [17,51,54]) of X and Y , as stated in the following corollary.

Doubly symmetric binary source
In this section, we analyze the achievable region for a DSBS. The same region (cf. Theorem 3.4) was previously analyzed in [52] in the context of a pattern recognition problem. We obtain additional results, disproving [52, Conjecture 1]. In particular, we conjecture that there is a gap between the inner bound conv(S i ) and the outer bound R o for the DSBS. To support this conjecture, we analyze a region S b , previously introduced by the authors of [52], with the property that S b ⊆ S i . However, we prove conv(S b ) = R o and subsequently conjecture that conv(S b ) = conv(S i ), based on numerical evidence.
Subsequently we will provide evidence, supporting the following conjecture.
Let S b be defined as and of each other, it follows that S b ⊆ S i . To illustrate the tradeoff between complexity (R 1 , R 2 ) and relevance (µ), the boundary of S b is depicted in Fig. 2 for p = 0.1. Based on numerical experiments, we conjecture the following.

Proof.
For a ∈ [0, 1] we define (U,V ) by the binary channels depicted in Fig. 3, satisfying  (2). For a = 0.8 we have µ ≈ 0.291103 and R ≈ 0.42281. On the other hand, we obtain µ b := max{μ : This argument can be verified numerically using interval arithmetic [34]. Code written in the Octave Programming Language [11] using its interval package [27] can be found at [38].
Note that Proposition 5.3 does not impact Conjecture 5.2 as it concerns the case p = 0. For p = 0 we have X = Y and Corollary 4.1 implies R = (µ, R 1 , R 2 ) : R 1 , R 2 0 and µ min{R 1 , R 2 , log 2} . It is easily verified that R = conv(S b ) and thus Conjecture 5.2 holds for p = 0 by Proposition 4.3.
In fact, it can be shown that the entire statement To prove Proposition 5.4 we will construct a point (µ, R, R) ∈ R o that satisfies (µ, R, R) / ∈ conv(S b ). To this end, define the concave functionsμ b (R) := max{µ : We can numerically compute an upper bound for the functionμ b . For α, β ∈ [0, 1 2 ], we calculate on a suitably fine grid and upper bound the upper concave envelope of the implicitly defined function On the other hand, we can obtain a lower bound forμ o by computing (4.2) for specific pmfs that satisfy the Markov constraints in Theorem 4.2. Note that based on the cardinality bound in Proposition 4.3, we can restrict the auxiliary random variables U and V to be binary. We randomly sample the binary pmfs that satisfy the Markov constraints in Theorem 4.2 (but not necessarily the long Markov chain and in doing so encounter points strictly above the graph ofμ b . Figure 4 shows the resulting bounds for p = 0.1 in the vicinity of R = log 2. Albeit small, there is clearly a gap betweenμ b andμ o outside the margin of numerical error. Proof of Proposition 5.4. We observed the largest gap between the two bounds at a rate ofR ≈ 0.675676. The particular distribution of (U,V ) at this rate, resulting from optimizing over the distributions that satisfy the Markov constraints in Theorem 4.2 is given in Table 1 for reference. Note that this is an exact 985673 · 10 −4 above the inner bound, thus proving Proposition 5.4. Using interval arithmetic [34] this claim can be verified numerically. Code written in the Octave Programming Language [11] using its interval package [27] can be found at [38]. It uses the distribution given in Table 1.
We firmly believe that a tight characterization of the achievable region requires an improved outer bound. However, using current information theoretic tools, it appears very challenging to find a manageable outer bound based on the full Markov chain U REMARK 5.1 Recently, Kumar and Courtade introduced a conjecture [7,31] concerning Boolean functions that maximize mutual information. Their work was inspired by a similar problem in computational biology [29]. A weaker form of their conjecture [7, Section IV, 2)], which was solved in [39], corresponds to a zero-rate/one-bit variant of the binary example studied here.

The information bottleneck
The information-theoretic problem posed by the IB method [49] can be obtained as a special case from the biclustering problem. We will introduce the problem setup and subsequently show how it can be derived as a special case of Definition 2.1. Note that the definition slightly differs from [10, Definition 1]. However, the achievable region is identical.
is IB-achievable if, for some n ∈ N, there exists f : X n → M 1 with log|M 1 | nR 1 and Let R IB be the set of all IB-achievable pairs. PROPOSITION 6.2 For a pair (µ, R 1 ), the following are equivalent: 3. There exists a random variable U such that U • −− X • −− Y , I(X;U) R and I(Y ;U) µ.
Proof. The equivalence '1 ⇔ 2' holds as Definition 2.1 collapses to Definition 6.1 for R 2 = log|Y|. To show '2 ⇔ 3' apply Theorems 4.1 and 4.2 with V = Y . The tradeoff between 'relevance' and 'complexity' can equivalently be characterized by the IB function (cf. [8,18]) µ IB (R) := sup{µ : (µ, R) ∈ R IB }. Proposition 6.2 provides Interestingly, the function (6.2) is the solution to a variety of different problems in information theory. As mentioned in [18], (6.2) is the solution to the problem of loss-less source coding with one helper [2,58]. Witsenhausen and Wyner [55] investigated a lower bound for a conditional entropy when simultaneously requiring another conditional entropy to fall below a threshold. Their work was a generalization of [57] and furthermore related to [2,3,53,56]. The conditional entropy bound in [55] turns out to be an equivalent characterization of (6.2). Furthermore, µ IB characterizes the optimal error exponent, when testing against independence with one-sided data compression [1, Theorem 2]. Also in the context of gambling in the horse race market, (6.2) occurs as the maximum incremental growth in wealth when rate-limited side-information is available to the gambler [15, Theorem 3].

Multiple description CEO problem
In [8, Appendix B] Courtade and Weissman considered a multi-terminal extension of the IB problem, as introduced in Section 6, the CEO problem under an information constraint. Analogous to how the IB problem is a special case of the biclustering problem (cf. Proposition 6.2), this CEO problem presents a special case of a multi-terminal generalization of the biclustering problem [40]. Under a conditional independence assumption, Courtade and Weissman were able to provide a single letter characterization of the achievable region. In what follows we will extend this result, by incorporating MD coding for the CEO problem. Loosely speaking, we require the CEO to obtain valuable information from the message of just one agent alone. Surprisingly, this extension also permits a single-letter characterization under the same conditional independence assumption.
In what follows, let (X J ,Y ) be J + 1 random variables, where J : Denote the set of all MI-achievable points by R MI .
To shorten notation we will introduce the set of random variables ⊆ R 2J+1 be the set of tuples (ν 0 , ν J , R J ) such that there exist random variables (U J , ∅) ∈ P * with We are now able to state the single-letter characterization of R MI , the proof of which is provided in Appendix A.3. ν 1 I(Y ;U 1 |U 2 ) ν 1 I(Y ;U 1 ) (7.9) ν 2 I(Y ;U 2 |U 1 ) ν 2 I(Y ;U 2 ) ν 2 I(Y ;U 2 ) (7.10) ν 0 I(Y ;U 1 U 2 ) ν 0 I(Y ;U 1 U 2 ) ν 0 I(Y ;U 1 U 2 ) (7.11) R 1 I(U 1 ; X 1 ) R 1 I(U 1 ; X 1 |U 2 ) R 1 I(U 1 ; X 1 ) (7.12) REMARK 7.2 Note that the total available rate of encoder 2 is R 2 = I(X 2 ;U 2 |U 1 ) to achieve a point in R MI . Interestingly, this rate is in general less than the rate required to ensure successful typicality decoding of U 2 . However, ν 2 = I(Y ;U 2 |U 1 ) can still be achieved. MI shows another interesting feature of this region. The achievable values for ν 1 and ν 2 vary across i ∈ {1, 2, 3} and hence do not only depend on the chosen random variables U 1 and U 2 , but also on the specific rates R 1 and R 2 . 1 For the notation regarding total orders refer to Section 1.3.
It is worth mentioning that by setting ν j = 0 for j = 1, 2, . . . , J, the region R MI reduces to the rate region in [8,Appendix B].
The following proposition shows that R remains unchanged if the cardinality bound U j X j + 4 J is imposed for every j ∈ J .
The proof of Proposition 7.4 is provided in Appendix A.4.

Summary and discussion
We introduced a multi-terminal generalizations of the IB problem, termed information-theoretic biclustering. Interestingly, this problem is related to several other problems at the frontier of statistics and information theory and offers a formidable mathematical complexity. Indeed, it is fundamentally different from 'classical' distributed source coding problems where the encoders usually aim at reducing, as much as possible, redundant information among the sources while still satisfying a fidelity criterion. In the considered problem, however, the encoders are interested in maximizing precisely such redundant information.
While an exact characterization of the achievable region is mathematically very challenging and still remains elusive, we provided outer and inner bounds to the set of achievable rates. We thoroughly studied the special case of two symmetric binary sources for which novel cardinality bounding techniques were developed. Based on numerical evidence we formulated a conjecture that entails an explicit expression for the inner bound. This conjecture provides strong evidence that our inner and outer bounds do not meet in general. We firmly believe that an improved outer bound, satisfying the adequate Markov chains, is required for a tight characterization of the achievable region.
Furthermore we considered an MD CEO problem which surprisingly permits a single-letter characterization of the achievable region. The resulting region has the remarkable feature that it allows to exploit rate that is in general insufficient to guarantee successful typicality encoding.
The interesting challenge of the biclustering problem lies in the fact that one needs to bound the mutual information between two arbitrary encodings solely based on their rates. Standard information-theoretic manipulations seem incapable of handling this requirement well.

Data availability statement
No new data were generated or analysed in support of this review.

A. Proofs
A.1 Proof of Theorem 3.4 To prove R ⊆ R * , assume (µ, R 1 , R 2 ) ∈ R and choose n, f and g according to Definition 2.1. Defining U := f (X X X) and V := g(Y Y Y ) yields inequalities (3.1)-(3.3) and satisfies the required Markov chain.
The inclusions R * ⊆ R HT and R * ⊆ R PR follow by applying the achievability results [23, Corollary 6] and [52, Theorem 1], respectively, to the vector source (X X X,Y Y Y ).

A.2 Proof of Proposition 4.3
We start with the proof of conv(S o ) = R o . For fixed random variables (X,Y ) define the set of pmfs (with finite, but arbitrarily large support) From the definition of R o , we have ψ(λ λ λ ) = −∞ if λ λ λ / ∈ O, and ψ(λ λ λ ) = inf p∈Q λ λ λ · F F F(p) otherwise. This shows, that and using the same argument, one can also show that We shall now prove that ψ(λ λ λ ) = ψ(λ λ λ ) for λ λ λ ∈ O. For arbitrary λ λ λ ∈ O and δ > 0, we can find random variables ( U, X,Y, V ) ∼ p ∈ Q with λ λ λ · F F F( p) ψ(λ λ λ ) + δ . By compactness of Q(a, b) and continuity of We now show that there existsp ∈ Q(|X |, |Y|) with As a consequence of the inequalities F 1 F 2 and F 1 F 3 we have λ λ λ · F F F(p) = 0 if λ 1 + max{λ 2 , λ 3 } 0. Thus, we only need to show (A.15) for λ λ λ ∈ O with λ 1 + λ 2 < 0 and λ 1 + λ 3 < 0. To this end we use the perturbation method [19,28] and perturb p, obtaining the candidate We require The Here, we used the shorthand H φ (UX) := − ∑ u,x p(u, x)φ (u) log p(u, x) and analogous for other combinations of random variables. By (A.14), we have ∂ 2 ∂ ε 2 λ λ λ · F F F(p) ε=0 0. Observe that and consequently, Here we already used that and thus, taking into account that λ 1 + λ 3 < 0, From (A.31) we can conclude where we used Substituting in (A.21) shows that λ λ λ · F F F(p) is linear in ε. And by the optimality of p it must be constant.

A.3 Proof of Theorem 7.3
We will prove Theorem 7.3, by showing an inner and an outer bound (Lemmas A.1 and A.2, respectively) and subsequently prove tightness.
LEMMA A. 1 We have R ( ,I) MI ⊆ R MI for any I ⊆ J and any total order on J .
Proof. In part, the proof of this lemma closely follows the proof of [25, Theorem 1]. We will use T ε (X) to denote the ε typical sequences [25, Section III]. Pick a total order on J , a set I ⊆ J , (U J , ∅) ∈ P * , and (ν 0 , ν J , R J ) satisfying (7.4)-(7.8). We will use typicality coding and deterministic binning to obtain a code. LettingR j = I(U j ; X j U j ), we verify for A = j, j ∈ J , and any B ⊆ A, that Following the proof of [25,Theorem 1] and applying the conditional typicality lemma [25, Lemma 3.1.(iv)], we can thus for any ε > 0 and n large enough obtain an (n,R J + ε)-codef J and for any A = j, j ∈ J , a decoding function g A , such that, For j / ∈ I, we set f j :=f j , but for j ∈ I, we let f j be typicality encoding without binning, in total yielding an (n, R J + ε)-code. Moreover, for n large enough and j ∈ I, we find decoding functions g j , such that P S j 1 − ε also for the 'success' events To shorten notation, let W j = f j (X X X j ) andŴ j :=f j (X X X j ) for j ∈ J . Pick an arbitrary 2 ε > 0. Provided that n is large enough and ε small enough, we have for any A = j ( j ∈ J ), For j ∈ J and A = j we obtain the following chain of inequalities, where (A.77) and (A.78) will be justified subsequently.
LEMMA A.2 If (ν 0 , ν J , R J ) ∈ R MI , then for all i ∈ J and A ⊆ J , for some random variables (U J , Q) ∈ P * .
Proof. For (ν 0 , ν J , R J ) ∈ R MI we apply Definition 7.1, choosing an (n, R J )-code f J for X J and define U j := f j (X X X j ) for j ∈ J . In the following, let either A = J , or A = { j} for j ∈ J . Slightly abusing notation, we define ν A := ν j for A = { j} and ν A := ν 0 for A = J . We thus have With U j,i := (U j , X X X i−1 j,1 ) and The result now follows by a standard time-sharing argument. Note that the required Markov chains and the independence constraints are satisfied.
The following result is a simple corollary of Lemma A.2 and will suffice for us.
COROLLARY A.1 For any (ν 0 , ν J , R J ) ∈ R MI there are random variables (U J , Q) ∈ P * with In the following proof, we will make use of some rather technical results on convex polyhedra, derived in Appendix A.5.
Proof of Theorem 7.3. Assume (ν 0 , ν J , R J ) ∈ R MI . We can then find (U J , Q) ∈ P * such that (A.113)-(A.116) hold. We define ( ν 0 , ν J ) := −(ν 0 , ν J ) to simplify notation. It is straightforward to check that where supermodularity follows via standard information-theoretic arguments. By the extreme point theorem [16,Theorem 3.22], every extreme point of H (0) is associated with a total order on K. Such an extreme point is given by Pick arbitrary j, j ∈ J . For nonempty B ⊆ J with j ∈ B we can write H( Observe that f j,B and g B are continuous functions of p X j |U j ( · u j ). Apply the support lemma [13,Appendix C] with the functions f j,B and g B for all j ∈ J , j ∈ B ⊆ J , and X j − 1 test functions, which guarantee that the marginal distribution p X j does not change. We obtain a new random variablê U j with H(X j |U B\ jÛ j ) = H(X j |U B ) and H(Y |U B\ jÛ j ) = H(Y |U B ). By rewriting (7.4)-(7.8) in terms of conditional entropies, it is evident that the defining inequalities for R ( ,I) MI remain the same when replacing U j byÛ j . The support ofÛ j satisfies the required cardinality bound 3 |Û j | X j − 1 + J2 J−1 + 2 J−1 (A.134) The same process is repeated for every j ∈ J .

A.5 Results on convex polyhedra
We start this appendix with a simple lemma, which will be used in several proofs.  a a (1) , a a a (2) , . . . , a a a (m) ) T and b b b ∈ R m , where a a a T ( j) is the jth row of A A A. In this section we will use the notation of [22]. In particular, we shall call a closed convex set line-free if it does not contain a (straight) line. The characteristic cone of a closed convex set C is defined as cc(C) := {y y y : x x x + λ y y y ∈ C for all λ 0} (x x x ∈ C arbitrary) and ext(C) is the set of all extreme points of C, i.e., points x x x ∈ C that cannot be written as x x x = λ y y y + (1 − λ )z z z with y y y, z z z ∈ C, y y y = z z z and λ ∈ (0, 1). LEMMA A.4 A point y y y is in cc(H) if and only if A A Ay y y 0 0 0.
Proof. If A A Ay y y 0 0 0, x x x ∈ H and λ 0, A A A(x x x + λ y y y) A A Ax x x b b b. On the other hand, for a a a T ( j) y y y < 0, we have a a a T ( j) (x x x + λ y y y) < b j for λ > b j −a a a T ( j) x x x a a a T ( j) y y y > 0. LEMMA A.5 If, for every i ∈ [1 : n], there exists j ∈ [1 : m] such that e e e i = a a a ( j) and for every j ∈ [1 : m], a a a ( j) 0 0 0, then H is line-free and cc(H) = R n + . Proof. For any y y y ∈ R n + , clearly A A Ay y y 0 0 0 and hence y y y ∈ cc(H) by Lemma A.4. If y y y / ∈ R n + we have y i < 0 for some i ∈ [1 : n] and choose j ∈ [1 : m] such that a a a ( j) = e e e i , resulting in a a a T ( j) y y y = y i < 0. To show that H is line-free assume that x x x + λ y y y ∈ H for all λ ∈ R. This implies ±y y y ∈ cc(H), i.e., y y y = 0 0 0. DEFINITION A.1 A point x x x is on an extreme ray of the cone cc(H) if the decomposition x x x = y y y + z z z with y y y, z z z ∈ cc(H) implies that y y y = λ z z z for some λ ∈ R. Proof. Assuming that less than n linearly independent inequalities are satisfied with equality at x x x, we find 0 0 0 = c c c ⊥ (a a a ( j) ) j∈A(x x x) and thus x x x ± εc c c ∈ H for a small ε > 0, showing that x x x / ∈ ext(H). Conversely assume x x x / ∈ ext(H), i.e., x x x = λ x x x + (1 − λ )x x x for λ ∈ (0, 1) and x x x , x x x ∈ H, x x x = x x x . For any j ∈ A(x x x), we then have λ a a a T ( j) x x x + (1 − λ )a a a T ( j) x x x = b j , which implies a a a T ( j) x x x = a a a T ( j) x x x = b j and therefore 0 0 0 = x x x − x x x ⊥ (a a a ( j) ) j∈A(x x x) . Proof. We obtain 0 0 0 = r r r ⊥ (a a a ( j) ) j∈A(x x x) . Define λ 1 := inf{λ : x x x + λ r r r ∈ H} and λ 2 := sup{λ : x x x + λ r r r ∈ H}. Clearly λ 1 0 λ 2 . As H is line free, we may assume without loss of generality λ 1 = −1 (note that x x x / ∈ ext(H)) and set c c c = x x x − r r r. We now have c c c ∈ ext(H) as otherwise c c c − εr r r ∈ H for some small ε > 0. If λ 2 < ∞, define d d d = x x x + λ 2 r r r, which yields d d d ∈ ext(H) and x x x = λ c c c + (1 − λ )d d d with λ = λ 2 λ 2 +1 . Note that λ 2 = 0 as x x x / ∈ ext(H). If λ 2 = ∞ we have x x x − c c c = r r r ∈ cc(H). We need to show that r r r is also on an extreme ray of cc(H).