Minimum Probability of Error of List M-ary Hypothesis Testing

We study a variation of Bayesian M-ary hypothesis testing in which the test outputs a list of L candidates out of the M possible upon processing the observation. We study the minimum error probability of list hypothesis testing, where an error is defined as the event where the true hypothesis is not in the list output by the test. We derive two exact expressions of the minimum probability or error. The first is expressed as the error probability of a certain non-Bayesian binary hypothesis test, and is reminiscent of the meta-converse bound. The second, is expressed as the tail probability of the likelihood ratio between the two distributions involved in the aforementioned non-Bayesian binary hypothesis test.


I. INTRODUCTION
Statistical hypothesis testing is the problem of deciding one of M possible statistical hypotheses after processing some observation data modeled by a random variable.Hypothesis testing is one of the main problems in statistics and inference and finds applications in areas such as social, biological, medical, computer sciences, signal processing and information theory.Depending on the subject area and underlying assumptions, it can be referred to as model selection, classification, discrimination or detection.Hypothesis testing problems are typically classified as binary or non-binary, depending on the number of hypotheses, and Bayesian or non-Bayesian, depending on whether or not priors on the hypotheses are known.
The minimum average probability of error of Bayesian binary hypothesis testing is attained by the likelihood ratio test.Similarly, the minimum average error probability of Bayesian M -ary hypothesis testing is attained by the maximum a posteriori (MAP) test [1].For non-Bayesian binary hypothesis testing, Neyman and Pearson formulated the optimal tradeoff between the pairwise error probabilities and showed that the likelihood ratio test attains the optimum tradeoff [2].
We study a variation of Bayesian M -ary hypothesis testing.Specifically, we allow the test to output a list with L candidate hypotheses.This setting is helpful when the number of hypotheses is very large and, for complexity reasons, one might wish to implement staggered or iterative testing.At each stage, a bank of tests of smaller dimension is run, but a candidate list is output instead of a single candidate, in order to facilitate information exchange at the next stage or iteration.List hypothesis testing is also implicitly employed in approximate recovery problems related to statistical estimation where a reduction to multiple hypothesis testing is performed (see e.g.[3,Sec. 16.2.2.]).In reliable data transmission or storage, list decoding is employed in order to improve the performance of error-correcting codes [4].In communications, list detection is employed in large linear multiple-input multiple-output systems iteratively exchanging information with iterative decoders of error correcting codes (see e.g.[5]).
From a theoretical perspective, it is important to understand what is the minimum error probability in order to establish a performance benchmark for practical tests.In this paper, we study the minimum probability of error of list hypothesis testing.We provide two new families of bounds to the minimum probability of error.The first one, bounds the minimum probability of error by that of a suitably optimized non-Bayesian hypothesis test and is reminiscent of the meta-converse bound in [6].Instead, the second family bounds the minimum probability of error by the tail probability of the likelihood ratio, or the information spectrum [7].When these bounds are optimized over an auxiliary output distribution, inspired by the work in [8], we show that the bounds are actually tight and provide two different expressions of the minimum probability of error We show that the solution of the optimization of the second bound is unique and provide an expression for the optimal auxiliary distribution.In turn, the identities not only help in better understanding the minimum probability of error, but also help assessing the tightness of the bounds.
This paper is structured as follows.Section II introduces the relevant notation for binary hypothesis testing.Section III describes the list hypothesis testing problem and derives the minimum probability of error.Section IV proves the first identity for the minimum probability of error and connects it with non-Bayesian binary hypothesis testing.Section V proves the second identity for the minimum probability of error and connects it information spectrum.In proving this result, it is shown that the optimal auxiliary distribution is unique.Proofs of auxiliary results can be found the Appendix.
E. Asadi Kangarshahi is with the Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, U.K. (e-mail: ea9972@gmail.com).A. Guillén i Fàbregas is with the Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, U.K. and also with the Department of Information and Communication Technologies, Universitat Pompeu Fabra, Barcelona 08018, Spain (e-mail: guillen@ieee.org).
This work was supported in part by the European Research Council under Grant 725411.
II. BINARY HYPOTHESIS TESTING Let Y be a random variable taking values on set Y. We consider two hypotheses H, 0 and 1, which correspond to Y being distributed according to two distributions, P or Q, respectively.A binary hypothesis test is a probabilistic mapping Y → [0, 1] that upon observing a certain y, decides which of the hypothesis represents the observation.We let Ĥ be the random variable associated with the output of test and T , the test mapping, the conditional distribution P Ĥ|Y .
The performance of binary hypothesis test is characterized by the type 0 and type 1 errors defined as follows: In the Bayesian setting, given a prior probability P H (0), P H (1), the smallest average probability of error is given by and T is known to be the likelihood ratio test [1]; the likelihood ratio P (y) Q(y) is checked against the ratio of the priors.In the non-Bayesian setting, no knowledge about the prior probabilities P H (h i ), i = 0, 1 is assumed.The trade-off between the pairwise error probabilities ǫ 0 (T, P ) and ǫ 1 (T, Q) is characterized by the function α β (P, Q) defined as follows Similarly, one can define the alternative tradeoff β α (P, Q) as It is well known that a minimizing test for (4) is the likelihood-ratio threshold test [2].It is known that every optimal test is a threshold test where the likelihood ratio between the two distributions is compared to a threshold λ NP ∈ R such that optimal test can be expressed by the following equation (6) where in order to solve (4), δ y and λ NP are chosen such that ǫ 1 (T, Q) = β.The minimizing test is not unique in general since all values of δ y and λ NP with the property ǫ 1 (T, Q) = β yield an optimal test.III.LIST HYPOTHESIS TESTING Consider now a Bayesian M -ary hypothesis testing problem, with two random variables Y, X jointly distributed according to P XY , such that Y, X take values on Y, X , respectively with |X | = M .The observation alphabet Y is a general alphabet that encompasses the Cartesian product of n-observations and many other standard settings.Upon observing y ∈ Y we wish to decide what X was.Standard M -ary hypothesis tests output a single candidate hypothesis X ∈ {1, . . ., M }.Instead, we consider list hypothesis testing.A list hypothesis test with list size L is a possibly random mapping P X|Y , where X = ( X1 , . . ., XL ) ∈ X L denotes the random vector containing a list of candidates { X1 , . . ., XL }.For simplicity of the presentation, we assume that all candidates in the list are distinct; this does not have an effect on the structure of the test that minimizes the probability of error.We say that the true hypothesis has been successfully estimated if the true X is one of the entries of the list vector X = ( X1 , . . ., XL ), i.e., if X ∈ { X1 , . . ., XL }.The problem is obviously of interest when L ≪ M .
Since the joint distribution P XY defines a prior distribution P X over the alternatives, the problem is naturally cast within the Bayesian framework.The average probability of error of a given list hypothesis test P X|Y , defined as ǭ(P X |Y ), is written as where the probability and expectation in ( 7)-( 10) is computed with respect to the joint distribution between the true hypothesis X, the observation Y and the list X, and where Equation ( 14) holds since all elements on the list are assumed to be distinct, and thus, the events { Xℓ = X} for ℓ = 1, . . ., L, are disjoint.Further define where a b = a! b!(a−b)! .Observe that (15) is, by assumption, defined only for distinct x 1 , . . ., x L ∈ X .In order to show that the above definition induces a probability distribution on X L , we write x1,...,xL,y where ( 16) follows from the definition of P XY (x 1 , . . ., x L , y) in ( 15), (17) follows from the fact that for any given x ∈ X in the sum (17), there are M−1 L−1 possible list configurations.We now turn to the minimum probability of error over all tests, defined as ǭ = min The following result finds a test that achieves the minimal probability of error.Lemma 1: An optimal test achieving the minimal probability of error ǭ chooses distinct (x 1 , . . ., xL ) ∈ X L such that Proof: Any test that maximizes P XY (x 1 , . . ., xL , y) will maximize the probability of success and thus minimize the probability of error.Thus, we set where is the set of list vectors that maximize P XY ; there might be more than one maximizing list.With this particular choice, we obtain The final result is obtained from definition (15).Finally, observe that in order for the optimal test to maximze P X ∈ {x 1 , . . ., xL }, Y = y it is needed that xℓ for ℓ = 1, . . ., L are distinct, since otherwise there would be fewer than L summands in (14).

IV. META-CONVERSE
In reference [6] Polyanskiy, Poor and Verdú introduced a lower bound to the minimum probability of error of conventional M -ary hypothesis testing.The bound, termed meta-convertse, is expressed as the error probability of a non-Bayesian binary hypothesis test as where Q X (x) = 1 M for every x ∈ X and Q Y is an arbitrary auxiliary output distribution.It was shown in [8] that optimizing over Q Y results in the bound being tight thus providing the exact minimum probability of error.In this section, we show a similar family of bounds for list hypothesis testing and provide an identity that connects the minimum error probability of list hypothesis testing and the proposed bound by means of an optimization over the auxiliary distribution.
First, define an auxiliary probability distribution over the list vector where X is a random vector defined on X L .
The following theorem states the main result of this paper for list hypothesis testing.
Theorem 1: The minimum probability of error ǭ of Bayesian M -ary list hypothesis testing with list size L can be bounded as 1 where P XY and Q X are defined in ( 15) and ( 28), respectively and Q Y is an arbitrary distribution over the observation alphabet Y.In addition, 1 where the following distribution is a maximizer for expression (30) We proceed by defining a binary hypothesis test T between two distributions on X L × Y such that hypothesis 0 : (XY ) ∼ P XY (34) The binary test T is chooses hypothesis 0 as and chooses hypothesis 1 in all other cases.Thus, the pairwise error probabilities are given by ǫ 0 (T, P ) = 1 − P Ĥ = 0|H = 0 (37) and where ( 39) and ( 43) follow from the definition of the binary test T (0|x 1 , . . ., x L , y) and (44) from the definition of Q X in (28).Therefore, from the conditions above we can see that for any distribution Q Y we have 1 since the pairwise error probability ǫ 0 (T, P ) of the above binary test cannot be better than the Neyman-Pearson optimal tradeoff (4).This proves (29).In addition, since Q Y is arbitrary, this also holds for the maximizing distribution, 1 In order to prove the tightness of the bound, we now need to show that 1 In order to show (50) we set Q Y = Q * Y defined in (31) and rewrite the α β function as (see e.g.[9,Ch. 11]) where the first probability is computed with respect to P XY and the second one is computed with respect to we find that where (53) follows since implying that From Lemma 1, we have that and thus, which implies that proving the desired result.
The identity established by Theorem 1 can be rewritten in terms of the alternative pairwise error probability tradeoff.Corollary 1: Identity (30) can be rewritten as The proof of Theorem 1 suggests a broad family of lower bounds to the probability of error parametrized by the auxiliary distribution Q Y .In particular, we have that for a fixed auxiliary distribution Q Y , we have that 1 or equivalently, In order to efficiently compute these bounds, one must choose a convenient Q Y .The specific choice will, naturally, depend on the specifics of the problem at hand.

V. INFORMATION SPECTRUM
In this section, we show an alternative identity for the probability of error of list hypothesis testing.Specifically, this identity is expressed as a function of the tail probability that the likelihood ratio exceeds a certain threshold.These expressions have sometimes been termed information spectrum [7].
Theorem 2: For a fixed auxiliary distribution Q Y and constant λ ≥ 0, we have that In addition, where Q * Y defined in (31) is the unique maximizer of (69).
Proof: As shown in the proof of Theorem 1, we have that, where (72) holds for any fixed λ ≥ 0 and (73) follows since the second term is always non-negative.Applying this to (29), the bound (68) follows.
For the particular choice λ * in (52) where ( 75) and (76) follow from ( 57) and (58), respectively.Eqs. ( 74)-(77) imply that λ * in ( 52) is a maximizer of (74), and thus We now proceed with the proof that Q * Y as defined in (31) is the unique maximizer of (69).We divide the proof in two parts, depending on whether or not Y is absolutely continuous with respect to P XY Using (71), we rewrite (30) as The above expression and (69) are both exact characterizations of the error probability.However, (79) has an additional nonnegative term compared to (69).Thus, any maximizing distribution and constant Q * Y and λ * of (69) are also mazimizers of (79).As a result, by comparing both equations, we have that Using the definition of Q X in (28), and the absolute continuity of Q X × Q * Y with respect to P XY this implies that for all (x 1 , . . ., x L ) ∈ X L and y ∈ Y. Since this expression holds for arbitrary (x 1 , . . ., x L ) ∈ X L , in particular it holds for the maximizing (x 1 , . . ., x L ) ∈ X where (84) follows from the fact that Q * Y is a probability distribution.We have shown that for the maximizing Q * Y , λ * must satisfy (84).In addition, the maximizing λ * must minimize the second term of (69).
Therefore, since the first term of (69) is increasing with λ, for any λ satisfying (84), the smallest λ satisfying (84) is the maximizer of ( 69 Observe that the left hand side of ( 86) is itself a probability distribution on Y and thus, (86) holds with equality for all y ∈ Y, recovering (31).
is not absolutely continuous with respect to P XY Consider a distribution V Y on Y and a non-Bayesian binary hypothesis test between P XY and Q X × V Y .Then, if there exists some ŷ ∈ Y such that V Y (ŷ) = 0, any optimal test in the Neyman-Pearson setting T is such that for every (x 1 , . . ., x L ) ∈ X L .The interpretation of this statement is that whenever V Y (ŷ) = 0, any optimal test would not choose hypothesis 1, unless P XY (x 1 , . . ., x L , ŷ) = 0 for all (x 1 , . . ., x L ) ∈ X L .We have the following result, whose proof can be found in Appendix A.
The above Lemma shows that, if there are two (or more) hypotheses for which P XY (x 1 , ȳ)P XY (x 2 , ȳ) > 0, an auxiliary distribution Q Y that associates zero mass to observation ȳ cannot be optimal.In particular, the lemma shows the existence of a distribution that places non-zero mass to all y ∈ Y that is better than one that places zero mass at ȳ, thus bringing us back to the case where Q X × Q * Y is absolutely continuous with respect to P XY .There is a remaining trivial case, where there are observations y ∈ Y that can only be obtained from only one individual hypothesis.In this case, there is no ambiguity as to what hypothesis caused the observation.Thus, then the problem reduces to removing those observations, i.e., the optimal distribution places zero mass on those and non-zero on the others.APPENDIX A PROOF OF LEMMA 2 Let (T, λ * ) be an optimal non-Bayesian likelihood-ratio test and the corresponding threshold for testing between P XY and Q X × Q Y with fixed type-1 error probability ǫ 1 (T, . Consider the distribution QY QY (y) = ( M L ) µ max (x1,...,xL)∈X L P XY (x 1 , . . ., x L , y) y = ȳ where µ = M L max (x1,...,xL)∈X L P XY (x 1 , . . ., x L , y) + λ * .We first show that QY is a probability distribution on Y, i.e., that y QY (y) = 1. (90) where (115) follows from the definition of QY in (89) and (116) from the definitions of Q X in (28) and of (x 1 , . . ., xL ) in (99).This implies that when y = ȳ, (x 1 , . . ., x L ) = (x 1 , . . ., xL ), P XY (x 1 , . . ., x L , ȳ) • When y = ȳ, (x 1 , . . ., x L ) = (x 1 , . . ., xL ), according to the definition of T in (100), we have that T (1|x 1 , . . ., xL , ȳ) = 0.