Saddlepoint approximations for the normalizing constant of Fisher–Bingham distributions on products of spheres and Stiefel manifolds

,


INTRODUCTION
The density of the Fisher-Bingham distribution on the unit sphere S d−1 = {y ∈ R d : y T y = 1}, with respect to Lebesgue measure on S d−1 , is where the elements of δ ∈ R d and the d × d real symmetric matrix G are parameters and C 0 (δ, G) is the normalizing constant.The family of distributions (1) contains a number of subfamilies of importance in direction statistics and shape analysis, including the von Mises-Fisher distribution, the Bingham distribution, the complex Bingham distribution and the Kent distribution; see, for example, Bingham (1974), Kent (1982Kent ( , 1994) ) and, for a general review, Mardia & Jupp (2000).Kume & Wood (2005) provided a convenient and accurate saddlepoint method for approximating the normalizing constant C 0 (δ, G).
The primary purpose of this paper is to extend the Kume & Wood (2005) saddlepoint approach to enable the accurate approximation of the normalizing constant of two types of generalization of (1): to Fisher-Bingham distributions on Cartesian products of spheres, and to Fisher-Bingham distributions on Stiefel manifolds.The common feature in the two cases is that we consider the most general exponential family in which the exponent consists of the sum of linear and quadratic terms.These distributions may be defined by conditioning on suitable quadratic constraints in a multivariate Gaussian distribution.Both first-and second-order saddlepoint approximations are considered; as in Kume & Wood (2005), it turns out that the second-order approximations typically have higher accuracy, often to an order of magnitude.In the present setting, this finding is perhaps rather surprising, due to the complexity of the second-order term.
As a particular case of the Fisher-Bingham distribution on a product-of-spheres, we briefly consider a multicomponent generalization of the Kent (1982) distribution.It turns out that for this distribution, the limiting relative errors of the saddlepoint approximations are zero as concentration goes to infinity, provided the distribution is unimodal.
Let S p d−1 denote the Cartesian product of p copies of S d−1 , and consider x = (x T 1 , . . ., x T p ) T ∈ S p d−1 , i.e., each x i is a unit vector in S d−1 .We define the density, with respect to Lebesgue measure on S where a = (a T 1 , . . ., a T p ) T ∈ R dp , B = (B i j ) p i, j=1 is a block matrix with p 2 blocks B i j of dimension d × d, and C 1 (a, B) is the normalizing constant chosen so that the integral of f 1 (x; a, B) over S p d−1 is 1.Without loss of generality we assume B is symmetric, so that B i j = B T ji .Mardia (1975) considered (2) in the case p = 2; see also Jupp & Mardia (1980), Mardia & Jupp (2000), Singh et al. (2002) and Mardia et al. (2008).
The second case we consider is a generalization of (1) to a Stiefel manifold.For d q, the Stiefel manifold V d,q is defined by V d,q = {Y (d × q) : Y T Y = I q }, where I q is the q × q identity matrix.Write X = (x 1 , . . ., x q ) for the d × q matrix with columns x 1 , . . ., x q and define the vec operator by vec(X ) = (x T 1 , . . ., x T q ) T .The density of the Fisher-Bingham distribution on V d,q , with respect to Lebesgue measure on V d,q , is where tr(•) denotes trace, A = (a 1 , . . ., a q ) is a d × q matrix, B = (B i j ) q i, j=1 has the same structure as in (2), and C 2 (A, B) is chosen so that the integral of f 2 (X ; A, B) over V d,q is 1.The difference between (2) and ( 3) is that in the latter we have X T X = I q , so that not only are the x i unit vectors, but they also satisfy the orthogonality conditions x T i x j = 0 if i | = j.The Fisher matrix distribution, corresponding to the case where B in (3) is a matrix of zeros, has been considered by a number of authors, including Downs (1972), Khatri & Mardia (1977), Jupp & Mardia (1979), Prentice (1986) and Wood (1993).It turns out that in the Fisher matrix case the proposed saddlepoint approximations are available in closed form.The Bingham version of (3), corresponding to the case where A is the matrix of zeros, has been studied by Arnold & Jupp (2013); see also Hoff (2009) for simulation from (3).In § 5 we consider a parametric model for rigid body motion data in which (3) arises as a marginal distribution.
The purpose of this paper is to present saddlepoint approximations to C 1 (a, B) and C 2 (A, B).Matlab code for performing the calculations is available in the Supplementary Material.

REPRESENTATION OF THE NORMALIZING CONSTANTS
First, a brief heuristic outline of our approach is given, with the aim of explaining how and why it works.The key theoretical result is then presented formally in Proposition 1 below.
Consider a d × q matrix X with real components and assume d q.Provided U = X T X has full rank, we can write X = Y U 1/2 , where U 1/2 is the upper-triangular matrix square root of U .Multiplying on the right by the inverse of U 1/2 yields Y = X (U 1/2 ) −1 , which implies Y T Y = I q and therefore Y ∈ V d,q .If X has a matrix Gaussian density f (X ) and is such that U = X T X has full rank with probability 1, then, ignoring regions with probability zero, where dU is the implied volume element on the space of positive-definite matrices, dV d,q (Y ) is the implied volume element on V d,q and J (U, Y ) is the relevant Jacobian determinant.
The key points are now summarized without proof: (a) J (U, Y ) ≡ J (U ) does not depend on Y , and J (U ) may be calculated explicitly using Jacobian theory presented in the Supplementary Material; (b) since J (U, Y ) ≡ J (U ) and f (X ) is a Gaussian density, the conditional distribution of Y given U = U † is of the form (3); (c) without loss of generality we may take U † = I q , the q × q identity matrix; (d) integrating the right-hand side of both (3) and (4) over Y shows that the normalizing constant C 2 (A, B) is equal to the marginal density of U at U = I q ; (d) the marginal density of U is equal to a known function of U multiplied by the joint density, evaluated at U = I q , of a set of quadratic forms in Gaussian variables; (e) although this density is not known in closed form, its joint moment generating function is known, so that a saddlepoint density approximation may be used; (f) this way of expressing the normalizing constant can be extended from a single Y ∈ V d,q to (Y 1 , . . ., Y p ) in a Cartesian product of Stiefel manifolds, not necessarily of the same dimension.Kume & Wood (2005) cover the case where q = 1, Y is a unit vector and U is a positive scalar.With general integer q 1 considered here, points (a) and (d) are more challenging, and the calculations involved in (e) are more difficult to implement because a multivariate saddlepoint approximation is needed and the second-order term is far more complex.
Before stating Proposition 1, some definitions are needed.Let a Cartesian product of p Stiefel manifolds, not necessarily of the same dimension.The Fisher-Bingham density with respect to Lebesgue measure on where for i, j = 1, . . ., p, X i ∈ V d i ,q i , A i is a d i × q i matrix and B i j is a (d i q i ) × (d j q j ) matrix; and 2) and ( 3) are particular cases of ( 5).This density (5) was mentioned by Mardia (1975, Reply to Discussion) denote a symmetric matrix consisting of p 2 blocks, where V i j is a (d i q i ) × (d j q j ) matrix given by V i j = −2B i j .We shall assume that V −1 and therefore V are positive definite; it will be seen in § 3•5 that there is no loss of generality in doing so.Now define μ via the equation μ = V {vec(A 1 ) T , . . ., vec(A p ) T } T .In this case, we have the following result.The vech operator, designed for symmetric matrices, stacks the lower triangular part of the matrix with the diagonal included, using the ordering implied by vec.5) may be written where s i = vech(I q i ) and g(u 1 , . . ., Although the density g(u 1 , . . ., u p ; μ, V ) is not available in explicit form, the moment generating function of the distribution with density function g is available in convenient form.Consequently, g can be approximated using saddlepoint methods, given by ( 6) or ( 7).Although there is no problem in principle in developing saddlepoint approximations in the general setting of Proposition 1, we have chosen to focus on the particular cases ( 2) and (3) in the remainder of the paper, which are likely to be the cases of most practical interest.
Case 1. C 1 (a, B).This arises when where a i is d × 1, and a = (a T 1 , . . ., a T p ) T , in which case C 1 (a, B) = C 3 (A 1 , . . ., A p ; B).Moreover, in this case g is the joint density of u 1 = x T 1 x 1 , . . ., u p = x T p x p , where x = (x T 1 , . . ., x T p ) T ∼ N dp (μ, V ), and the distribution with density (2) is obtained by conditioning , and here g is the joint density of vech(X T X ), where x = vec(X ) ∼ N dq (μ, V ).The distribution with density (3) is obtained by conditioning x ∼ N dq (μ, V ) on X T X = I q .

3•1. Review of multivariate saddlepoint density approximations
Later in this section we derive saddlepoint approximations for C 1 (a, B) and C 2 (A, B).First, general multivariate saddlepoint density approximations are briefly summarized; see, for example, Butler (2007) for further details.
Let M(θ ) and K (θ ) = log M(θ ) denote, respectively, the moment generating function and cumulant generating function of a random m-vector x.If x has an absolutely continuous distribution on R m , then the first-order saddlepoint approximation, f1 (x), of the density where θ is the unique solution to the saddlepoint equation and, here and below, a circumflex means evaluation of a function of θ at θ = θ.Two versions of the second-order saddlepoint approximation are where where K i j is component (i, j) of ( K ) −1 , and the summation convention is assumed in (9), i.e., when an index occurs as both a subscript and a superscript in a product of terms, then summation over that index is implied.The approximation f3 in (7) has the potential advantage that it is positive even when T −1, although in the extensive numerical calculations performed for this paper, we never encountered an example where T −1.Our general findings are that f3 and f2 are typically very close, but that f3 tends to be slightly more accurate.
where J i is a block diagonal matrix with p 2 blocks, with the d × d identity matrix in the ith diagonal block and zeros elsewhere.Consequently, the cumulant generating function Then where we have used the standard results Higher derivatives of K (θ ), which are needed to obtain T in ( 8) via (9), may be calculated using (12) repeatedly.Simplified expressions for these derivatives are given in the Supplementary Material; see (S1)-(S3).
Applying Proposition 1 and (11), and depending on which of the saddlepoint approximations in ( 6) and ( 7) is used, we obtain the following approximations for the normalizing constant C 1 (a, B): where T is defined in (8) and Here we need the joint moment generating function of {x T i x j : 1 i j q}, which is exp i.e., of the same form as ( 14), but where now θ = {θ i j : 1 i j q}, and J i j is a (dq) × (dq) block matrix with q 2 blocks each of dimension d × d, with block (i, j) equal to the d × d identity matrix I d and all other blocks equal to the d × d matrix of zeros.The cumulant generating function K (θ ) is given by the expression ( 11), but it should be noted that μ, V and C(θ ) are defined differently.Then, using the identities in (12), we obtain , and the saddlepoint equation in this case is K (rs) (θ ) = 0 if r < s and K (rr) (θ ) = 1.Further applications of (12) yield higher derivatives that are needed to obtain T in (8) via (9); see (S4)-(S6) in the Supplementary Material.

3•4. The Fisher matrix case
The Fisher matrix distribution is the special case of (3) in which B = 0 dq,dq , a matrix of zeros, and the density with respect to Lebesgue measure on V d,q simplifies to , where Q and R are, respectively, d × d and q × q orthogonal matrices, obtained from the singular valued decomposition A = Q R T , and = (ω i j ) is d × q, with ω i j = 0 if i | = j and ω ii = ω i > 0 for i = 1, . . ., q. Applying this isometry, the density (17) simplifies to where C 2 ( , 0 dq,dq ) = C 2 (A, 0 dq,dq ).For the underlying multivariate normal distribution in § 2, we take μ = (ω 1 e T 1 , . . ., ω q e T q ) T , where e j is a d-vector with all components zero except for component j, which is 1.
Therefore the saddlepoint equation is K (r,s) = δ r,s , where δ r,s is the Kronecker delta and The unique solution is As ˆ = ( φrs ) q r,s=1 is diagonal, so is ˆ † , and so θrs = 0 if r < s and θrr = −1/(2 φrr ) = −ω 2 r /{(d 2 + 4ω 2 r ) −1/2 − d}.Further calculations using (S4) in the Supplementary Material show that where always r i s i by convention.If, as is the case here at the saddlepoint, Ki j = K i j = 0 unless i = j, i.e., Ki j and K i j are diagonal, then the formulae in (9) simplify to the following: In the case of the Stiefel manifold V d,q , i, j and k are of the form (r, s) where 1 r s q, and m = q(q + 1)/2.Convenient expressions for ( 19) and ( 20) in the Fisher matrix case are given in the Supplementary Material; see (S7)-(S12).
In (2) we have the constraints x T i x i = 1 for i = 1, . . ., p, from which we conclude that C 1 (a, B) satisfies the following identity: for any diagonal matrix = diag(λ 1 , . . ., λ p ), The corresponding identity for C 2 (A, B) is that for any symmetric matrix = (λ i j ) It is straightforward to check that the approximations ( 13) and ( 16) satisfy ( 21) and ( 22), respectively.A practical consequence is that we can always arrange for V to be positive definite when calculating ( 6) and ( 7).
Finally, we offer some brief comments on numerical implementation.One convenient and numerically robust way to solve the saddlepoint equation is to minimize θ ii , over θ .An important point in practice is to ensure that we restrict attention to θ such that the cumulant generating function K (θ ) is finite.A necessary and sufficient condition for θ to be feasible is that C(θ ) in ( 10) or ( 15) is positive definite.

THEORETICAL RESULTS
Some results on the relative errors of the saddlepoint approximations of C 1 (a, B) and C 2 (A, B) are now presented; all proofs are given in the Supplementary Material.Propositions 2 and 3 describe relative accuracy in the neighbourhood of the relevant uniform distribution, while Propositions 4 and 5 describe limiting behaviour for, respectively, highly concentrated Fisher matrix distributions, and highly concentrated multicomponent Kent distributions, defined below.
PROPOSITION 4. For any d × q matrix A, the saddlepoint approximations in (16 For our final result, in (2) define where, without loss of generality, we assume κ i 0, d−1 j=1 β i j = 0 and a i j,i j = a i j ,i j for all i, i = 1, . . ., p and j, j = 1, . . ., d − 1; and for each i, where each block is a PROPOSITION 5. Consider the distribution (2) with vector a and block matrix B = (B ii ) defined by (24).Suppose that the matrix = ( ii ) The distribution defined via (2) and ( 24) may be thought of as a multicomponent generalization of the Kent (1982) distribution when p > 1.When p = 1 and d = 3 we obtain FB 5 , the standard Kent distribution, and with p = 1 and general d 3 we obtain the Kent distribution on S d−1 .This distribution has p(d − 1){3 + p(d − 1)}/2 free parameters: p parameters from the κ i , p(d − 2) from the β i j , p( p − 1)(d − 1) 2 /2 from the a i j,i j and pd(d − 1)/2 from the μ i and μ i j .This is the same number of free parameters as in the p(d − 1)-dimensional multivariate normal distribution.The unit d-vectors μ i and μ i j determine the orientation of the distribution but do not appear in the normalizing constant C 1 (a, B).Also, from symmetry considerations, it follows that there is always a stationary point at x = (μ T 1 , . . ., μ T p ) T .Some further calculations show that a necessary and sufficient condition for this stationary point to be a local maximum, and therefore the global maximum of the density, is that in ( 25) is positive definite.Moreover, if we define ξ i j = x T i μ i j for i = 1, . . ., p and j = 1, . . ., d − 1, and replace κ i , β i j and a i j,i j by ψκ i , ψβ i j and ψa i j,i j respectively, then, under the assumption that in (25) is positive definite, it may be shown with some further calculations that, as ψ → ∞, in distribution.Proposition 5 above generalizes Kume & Wood (2005, Proposition 3) in two senses: when p = 1, from the von Mises-Fisher distribution on S d−1 to the unimodal Kent distribution on S d−1 ; and to p > 1.
Limiting asymptotic normality of the type shown in ( 26) is not sufficient for limiting relative error to be zero, as is the case in Proposition 5 above.A counterexample is Proposition 4 of Kume & Wood (2005), in which a central limit theorem analogous to (26) holds but the limiting relative error is nonzero.In saddlepoint terminology, the reason that the limiting relative error is zero in the present setting is that the relevant tilted distribution, standardized to have mean zero and covariance matrix the identity, converges as ψ → ∞ to a multivariate Gaussian distribution, for which the saddlepoint approximation is exact.2012) develop statistical methods for the analysis of rigid body motion data, focusing on the question of how to define an average rigid body motion.A rigid body motion in R d relative to a fixed coordinate system is defined by a pair (X, y) where X is a d × d rotation matrix that represents the change in orientation of the object, while y ∈ R d is a vector that represents the translation of the object.An issue not considered by Oualkacha & Rivest (2012) is the development of parametric models that incorporate dependence between X and y.One possibility, which so far as we are aware has not been considered before, is to consider an exponential family model of the form f (X, y) ∝ exp tr( ÃT X ) where x = vec(X ).Dependence between X and y is governed by the cross-product term y T C x.
Absorbing this cross-product term into the quadratic form in y by completing the square, and writing d , the inverse of the vec operator, maps a d 2 × 1 vector argument onto a d × d matrix, we obtain From ( 28), y | X ∼ N d (μ + C x, V ) and the marginal density function of X is given by f 2 (X ; A, B) in (3).Moreover, from (27), X | y ∼ f 2 (X ; Ã + vec −1 d (C T y), B) and the marginal density function of y is proportional to where C 2 is the normalizing constant given in (3).
A convenient estimation approach is to estimate A and B using the marginal likelihood based on f 2 in (3), with C 2 (A, B) approximated using one of the saddlepoint approximations described Table 1.Numerical results showing the performance of the normalizing constant approximations for (i) a density with respect to Lebesgue measure on S 2 1 given by (2) with a = (ω 1 , 0, ω 2 , 0) T and B having all elements zero except b 24 = ω 3 ; (ii) a density on S 2 2 given by (2) with a = (0, 0, ω 1 , 0, 0, ω 2 ) T  (iii) the density on V 3,3 defined by (18).To investigate performance at various concentrations, for each of (i)-(iii) we considered three values of the parameter = (ω i ).The parameter values used were 1 = (10, 5, 1), 2 = (15, 5, 4, 3, 2, 1), 3 = (30, 10, 6).2008), the sample analogues of the σ 2 i j are given by σ 2 12 = 0•0008, σ 2 13 = 0•0352, σ 2 23 = 0•0708, which corresponds to a negative ω3 .This explains why the Fisher matrix model does not produce a good fit here, and it is interesting that the introduction of the one-parameter quadratic term in M 6 produces a much improved fit.More generally, further work is needed to identify useful subfamilies of (3) with nonzero B.

NUMERICAL ACCURACY
We present results on the numerical accuracy of the normalizing constant approximations for two particular Fisher-Bingham distributions on (i) S 2 1 and (ii) S 2 2 , and for (iii) the Fisher matrix distribution on V 3,3 .Table 1 shows results in terms of Y k = log(C) − log( Ĉk ), where C is the true normalizing constant and Ĉk the corresponding saddlepoint approximation; a Wilks statistic, W k , defined below; and the sample size, n k , necessary for a likelihood ratio test to detect a difference between the true maximum likelihood estimate and an estimate that comes from maximizing an approximate likelihood based on Ĉk .
The loglikelihood corresponding to (2) or ( 3) is ( ) = − log C( ) + T t, where = {a T , vec(B) T } T , t = {E(x) T , E vec(x x T ) T } T in the case of (2), and = {vec(A) T , vec(B) T } T , t = [E vec(X ) T , E vec {vec(X )vec(X T )} T ] T in the case of (3), where E denotes expectation with respect to the true density.An approximate loglikelihood, based on Ĉk in place of C, is ˆ k ( ) = − log Ĉk ( ) Denote MLE = argmax ( ) and ˆ MLE,k = argmax ˆ k ( ), and define W k = 2| ( MLE ) − ( ˆ MLE,k )|.Then the sample size necessary for a difference between MLE and ˆ MLE,k to be detected at the 5% level by a large-sample likelihood ratio test is n k = χ 2 p (0•95)/W k , where χ 2 p (0•95) is the 95% quantile of a chi-squared distribution with degrees of freedom equal to p, the number of parameters in .
In the absence of a convenient method for calculating C( ) and t, we estimated them using 10 7 Monte Carlo samples.We substituted the estimate of t into (29) and used an optimization routine to calculate a Monte Carlo estimate, ˆ * MLE,k , of ˆ MLE,k , then used a further 10 7 Monte Carlo samples to estimate C( ˆ * MLE,k ), l( ˆ * MLE,k ) and hence W k .To generate the data for each row of Table 1, we repeated the foregoing procedure 100 times, each run yielding a Monte Carlo estimate of Y k and W k , over which we averaged to obtain the Ỹk and Wk shown in Table 1.The relative standard errors, not shown, are negligible for Ỹk and in all cases less than ten percent for Wk .We estimated the critical sample size for the large-sample test by ñk = χ 2 p (0•95)/ Wk .The results for the first-order approximation, k = 1, indicate good accuracy, but the secondorder, k = 3, approximation is typically a major improvement, in some cases having values of Ỹk and Wk an order of magnitude smaller than for k = 1.Further results given in the Supplementary Material show there is little to choose between k = 2 and k = 3, although the latter is slightly superior.The results show that in most cases considered, which span a range of concentrations, a large sample size would be necessary to detect a significant difference, at the 95% level, between the exact and approximate maximum likelihood estimates.The dimensions of the underlying saddlepoint approximation for distributions (i)-(iii) are 2, 2, 6, respectively.Even in the more challenging higher-dimensional setting, performance is strong.However, the computational cost of the second-order approximations grows rapidly with p; typical timings to compute a second-order approximation of C 1 for the product of p = 2, 4, 6, 8, 10 circles are 0•2, 1, 5, 21, 70 seconds.