A Sampling Strategy for High-Dimensional Spaces Applied to Free-Form Gravitational Lensing

We present a novel proposal strategy for the Metropolis-Hastings algorithm designed to efficiently sample general convex polytopes in 100 or more dimensions. This improves upon previous sampling strategies used for free-form reconstruction of gravitational lenses, but is general enough to be applied to other fields. We have written a parallel implementation within the lens modeling framework GLASS. Testing shows that we are able to produce uniform uncorrelated random samples which are necessary for exploring the degeneracies inherent in lens reconstruction.


I N T RO D U C T I O N
Some inversion problems in astrophysics make it desirable to search or sample a high-dimensional solution domain S ⊂ R n such that S is bounded by the linear constraints where A ∈ R m×n and b is a constant vector. A classic application is Schwarzschild's construction of triaxial stellar systems in equilibrium (Schwarzschild 1979). Given a three-dimensional discretized target density function ρ j , the number of stars c i on a given orbit i is found by solving where σ i,j is the orbit density. The orbit density is calculated a priori using test particles in a fixed potential corresponding to ρ. However, searching the model space was not feasible at the time and only some particular models were considered. More recent work has further developed this technique (Schwarzschild 1982;Merritt 1999;Cappellari et al. 2006, and references therein).
In this paper, we consider applications to gravitational lensing. Lensing has had quite a long history, beginning with the first direct evidence of general relativity, but until 1979 with the discovery of the extra-solar lens Q0957+561 (Young et al. 1981), the field was largely of only theoretical interest (Refsdal 1964a,b). Today more than 100 strong lensing objects are known with many studied in great detail (e.g. Kochanek et al. 1999;Faure et al. 2008;Auger et al. 2009). Future surveys promise to deliver thousands more.
Crucially, the equations governing gravitational lensing are linear in the projected mass density κ. As detailed in Section 2, one can discretize κ on to a grid of pixels and solve for physically motivated solutions by imposing constraints in the form of equation (1). Several versions of this idea have been developed by Saha & Williams (2004), Coe et al. (2008) and Koopmans (2005).
This free-form approach is more flexible than simple analytic models, which assume a functional form of the mass profile and may unintentionally break degeneracies. However, this creates a large system of linear equations that is highly underconstrained. To understand the range of degeneracies we therefore require a technique that can explore the space of solutions S. One possible technique is to choose a random point x and accept it if x lies in S. This might be a reasonable method in low dimensions n, as is done in Monte Carlo integration, but the probability of acceptance rapidly approaches zero as n increases. Each of the pixels in the discretization of κ represents one dimension, and typically n is greater than 100. Complex systems, where multiple lenses are used, can easily have more than 1000 dimensions. The priors can also be arbitrary, so the solution space will have a very complex shape, although by construction it will always be convex.
General sampling of probability distributions has been a topic of statistics research for many years (e.g. Chib & Greenberg 1995;Robert & Casella 2005). In the case of lensing, the PIXELENS algorithm (Saha & Williams 2004) is frequently used. We show, however, that the sampling of this algorithm is not uncorrelated. We address these details and related issues in Section 3 and suggest an alternative based on the Metropolis-Hastings algorithm in Section 4. In Section 5, we discuss the implementation and demonstrate in Section 6 that even for high dimensions we are able to sample our solution space to achieve a uniform uncorrelated random sample. We also achieve significant speed improvements over PIXELENS. In Section 7, we discuss future work and applications.

F R A M E W O R K
There are two primary equations in gravitational lensing (Blandford & Narayan 1986;Schneider, Ehlers & Falco 1992;Schneider 2006). The lens equation maps an observed position θ to an unobservable source position β through the potential where κ is the dimensionless projected mass density of the lensing object. The Fermat potential measures, up to an affine transformation, the time a photon takes to travel from the source to the observer. Equation (3) corresponds to the stationary points of equation (5). If the source varies in brightness and the arrival times are different for different images, then one can measure the physical time delay t 21 between the light curves of θ 1 and θ 2 . As in Saha & Williams (2004), we discretize κ into grid cells, or pixels, centred on the lensing object and construct a system of linear equations from τ ∝ t and equation (3): where C ∈ R p×k , d is a constant vector and x is a vector containing the free parameters κ and β. These equations only serve to reduce the dimension of the problem k by the number of equalities p since in general p ≤ k, where p is equal to twice the number of observed images plus the number of measured time delays. This reduction is performed with the orthogonal projection which takes a point x to the solution set of equation (6). The matrix C + is the Moore-Penrose pseudoinverse of C. A basis of this affine space is given by those eigenvectors of (1 − C + C) with eigenvalue equal to 1. We can therefore, without loss of generality, take this reduced space of dimension n = k − p to be our problem domain. Since the space is unbounded, we must impose constraints in the form of equation (1) to limit ourselves to reasonable and physical solutions. These constraints may derive from data, such as arrival time order or image parity, or from Bayesian priors. We consider only modest priors, such as the mass must be positive everywhere, variations in κ must be smooth and the local density gradient must point within 45 • of the centre. A complete discussion can be found in Coles (2008). Our choice of constraints constructs a non-empty compact solution space S, which is a convex polytope, or simplex. By our definition of the solution space, the Bayesian posterior distribution is since all x ∈ S are equally probable. We are interested in an uncorrelated uniform random sample drawn from S, which we will simply refer to as a random sample.

R E V I S I T I N G T H E P I X E L E N S ALGORITHM
Earlier work used the program PIXELENS (Saha & Williams 2004) to model gravitational lenses and estimate the Hubble time H −1 0 . The sampling strategy employed in PIXELENS is a type of random walk explained in detail in Saha & Williams (2004) and Coles (2008). Here we summarize the algorithm and discuss some problems.
To build a set of sample points X = {x 1 , x 2 , . . .}, one begins by selecting a set of vertices V = {v 0 , v 1 , . . .} of S. The first sample point x 1 is chosen uniformly from the chord connecting v 0 , v 1 . Each new point x i with i > 1 is chosen randomly and uniformly from the chord from v i through x i−1 to the boundary of S. In the limit of infinite samples this algorithm will explore the entire simplex.
To construct V, PIXELENS uses the simplex algorithm (Dantzig 1963;Press et al. 2007). This algorithm was designed to maximize (or equivalently minimize) a linear objective function f (x) subject to linear constraints as in equation (1) by moving from vertex to vertex of the simplex in a direction that always increases f . It is a standard algorithm in the field of linear programming, where the vertex that maximizes f is the desired result. Finding a particular vertex is not the goal of PIXELENS and so each vertex v i is found by maximizing a random objective function One issue is that randomly choosing an objective function does not randomly choose a vertex. If the simplex is not a regular polytope there will be some vertices that are chosen with a higher probability than others. This is demonstrated in Fig. 1 with a simplex in seven dimensions with 32 vertices. The vertices have been enumerated and sorted by the number of times they were chosen. Clearly some vertices are highly preferred. Even in high dimensions where the choice of a particular vertex is unlikely to occur again, vertices that are particularly acute will be more likely. By not choosing vertices at random the algorithm prefers some regions over others which leads to correlations in the final sample and not all directions will be adequately explored. Vertex selection is also not invariant under some general coordinate transformation x → Tx for an invertible matrix T ∈ R n×n . Even a change in units may affect the sample distribution and therefore the inferred physical parameters.
Correctly choosing vertices at random is itself a difficult problem. There exist several methods to enumerate the vertices (e.g. Dyer 1983;Avis & Fukuda 1992), but unfortunately the number of  Figure 2. A comparison of sample sets X P and X R for the PIXELENS sampling algorithm (red) and random sampling (black), respectively. Each set has 100 000 items. The means of the plotted distributions (vertical lines) for both the hypercube and n-ball samples are nearly identical, whereas the deviations are not. Top-left: we take the sample space S to be a unit cube centred at the origin in 100 dimensions and show the PDF of r = |x| for all x ∈ X. Top-right: the PDF for the single coordinate x 1 of each sample point. Marginalizing over the other coordinates we expect the probability to be uniformly 1. Points from X P clump along the chords connecting vertices. Bottom-left: similarly, we plot the PDF where S is an n-ball. Bottom-right: marginalizing out a single coordinate we see that a random sample has a broader distribution.
vertices has a huge combinatorial upper bound of where m is the number of inequalities (McMullen & Shephard 1971).
To avoid the enumeration we modified the simplex algorithm to randomly walk between the vertices. This dispenses with the objective function and simply selects a neighbouring vertex to which to move. While this does improve the sampling of the vertices (see Fig. 1), it is not without its own problems. If a vertex has many close neighbours, as is likely in high dimensions, the random walk will tend to stay in one region before moving large distances. To compensate, one must run for a long time. The process of moving to a new vertex is computationally costly, however, and incurs numerical error that quickly dominates after too many iterations.
Another issue is that the PIXELENS algorithm is based on the vertices of S. The algorithm produces samples that do not follow the target probability distribution function (PDF) P(x), even if the vertices are chosen randomly. In Fig. 2, we demonstrate this for a 100dimensional hypercube and n-ball, where the vertices have been randomly selected a priori to avoid the PIXELENS vertex-selection algorithm. In the case of the n-ball, we have chosen a random set of points on the surface to be the vertices. Points chosen from the hypercube tend to lie in the corners, while points from the n-ball are more closely clustered in the centre. In general, the points tend to clump along the chords connecting vertices. In both cases the PIXE-LENS sampling is markedly different than a random sample, although the means are nearly identical.
The sample should be uniformly distributed in S in order to be able to perform statistical analysis on it. For this reason, we chose to explore an alternative method based on a random walk that does not depend on the vertices of the simplex.

A N E W M C M C P RO P O S A L D E N S I T Y
The Metropolis-Hastings algorithm (Metropolis et al. 1953;Hastings 1970) is a well-known method to sample the PDF P(x) by generating a Markov chain where 0 ≤ α ≤ 1 is chosen from a uniform distribution, then x is accepted and , so that the chance of moving from x i to x i+1 is the same as moving from x i+1 to x i . One possibility for Q is to simply move by an arbitrary amount in a random direction but the chain may become trapped in narrow regions, especially in high dimensions. To account for the shape of S, Q is often taken to be a multivariate Gaussian distribution N (x i , Σ), where Σ is the covariance matrix of the sample X. This matrix is an estimate of the covariance matrix Σ of S and can be progressively calculated as the chain is built. Such adaptive chains are no longer Markovian because the reversibility is broken, but  . The sorted eigenvalues of the sample covariance matrix of three sets of random points in a 100-dimensional cube with side length √ 12. The set size is |X|. With infinite samples we expect all eigenvalues to be equal to unity, but due to the high dimensionality, a random sample does not fully estimate each direction, especially for low |X|.
one can run an initial adaptive burn-in phase before beginning the Markovian chain with Σ fixed at its last value.
Selecting x from N (x i , Σ) is equivalent to selecting y from the distribution N (0, 1) 1 and setting x = x i + E Λ 1/2 y , where E is the matrix of the eigenvectors e j of Σ and Λ is the diagonal matrix with the corresponding eigenvalues λ j . In other words, we move along a randomly selected direction accounting for the shape of S through the eigenvalues.
For a reasonable estimate of Σ, particularly in high-dimensional spaces, it is important to have n points. When only n points are known, some of the eigenvalues of Σ are strongly underestimated simply due to poor sampling of the space. Even if true random samples were to be drawn directly from S, the shape would be incorrectly estimated. In Fig. 3, all the eigenvalues of a 100dimensional cube should be equal, but for small sample sizes |X| this is clearly not the case. Poorly estimated eigenvalues cause the standard proposal density to undersample S in the direction of the corresponding eigenvectors, even as new points are added to the chain; the new points reinforce the bias that was present in the original Σ.
The key improvement from this paper is to use the constraint information of equation (1) to hint at the shape of S and achieve a reasonable proposal density despite having a small sample size. We do this in the following way. Let X = {x 1 , x 2 , . . ., x k } with k ≥ n + 1 being a set of points in S ⊂ R n . These points may be chosen in any fashion, but the sample covariance matrix Σ must be invertible, i.e. all x i s do not lie on the same hyperplane of R n , and thus, the set of the eigenvectors of Σ is an orthonormal basis of R n . Since S is convex, the mean X will also be in S. Extending the eigenvector e j from X intersects the boundary of S at two points: one in the positive and one in the negative direction. The distance d j between each pair of boundary points is taken as an estimate of the size of S along e j .
Our modification takes Q to be the multivariate Gaussian distribution N (x i , Σ), where Σ = EDE T and D is the diagonal matrix of the new σ 2 j = d 2 j /12. We therefore select x = x i + E D 1/2 y . The ellipsoidal shape of Q is thus adjusted by substituting Λ with D to better approximate the shape of S. In Fig. 4, we show schematically the modification. The initial set of points inadequately samples the horizontal direction, and therefore, the eigenvalues λ 1 , λ 2 of the covariance matrix are small in that direction. The new distances d 1 , 1 The probability density function for y is f ( y ) = (2π) − n 2 e − 1 2 y 2 . d 2 are a much better approximation and are taken in place of the eigenvalues. This strategy encourages the movement along directions that have been poorly sampled and therefore have a very small variance. This is an important distinction to so-called Hit-and-Run algorithms (Belisle, Romeijn & Smith 1993) that step in a random direction and may require many iterations to move through narrow spaces.
Metropolis algorithms in high dimensions can often be inefficient. If the average step length is too big, almost all proposals will fall into low-probability regions and be rejected, whereas if the step length is too small, almost all proposals will be accepted but sampling the space will be very slow. The optimal is somewhere in between and can be reached if we regulate the step length by multiplying the σ i s with some scaling constant c such that the acceptance rate is roughly 25 per cent (Gelman, Roberts & Gilks 1996). In our case P(x) is not a multivariate Gaussian distribution and its shape varies from case to case and thus there is no single c. However, we found that for c ∼ 1/n the acceptance rate remains reasonable.
Assuming that c = 4/n, the random walk will typically translate to an average distance of t 1 = 4( j σ 2 j /n 2 ) 1/2 after one step and t s = t 1 √ s/2 after s steps assuming an acceptance rate of 25 per cent. In order to produce two uncorrelated points in S, we must make s = N s steps such that the average travelled distance is of the order of the simplex diameter D. In the case of the hyperrectangle or similar shaped simplices and therefore N s = O(n 2 ) steps are needed. In other cases, such as a regular n-simplex, where the approximation for D in equation (11) does not hold, N s > O(n 2 ) steps may be required. This is not a typical scenario however in lensing, as we demonstrate in Section 6. The strength of this algorithm is that it is not sensitive to the dispersion of the starting set of points X but only to its mean X . When X is not a good estimate of the centre of S, the algorithm will need some time to remove the starting bias.

I M P L E M E N TAT I O N
We have implemented our modified sample strategy in a new gravitational lens modelling framework called GLASS. This framework is specifically designed for free-form lens modelling and to allow for easy modification of modelling strategies and priors. Furthermore, we are able to immediately test the implementation by comparing with lensing theory and published results.
As discussed in the previous section, the proposal density Q depends on an estimate for the size of the simplex S. We estimate the size by measuring the distances d j from the current sample chain mean X to the boundary of S following the estimated eigenvectors e j . These diameters are best estimated if X is close to the simplex mean S and the vectors e j are aligned with the true eigenvectors of S.
As is often done (Press et al. 2007), the Markov chain is restricted to move coordinate-wise along the eigenvectors e j , rather than in a random direction. We rotate S such that these eigenvectors coincide with the standard basis. This provides a significant performance improvement since only one coordinate needs to be updated. In addition, the constraints equation (1), typically numbering a few times n in lensing, must only be checked in one coordinate. After  of the covariance matrix of an initial set of points do not accurately capture the shape of the simple S. This is a problem only for high dimensions. We show a two-dimensional drawing for simplicity. Since the boundaries of S are known, we extend the eigenvectors until we reach the boundary. We substitute the eigenvalues with d 2 1 /12, d 2 2 /12. produce one sample is O(n 3 ). While PIXELENS also has a theoretical running time of O(n 3 ) our implementation has a reduced scaling constant resulting in significant performance gains. Our implementation begins by finding the point x 0 where the temporary variable t is maximized subject to Ax 0 + t ≤ b. For this we use the simplex algorithm but any linear programming algorithm will suffice. The point x 0 is inside S and in some sense 'far' from the boundaries.
Initially, the chain walks along the eigenvectors of the matrix 1 − C + C. The chain is run for a burn-in phase where we collect N b samples. We typically let N b = 10n. After the first 2n samples, and subsequently after each N b /10 samples, we updated Q by calculating Σ and then continue walking along the respective eigenvectors. The scaling constant c is adjusted to hold the acceptance rate around 25 per cent.
After the burn-in phase, we fix Σ and c at their final values and run a new chain for as many samples as are desired. In both phases we can run several Markov chains in parallel as long as we ensure that all threads use the same Σ. We have tested this on a shared memory machine using up to 48 CPUs.
As we move to higher dimensional spaces we expect that accumulation of round-off error will tend to produce departures from the equality constraints in equation (6). To remove this numerical error we ensure that a sample point x lies on the simplex by using equation (7) to project it back on to the simplex.

S A M P L I N G E VA L UAT I O N
The stationary distribution of the sample set X from any general MCMC strategy will be the target probability distribution P(x). This is only reached though for a sample size much greater than n. In practice this is not feasible in high dimensions and we must limit our sample size to n. As we demonstrated in Fig. 3, the eigenvectors of a small random sample will not be able to fully describe the solution space, but for lensing statistics this is sufficient as we typically marginalize over many parameters. As we want to obtain a uniform uncorrelated random sample, we compare our MCMC implementation to a random sample from a hyperrectangle and a regular n-simplex in 100 dimensions, while varying N s to measure convergence.

The hyperrectangle
is straightforward to sample directly and the n-simplex is only slightly more involved. 2 We expect the simplex of a real lens system to be similar to H n since it has 2n inequalities and 2 n vertices. In addition, we also test S n as an extreme example. We repeated the tests of PIXELENS from Section 3 using the hypercube and S n with N s = n 2 and n 2.5 , respectively, and show the results in Fig. 5. We are able to match the expected distributions of a random sample perfectly. We test the global properties of our samples X H , X S against the random samples R H , R S by comparing the eigenvalues of the respective sample covariance matrices. Each sample set contains 1000 points. In Fig. 6, we show the sorted eigenvalues for these samples. As expected from equation (11) the hyperrectangle converges at N s ∼ n 2 . The n-simplex requires a larger N s , as mentioned in Section 4, because the estimate of the diameter from equation (11) is no longer valid. In the right-hand panels, we have taken several hundred random sample sets and plotted the 1σ deviations. The plots are normalized to the mean of these random sets. The volume of a simplex can be well approximated by (det Σ) 1/2 = j λ j . Table 1 shows the volumes of all the computed samples. The uncertainties have been calculated from 1000 independent runs. Our sampling strategy is in excellent agreement with random samples.
While the eigenvalues paint a global picture, we also tested the local properties of our sample. In particular, we looked at the distribution of nearest-neighbour distances. In Fig. 7, we show this distribution for H n and S n and for increasing N s . A misalignment with a random sample or multiple peaks are indicators of a correlated sample, such as clumped points. The chains with too low N s are unable to traverse across the simplex resulting in sample points which are too close to each other. By the time the eigenvalues converge the local distribution also converges.   Table 1. Estimated volumes for the tested simplices of H n , S n , and the gravitational lens for various values of N s , demonstrating convergence. The volumes are calculated from the eigenvalues shown in Figs 6 and 9. The column labelled 'Random' is calculated via direct random sampling, which is not possible for the lens. Where the value is in bold face, the volume is underestimated compared with the random sample.   Finally, we tested our implementation with a simulated triaxial lens mass. The four image positions and respective time delays were calculated using a root-finding algorithm built into GLASS. We supplied the value of H 0 , all the time delays and all the image positions to the algorithm without error bars to test the effectiveness of the method. The problem, however, still remains heavily underconstrained. In the near future, we will explore in detail the effects of relaxing these assumptions in a variety of different lens systems.
For the reconstruction, we used a grid of 225 pixels but we assumed radial symmetry to reduce the number of independent pixels to 113. Together with the unknown source position the problem lies on a 104-dimensional affine space. The reconstructed average arrival time surface and images are shown in the left-hand panel of Fig. 8. The right-hand panel compares the inferred surface den-sity profile with uncertainties (black error bars) to the original lens profile. The constraint information on the mass profile is contained within the image positions (vertical lines) and therefore the pixels outside are not expected to be well fit. In general, though, this reconstruction is excellent where the information content is highest. As discussed in Section 1, by sampling the solution space, we are able to explore the degeneracies which simple models cannot. For comparison, we also show the results for the same lens obtained using PIXELENS (red boxes). Although the results are similar, the PIX-ELENS error estimates favour shallower or nearly flat models, again suggesting that the old algorithm oversamples some regions of the parameter space as shown in the upper-right panel of Fig. 2.
We also performed the same eigenvalue and nearest-neighbour analysis as before but because we are unable to directly sample Figure 9. A convergence test of the sample eigenvalues (left) and the probability of nearest-neighbour distances (right) using the realistic lens model depicted in Fig. 8. The number of steps N s used in the MCMC walk is increased until the histograms of nearest-neighbour distances converge. An accept/reject random sample is impossible to obtain given the unknown shape of the simplex. The left-hand plot has been normalized to the N s = n 2.5 case. For N s n 2 the eigenvalues are stable. the solution space we can only change N s until we converge. As expected and shown in Fig. 9 we converge when N s ∼ n 2 .

O U T L O O K
Our novel proposal strategy for the Metropolis-Hastings algorithm allows sampling of general convex polytopes in 100 or more dimensions. We have implemented an efficient parallel version of the algorithm in the gravitational lens modelling framework GLASS so that we may apply the strategy to large lensing problems exceeding 1000 dimensions. GLASS will be publicly available for download in the near future.
Several future applications are possible. Multiple redshift sources carry information of the cosmological distances, which in turn depend on the cosmological parameters. Considering the statistical dispersion of the parameter space, one could in principle be able to infer the cosmological parameters in a Bayesian framework. In order to achieve this, a uniform sample of the solution space is needed.
Previous work on estimating the Hubble time has used systems of up to 18 lenses (Paraficz & Hjorth 2010). New lenses can be included to further constrain this value but each additional lens increases the dimensionality of the space by ∼100 making this current work essential for such upcoming projects.