Dimension constraints improve hypothesis testing for large-scale, graph-associated, brain-image data

Summary For large-scale testing with graph-associated data, we present an empirical Bayes mixture technique to score local false-discovery rates (FDRs). Compared to procedures that ignore the graph, the proposed Graph-based Mixture Model (GraphMM) method gains power in settings where non-null cases form connected subgraphs, and it does so by regularizing parameter contrasts between testing units. Simulations show that GraphMM controls the FDR in a variety of settings, though it may lose control with excessive regularization. On magnetic resonance imaging data from a study of brain changes associated with the onset of Alzheimer’s disease, GraphMM produces greater yield than conventional large-scale testing procedures.


Introduction
• Local false discovery rate in toy example   Table S1: mixture structure, toy problem; h encodes a Gaussian joint density of arguments, with mean zero in all components, marginal variance 1 + σ 2 , and exchangeable covariance with all correlations equal to 1/(1 + σ 2 ); fourth column indicates the number of free parameters integrated to give the last column.

Pair
Unit 1 Unit 2 mix prob # free predictive density block null null p block p 0 1 h(x 1 , x 2 , y 1 , y 2 ) block alt alt 1. Introduction lfdr 2 in the toy problem On the toy problem described in Section 1 of the main paper, we assume that we have observables X 1 and X 2 from the first condition and Y 1 and Y 2 from the second. Means are µ X 1 , µ X 2 , µ Y 1 and µ Y 2 respectively, conditioned upon which, the observables are independent normals with those means and with constant variance σ 2 > 0. The four latent means fluctuate over the system by a discrete mixture that indicates strict equality constraints; the unconstrained common levels then fluctuate independently and according to the standard normal distribution.
Discrete mixing is by two coins; the first has success probability p block ; in that event, called block, µ X 1 = µ X 2 and µ Y 1 = µ Y 2 . Otherwise, the units are split. A second coin has success probability p 0 ; this is tossed once if there is blocking and twice if there is no blocking. Success on this second coin indicates µ X 1 = µ Y 1 (which is the null hypothesis of interest). Supplementary Table S1 shows the six discrete outcomes, their mixing probability, as well as the joint density of the observables conditional on that mixture component, but marginal to the latent mean values themselves. From this table, the local FDR is lfdr 2 = P(µ X 1 = µ Y 1 |x 1 , x 2 , y 1 , y 2 ) lfdr 2 ∝ p block p 0 h(x 1 , x 2 , y 1 , with proportionality resolved by summing over the six discrete possibilities. The vertical axis of Fig. 1 (main) is the conditional FDR of the list, which, for list size n, we compute as where the lfdr(i) is the ith smallest lfdr.

Software versions
For statistical parametric and nonparametric mapping we used SPM12 and SnPM13, respectively.

Data structure and inference problem
Graph respecting partitions: Let G = (V, E) denote a simple undirected graph. Over the n nodes in V we are interested in partitions π = {b}, corresponding to disjoint subsets of V, (i.e., blocks), whose union equals the whole vertex set V. We say that π respects G if each block b induces a connected subgraph G b , and we are interested in probability distributions over this restricted class of partitions as an avenue to improved statistical inference. Observe first that a graph-respecting partition π may be encoded with a vector of binary edge variables: Z = {Z e : e ∈ E} such that any two neighboring nodes in the same block have Z e = 1, and any two neighboring nodes in different blocks have Z e = 0. In general, these requirements -that within-block edges are on (Z e = 1) and between-block edges are off (Z e = 0) -are such that some binary vectors Z are not allowed. One exception is when G is a tree: Lemma 1: There is a 1-1 correspondence between graph-respecting partitions of a tree on n nodes and the set of length n − 1 binary vectors.
Notice that we can recover the blocks of the encoded partition by creating a new graph G from the original G through the removal of edges where Z e = 0. Then the connected components of what we call the decimated graph G are the blocks of π. In this special case where G is a tree, we could develop MCMC proposals by randomizing the representation vector Z. But how could we take advantage of this scheme if G is not a tree?
Suppose that G is connected. (If not, we need to consider the following computation performed separately on the different connected components.) Let S denote the set of spanning trees of G and Z S = {Z e : e ∈ edges(S)} be a binary edge labeling of spanning tree S ∈ S.
Together, the tree S and its edge labeling Z S provide a data augmentation of the partition π. Clearly a partition can be derived from φ = (S, Z S ) by decimating the edges of G that are either not in S or are in S but with edge-label 0, and then associating blocks with the connected components of the decimated graph. In general, there will be multiple φ's corresponding to a given partition π, each one associated with a different spanning tree of G; we call Φ π = {φ = (S, Z S ) : π(φ) = π}, where we've allowed notation π(φ) to denote the partition induced by the input. We find a formula for the cardinality of Φ π , which may be useful in deriving MCMC samplers against a particular target distribution over partitions.
Important: Material in this subsection describes a Markov chain Monte Carlo algorithm that could be used for GraphMM, but that was not used used in local posterior computations presented in the main paper. Conveniently, the sampler also allows prior sampling, which we used to enumerate the graph-respecting partitions for subsequent non-Monte-Carlo computa-tion. We include the subsection for completeness.
For each block b of a graph-respecting partition π, the induced subgraph is connected and has at least one spanning tree; the total number of such trees for where D b is a diagonal matrix holding the node degrees in G b , and A b is the incidence matrix describing this same subgraph. This is an application of Kirchhoff's matrix-tree theorem. Of importance in relating the different subgraphs G b is the hyper-graph H in which nodes are blocks of π and multi-edges between nodes correspond to all the between-block edges in G: By another application of the matrix-tree theorem, the number N H of spanning trees of H is Lemma 2: The cardinality of Φ π , denoted N(π), satisfies N(π) = N H ∏ b∈π N b .
We augment the partition π of G to the pair φ = (S, Z S ) holding both a spanning tree S and a vector of binary labels Z S which encodes the partition of S. For any target distribution p(π) we develop a Markov chain sampler by running a chain over the space of φ's, which by construction is a union of sets Φ π . The idea is that the target distribution of a Markov chain in the augmented space is . (2.1) Generatively, this is equivalent to realizing π from p(π) and then selecting one of its N(π) representations φ ∈ Φ π uniformly at random. Of course we cannot easily generate from p(π); the point is that the data augmentation offers a variety of proposal mechanisms that might drive a Metropolis-Hastings sampler. We envision two coupled proposals: 1. Update Z S for a fixed spanning tree (and thus change the partition).
2. Update spanning tree S but keep the same partition π(φ) (and thus update Z S ).
Partition update: Suppose the chain is at state φ = (S, Z S ) and we propose a new state φ * = (S, Z * S ) by randomizing the representation vector Z S but keeping the spanning tree fixed. For example, we could generate entries Z * e as independent Bernoulli trials with some edge-specific success probability. Data dependent probabilities might improve mixing, and we might randomize only a fraction of the entries in order to simplify the update and improve its acceptance rate. Call q S (Z * S |Z S ) the probability mass associated with this proposal; then the Metropolis-Hastings ratio for this case is The first ratio on the right may be relatively simple, especially for product-partition models, since only the blocks incurring some change are retained. The third ratio, of q's, may also be simple if we use independent Bernoulli's to randomize the entries. What is critical then is the middle ratio, whose value is available according to Lemma 2.
Tree update: Suppose the chain is at state φ = (S, Z S ) and we propose a new spanning tree S * uniformly over the set of possibilities S, for example via Wilson's loop-erased random walk.
Relative to the current partition π = π(φ), there is exactly one binary encoding vector Z * S * corresponding to S * , by Lemma 1, and so we have proposed the state φ * = S * , Z * S * that induces the same partition, and thus is an element of Φ π . The Metropolis-Hastings ratio for Thus, any proposal obtained by randomizing the spanning tree while keeping partition fixed is surely accepted.
The spanning tree representation above could be used to construct a posterior MCMC sampler. We experimented with such a sampler, but the main paper reports local computations based on explicit sumations over the discrete states; ordinary MCMC was not required. However, we did use a prior-sampler version of the spanning-tree sampler as a simple way to enumerate all graph respecting partitions on a given local graph.

Graph-based mixture model
Local subgraph calculations: We do not apply the model of Section 2 (main) on the entire graph, due to complexity and dimensionality. We expect most of the information relevant to inference about one node to come through nearby nodes, so we deploy local graph calculations amenable to parallel computing. For each node v, consider a local subgraph N v containing node v. N v is chosen such that it is a connected component of graph G. GraphMM, then, is applied to model the data on this subgraph ( The local procedure is illustrated by Fig. S1. Supplementary Figure S1: Local subgraph calculations. (a) shows pre-processed MRI images of two conditions. (b) shows lattice graphs associated with the data. (c) shows local procedure, in which local subgraph N v includes v and eight adjacent nodes. GraphMM model is applied to the subgraph data to get posterior probability of null effects l v as in (2.2).

Algorithm 1 Local subgraph calculation
Input: pre-processed MRI data X, Y.
Output: per voxel posterior probability of differential structure.
1: procedure GraphMM(X, Y) 2: for v in nodes do 3: 10: For results in Sections 3.1 and 3.2 (main), we did local approximation on a 3 × 3 neighborhood of each node, as illustrated in Fig. S1. In this case we can enumerate all the graphrespecting partitions (we devised a data-augmentation sampling scheme that makes use of spanning trees within the input graph; see Supplementary section 2.1). Then, we are able to enumerate all the pairs (Ψ, ∆) and compute the exact posterior distribution. In addition to the 3 × 3 local, within-slice, subgraph, we used two other local star subgraphs in the calculations reported in Figure 9.
Marginal likelihood: We derive the marginal likelihood using Laplace approximation. Consider the notations as in Section 2 of main paper. Let K Ψ be the number of blocks corresponding to partition Ψ and K diff be the number of changed blocks, which means Denote the ordered indices of changed blocks as (j 1 , j 2 , . . . , j K diff ). We re-parametrize the model in order to remove the clustering constraints on the means Then, the free parameters are (ϕ, ) and the marginal likelihood function can be written as In the next step, we derive the explicit formula for the gradient and Hessian matrix of F.
Let L be the allocation matrix with size N × K Ψ where c vk = 1 if and only iff node v belong to block k. Let R be a K Ψ × K diff matrix such that column lth of R has value 1 at position j l and has value 0 at other postions. Then we can relate the mean vectors with the new parameters (ϕ, ) as follows We consider following notations.
The formula for the gradient of F is.
where J K Ψ is a vector of ones with size K Ψ .
Next, the formula for Hessian matrix of F is Finally, the maximizer (( ϕ, )) can be found using Broyden-Fletcher-Goldfarb-Shanno algorithm.
We point out that the calculation above takes sample covariance matrices S1 and T1, adjusts them by adding a scaled hyper-parameter covariance matrix (A or B): S 3 = 1 M X −1 A + S 1 , and where M X and M Y are the sample size of group1 and group 2 respectively.
We require these adjusted covariance matrices to be of full rank in order to deploy calculations above. This has not been difficult in the context of local graph computations where the dimensionality is low compared to sample size.
Hyperparameters: The analysis of GraphMM involves estimating hyper parameters: prior null probability p 0 , prior mean µ 0 and standard deviation τ 2 of block mean of group 1, prior mean δ 0 and standard deviation σ 2 of difference in block mean between 2 groups, prior covariance matrix A for group 1 and matrix B for group 2. Different strategies for estimating hyperparameters have been considered, • Estimating prior null probability p 0 : We experimented with both qvalue or ahsr to get the estimated value of p 0 . Package qvalue produces conservative estimate of p 0 without any assumption on the distribution of effects. Hence it is a safe and conservative choice under general settings. Package ashr, on the other hand, provides conservative estimate under the assumption that the distribution of effects is unimodal. Furthermore, in our graph-based mixture model, the distribution of effects δ k was assumed to be a mixture of probability mass at 0 and normal distribution, which satisfies unimodal assumption.
Therefore, using package ashr to estimate for p 0 meshes with our GraphMM. The estimation of p 0 is based on the whole dataset, computing prior to the local approximation procedure. In reported computations we used ashr package for p 0 .
• Estimating other hyperparameters: We consider 3 approaches: global, local and mixed es-timation. With global estimation, the hyperparameters are estimated using the whole dataset and computed prior to the local approximation procedure. With local estimation, hyperparameters are estimated for each neighborhood, during the local approximation procedure. With mixed estimation, all hyperparameters are estimated locally except for matrices A and B, which are estimated globally. These approaches, local, mixed and global provides increasingly conservative estimates in that order. In the brain simulation and application, we present results using mixed estimation.

Brain MRI study: ADNI-2
The Alzheimer's Disease Neuroimaging (ADNI) project was launched in 2003 by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, the Food and Drug Administration, private pharmaceutical companies, and nonprofit organizations. The overarching goal of ADNI study comprises of detecting Alzheimer's Disease (AD) at the earliest possible stage, identifying ways to track the disease progression with biomarkers and support advances in AD intervention, prevention and treatments. ADNI is the result of the efforts of many co-investigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada.

Data-driven simulations
Scenarios 1-5: Figure S2 and Algorithm 2 show the general structure of the empirically-guided simulation study that is varied in 5 specific scenarios (Tables S2 and S3). Three data sets are drawn in each scenario. Simulations are of data on a single coronal MRI slice.
For constructing the graph-respecting partition, let us explain step 6 more carefully. We construct a "meta dataset" : data at each node is a vector of following components: (data values Algorithm 2 General framework for all the simulation scenarios Input: MRI dataset for condition 1, G1, a N × M X matrix; real dataset for condition 2, G2, a N × M Y matrix. Output: synthetic dataset 1 X; synthetic dataset 2 Y. Step 1: S 1 ← sample-covariance(G1).
Step 4: W ← S 2 + 0.5I # Add a small value to the diagonal of S 1 , S 2 to get positive definite matrices.

##########
Step 10: X ← Multivariate Normal (µ X , V) Step 11: Y ← Multivariate Normal (µ Y , W) from group 1, data values from group 2, α*x-coordinate, α*y-coordinate). Here, α is a scale parameter. Use the "greedy clustering" method to cluster the meta dataset (Collins, 2014, http:// pages.cs.wisc.edu~mcollins/software/greedy-clustering.html). This method performs clustering based on Euclidean similarity. So, two nodes are in the same block when they are close in terms of both data values and xy-coordinates. If α is large, it would increase the weight of xy-coordinates in the clustering result. Adding xy-coordinates causes the block to be more likely to respect the lattice. Let's call this partition as P1. No graph structure is involved in this step yet. Next, consider a lattice graph associated with the data. Loop over each block of P1, check whether the block is graph-respecting (do this using igraph package to count the number of connected components). If the block isn't graph-respecting, split that block into multiple blocks so that each extra block is a connected component. In the end, each block would be a connected component of the lattice graph, hence the final partition (P2) is graph-respecting.
Supplementary Table S2: Description for simulation 1, 2 and 3. Text with blue color and figures emphasizes that these simulations differ in the average size of latent blocks. In the figures, area with magenta color shows changed-blocks. We can see that the size of changed-blocks decreases in simulation 1, 2 and 3. Especially simulation 3 has no clustering effect, i.e the block size is 1 for all blocks.
Scenario 1 Scenario 2 Scenario 3 Step 6 * Use greedy clustering method * Partition is adjusted to respect lattice graph. * There are 1313 blocks, average block size is 3.9 Same as scenario 1 * Each node itself is a block. * There are 5236 blocks, block size is 1 Step 7 * 50 blocks with size from 12 to 14 are chosen to be changed-block * Average size of changed-block is 13.6 * Percentage of changed-nodes: 14.3% * 300 blocks with size from 2 to 5 are chosen to be changed-block * Average size of changed-block is 2.6 * Percentage of changed-nodes: 14.9% * 15% of the nodes are chosen to be changed-nodes Step 8, 9 Same as scenario 1 m x ← block average of MRI data group 1 sd x ← sample block standard deviation group 1 mar.m ← mean(m x ) mar.sd ← mean(sd x ) ϕ ∼ Normal(mar.m, mar.sd) m y ← block average of MRI data group 2 max.d ← max(m y − m x ) min.d ← min(m y − m x ) For changed-blocks: δ ∼ Beta(2, 2) δ ← δ * (max.d − min.d) + min.d

Figure
The graph-respecting partition P2 is used for both groups.
Robustness to graph structure: We follow the simulation algorithm in Figure S2, except we modify the clustering so that blocks may not be graph-respecting and also shifts between conditions may not be constant shifts on whole blocks. In Step 6, we construct a meta dataset for each group 1 and 2. we use greedy clustering to get two separate partitions (P and Q), for the two Supplementary Table S3: Description for simulation 4 and 5. Text with blue color and figures emphasizes that these simulations differ in the percentage of changed-nodes. The figures show histogram of block avergage shifts across 2 groups for all blocks (red area) and for changedblocks (green area).

Simulation 4
Simulation 5 Step 6 * Use greedy clustering method * Partition is adjusted to respect lattice graph. * There are 1313 blocks, average block size is 3.9 Same as Simulation 4 Step 7 m x ← block average of MRI data group 1 m y ← block average of MRI data group 2 di f f ← m y − m x prob ← increasing function of |di f f | and belongs in (0,1) ∆ ∼ Bernoulli(prob) * Percentage of changed-nodes: 16.4% * Similar to simulation 4, except that * Percentage of changed-nodes: 50.3% Step 8 & 9 * If block k is a changed-block: Same as Simulation 4 Figure groups. So, P and Q are not graph-respecting when we downweight the spatial coordinates, and they're also different for 2 groups. We use data for each group to get the block mean for each block of P and Q. Next we "merge" a block of P with a block Q when their block means are sufficiently close: For group1, consider block 1 of P, named Pb1. For group 2, the nodes in Pb1 block would belong to several blocks of Q, say Qb1, Qb3, Qb7. Let's assume that the mean of Pb1 and Qb7 are very close, then we can "merge" Pb1 and Qb7 in the sense that the nodes in Pb1 (group1) and nodes in Qb7 (group2) have the same mean (and thus are null).
After this merging step, the partitions for 2 groups are still different, each of them are also not graph-respecting. The latent mean vectors are low dimensional, but they are not as spatially coherent as in the model, and they are not consistent between groups. Other aspects of the sampling model are retained.

ADNI-2 data analysis
Number of significant voxels at various thresholds and cluster sizes: Expanding the analysis of Section 3.2 (main), Figure S4 presents the number of significant voxels at various FDR thresholds; Figure S5 shows sizes of clusters of significant voxels. In both cases we consider the 3D brain lattice after calculations use the 2d 3 × 3 local graph.
Results from different local graphs Figure S8 considers the full brain ADNI-2 example on two different GraphMM computations, which differ by the nature of the local graph used. In one case we use the 3 × 3 2-d slice graph; the second case uses the 3-d star graph centered. See Figure S9 on the marginals prior to ranking.

Discussion
Compute time: The prototype GraphMM code is not especially fast. For the 3-d star-local-graph whole-brain Importance of the graph-respecting constraint: We do a sanity check to confirm that statistical efficiency gains may arise by regularizing the expected values through the graph-respecting assumption. In a predictive simulation of the 3 × 3 lattice, we generate synthetic Gaussian data D as follows: expected values are guided by some generative graph-respecting partition Ψ * (drawn from a prior); block-specific means are realized as i.i.d. Gaussian(0, σ 2 = 1/4) variables; the 9 data elements in D deviate from these means by realizations of i.i.d. Gaussian(0, σ 2 = 1) variables. Each simulated data set is one instance of data when the generative setting is graph-respecting. We take each such sim- Additional simulations on line graph: To further investigate resistance to violation of model assumptions, we simulated the GraphMM generative model, except we altered the emission distribution on measurements to allow heavytailed or skewed data. We used a line graph with 1000 nodes, and thus all internal nodes have exactly 2 neighbors. To set latent blocks, we applied a fair coin toss to each graph edge. Then we applied a further fair coin to each block to decide if it null or not. The mean value in both artificial groups at a null node was zero, and the two means values at any non-null block were draws from a normal distribution with mean zero and standard deviation δ = 1/4. Supplementary Figures S11 and S12 consider two different sample size cases; the first having n = 50 samples per group and the second having n = 100 samples per group. Data were generated either standard normal deviations from groups means (top panels) or heavy-tailed or highly skewed. For the heavy-tailed case, we took the normal measurements and cubed them. For the skewed case we added standard exponential variates to the means. The central column of both figures shows normal qq-plots of the data from one group. We see generally good FDR control, Supplementary Figure S2: Structure of data-driven simulation (Scenarios 1-5): Steps 1-4 make the correlation structure of synthetic data similar to that of MRI data. Steps 5-7 aim to mimic the mean structure and clustering pattern of MRI data. Steps 8-11 simulate data following multivariate normal distribution with specified correlation and mean structure.
Supplementary Figure S7: Bar plot for summary on the size of significant clusters. By definition, a significant region is a collection of significant voxels that is spatially connected.
Supplementary Figure S8: Histogram showing the full-brain voxel ranks by GraphMM local FDR when the local graph is a 3d, 7 node star graph (horizontal) or when the local graph is a 2d, within-slice 5 node star graph (vertical). Voxel prioritization is different by the two methods, but there is a great deal of consistency. (Ranks from the bottom, so smallest local FDR (and greatest group difference) is on the bottom left.) Supplementary Figure S9: The empirical distribution function (ecdf) of voxel-specific local FDR statistics from three computations and over 464258 voxels in the ADNI-2 data (those voxels from the giant component of the filtered lattice graph). The local subgraph in GraphMM is either the one holding all first neighbors (GraphMM-3*) or the one holding all such neighbors in a 2-D coronal slice (GraphMM-2*). Note the black curve rises slightly faster from the origin, indicating more local FDR statistics close to zero for GraphMM-3*. The mean values of the distributions are similar (approximating π 0 ). ASH statistics, which do not attempt to use the brain structure, are included for comparison. Figure S10: Shown are predictive averages of posterior similarity distributions between the generative mean partition and the posterior distribution over partitions for two analysts. For each similarity value s (Adjusted Rand Index), each curve records the predictive average E [P(S(Ψ, Ψ * ) s|D)|OK], where OK is the event that the true partition Ψ * is graph-respecting. One analyst uses a prior that ignores the graph; the other uses a graph-respecting prior. The analyst who has regularized posterior computations tends to place more posterior probability near the generative partition.

Supplementary
Supplementary Figure S11: FDR control (left), sensitivity (right), and non-normality (center) for three simulation scenarios on a line graph with 1000 nodes. Ten data sets are simulated in each (separate lines); the normal qq plot of data from one of two synthetic groups is shown in the center. GraphMM computed local FDR is on the horizontal on left and right columns.