FPLS-DC: functional partial least squares through distance covariance for imaging genetics

Abstract Motivation Imaging genetics integrates imaging and genetic techniques to examine how genetic variations influence the function and structure of organs like the brain or heart, providing insights into their impact on behavior and disease phenotypes. The use of organ-wide imaging endophenotypes has increasingly been used to identify potential genes associated with complex disorders. However, analyzing organ-wide imaging data alongside genetic data presents two significant challenges: high dimensionality and complex relationships. To address these challenges, we propose a novel, nonlinear inference framework designed to partially mitigate these issues. Results We propose a functional partial least squares through distance covariance (FPLS-DC) framework for efficient genome wide analyses of imaging phenotypes. It consists of two components. The first component utilizes the FPLS-derived base functions to reduce image dimensionality while screening genetic markers. The second component maximizes the distance correlation between genetic markers and projected imaging data, which is a linear combination of the FPLS-basis functions, using simulated annealing algorithm. In addition, we proposed an iterative FPLS-DC method based on FPLS-DC framework, which effectively overcomes the influence of inter-gene correlation on inference analysis. We efficiently approximate the null distribution of test statistics using a gamma approximation. Compared to existing methods, FPLS-DC offers computational and statistical efficiency for handling large-scale imaging genetics. In real-world applications, our method successfully detected genetic variants associated with the hippocampus, demonstrating its value as a statistical toolbox for imaging genetic studies. Availability and implementation The FPLS-DC method we propose opens up new research avenues and offers valuable insights for analyzing functional and high-dimensional data. In addition, it serves as a useful tool for scientific analysis in practical applications within the field of imaging genetics research. The R package FPLS-DC is available in Github: https://github.com/BIG-S2/FPLSDC.


Introduction
With the advancement of imaging and genetic technologies, there has been a surge in large-scale biomedical studies like the UK Biobank (UKB) study (Miller et al. 2016).These studies have amassed a wide array of data spanning imaging, genetics, health factors, and electronic health records.Among this vast spectrum of information, organ-specific imaging traits have emerged as essential biomarkers for understanding organ development, aging, and the diagnosis and prognosis of diseases (Zhou et al. 2021).Furthermore, a joint analysis of comprehensive organ-wide imaging and genetic data not only aids in deciphering the genetic architectures behind organ structure and function (Bearden and Thompson 2017, Elliott et al. 2018, Zhao et al. 2021, 2023), but also helps in detecting pertinent genetic markers associated with various organ-related disorders, like Osteoarthritis (Le andStein 2019, Wilkinson andZeggini 2021).Over time, this extensive collection of data could pave the way for mapping potential biological pathways that link genetics to imaging endophenotypes for different organs, such as the brain and heart, and relate these to clinical outcomes that are confounded with health factors.Nevertheless, the joint analysis of imaging and genetic data poses considerable challenges to existing statistical methods (Liu and Calhoun 2014, Shen and Thompson 2020, Zhu et al. 2023).As elucidated by Vounou et al. (2010), methods used in imaging genetics can be classified into four categories: candidate phenotype-candidate gene association (CPCGWA), candidate phenotype-genome-wide association (CPGWA), brain-wide candidate gene association (BWCGA), and brain-wide genome-wide association (BWGWA).These categories are distinguished by the dimensionality of the genotypes and imaging phenotypes involved.Our primary research interest lies in addressing several major computational and methodological challenges related to BWCGA and BWGWA.These challenges primarily stem from the high dimensionality of genetic and imaging data, along with the intricate spatial structures inherent to them.
A large body of literature exists on the development of statistical methods for BWGWA and BWCGA (Liu andCalhoun 2014, Shen andThompson 2020).Mass-univariate linear modeling (MULM) is a commonly utilized method for detecting linear genotype-phenotype relationships (Stein et al. 2010, Hibar et al. 2011, Huang et al. 2015).However, MULM involves a substantial number of comparisons, which may limit the power to detect even moderate signals.To partially mitigate this problem, various multivariate approaches, such as partial least squares correlation and canonical correlation analysis, have been introduced to identify multivariate linear genotype-phenotype associations (Liu andCalhoun 2014, Grellmann et al. 2015).Moreover, regularization methods have been used to handle high-dimensional scenarios, with the goal of selecting a small number of features (Vounou et al. 2010, Kohannim et al. 2012, Yang et al. 2015).Recently, Huang et al. (2017) proposed a functional genome-wide association analysis (FGWAS) framework designed to detect linear genotype-phenotype relationships, while accounting for functional features, such as functional smoothness and correlation, inherent in imaging data.Lastly, there is a growing interest in delineating nonlinear genotypephenotype relationships.This has been pursued by using distance correlation metrics for random vectors (Sz� ekely et al. 2007, Sz� ekely andRizzo 2013), including the greedy projected distance correlation [G-PDC, Fang et al. (2018)] and weighted distance correlation [wdCor, Wen et al. (2020)] methods.
In this article, we propose a functional partial least squares through distance covariance (FPLS-DC) framework.Our goal is to delineate nonlinear genotype-phenotype relationships while explicitly accounting for the functional features present in imaging data.Our FPLS-DC framework consists of three main steps.First, we utilize an alternative partial least squares (APLS) approach (Delaigle and Hall 2012) to extract a low-dimensional informative projection vector (IPV) from the imaging data.Next, we use simulated annealing algorithm to maximize the distance covariance between genetic markers and the IPV.Finally, we implement a rapid testing procedure for conducting BWCGA and BWGWA efficiently.To examine the finite sample properties of our FPLS-DC, we carry out extensive simulations and real data analyses.Moreover, to address the challenge posed by correlation between predictors, we introduce the iterative FPLS-DC (I-FPLS-DC) algorithm.This iterative approach further amplifies the analytical capabilities of our methodology.
The structure of the remainder of the paper is as follows.Section 2 lays the groundwork by providing necessary preliminary information for the subsequent discussions.In Section 3, we provide a detailed description of the FPLS-DC and I-FPLS-DC algorithms.Section 4 presents the outcomes of our Monte Carlo simulations.In Section 5, we carry out a large-scale real data analysis using UKB.Finally, Section 6 discusses potential extensions of the proposed algorithms and outlines possible directions for future work.

Preliminary
In this section, we introduce two fundamental components of the FPLS-DC and I-FPLS-DC algorithms: the APLS algorithm and distance covariance.

APLS algorithm
The functional linear model (FLM) is a statistical framework designed to model the relationship between a scalar response variable V and a functional predictor variable UðsÞ, the latter variable is defined on a nondegenerate and compact set S, satisfying Ð S EfU 2 ðsÞgds<1.In this framework, the response variable is hypothesized to be a function of the predictor variables, and the relationship is modeled as a linear combination of basis functions of the predictor variable: where a 0 is a scalar parameter, ε is a scalar random variable with EðεjUÞ ¼ 0, and bðsÞ is an unknown coefficient function on S.
The partial least squares (PLS) algorithm can be used in FLM to find a series of orthogonal basis functions fψ j ðsÞg j≥1 to approximate the functional coefficient bðsÞ.However, the computational implementation of PLS can prove challenging, particularly in the context of high-dimensional and complex functional data.To address these limitations in terms of computational efficiency and stability, the APLS approach was designed.Specifically, APLS replaces the singular value decomposition (SVD) used in PLS with an eigenvalue decomposition, which may be more efficient for specific types of data.
Step 3: Estimate the prediction function EfV i jU i ðsÞg by The APLS algorithm has certain limitations that need to be considered.Firstly, it may not be suitable when the functional predictor variable exhibits a complex dependency structure, such as nonstationarity or nonlinearity.This is because the algorithm assumes a linear relationship between the predictor and the response variable, which may not be valid in these cases.In addition, APLS might not perform optimally with datasets having a low signal-to-noise ratio or with highly correlated predictors.

Distance covariance
Distance covariance (dcov) is a statistical measure used to determine the degree of dependence between two random vectors, W 2 R p and V 2 R q .Initially introduced by Sz� ekely et al. (2007), it measures the strength of the relationship between two random vectors.Let's proceed with the formal definition of distance covariance and its empirical unbiased variant.
Definition 2.1.The distance covariance (dcov) between random vectors W and V with finite first moments is the nonnegative number dcovðW; VÞ defined by where ϕ W;V , jtj p and jsj q are, respectively, the Euclidean norms of t 2 R p and s 2 R q , ϕ W and ϕ V are the characteristic functions of ðW; VÞ, W and V, respectively, and c d ¼ π ð1þdÞ=2 =Γðð1þdÞ=2Þ.
Let ðW i ; V i Þ i¼1;2;3 be i.i.d. with ðW; VÞ and there exists an equivalent form of distance covariance that involves absolute differences of W and V. Specifically, the distance covariance between W and V can be expressed as: Empirical unbiased versions of distance covariance can be obtained from finite samples of W and V by substituting the expected values with sample means.

Definition 2.2. Let fðW
, and the form of Bij is similar to that of Ãij .
The distance covariance has shown promising results in detecting both linear and nonlinear relationships between random vectors.

Materials and methods
In this section, we introduce the FPLS-DC algorithm and the I-FPLS-DC algorithm.We first establish some essential notations to facilitate understanding of the content to follow.Let X ¼ ðX 1 ; . . .; X p Þ T be a p×1 random vector, and YðsÞ be a random functional response variable defined on S. We denote ðX; YðsÞÞ ¼ ðX 1 ; . . .; X p ; YðsÞÞ ¼ fðX i1 ; . . .; X ip ; Y i ðsÞÞg i¼1;...;n as a random sample from ðX; YðsÞÞ.For b ¼ ðβ 1 ; . . .; β p Þ T , we define the zero norm jjbjj 0 ¼ P p j¼1 Iðβ j 6 ¼ 0Þ and the p-norm jjbjj p ¼ ð P q j¼1 jβ j j p Þ 1=p for p 2 ½1; 1Þ, where IðAÞ is an indicator function of an event A. For any index set A � f1; . . .; pg, we denote jAj as its cardinality and define X A ¼ ðX ij ; j 2 AÞ T 2 R jAj×n .

FPLS-DC algorithm
The FPLS-DC algorithm includes a dimension reduction step and a nonlinear dependence step, aiming to characterize the dependence between the functional response variable YðsÞ and a scalar predictor variable X.The FPLS-DC algorithm proceeds as follows: Step 1: Estimate ψ 1 ðsÞ; . . .; ψ q ðsÞ using the sample ðX; YðsÞÞ and the Equation (2) from Section 2.1.This step results in a low-dimensional informative projection vector (IPV) Z ¼ ðZ 1 ; . . .; Z q Þ T of projections of YðsÞ onto the basis functions fψ j ðsÞg q j¼1 , where Z j ¼ Ð S YðsÞψ j ðsÞds for j ¼ 1; . . .; q.
Step 2: Optimize the empirical unbiased distance covariance: The optimization problem (Equation 4) can be transformed into an unconstrained optimization problem by using a spherical coordinate transformation.This is because the constraint c T c ¼ 1 corresponds to a hypersphere in q-dimensional space, which can be parameterized by spherical coordinates.In spherical coordinates, the parameter c is represented by a vector h of length q−1, where each θ i signifies the angle between c and the ith coordinate axis.The function J ðhÞ then transforms h to c, mapping the spherical coordinates to a point on the unit hypersphere.
Utilizing spherical coordinates, the optimization problem (Equation 4) can be reformulated as max h2R q−1 dcov 2 n ðX; Z T J ðhÞÞ: (5) This becomes an unconstrained nonconvex optimization problem that can be solved using standard optimization methods, such as simulated annealing.Simulated annealing is a stochastic optimization algorithm that finds the global maximum of a function by randomly adjusting the current solution and accepting moves that enhance the objective function with a probability that is governed by the temperature parameter.
Algorithm 1 presents a new feature screening procedure for all p components of X that is based on the optimization problem (Equation 5).This marginal screening approach assumes that the predictors are independent of each other.However, if there are correlations between predictors, then the FPLS-DC method may identify redundant variables instead of the most relevant ones.In the next section, we will present a novel iterative algorithm to address such redundancy.

I-FPLS-DC algorithm
We propose an iterative FPLS-DC algorithm, known as I-FPLS-DC, by introducing another constraint on X.Let c and b be the coefficients of the response variables and predictor variables, respectively.For each b, we apply APLS algorithm to ðX T b; YðsÞÞ in order to compute the basis functions ψ 1 ðsÞ; . . .; ψ q ðsÞ and subsequently calculate the IPV vector Z ¼ ðZ 1 ; . . .; Z q Þ as a sequence of YðsÞ along fψ j ðsÞg j≤q .It is important to note that the IPV vector Z is a function of b.
For I-FPLS-DC, our optimization problem becomes: Moreover, we can incorporate sparsity into (6) by adding an L 0 penalty term on the predictor variables, leading to: By using spherical coordinates b ¼ J ðaÞ and c ¼ J ðhÞ, and selecting sparse variables in X, we can rewrite (7) as: where A is the active set of X, consisting of selected features of size K. Optimizing Equation ( 8) can be computationally challenging when the total number of basis functions and significant variables are large.One strategy to manage this challenge involves fixing either the h or a parameters and then updating the other set of parameters in an iterative manner.
Determining the K significant variables from p predictors remains a complex task.One intuitive approach is to rank the dcov 2 n ðX l ; YðsÞÞ for l ¼ 1; . . .; p and select the top-K variables.However, this method could be influenced by the correlation among predictor variables.An enhanced approach would be to update the importance score of each variable iteratively and constantly adjust the ranking of the significant variables to supersede the irrelevant ones.This method may provide a more accurate solution by considering the correlation among predictor variables.Given an estimated active set A, we can utilize two sacrifices to evaluate the importance of a variable: ( where ĉ and b are the solutions of Equation ( 7) based on the active set A.
The I-FPLS-DC procedure can be divided into two steps: Step X il ; 4: for j ¼ 2; . . .; q do 5: The detailed steps of the proposed I-FPLS-DC method are summarized in Algorithm 2. To enhance comprehension of Algorithms 1 and 2, we present the corresponding flowchart in Supplementary Fig. S1.
To determine the size of the active set in practice, information criteria such as Akaike information criterion (AIC) can be used.AIC provides a criterion for balancing the complexity of model estimation and the goodness of fit.However, when using L 2 penalty to constrain the maximization of unbiased distance variance, there may be an imbalance between the complexity of model estimation and the goodness of fit.To better determine the size of the active set, we integrate the sample size with the maximum distance covariance.For any active set A, we define a modified AIC as follows: where dcov max ¼ max h2R p−1 ;a2R jAj−1 dcov 2 n ðZ T J ðhÞ; X T A J ðaÞÞ.

Hypothesis testing
To explore the relationship between a functional response variable and a predictor, we conduct a hypothesis test for independence.Specifically, for l ¼ 1; . . .; q, we can write the null and alternative hypotheses as H 0 : X l ??YðsÞ vs H 1 : X l 6 ??YðsÞ; (10) To associate the predictor variables X l with the most pertinent regions of functional data YðsÞ, we can assign weights to each voxel s in the image YðsÞ based on the directional influence of bðsÞ.The signal pattern across S, which is linked to the variable X l , can be elegantly captured by dcovðX l ; Ð S YðsÞbðsÞdsÞ.Consequently, the hypothesis test in (10) transforms into examining the independence between X l and Ð S YðsÞbðsÞds.Conventionally, a permutation procedure is used, but its computational demands can be prohibitive in large-scale GWAS.An alternative, as explored in prior literature (Gretton et al. 2005, Sz� ekely et al. 2007), entails approximating the null distribution of distance covariance through a gamma distribution.The shape and scale parameters of this distribution are estimated from � dcov n , providing a practical and efficient approximation.The FPLS-DC test procedure can be divided into four parts: Step Step 4: Calculate the P-value, denoted as p, and reject H 0 if p<α, where α is the prespecified significant level.
During GWAS with the FPLS-DC test, we can simplify the approximation of the gamma distribution parameters to expedite the testing speed.For a reference variable ( X; YðsÞ), the gamma distribution's mean and variance hinge solely on Applying this, we can estimate gamma distribution mean and variance of (X l ; YðsÞ).Then, we can estimate the null distribution of (X l ; YðsÞ).
Similar to the FPLS-DC test, I-FPLS-DC can conduct group genetics testing following (7).The details are summarized in the supplemental material.In the context of the I-FPLS-DC test procedure, which focuses on a conditional test given an active set A, we define our hypotheses as follows: H 0 : X l ??YðsÞjX A flg vs H 1 : X l 6 ??YðsÞjX A flg : (11) Then, the I-FPLS-DC test procedure exhibits slight variations depending on the specific variable.Assuming that we have obtained the estimated direction bðsÞ and the estimated active set Â through the I-FPLS-DC method, the following steps are followed: For the variable X, the procedure closely resembles the FPLS-DC process, with a minor modification in Step 3. If l 2 Â, the adjustment lies in the determination of b� ðmÞ ðsÞ during Step 3. We optimize the objective function (8) using the permuted sample (X; Y � ðsÞ) and the estimated active set Â to compute b� ðmÞ ðsÞ.Conversely, if l 6 2 Â, the only variation is in setting the value of b� ðmÞ ðsÞ during Step 3, where it is set to be bðsÞ.

Simulation
In this section, we evaluate the finite sample performance of FPLS-DC and I-FPLS-DC through several numerical experiments.We set four types of examples to contrast the performances of these methods against APLS (Delaigle and Hall 2012), wdCor (Wen et al. 2020), and FGWAS (Huang et al. 2017).The comparison is conducted using two evaluation metrics: i) The quantile of the minimum number of chosen variables that encompasses all the causal variables.ii) The sensitivity rate, defined as the proportion of significant causal SNPs over the total number of causal SNPs, and the nonsensitivity rate, defined as the ratio of significant noncausal SNPs to the total number of noncausal SNPs, with significance levels of 0.001, 0.01, and 0.05.
In our comparisons, we apply sample sizes of 200, while the dimension of genetic data is set at 1000 or 2000 across all scenarios.We perform 100 Monte Carlo simulations in each case.
The genetic data X ¼ ðX 1 ; . . .; X p Þ T , where p ¼ 1000 or 2000, is simulated as follows: Initially, n subjects are created by randomly combining haplotypes of HapMap CEU subjects.We then use PLINK software to establish linkage disequilibrium (LD) blocks, based on the genotypes of these simulated subjects.We randomly select 200 or 400 blocks from the resultant LD blocks and amalgamate the haplotypes of HapMap CEU subjects within each chosen block to create genotype variables for these subjects.Finally, 5 SNPs are randomly selected from each block to formulate the genetic data X.
Tables 1 presents intriguing insights into the performance of various methods in the four examples.The results derived from the minimum model size reveal that I-FPLS-DC surpasses all other methods in discerning the most consequential variables while excluding irrelevant ones.FPLS-DC also displays its effectiveness by maintaining a smaller model size compared to the wdCor method in the majority of the examples.The wdCor method falls short of the performance levels exhibited by both FPLS-DC and I-FPLS-DC across both sample sizes.In contrast, APLS and FVGWAS tend to incorporate more variables into the selected models, particularly when dealing with smaller sample sizes.
Table 2 presents the sensitivity rates of all methods.It suggests that FPLS-DC and I-FPLS-DC are highly likely to accurately identify important variables, with wdCor demonstrating comparable performance, albeit slightly inferior to FPLS-DC and I-FPLS-DC.Both APLS and FVGWAS tend to manifest lower sensitivity rates, implying they might incorporate fewer pertinent variables in their selected models.However, all methods exhibit similar performance when it comes to irrelevant variables, which indicates the stability of their tests when dealing with such variables.The decision to use a gamma distribution as an approximation in our test is due to the complexities of the null distribution, potentially resulting in higher NSR of FPLS-DC and I-FPLS-DC at times.
In Supplementary Fig. S2, a detailed comparison of computation times for various methods across 100 repeated simulations under the null hypothesis is shown.The analysis reveals that FVGWAS is the most time-efficient method.However, FPLS-DC and I-FPLS-DC also demonstrate commendable speed, significantly surpassing wdCor and APLS in computational efficiency.This evidence implies that integrating FPLS-DC and I-FPLS-DC into existing analysis workflows could enhance the speed of detecting genetic variants linked to complex diseases in extensive genetic association studies.

UK Biobank data analysis
In this section, we apply our FPLS-DC framework to analyze hippocampus surface data obtained from phases 1 and 2 of the UK Biobank (UKB).The UKB's principal goal is to collect extensive biomedical information from around 500 000 participants.This initiative aims to assess the influence of genetics, lifestyle choices, and environmental factors on various diseases and health conditions.Our analysis uses a dataset comprising 10 000 unrelated individuals of British ancestry, encompassing two key components.The first component includes the left and right hippocampus radial distances for each subject, with each surface represented by 15 000 vertices.This detailed representation allows for a comprehensive analysis of the hippocampal structure.The second component consists of SNPs located on 22 autosomes.The primary objective of this data analysis is to explore the genetic impact of these SNPs on the hippocampus.Given the hippocampus's pivotal role in memory processes, our study focuses on understanding how genetic variations might influence hippocampal structure and function, and consequently, memory lapses.This research is instrumental in deepening our understanding of the genetic underpinnings of memory-related aspects of the brain.
We processed the MRI data using standard procedures, including AC-PC correction, skull-stripping, cerebellum removal, intensity inhomogeneity correction, segmentation, and registration.This was followed by automatic regional labeling of the template, transferred to subject images through deformable registration, enabling the computation of volumes for 93 ROIs per subject.For the hippocampus, we used a subregional analysis package utilizing surface fluid registration (Shi et al. 2013), which leverages isothermal coordinates and fluid registration for hippocampal surface mapping.This allowed us to calculate various surface statistics, such as multivariate tensor-based morphometry and radial distance measures on the registered surfaces.Further methodological details are available in Wang et al. (2011).
We downloaded the UKB imputed genotype data and applied the following standard quality control procedures: excluding subjects with more than 10% missing genotypes, only including SNPs with MAF>0:01, genotyping rate >90%, and passing Hardy-Weinberg test (P-value > 1×10 −7 ).After quality control, we limit our analysis to 653, 122 SNPs that overlap with the HapMap3 reference panel (International HapMap 3 Consortium et al. 2010) to balance accuracy and computational burden (Ge et al. 2019).
We adjusted for key covariates in our study, including the top 20 genetic PCs, age at imaging, gender, interaction terms (age × gender, age 2 , age 2 × gender), education, study site, motion, image quality, scale, and brain position.To minimize their impact on the FPLS-DC analysis, we preprocessed the imaging data by consolidating IDs into one file and imputing missing data with subject-specific mean values.The residualized imaging data was then calculated using the formula the equation ½I−XðX T XÞ −1 X T �Y, where X is the design matrix (enhanced by adding a constant column before the covariate matrix) and Y is the image data.Given the longer processing time for wdCor and APLS tests on real data, the residualized data, along with the genetic data, were analyzed using FPLS-DC, I-FPLS-DC, and FVGWAS methods.Supplementary Table S1 provides an overview of the top 10 SNPs that have been detected by FPLS-DC, I-FPLS-DC, and FVGWAS in both the right and left hippocampus.These results were obtained with a significance level at 7:655×10 −8 , which corresponds to the threshold of 0.05 divided by the total number of SNPs (653 171).Within the findings generated by FPLS-DC, as highlighted in Table 1, SNP rs11245347 exhibited the highest significance with a P-value of 9:992×10 −15 .This SNP, located on the FAM53B gene, demonstrates a remarkable association with the right hippocampus.In addition, SNP rs9321028 on the TPD52L1 gene exhibited a significant association with the left hippocampus (P-value ¼ 9:881×10 −15 ).Liu et al. (2023) indicated that SNP rs11245347 (chr10) showed significant effects on several hippocampal and subfield volumes in both EAS and EUR, including the right hippocampal tail volume.Zhao et al. (2019) and Van der Meer et al. (2020) both found significant gene-level associations between FAM53B and hippocampal subfield volumes.
In the I-FPLS-DC results detailed in Supplementary Table S1, two prominent SNPs stand out: rs853169 and rs7998301, each displaying a remarkably low P-value of 1.110×10 −16 .These SNPs, respectively situated in the ARHGAP26 and SPATA13 genes, showcase the most robust relationships with the right and left hippocampus among the identified SNPs.Specifically, the ARHGAP26 gene, associated with SNP rs853169, has been observed to exhibit expression in a distinct subset of hippocampal neurons, as highlighted by Jarius et al. (2015) and Uhl� en et al. (2015).
Unlike FVGWAS, which fails to identify any SNPs with P-values below 7:655×10 −8 , FPLS-DC has discovered 142 significant SNPs linked to the right hippocampus and 257 to the left hippocampus.I-FPLS-DC, on the other hand, has identified 222 significant SNPs associated with the right hippocampus and 220 with the left hippocampus.Moreover, a Venn diagram in Supplementary Fig. S5 illustrates that FPLS-DC and I-FPLS-DC share 55 significant SNPs connected to the left hippocampus and 60 to the right hippocampus.The overlap of significant SNPs detected by both methods is detailed in Supplementary Tables S2 and S3 (left hippocampus) and Supplementary Tables S4 and S5 (right hippocampus).Within these overlaps, genes such as FAM53B, ADAMTS18, and Tll-1, among others, are implicated in hippocampal function and have been previously mentioned in literature such as Zhao et al. (2019), Van der Meer et al. (2020), Zhu et al. (2019), Tamura et al. (2005), and other pertinent studies.Remarkably, I-FPLS-DC is capable of identifying SNPs with P-values below 10 −15 , a sensitivity not shared by FPLS-DC or FVGWAS.
In addition, we provide Manhattan plots (Fig. 1) created using FPLS-DC, I-FPLS-DC, and FVGWAS to highlight significant SNPs.Supplementary Fig. S3a focuses on the influence of the most notable SNPs, rs9321028 and rs11245347, identified by FPLS-DC, on the left and right hippocampus.Supplementary Figure S3b depicts the effect of the most significant SNPs, rs7998301 and rs85316, discovered by I-FPLS-DC, on the left and right hippocampus.It is clear from Supplementary Fig. S3 that I-FPLS-DC pinpoints subregions linked to the most significant SNPs similarly to FPLS-DC.Supplementary Fig. S4 demonstrates that our proposed methods achieve notably lower P-values when testing significant SNPs compared to those yielded by FVGWAS.Furthermore, we examined the polygenic effects of each active set derived from individual chromosomes (Supplementary Table S6) and conducted a pathway enrichment analysis (Supplementary Fig. S6) utilizing the Gene Ontology analysis approach.From the result of Supplementary Fig. S6, it can be observed that both FPLSDC and I-FPLS-DC jointly identify genes associated with 'axon extension involved in axon guidance,' 'juxtaparanode region of axon,' and so on.Those pathways are both related to the hippocampus in human beings through their roles in neural circuit development and function.

Conclusion
Our research introduces two powerful techniques, FPLS-DC and I-FPLS-DC, for assessing genetic markers in highdimensional imaging data.These methods use advanced statistical approaches to handle complex relationships between genetic and imaging data.FPLS-DC uses a thorough screening process and permutation testing, while I-FPLS-DC uses an iterative splicing technique for nonlinear model analysis.Our findings show these strategies efficiently handle ultrahigh dimensional imaging genetic data, expanding the scope of nonlinear genetic analysis.
To enhance computational efficiency, future work could focus on developing faster approximations for estimating the null distributions of FPLS-DC and I-FPLS-DC, speeding up whole-genome analyses of whole-brain data.In addition, extending these methods to handle more complex models, such as graph, longitudinal, and Riemannian manifold models, would enable researchers to explore intricate interactions between genetic variants and imaging phenotypes.Furthermore, while currently applied to MRI, FPLS-DC and I-FPLS-DC have the potential to extend to other imaging data types like CT, PET, and EEG.This extension would broaden the range of data available, providing a more comprehensive understanding of the genetic basis of complex disorders, such as schizophrenia.