FunctanSNP: an R package for functional analysis of dense SNP data (with interactions)

Abstract Summary Densely measured SNP data are routinely analyzed but face challenges due to its high dimensionality, especially when gene–environment interactions are incorporated. In recent literature, a functional analysis strategy has been developed, which treats dense SNP measurements as a realization of a genetic function and can ‘bypass’ the dimensionality challenge. However, there is a lack of portable and friendly software, which hinders practical utilization of these functional methods. We fill this knowledge gap and develop the R package FunctanSNP. This comprehensive package encompasses estimation, identification, and visualization tools and has undergone extensive testing using both simulated and real data, confirming its reliability. FunctanSNP can serve as a convenient and reliable tool for analyzing SNP and other densely measured data. Availability and implementation The package is available at https://CRAN.R-project.org/package=FunctanSNP.


Methods
To study the associations between quantitative traits and dense molecular measurements in particular SNPs, we adopt techniques from functional data analysis.We comprehensively consider both partially functional linear and interaction analysis models.It is noted that with minor modifications, the analysis techniques and software can also be applied to other dense/functional variables.The partially functional linear model, referred to as SNPlm, was first proposed in Fan et al. (2013) for testing the associations between quantitative traits and genetic variants.SNPlm treats densely measured genetic variants as realizations of smooth stochastic functions, and then genetic variant functions (GVFs) are estimated.
An association analysis is then conducted on the GVFs.Interaction analysis has been routinely conducted with SNPs and other molecular data.We advance the SNPlm model and introduce a partially functional interaction model with local sparsity, named SNPinter.This model allows us to take advantage of SNPlm and can effectively "bypass" the curse of dimensionality of parametric interaction models.A challenge here is the "main effect, interaction" selection hierarchy, which postulates that, if an interaction effect is selected as important, the corresponding main genetic effect also needs to be selected.This hierarchy has been well studied with parametric modeling.However, its functional counterpart has been very limitedly examined.

SNPgvf
SNPgvf conducts the ordinary least squares smoothing analysis and models the genetic variants of an individual as a genetic variant function.
Consider a dataset with n independent individuals, and individual i is sequenced in a genetic region containing m i variants.Assume that the m i variants have been ordered according to their physical locations t i1 < . . .< t im i .To simplify notation, we normalize the region [min 1≤i≤n {t i1 } , max 1≤i≤n {t im i }] to [0, T ].Denote the genotypes of the m i variants as G i = (g i (t i1 ) , . . ., g i (t im i )) ⊤ , where g i (t ij ) ∈ {0, 1, 2} represents the number of minor alleles of the i-th individual and the j-th variant located at t ij , j = 1, . . ., m i .
With functional data modeling, the genotypes G i = (g i (t i1 ) , . . ., g i (t im i )) ⊤ are modeled as a discrete realization of a genetic variant function, denoted as X i (t), t ∈ [0, T ].Three discrete realization patterns for X i (t) and G i have been considered (Fan et al., 2013): 1. Additive effects of minor alleles: X i (t ij ) = g i (t ij ).
To estimate X i (t) using the discrete realizations X i = (X i (t i1 ), . . ., X i (t im i )) ⊤ , we adopt least squares smoothing with a suitable set of basis functions, such as B-spline, Fourier, and wavelet functions (Wang et al., 2016).Specifically, with a set of basis functions ϕ l (t), l = 1, . . ., L X , we generate the m i × L X matrix Ω i , which contains the values of ϕ l (t ij ), j = 1, . . ., m i , l = 1, . . ., L X .Subsequently, X i (t) can be approximated as: where ϕ(t) = (ϕ 1 (t), . . ., ϕ L X (t)) ⊤ .According to the three discrete realization patterns, the estimated GVFs are classified as additive, dominant, and recessive.

SNPlm
This model adopts the aforementioned functional modeling strategy.It can accommodate low-dimensional covariate effects when associating genetic variations with a quantitative trait.For subject i(= 1, . . ., n), let y i denote the quantitative trait and Z i = (z i1 , . . ., z iq ) ⊤ denote the covariates.Consider the following functional linear model: where α is the intercept, γ = (γ 1 , . . ., γ q ) ⊤ is a q × 1 vector of regression coefficients, β(t) is a coefficient function describing the association between the genetic variants and trait, and ε i is an error term following a normal distribution with mean zero and finite variance σ 2 e .For β(t), we also adopt the basis expansion technique.With a set of basis functions ⊤ defined in the domain [0, T ], where b = b 1 , . . ., b L β ⊤ are the unknown basis coefficients.Note that the basis functions used for β(t) can be different from those used for X i (t).With the basis expansions, SNPlm can be rewritten as: where

SNPinter
With SNPinter, we advance from the above model and accommodate the interactions between the genetic variant function X i (t) and covariates Z i .This is the functional counterpart of the genetic interaction model -in particular including the gene-environment interaction model as a special case, which has been routinely considered.

Modeling and estimation
Consider the model: Here, β 0 (t) and β k (t) are the coefficient functions, where k = 1, . . ., q.The β k (t) functions capture the interactions between Z i and X i (t).The error terms ε i 's are assumed to be independent of both Z i and X i (t).
In genetic data analysis, sparsity is usually assumed, under which only a subset of genetic variants is relevant for a specific trait.Here, we consider its functional counterpart and adopt a locally sparse estimation technique for β 0 (t) and β k (t)'s.Under this estimation, the functions are exactly zero on certain subregions (which are unknown a priori).Accordingly, the corresponding genetic variants have no contribution to the response.To make the results more interpretable, we consider the functional counterpart of the "main effect, interaction" variable selection hierarchy.In particular, if a subregion has β 0 (t) = 0, then all corresponding β k (t) functions for k = 1, . . ., q are zero.In our analysis, we consider the common scenario where the covariates are manually selected and likely to be important.As such, sparsity in γ k 's is not considered -it can be easily added if needed.
For estimation, determination of the local sparsity structure, and respect of the variable selection hierarchy, we consider the following objective function: where ∥β(t)∥ 2 = ( q k=0 β 2 k (t)) 1/2 , p λ j (•)'s are penalty functions with tuning parameters λ j 's (for j = 1, 2), κ is a modifier and will be discussed below, η is a tuning parameter, and β ′′ k (t) is the second order derivative of β k (t) with respect to t.In the numerical study, Minimax concave penalty (MCP) is adopted for both p λ 1 (•) and p λ 2 (•).It is defined as p λ j (t) = λ j |t| 0 (1 − x/λ j ξ) + dt, with λ j ≥ 0 as a tuning parameter and ξ > 0 as a regularization parameter (Zhang, 2010).In the context of locally sparse estimation in functional analysis, penalties similar to T 0 p λ 1 (|β k (t)|) dt have been proposed (Lin et al., 2017).To respect the hierarchy, we treat (β 0 (t), β 1 (t), . . ., β q (t)) as a "group".The term κ T T 0 p λ 2 (∥β(t)∥ 2 ) dt determines whether this group of functions has an overall zero effect for a subregion.If not, q k=1 κ T T 0 p λ 1 (|β k (t)|) dt determines whether (some of) the q interaction effects are nonzero.Note that no penalty is applied directly to β 0 (t), ensuring that the corresponding estimate remains nonzero and the hierarchy is respected.
To simplify computation, we adopt the B-spline expansion technique and approximate β k (t) using a set of B-spline basis functions ψ(t) = ψ 1 (t), . . ., ψ L β (t) ⊤ of degree d.These basis functions are defined over M n equally-spaced knots in the interval [0, T ], where L β = M n + d.The approximation is: Here, θ k contains the coefficients for the B-spline basis functions.Denote θ as a vector that contains all θ k 's.We set the modifier in (5) as κ = M n , which leads to the approximation: where W l is an L β × L β matrix with the (i, j)th entry dt 2 dt.Then we have: With this, we can rewrite (5) as: When incorporating the discrete realizations X i 's into the objective function ( 6), SNPinter performs partially functional interaction regression analysis with local sparsity by minimizing the penalized objective function: Let U = (u 1 , . . ., u n ) ⊤ be an n × L β matrix, where the (i, j)th entry is , which is an n × q n matrix and q n = (q + 1) × L β .With these definitions, the objective function can be more concisely and intuitively expressed as: (7)

Computation
We develop an iterative algorithm based on the local quadratic approximation (LQA) technique, which optimizes objective function ( 8) by approximating through the calculation of its first and second derivatives at a given point: At the (s + 1)th iteration, with estimate θ k from the sth iteration, we have the local quadratic approximation for the penalty term p λ 1 (•): k that does not depend on θ k .Similarly, for the penalty term p λ 2 (•), we have: where G 1 (θ (s) ) is a function of θ (s) that does not depend on θ.As such, we can obtain the LQA approximation of the sparse group penalty as: where q is a q n × q n block diagonal matrix with for each k = 1, . . ., q.In addition, G(θ (s) ) consists of G 0 (θ Let D be an n × p n matrix with the ith row where 0 q+1 is a (q + 1) × (q + 1) matrix with all elements being 0. The overall objective function at the (s + 1)th iteration is: Set the first order derivative of Q(ω|ω (s) ) with respect to ω equal to 0: where y = (y 1 , . . ., y n ) ⊤ .Then, we have: Overall, the proposed computational algorithm is summarized as follows: Algorithm 1 Computational algorithm for SNPinter in (8).
2: Repeat: for s = 0, 1, 2, . .., update: until convergence, where we use the norm of the difference between two consecutive estimates smaller than a pre-specified threshold as the criterion for convergence.3: The final estimates α, γ, β k (t) = ψ(t) ⊤ θ k can then be obtained from ω.

Tuning parameter selection
As argued in the literature, the value of M n is not crucial, as the smoothness of estimation is primarily controlled by the roughness penalty, as opposed to the number of knots.In our numerical study, we set M n as 71.Following the published studies (Shi et al., 2014;Wu et al., 2020), we set λ 2 = √ q + 1λ 1 and ξ = 6 and subsequently conduct a grid search to determine the optimal values of (η, λ 1 ) based on prediction performance.

Software development
The FunctanSNP package contains two main functions, SNPlm and SNPinter, designed for performing partially functional linear and interaction analysis, respectively.Additionally, to facilitate a better understanding of the models/approaches and future numerical studies, we have also developed a function for generating simulated data, with which users can specify the number of genetic variants m and the number of samples n.The simulated data generation process is described in Section 3.With simple modifications, users will be able to generate data under other models.The simulation function simX is illustrated as follows.

SNPlm
When using the SNPlm function, the quantitative trait y needs to be organized as a vector, the covariates z need to be organized as a matrix, and the genetic variants need to be specified by a position vector location and a corresponding genotype matrix X.For the genetic variant function X i (t) and the genetic effect function β(t), users can choose different types (type) and numbers (nbasis) of basis functions, where 1 corresponds to the relevant parameters for X i (t) and 2 corresponds to β(t).Below, we provide an example to demonstrate how to use these options.

Simulation
We conduct simulations to gain a deeper understanding of the analysis and "validate" the functions in the package.

SNPlm
We generate data from the following model: where the scalar covariates k for k = 1, 2 are independently generated from the standard normal distribution, with the corresponding coefficient vector γ = (0.4,0.8) ⊤ .Additionally, ε i follows a normal distribution N (0, 0.1).We then generate sequence genotypes X i = (X i (t i1 ), . . ., X i (t im i )) ⊤ by solving the optimization problem: Here, X * i (t) is generated as X * i (t) = j a ij ϕ j (t), where a ij follows a normal distribution with mean 1 and standard deviation 1.Each ϕ j (t) is a B-spline basis function with order 4 and 71 equally spaced knots.For the genetic effect function β(t) = β 0 (t), we consider and generate the response In all simulations, we consider two cases: (a) Case I, where m i = m and 0 < t 1 < t 2 < . . .< t m = 1 with m = 100 equally spaced points on (0, 1]. (b) Case II, with 35% random missing values for each X i (t) based on Case I.
For comparison, we consider two alternatives: (a) Alt.1 adopts a parametric (as opposed to functional) model; and (b) Alt.2 takes individual genetic variants as covariates and asthat the effects are functions.Here, it is noted that advantages of the functional modeling-based approaches have been established elsewhere (based on, for example, extensive numerical comparisons), and that the goal of simulation and comparison here is for the completeness of this study.We evaluate performance using the following criteria: (a) , where l T is the length of the region.(b) Root mean squared errors of γ (RMSE γ ): RMSE γ = ∥ γ − γ∥ 2 .It is observed that SNPlm performs significantly better than the two alternative methods in estimating the genetic effect function β(t), and the estimation for the scalar effect γ is more stable and consistent.
For comparison, we consider the following alternatives: (a) Alt.1 adopts a parametric interaction model and conducts least squares estimation; (b) Alt.2 adopts a parametric interaction model and applies MCP for selection, which may not respect the selection hierarchy; (c) Alt.3 adopts parametric interaction model and applies MCP+group MCP for selection, which respects the selection hierarchy; (d) Alt.4 takes individual genetic variants and corresponding interactions as covariates.The genetic main and interaction effects are assumed to be functions.A smoothness penalty, which has the same form as the third penalty term in (5), is applied; (e) Alt.5 adopts the same model as Alt.4.It applies a smoothness penalty and a functional MCP penalty, which have the same form as the third and first penalty terms in ( 5); (f) Alt.6 adopts the same model as Alt.4 and the same penalty as the proposed approach; (g) Alt.7 adopts the same model as the proposed approach and only the smoothness penalty; and (h) Alt.8 adopts the same model as the proposed approach and only the smoothness and functional MCP penalties.
SNPlm can be effectively viewed as a "traditional" linear model, with (Z i , H i ) as the covariates and (γ, b) as the coefficient vectors.We can obtain the estimate ( γ, b) via the standard linear regression techniques and then estimate the genetic effect function by β(t) = ψ(t) ⊤ b.

Figure 1 :
Figure 1: Simulated data for genotypes of two subjects.Left: discrete realization.Right: genetic variant function.
where l I c 1k is the length of nonnull region I c k of β k (t), (c) Root mean squared errors of γ (RMSE γ ): RMSE γ = ∥ γ −γ∥ 2 , (d) Average proportion of null regions that are correctly identified (PNRCI), which is the functional counterpart of true negative rate in parametric variable selection, and (e) Average proportion of nonnull regions that are falsely identified (PNRFI), which is the functional counterpart of false negative rate in parametric variable selection.

Figure 5 :
Figure 5: Estimation of β 0 (t) under Case I and n = 500.In each plot, the black line represents the true function β 0 (t), and the red line represents the mean function of the 500 estimated β 0 (t).

Figure 6 :
Figure 6: Estimation of β 1 (t) under Case I and n = 500.In each plot, the black line represents the true function β 1 (t), and the red line represents the mean function of the 500 estimated β 1 (t).

Figure 7 :
Figure 7: Estimation of β 2 (t) under Case I and n = 500.In each plot, the black line represents the true function β 2 (t), and the red line represents the mean function of the 500 estimated β 2 (t).

Table 3 :
Comparison of selection performance under different sample sizes: mean (sd) based on 500 replicates.

Table 5 :
Comparison of selection performance under different cases: mean (sd) based on 500 replicates.

Table 6 :
Average estimates and prediction errors for the intercept and scalar covariate effects based on 100 random partitions.sd in "()".