Interpretable discriminant analysis for functional data supported on random nonlinear domains with an application to Alzheimer’s disease

Abstract We introduce a novel framework for the classification of functional data supported on nonlinear, and possibly random, manifold domains. The motivating application is the identification of subjects with Alzheimer’s disease from their cortical surface geometry and associated cortical thickness map. The proposed model is based upon a reformulation of the classification problem as a regularized multivariate functional linear regression model. This allows us to adopt a direct approach to the estimation of the most discriminant direction while controlling for its complexity with appropriate differential regularization. Our approach does not require prior estimation of the covariance structure of the functional predictors, which is computationally prohibitive in our application setting. We provide a theoretical analysis of the out-of-sample prediction error of the proposed model and explore the finite sample performance in a simulation setting. We apply the proposed method to a pooled dataset from Alzheimer’s Disease Neuroimaging Initiative and Parkinson’s Progression Markers Initiative. Through this application, we identify discriminant directions that capture both cortical geometric and thickness predictive features of Alzheimer’s disease that are consistent with the existing neuroscience literature.


Introduction
Functional discriminant analysis, a statistical framework used to predict categorical outcomes from functional predictors, has been extensively studied within the field of Functional Data Analysis (FDA) (Ramsay and Silverman 2015;Ferraty and Vieu 2006;Horváth and Kokoszka 2012;Hsing and Eubank 2013) and has motivated a large body of literature (see, e.g., James and Hastie 2001;Müller 2005;Preda 2007; Delaigle and Hall 2012;Dai et al. 2017;Berrendero et al. 2018;Kraus and Stefanucci 2019;Park et al. 2021).However, most of the existing methods are concerned with the classification of functions supported on one-dimensional linear domains, which can be a limiting assumption in many modern biomedical applications (Zhu et al. 2023).On the other hand, recent work on the analysis of functional data with manifold structure has mostly focused on vector-valued functions with non-Euclidean constraints in the image space (see e.g., Su et al. 2014;Dai and Müller 2018;Lin et al. 2017;Dubey and Müller 2020;Kim et al. 2021).
In this paper, motivated by the analysis of modern multi-modal imaging data, we propose a novel functional discriminant analysis model that can handle functional predictors supported on nonlinear sample-specific manifold domains, which we term Functions on Surfaces (FoSs) (Lila and Aston 2020).An example of such 'object data' (Marron and Dryden 2021) is provided in Figure 1A, illustrating our motivating application of identifying subjects with Alzheimer's disease from FoSs that are subjectspecific cortical surfaces coupled with cortical thickness measurements.The statistical analysis of these object data poses unique statistical challenges.This is mainly due to the non-Euclidean structure of each individual domain, which makes it difficult to define appropriate spatial regularization, and due to the more abstract non-Euclidean structure of the latent space where the random domains are supported, which further invalidates traditional linear statistical models (Kendall 1984;Grenander and Miller 1998;Dryden and Mardia 2016;Younes 2019).M i is a two-dimensional manifold encoding the geometry of the cerebral cortex, and z i : M i → R is a real function, supported on M i , describing cortical thickness (in mm).Panel B: Linear representation (v i , x i ) of each FoS (M i , z i ) shown in Panel A.Here v i : R 3 → R 3 is a vector-valued function encoding the geometry of the ith individual.This is depicted as a collection of 3D vectors {v i (p j )} for a dense set of points {p j } ⊂ R 3 .For clarity, the function v i is displayed only on half of its domain R 3 .The function x i : M → R describes the spatially normalized thickness map of the ith individual on the fixed template M. Panel C: FoS (φ v i (M), x i • φ −1 v i ) parametrized by the associated functions (v i , x i ) in Panel B. This is a close approximation of the FoS (M i , z i ) in Panel A.
Current approaches in the literature do not comprehensively address these challenges.Although various methods have been proposed to model functional data on multi-dimensional domains, they often focus on flat domains (Goldsmith et al. 2014;Wang and Zhu 2017;Kang et al. 2018;Feng et al. 2020;Yu et al. 2021) or assume nonlinear but fixed domains (Chung et al. 2015;Chung et al. 2021;Lila et al. 2016;Mejia et al. 2020).
There has also been considerable work on the simpler setting of random surfaces that are not coupled with functional data.These efforts can be broadly grouped into three main approaches.The first approach leverages global parametrizations to represent surfaces, employing either an L 2 metric (Chung et al. 2008;Epifanio and Ventura-Campos 2014;Ferrando et al. 2020) or a non-Euclidean metric (Jermyn et al. 2012;Jermyn et al. 2017;Kurtek and Drira 2015;Zhang et al. 2022).The second approach, which is more closely related to the one adopted in this work, uses diffeomorphic deformation functions of the surfaces' embedding space (Vaillant et al. 2004;Younes 2019;Arguillère et al. 2016), allowing for the inclusion of topological constraints.The third approach, prevalently used in neuroimaging studies, employs pre-specified or spectrum-based descriptors of shape (Reuter et al. 2006;Im et al. 2008;Wachinger et al. 2015;Hazlett et al. 2017;Wang and Wang 2017;Dong et al. 2019;Dong et al. 2020).A critical drawback of the latter approach is the inability to uniquely map the discrete representations back to the original space of random surfaces.
The statistical analysis of random surfaces that are coupled with functional data has not been extensively explored.One exception is the model in Zaetz and Kurtek (2015) which focuses on annotated surfaces.In addition, unsupervised models have been investigated by Charlier et al. (2017) and Lila and Aston (2020).Lee et al. (2017) have dealt with the classification problem by employing the fshape framework (Charlier et al. 2017) to represent the data and by using a linear discriminant model on the resulting finite-dimensional representations.In contrast, the statistical framework for the discriminant analysis of FoSs presented in this paper avoids any dimension reduction of the predictors and instead employs spatial penalties to regularize the discriminant direction.It tackles the non-Euclidean nature of the latent space of random domains by defining appropriate linear functional representations of FoSs, effectively reframing the problem of classifying FoSs as the problem of classifying bivariate functional data supported on general, but fixed, domains.A direct model for the estimation of a functional discriminant direction is then defined on the representation space where differential spatial regularizations are introduced to produce interpretable and well-defined estimates.
A key feature of the proposed representation is its invertibility, which enables us to map estimates from the representation space back to the original space.This allows us to explore and interpret the estimated classification rule in the context of the original neurobiological objects of our motivating application.
The rest of the paper is organized as follows.In Section 2, we describe the representation model adopted to parametrize the non-Euclidean space of FoSs using linear function spaces.In Sections 3 and 4, we develop a novel discriminant analysis model on the parametrizing linear function spaces and provide theoretical guarantees for the prediction performance of the proposed model.We introduce an efficient discretization approach in Section 5 and apply the proposed model to the Alzheimer's Disease Neuroimaging Initiative (ADNI) and Parkinson's Progression Markers Initiative (PPMI) datasets in Section 6. Proofs and simulations are left to the appendices.

Functional data supported on general random domains
The data considered in this work is a sample of triplets where g i is a binary label, M i ⊂ R 3 is a sample-specific closed two-dimensional manifold embedded in R 3 , and z i : M i → R is a scalar function supported on the geometric object M i .We moreover assume that the points on the geometries {M i } of the observed FoSs are in one-to-one correspondence across subjects.We refer to the pairs {(M i , z i )} as FoSs.
In Figure 1A, we display three observations of the training sample of our final application, where g i encodes the disease state of the subject, M i encodes the geometry of the cerebral cortex, and z i encodes the cortical thickness map supported on M i .Our goal is to build a classifier from the given training sample that can predict the binary label g * of a previously unseen FoS (M * , z * ).

Linear functional representations
In our motivating application, the geometric objects {M i : i = 1, . . ., n} are topologically equivalent to a sphere, and therefore, they do not display holes or self-intersections.To inform our model of such physical non-Euclidean constraints, we define a convenient unconstrained representation model for the FoSs {(M i , z i ) : i = 1, . . ., n} in terms of objects belonging to linear function spaces.
Let M be a template two-dimensional manifold embedded in R 3 that is topologically equivalent to a sphere.We denote by L 2 (M) the space of square integrable functions over M, equipped with the standard inner product ⟨•, •⟩ L 2 (M) and norm ∥ • ∥ L 2 (M) , and denote by L 2 (R 3 , R 3 ) the space of square integrable vector-valued functions from ) be a Reproducing Kernel Hilbert Space (RKHS) of smooth functions with compact support in R 3 .We then introduce a diffeomorphic operator φ such that φ v : R 3 → R 3 is a diffeomorphic function for every choice of v ∈ V(R 3 ) (Younes 2019).We denote by φ v (M) the geometric object given by displacing every point p ∈ M ⊂ R 3 to the new location φ v (p).A direct consequence of φ v being diffeomorphic is that φ v (M) is topologically equivalent to a sphere for every choice of v ∈ V(R 3 ).The construction of the diffeomorphic operator adopted in this paper and the computation of Next, we use the estimated v i , and z i , to define the spatially normalized function where v i ∈ V(R 3 ) and x i ∈ L 2 (M).
We can depict the representation model introduced as follows: meaning that given a FoS (M i , z i ), we can compute a loss-less representation (v i , x i ) as described earlier, and vice-versa, given the representation (v i , x i ), we can compute the associated FoS through equation ( 2).Hence, the pair of functions (v i , x i ) provides us with a linear representation of the original FoS (M i , z i ) where every geometric object M i is modeled as a (diffeomorphic) deformation of the template, i.e. φ v i (M), while the associated function z i is given by 'transporting' the spatially normalized function x i onto M i with such a deformation.The approach described allows us to recast the original non-Euclidean problem of learning a classifier from the training sample {(g i , M i , z i )|i = 1, . . ., n} as the problem of learning a classifier from where ) and x i ∈ L 2 (M) are two functional predictors both belonging to linear function spaces.In Figure 1B, we show the functional linear representations associated with the three FoSs in Figure 1A.Crucially, the representation mapping employed here is 'invertible', meaning that any pair of estimates (β G , β F ) ∈ V(R 3 ) × L 2 (M), such as the 'direction' that optimally discriminates between two classes, can be mapped back to the original space of FoS using equation ( 2).This mapping defines the associated trajectories of FoSs where φ c 1 β G (M) is guaranteed to be topologically equivalent to a sphere, thereby satisfying the physical constraints of the problem considered.
In contrast to methods that require computing shape features, such as the spectrum of the Laplace-Beltrami operator (Reuter et al. 2006;Wachinger et al. 2015), the approach adopted here provides us with interpretable discriminant directions in the space of the original neurobiological objects, as described by equation ( 5).In addition, unlike approaches that work with global parametrizations of FoSs (see, e.g., Chung et al. 2008;Epifanio and Ventura-Campos 2014;Zaetz and Kurtek 2015;Ferrando et al. 2020), the representation model used in this study is independent of the imaging data type, as long as we can specify how φ v i deforms our objects and a suitable similarity measure.Therefore, the framework proposed in this work has the potential to accommodate additional types of data, such as streamlines generated from diffusion tensor images, where there may not be a one-to-one correspondence across subjects, but for which optimal transport similarity measures have been developed (Feydy et al. 2017).Although our work assumes that the FoSs are in one-to-one correspondence, this is not strictly necessary.Assuming a one-to-one correspondence simplifies the definition of a similarity measure and makes it easier to compute these representations for complex objects such as cortical surfaces, as detailed in Appendix A.1.In Section 6, we compare the performance of our representation model with alternative models, in the context of our motivating application.

Discriminant analysis on the parametrizing linear function spaces
The aim of Sections 3 and 4 is to provide methodology for learning a linear classifier, from the training data {(g i , v i , x i )|i = 1, . . ., n} displayed in Figure 1B, by introducing a novel functional classification model that has the following crucial characteristics: • Does not rely on Functional Principal Components Analysis (FPCA), or related dimension reduction methods, to reduce the dimension of the functional predictors, bypassing the intrinsic assumption that the discriminant direction is well represented by the space spanned by the first few unsupervised PC functions; • Can be applied to bivariate, and possibly multivariate, functional predictors each supported on a different domain; • Allows for explicit spatial regularization of the estimates on potentially nonlinear manifold domains, yielding well-defined and interpretable estimates; • Provides a direct approach to estimating the discriminant directions without relying on prior computation of the covariance structure, which is prohibitive in our application setting.

Linear discriminant analysis over general domains
We begin by focusing on the sub-problem of defining a classifier for the training data {(g i , x i )}, i.e., for the spatially normalized functional predictors supported on the fixed nonlinear manifold M. In Section 4, we then extend the proposed model to account for the geometric component v i , in an additive fashion.Assume the training set {(g i , x i )} consists of n independent copies of (G, X), a pair of random variables with X a zero-mean random function taking values in L 2 (M) and G a binary random variable such that P (G = 1) = π 1 and P (G = 2) = π 2 .Let µ 1 = E [X|G = 1] and µ 2 = E [X|G = 2] denote the conditional means of X and assume µ 1 ̸ = µ 2 .Moreover, let C(p, q) = E [X(p)X(q)] , p, q ∈ M denote the covariance function of X and assume this is square integrable, i.e., M M C(p, q) 2 dp dq < ∞.For a real, symmetric, square-integrable, and non-negative function R ∈ L 2 (M × M), let the integral operator L R : L 2 (M) → L 2 (M) be defined as Consequently, L C denotes the covariance operator of X, which is a compact self-adjoint operator and therefore admits the following spectral representation in terms of the eigenvalues θ 1 ≥ θ 2 ≥ . . .≥ 0 and associated eigenfunctions e 1 , e 2 , . . .⊂ L 2 (M) of L C .Let L −1 C denote the linear inverse covariance operator, where < ∞ and define the population quantity β 0 ∈ L 2 (M) such that Note that this is an assumption on the underlying population quantities and will not have practical implications.However, it allows us to have a unique well-defined variable β 0 to study the convergence properties of the proposed model.For a discussion on the case = ∞, which is related to the perfect classification phenomenon, see Delaigle and Hall (2012), Berrendero et al. (2018), Chen and Jiang (2018), and Kraus and Stefanucci (2019).
The function β 0 can be understood as a functional analog of the multivariate discriminant vector of a linear discriminant analysis (Shin 2008).For a new observation with predictor x * ∈ L 2 (M), it can be used to predict the associated label g * with the linear classification rule ⟨β 0 , x * ⟩ L 2 (M) > c th , with c th an appropriately chosen threshold.Moreover, if X is a Gaussian random function within each group in G, it can be shown that the function β 0 defines the linear classifier that minimizes the misclassification error rate, and it is therefore optimal (Delaigle and Hall 2012).The discriminant direction β 0 can also be equivalently defined as the minimizer of the functional In practice, the population quantities C, µ 1 , and µ 2 are unknown and need to be estimated from the data.The goal of a classification model is therefore to recover β 0 from the training sample {(g i , x i ) : i = 1, . . ., n} of n independent copies of (G, X).This can be achieved by using the sample covariance L Ĉ , with x i (p)x i (q), p, q ∈ M and the sample conditional means μ1 and μ2 to replace the population counterparts in equation ( 7).An estimate β of β 0 can then be defined as a minimizer of where a penalty term P(β) is typically added to overcome the ill-posedness of the minimization problem, which is due to the low-rank structure of L Ĉ .For instance, Park et al. ( 2021) define a penalty P that encourages the estimate β to be smooth and sparse, while Kraus and Stefanucci (2019) consider a ridge-type penalty.
The functional discriminant model in equation ( 8) requires precomputing the empirical covariance function, which is generally not possible for dense functional data supported on multidimensional domains and is ultimately not feasible in our application setting.We therefore propose a direct regularized approach to estimating β 0 .This will be possible thanks to the following simple observation.As noted for instance in Delaigle and Hall (2012), the discriminant direction β 0 can be equivalently characterized as the solution to the minimization problem where Y is an auxiliary scalar random variable such that Y = − 1 π 1 if G = 1 and Y = 1 π 2 otherwise.In other words, the classification problem considered can be reformulated as a functional regression problem.This motivates the adoption of a least-squares approach to estimating β 0 , based on the empirical counterpart of the objective function in equation ( 9), where an additional differential regularization term is introduced to incorporate information on the geometric domain M and overcome the ill-posedness of the problem.For multivariate data, analogous least-squares formulations have also been adopted, for instance, in Hastie et al. (1994), Mai et al. (2012), andGaynanova (2020).

Regularized estimation
Given the training sample (g i , x i ), introduce a scalar variable y i such that y i = − n n 1 if g i = 1 and y i = n n 2 otherwise, where n 1 and n 2 represent the sample sizes of class 1 and 2, respectively.Observe that − n n 1 and n n 2 are estimates of the values that can be taken by the random variable Y .Let W 2 (M) be the Sobolev space of functions in L 2 (M) with first and second distributional derivatives in L 2 (M).We define an estimate β ∈ W 2 (M) of the population quantity β 0 as the solution to the following minimization problem β = arg min where the first term is a least-squares estimate of the objective function in equation ( 9) and the second term is a differential regularization term.The parameter λ controls the trade-off between the leastsquares term of the objective function and the penalty term.Our choice of the regularization term is with ε ≥ 0. This is a linear combination of two terms.The first one is based on the Laplace-Beltrami operator ∆ M : W 2 ⊂ L 2 (M) → L 2 (M) and quantifies the smoothness of the function β : M → R on the nonlinear manifold domain M. Specifically, it allows the model estimate β, at any fixed point p ∈ M, to borrow strength from the other points on M while constraining the 'information' to propagate coherently with the nonlinear manifold structure of the anatomical object M. The second term is a generic shrinkage-type regularization.
It is worth noting that the function space L 2 (M) is linear, even though each function f ∈ L 2 (M) is supported on a nonlinear domain.For Euclidean domains, it is common to define a smooth subspace of L 2 (M) by forming an RKHS from a positive-definite kernel.However, constructing a positivedefinite kernel that is compatible with the geometry of a manifold is a non-trivial task (Feragen et al. 2015;Jayasumana et al. 2015).To overcome this challenge, we constructively define a Sobolev norm J 1/2 (•) and an associated Sobolev space W 2 (M) by leveraging a local differential operator, namely the Laplace-Beltrami operator.The discrete counterpart of this local operator is a sparse matrix, reducing our problem to sparse linear algebra and enabling us to solve equation ( 10) for the large data of our final application.We provide more details about the relationship of our approach with the RKHS approach in Section 4.2.

Theory
The aim of this section is to provide theoretical guarantees for the performance of the proposed model.Specifically, we provide a probability bound for the out-of-sample risk, i.e., the random variable where X * is a copy of X that is independent of the training data and E * is the expectation taken over X * .Equation ( 12) measures the discrepancy between the prediction made with the estimated parameter β and the 'optimal' prediction made with the unknown population quantity β 0 .
Assume for simplicity that ε > 0.Then, thanks to the Sobolev embedding theorem (Brezis 2011), ∃M ≥ 0 such that for any p ∈ M that is, the evaluation operator is a continuous functional.A direct consequence is that the space is an RKHS with a symmetric, positive definite kernel function K M : M × M → R (Berlinet and Thomas-Agnan 2004).The kernel function K M is used only for theoretical investigation here and obtaining its explicit form is, in general, not computationally feasible and not necessary.We will, however, take advantage of the fact that L (Cucker and Smale 2002).For ε = 0, the functional J 1/2 defines a semi-norm rather than a norm and similar arguments hold by restricting ourselves to the subspace of L 2 (M) that is orthogonal to the null space of J 1/2 .
Next, we define the sandwich operator and Yuan 2012) and make the following assumptions.
Assumption 3.1 allows us to use a Hoeffding-type inequality for Hilbert space valued random elements and has no practical implications.This condition is met, for example, when ∥X∥ L 2 (M) is bounded.However, it is more general given that L 1/2 K M X represents a smoothed version of X. Assumption 3.2 guarantees that the population quantity β 0 is well-defined and belongs to the space of smooth functions W 2 (M).Assumption 3.3 is expressed in terms of properties of the effective dimension D(•).For our final choice of λ, it is straightforward to check that this assumption holds by assuming that the eigenvalues {τ k } of T decay as τ k ≍ k −2r , with r > 1 2 .This is a typical assumption in the literature on functional linear models (Cai and Yuan 2012) and is related to the rate of decay of the eigenvalues of L K M and L C , and their alignment.
The following theorem provides an upper bound for the out-of-sample risk.
Theorem 3.1.Under Assumptions 3.1-3.3,if λ ≍ n − 1 1+θ , the estimator β in equation ( 10) is such that Similar rates of convergence have been shown to hold for regularized estimates in the functional linear regression setting (see, e.g., Cai and Yuan 2012;Tong and Ng 2018;Sun et al. 2018;Reimherr et al. 2018).However, a key difference in our model is that the residual random variable ε = Y − ⟨X, β 0 ⟩ L 2 (M) and the functional predictor X are not independent, which prevents the direct application of such results.Therefore, Theorem 3.1 shows that in spite of such a dependence structure we are nevertheless able to recover the functional linear model rates of convergence.The proof is provided in Appendix C.

Nonlinear extensions
To incorporate nonlinearity into the model described in equation ( 10), one can substitute the term ⟨x i , β⟩ L 2 (M) with a nonlinear function of the data, such as a polynomial term or a single-index model, as done in the context of functional regression models in Yao and Müller (2010) and Jiang and Wang (2011), respectively.However, these extensions come at the cost of estimating additional functional parameters or optimizing a more complex objective function.
If the covariance structures of the two classes are believed to be different, the proposed functional linear discriminant model can be generalized to an approximate functional quadratic discriminant model, following the approach proposed by Gaynanova and Wang (2019), as follows.We estimate the discriminant rule by minimizing the following objective function with respect to β 1 , β 2 ∈ L 2 (M): where λ 1 , λ 2 > 0 are tuning parameters.A modified version of Fisher's criterion (Gaynanova and Wang 2019) is then employed to assign the class by first projecting the data along the estimated directions.
As expected, the simulations presented in Appendix B demonstrate that the approximate functional quadratic discriminant model outperforms the functional linear discriminant model when the covariance structures of the two classes differ.Examining the theoretical properties of this extension is beyond the scope of this paper and is left to future work.

Additive multivariate generalizations
We now consider a bivariate extension of the functional model introduced in Section 3, which incorporates the geometric component of the original data.We therefore consider the training sample ) is an RKHS of smooth functions with compact support in R 3 .Moreover, denote by K R 3 its reproducing kernel.
Suppose the training set {(g i , v i , x i )} consists of n independent copies of (G, V, X), a triplet of random variables with V a zero-mean random function taking values in V(R 3 ), X a zero-mean random function taking values in L 2 (M), and G a binary random variable such that P (G = 1) = π 1 and P (G = 2) = π 2 .
We now adopt the multivariate functional data notation from Happ and Greven (2018), and define Here f (j) , with j ∈ {1, 2}, denotes the jth functional component of the multivariate function f .For p, q ∈ D, define the matrix of covariances C(p, q) = E(X(p) ⊗ X(q)) with elements C lj (p l , q j ) = E[X (l) (p l )X (j) (q j )] where p l ∈ D l , q j ∈ D j , l ∈ {1, 2}, and j ∈ {1, 2}.Denote the conditional means of X by µ Similar to the univariate case, we assume that the population quantity β 0 ∈ H is well-defined and satisfies the equation This can be viewed as a multivariate generalization of the linear discriminant direction defined in the previous section.We now turn to the problem of defining an estimator for β 0 .

Estimate
Function space Norm Kernel βF :

Regularized estimation
Let the variable y i be such that y i = − n n 1 if g i = 1 and y i = n n 2 otherwise.We define the multivariate functional estimate β = βG , βF of the population quantity β 0 to be the solution to the following minimization problem βG , βF = arg min with λ 1 , λ 2 tuning parameters.Equation ( 15) extends the model proposed in Section 3 to account for both the geometric functional descriptor v i and the function x i in an additive fashion.The regularization terms in the equation enforce smoothness on the functional estimates βG and βF in their respective function spaces.

Differential regularization and kernel penalty: a unified modeling perspective
In Section 4.1, we have adopted two different approaches to produce smooth estimates βG : R 3 → R 3 and βF : M → R. The smoothness of βF is enforced by means of a penalty J(•) defined in terms of a Sobolev norm, which implicitly defines a kernel K M .On the other hand, the smoothness of βG is enforced by means of a norm ∥ • ∥ V(R 3 ) , defined implicitly through the direct definition of a kernel K R 3 .For clarity, we summarize the relevant function spaces and associated norms and kernels in Table 1.
From a methodological perspective, the reproducing kernel K R 3 (p, q) can be understood as a measure of the influence of the function value at p ∈ R 3 on the function value at q ∈ R 3 and vice-versa.Defining a smooth function space through a kernel has arguably an advantage when it comes to discretizing an infinite-dimensional minimization problem over that function space.In fact, thanks to the well-known representer theorem (Wahba 1990;Yuan and Cai 2010), under mild conditions, its exact solution can be expressed as a linear combination of the elements of a n-dimensional basis, which involves the explicit expression of the kernel.
Hence, it is natural to wonder whether a similar approach could be adopted for βF : M → R. In other words, can we define a smooth real function space on M by explicitly defining a kernel K M : M × M → R encoding a measure of influence that is coherent with the nonlinear geometry of M? This, however, is a challenging task due to the positive-definiteness property that K M must satisfy.Consider, for instance, the popular exponential kernel.Its natural extension to the manifold setting is K M (p, q) = exp(−c d M (p, q) 2 ), where d M (p, q) is the geodesic distance between p ∈ M and q ∈ M. Unfortunately, this kernel cannot be guaranteed to be positive definite for a general nonlinear manifold M (Feragen et al. 2015;Jayasumana et al. 2015).
Alternatively, we could try to compute an explicit form of the kernel K M from J 1/2 (•) by employing the following identity (Wahba 1990;Fasshauer and Ye 2013) where ⟨•, •⟩ W 2 (M) is the inner product that induces the norm J 1/2 (•).However, closed-form solutions to equation ( 16) are not available in our setting.Approximate solutions could be computed by Finite Elements Analysis (FEA) (Quarteroni 2009), but we would still face the challenge of storing the dense object K M (•, •).As described in Section 5, we instead leverage FEA to directly discretize the function βF : M → R in equation ( 15).
This highlights that the choice of the two modeling approaches is not arbitrary and that, arguably, for functional estimates supported on Euclidean spaces, defining explicitly a reproducing kernel is likely the preferred choice.Meanwhile, for general non-Euclidean domains, where defining a reproducing kernel is not trivial, the differential penalization approach is preferable.

Theory
Define the diagonal matrix of reproducing kernels K(p, q) with entries K 11 (p 1 , q 1 ) = K R 3 (p 1 , q 1 ) and K 22 (p 2 , q 2 ) = K M (p 2 , q 2 ); p i ∈ D i , p j ∈ D j .Let L K : H → H be the associated integral operator and, analogously to the univariate functional setting, define the sandwich operator We make the following assumptions, which are analogous to Assumptions 3.1-3.3.
Assumption 4.1.The constant κ 2 2 , defined as κ 2 2 = ess sup ∥L Assumption 4.2.There exists a smooth function Assumption 4.3.The penalty coefficient λ := λ 1 = λ 2 and the effective dimension of T satisfy The following theorem, which is an extension of Theorem 3.1, provides an upper bound for the out-of-sample risk.
Theorem 4.1.Under Assumptions 4.1-4.3,if λ ≍ n − 1 1+θ , the estimator β = βG , βF in equation ( 15) is such that where X * is a copy of X that is independent of the training data and E * is the expectation taken over X * .

Discretization
Consider a triangle mesh, denoted by M T , which is formed by the union of a finite set of triangles, T .These triangles share a common set of s vertices, denoted as ξ 1 , . . ., ξ s .Let M T be an approximate representation of the manifold M. We then introduce the linear finite element space W T consisting of a set of globally continuous functions over M T that are affine within each triangle τ in T , i.e., The space W T is spanned by the Finite Elements (FE) basis ψ 1 , . . ., ψ s , where ψ l (ξ j ) = 1, if l = j, and ψ l (ξ j ) = 0 otherwise.In Figure 2, we show one element of this basis.Moreover, define ψ as the vector-valued function ψ(•) = (ψ 1 (•), . . ., ψ s (•)) T .Our goal is to find an approximate solution βF T of the form where Let now M and S be the sparse mass and stiffness s × s matrices defined as (M ) jj ′ = M T ψ j ψ j ′ and (S) where ∇ M T is the gradient operator on the mesh M T .For β F T of the form given in equation ( 18), we have that the penalty term J(•) can be approximated as (c F ) T D M T c F , with D M T = SM −1 S + εM (Lila et al. 2016).Further, following an approach often adopted in FEA (Fried and Malkus 1975;Hinton et al. 1976), we replace the dense matrix M −1 with the sparse matrix M −1 , where M is the diagonal matrix such that Mjj = j ′ M jj ′ .This results in the sparse penalty matrix D M T = S M −1 S + εM .In practice, each functional observation x i is also of the form given in equation ( 18).Therefore, denoting by X the n × s matrix where each row consists of the basis coefficients of x i , the terms {⟨x i , β F ⟩ L 2 (M) } can be approximated by the entries of the vector XM c F .
We now turn our attention to the estimate βG ∈ V(R 3 ).Since for V(R 3 ) we have an explicit form of the associated reproducing kernel K R 3 , we employ the representer theorem (Wahba 1990;Yuan and Cai 2010) and take βG of the form On the right hand side of Figure 2, we show an example of a basis function T and Σ is a n × n matrix with entries This is a scalar affine function within each triangle of the mesh M T that takes value 1 on a fixed vertex and value 0 on every other vertex.On the right hand side, we show an element of the basis . This is a smooth vector-valued function from R 3 to R 3 .
As a result, the coefficients of the approximate solution of the model in equation ( 15) are given by ĉG , ĉF = arg min where y = (y 1 , . . ., y n ) T is the vector of auxiliary response variables.It is easy to check that this minimization problem can be equivalently written as the following augmented quadratic least-squares problem ĉG , ĉF = arg min where 0 is the zero-vector of length 2s + n and with Note that for n ≪ s, which is the setting of our application, the matrix A is sparse.Therefore, the minimization problem ( 21) can be efficiently solved by conjugate gradients, or its variations, e.g.LSQR (Paige and Saunders 1982), without requiring the explicit computation of the high-dimensional normal matrix A T A -a quantity related to the covariance structure of the functional predictors.An approximate solution to the univariate model in equation ( 10) follows as a special case of the multivariate case considered here.

Data and preprocessing
We analyze a cohort of n = 484 subjects from the ADNI and PPMI studies.On the basis of the ATN classification scheme (Jack et al. 2016), each subject was assigned to one of the two diagnostic categories -C: Control (n 1 = 100) and AD: Alzheimer's Disease (n 2 = 384).Here, we focus on data collected at the baseline visit, which includes, among other imaging modalities, structural T1-weighted MRI.
The T1-weighted images were preprocessed using FreeSurfer (Dale et al. 1999;Fischl et al. 1999).Specifically, white matter, grey matter, and cerebrospinal fluid were segmented and used to extract the outer and inner surfaces of the cerebral cortex.From these two surfaces, we generated a central surface interpolating the midpoints between the outer and inner layers, which offers an accurate representation of the two-dimensional anatomical structure of the cerebral cortex.This representation has the benefit of encoding a notion of distance between brain regions that is neurologically more relevant than the original volumetric representation.The cortical surface can moreover be coupled with one or more maps describing complementary structural or functional properties of the cortex, such as cortical thickness measurements (Fischl and Dale 2000), fMRI signals, or connectivity maps (Smith et al. 2013;Yeo et al. 2011).In this study, we focus on cortical thickness, which is estimated from the distances between the outer and inner surfaces of the cerebral cortex.Next, the n surfaces were registered and sub-sampled.
As a result of the preprocessing stage, we obtain n = 484 triangle meshes {M T i } consisting of s = 64K vertices in correspondence across subjects, along with a set of triangular faces defining how these vertices are connected to delineate the surfaces.By classical Generalized Procrustes Analysis (Dryden and Mardia 2016), translation, rigid rotation, and scale were removed from the data, while jointly estimating a template M T , which is also a triangle mesh with s = 64K vertices in correspondence with those of the individual subjects.Each surface has been moreover coupled with cortical thickness measurements at the mesh vertices, which we model as a real piecewise linear continuous function z T i over the mesh M T i .The preprocessing stage results in a set of FoSs M T i , z T i , i = 1, . . ., n , which are discretized versions of the continuous idealized objects {(M i , z i ) , i = 1, . . ., n} introduced in Section 2. To simplify the notation, we drop the superscript T .Moreover, we denote the diagnostic labels by {g i ∈ {C, AD}, i = 1, . . ., n}.Three examples of such FoSs, and associated diagnostic labels, are given in Figure 1A.
Here, we are interested in using the proposed models to identify subjects with AD from the extracted FoSs.The interpretability of these methods is an important feature.Indeed, while it is crucial to build models with good classification accuracy, it is equally important to describe the estimated relationship between the predictors and the outcome variable, in order to inform subsequent studies and generate data-driven hypotheses about the pathophysiology of AD.

Linear functional representations
For each FoS, we compute a function v i ∈ V(R 3 ) such that φ v i (M) closely approximates M i , where closeness is measured as the sum of Euclidean distances between the corresponding vertices of φ v i (M) and M i .As noted in Appendix A.1, alternative definitions of surface similarity could also be used.
We can then transport the function z i : M i → R onto the template defining a continuous piecewise linear function x i = z i • φ v i .This leads to the definition of the bivariate functional representation (v i , x i ) that is a linear representation of the FoS (M i , z Further details on the computation of v i can be found in Appendix A.1.

Discriminant analysis
Our aim is to estimate directions in the parametrizing geometric and thickness spaces that are most predictive of AD.To this end, we apply the model introduced in  describes a configuration that is more strongly associated with AD.
Crucially, these linear trajectories on the parametrizing space can be mapped back to the original space of FoSs by using equation ( 2), defining the curved space We fit the proposed model for different choices of the parameters λ 1 and λ 2 .Recall that λ 1 controls the regularity of the geometric discriminant direction and λ 2 that of the thickness discriminant direction, with high values virtually constraining the solution to be the zero function.The final choice of λ 1 and λ 2 is the result of a compromise between classification accuracy on the test set and the consistency of the estimated discriminant directions with the neurodegenerative nature of the disease (see also Discussion).The test Area under the ROC Curve (AUC) of the selected model is 0.7006.

Results
On the left hand side of Figure 3, we show the estimated most discriminant geometric and thickness directions, i.e., βG : R 3 → R 3 and βF : M → R.These have been estimated by applying the model in equation ( 15) to the linear representations {(v i − v, x i − x)}.The colormap describing βF illustrates what types of variations, with respect to the population average cortical thickness, are most predictive of AD.Specifically, a thinner cerebral cortex in the blue areas (i.e., lateral temporal, entorhinal, inferior parietal, precuneus, and posterior cingulate cortices) is associated with AD.These results are consistent with the typical thickness signature of AD observed to date (see, e.g., Bondareff et al. 1989;Dickerson et al. 2009;Sabuncu et al. 2011).The geometric component βG is instead a vector field in R 3 .This is a linear representation of the morphological variations, with respect to the population average cortex geometry, that are associated with AD.While a full understanding of its meaning is only possible by mapping βG back to the space of FoSs, i.e., by examining φ v+c 1 βG (M) for different choices of c 1 ∈ R, the magnitude of the vector field βG , at any fixed point, offers a rough indication of the cortical regions whose morphological variations are most relevant to the classification problem.
On the right hand side of Figure 3, we show the FoSs associated with the linear representations βG and βF , that is, These describe the most predictive patterns of AD in terms of the original neurobiological objects.We have circled a specific area of the brain to ease comparison and highlight the morphological patterns that the model deems relevant to the classification problem.

Comparison against alternative approaches
In this section, we compare the test AUC of our proposed classification method with alternative models and evaluate different representation models for FoSs.In addition to the functional linear discriminant model (FLDA) that we propose, we also consider the following alternatives: (i) FPCA+LDA: The geometry-aware FPCA model proposed in Lila et al. (2016), followed by multivariate LDA (Hastie et al. 2009) on the PC scores; (ii) Lasso: A logistic regression model with lasso regularization (Tibshirani 1996); (iii) Ridge: A logistic regression model with an ℓ 2 regularization (Hoerl and Kennard 1970); (iv) FQDA: The approximate functional quadratic discriminant model defined in Section 3.3; (v) RF: A Random forest model (Breiman 2001); (vi) SVM: A support vector machine with a squared exponential kernel (Cortes and Vapnik 1995); (vii) NN: A multilayer feedforward neural network (Hastie et al. 2009).Table 2: The test AUC of the classification methods applied to the data of our final application.Four different representation models have been considered: (i) the registered thickness map x i : M → R without geometric information; (ii) the parametrization h i : M → R 4 , where the first three components are the surface coordinates, and the last component is thickness; (iii) the registered thickness map x i : M → R and the first 200 eigenvalues of the Laplace-Beltrami operator computed on the surface M i ; and (iv) the proposed representation model (x i , v i ).The symbol '-' indicates that although the method could be adapted to accommodate the specific FoS representation model, its implementation is beyond the scope of this paper and is left to future work.For each representation model, the topperforming method is highlighted.
Furthermore, besides the proposed representation model (v i , x i ) for FoSs, we also consider the following representations: (i) Thickness: Spatially normalized thickness maps x i : M → R without geometric information; (ii) Thickness & Displacement: The parametrizations h i : M → R 4 , where the first three components are the surface coordinates, and the last component is the (spatially normalized) thickness map; (iii) Thickness & Shape spectrum: The spatially normalized thickness maps x i : M → R and the first 200 eigenvalues of the Laplace-Beltrami operator computed on the surface M i , i.e., a spectral representation of shape (Reuter et al. 2006).
To evaluate the listed methods and representation models, we split the dataset into three sets, namely the training set, validation set, and test set, comprising 50%, 20%, and 30% of the data, respectively.While a Monte Carlo evaluation of these methods would be desirable, it is computationally prohibitive, so we defer that analysis to the simulation setting in Appendix B. However, we use the same exact data split for all methods.The models are trained on the training set, hyperparameters are chosen to maximize the AUC on the validation set and the selected model is tested on the test set, resulting in the AUC scores presented in Table 2.Note that, in contrast to the results presented in Section 6.4 and Figure 3, all hyperparameters of the proposed methods have been chosen to maximize the AUC on the validation set, rather than striking a balance between the classification accuracy and consistency of the estimated discriminant directions with the neurodegenerative nature of the disease.Hence, the test AUC value of the proposed method is different from that in Section 6.4.
For standard multivariate models, we use the values of x i at the vertices of the template mesh (64K values) and the RKHS coefficients of the estimated v i (192K coefficients) to construct the data matrix.We have also implemented a variation of the functional linear discriminant model introduced in Section 4, for multivariate functions whose components share a non-linear domain M, in order to accommodate the representation The results are shown in Table 2, from which we can make several observations.Firstly, if the goal is to maximize prediction accuracy, then the best-performing model is a lasso-penalized generalized linear model applied to thickness maps and the first 200 eigenvalues of the Laplace-Beltrami operator of the surfaces.However, as mentioned in the introduction, this "lossy" shape representation cannot be mapped back to the original space of neurobiological objects, leading to a less interpretable model.Additionally, our results show that in the context of our application, the representation (v i , x i ) performs better across all methods than using the representation h i : M → R 4 .The latter appears to be more susceptible to overfitting, resulting in inferior performance even when compared to models that use thickness only.Finally, classification using thickness alone produces satisfactory results.The top-performing models are the proposed FLDA and FQDA, and the ridge logistic regression model.One possible explanation for this is that the registered thickness maps may include some geometric Figure 4: On the left side, we show the discriminant direction derived from applying a ridge logistic regression model to the thickness maps.In the center, we show the discriminant direction resulting from fitting the proposed model in equation ( 10) to the thickness maps.Although it does not account for subject-specific geometric variations, this model enforces smoothness.On the right side, we have the cortical thickness discriminant direction obtained by fitting the model in equation ( 15), which explicitly accounts for inter-subject geometric differences.The results of the logistic regression are more difficult to interpret due to the high spatial variability.The model in equation ( 10) provides more interpretable results thanks to its smoothness penalty, but suggests that a thicker cortex in the red areas is indicative of AD, which is not physiologically plausible.When we explicitly model geometric differences, this evidence seems to disappear.This suggests that there is a non-negligible dependence structure between the predictors modeling geometry and those modeling thickness.Differences that seemed to be related to cortical thickness in the model without the geometric component are now captured by the term that models cortical geometric variations.Furthermore, when we model intersubject geometric differences the entorhinal cortex atrophy in the medial temporal lobe is identified as the strongest predictor of AD.This is consistent with pathological findings and staging of early AD (Braak et al. 2006).
information due to misregistration.While incorporating geometric information into the model may lead to only minor improvements in classification performance, as shown in Figure 4, the estimated discriminant direction can be significantly different between the two models.Although the ground truth is unknown, the estimated discriminant direction when geometric information is included is more consistent with the neurodegenerative nature of the disease, as explained in the next section.

Discussion
The results in Figure 3 identify the typical AD thickness signature.Several studies that focus on identifying AD-vulnerable areas include the regions found in our analysis (see, e.g., Bondareff et al. 1989;Dickerson et al. 2009;Sabuncu et al. 2011).However, there is some variability in the estimated regions.For instance, Sabuncu et al. (2011) used a dynamic model and found strongest changes in the inferior parietal regions and the posterior cingulate.It should be noted that these studies typically consist of massive univariate analyses between the cortical thickness at each vertex, or each parcel, and the diagnostic label.They are therefore taking a feature-centric perspective on the problem.It is not clear how these findings would generalize to out-of-sample data (Li and Tong 2020).
To demonstrate the importance of modeling cortical geometry, we compare our results to those obtained by fitting a ridge logistic regression model and the proposed model in equation (10), i.e., by discarding inter-subject geometric differences.We compare these estimates in Figure 4. What we observe is that ridge logistic regression yields estimated discriminant directions that are more difficult to interpret, due to the high spatial variability.Except for the entorhinal cortex, the functional model in equation ( 10) is able to capture the main areas where cortical thinning is associated with AD.
However, this model also suggests that a thicker cortex in certain regions (dark red) is associated with AD, contradicting the neurodegenerative nature of AD.Interestingly, introducing the geometric component in the model reduces such effects.This may also be caused by the geometric component now capturing systematic misregistration.In order to verify such a hypothesis, further validation of the estimated geometric component is required in controlled settings where registration is more reliable, e.g., in the longitudinal setting.

Conclusions
We introduce a framework for the discriminant analysis of functional data supported on random manifold domains, i.e., FoSs.To this aim, we adopt linear representations of these objects that are bivariate functional data belonging to linear spaces.We then define a functional linear classification model on the parametrizing space.Thanks to a penalized least-squares formulation, the proposed model is able to estimate the most discriminant direction in the data without the explicit computation of the covariance function of the predictors or low-rank approximations thereof.This allows us to reduce the memory requirements by five orders of magnitude and ultimately be able to run our model on a standard workstation.The complexity of the solution is controlled by means of differential penalties that are aware of the geometry of the domain where the functional data are supported.
We apply the proposed model to the analysis of modern multi-modal neuroimaging data.Specifically, we estimate interpretable discriminant directions that are able to leverage both geometric and thickness features of the cerebral cortex to identify subjects with AD.Our results are consistent with those in the neuroscience literature.
The model proposed can be applied to several imaging settings that lead to FoSs representations, such as musculoskeletal imaging (Gee et al. 2018) or cardiac imaging (Biffi et al. 2018).It is also important to highlight that the proposed model is not a mere generalization of existing models for functional data supported on one-dimensional domains to multidimensional domains.We believe that its application to one-dimensional functional data, where the bivariate representation is given by the registered functions and associated registration maps, leads to a novel classification approach in this simplified setting.

A.2 Computation
In practice, each surface M i has a computational representation M T i that is a triangle mesh with s vertices ξ i 1 , . . ., ξ i s in correspondence across the n subjects.We use these vertices to perform Procrustes analysis (Dryden and Mardia 2016) and remove translation, size, and rigid rotations from the surfaces M T i .In addition, Procrustes analysis yields a template M T with vertices ξ 1 , . . ., ξ s .The representation functions {v i : v i ∈ V(R 3 )}, associated with the surfaces {M T i }, are then computed by solving the minimization problem where the least-squares term ensures that the deformed template φ v i (M T ) is a close approximation of M T i .The term ∥v∥ 2 V(R 3 ) is a regularizing term that encourages the solution to achieve such a close approximation with a 'minimal' deformation.The constant λ is selected by inspecting the solutions on a small subset of the full cohort.To perform the actual computations, we use the MATLAB implementation fshapetk (Charlier et al. 2015;Charlier et al. 2017).Note that if the vertices were not in correspondence, that is, the surfaces had not been registered beforehand, the proposed framework would still be applicable by replacing the least-squares term in equation ( 23) with a more general shape similarity measure D(•, •).An example of such a similarity measure is found in Vaillant and Glaunès (2005) and Vaillant et al. (2007), where the authors use the concept of currents, from geometric measure theory, to represent surfaces.

A.3 Selecting an appropriate kernel
The main requirement for choosing the kernel K R 3 is that the associated space V(R 3 ) is an admissible space (Younes 2019).Therefore, there must exist a positive constant M such that for all v ∈ V(R 3 ), the following inequality holds: where ∥ • ∥ 1,∞ is the canonical norm of the space C 1 (R 3 ).This condition guarantees that φ v is diffeomorphic for any v ∈ V(R 3 ).Nonetheless, within the set of admissible spaces, the specific choice of the kernel can significantly influence the quality of the estimated representations.We have found the approach proposed in Bruveris et al. (2012) to be effective.This uses a mixture of isotropic Gaussian kernels with different variances, which intuitively allows for a multi-scale representation of the surfaces.In our application, we use a mixture of six Gaussian kernels with variance parameters set to (σ 2 1 , . . ., σ 2 6 ) = (64, 16, 4, 1, 0.25, 0.01).Our choice was informed by visual inspection of the differences between the estimated φ v i (M T ) and M T i .

A.4 Template estimation
Algorithm 1: Algorithm for template estimation.
Data: The surfaces M 1 , . . ., M n and an initial guess for the template M{1} = M 1 Result: M = M{N iter } for iter = 1, . . ., N iter do for i = 1, . . ., n do v{iter} In this section, we introduce an algorithm designed to estimate a template leveraging the (formal) Riemannian structure of the manifold of diffeomorphisms adopted to model random manifold domains.One approach is to define the template so that the average of the linear representations {v i }, located on the tangent space at the identity map, is zero.The details of this iterative centroid approach are outlined in Algorithm 1.For an overview of alternative approaches see Cury et al. (2014).
Despite leveraging GPU acceleration for computing RKHS norms and associated gradients, the process of computing v i for each subject still takes about 40 minutes in our application.This makes the process of estimating the template computationally prohibitive given the necessity of multiple iterations.Therefore, we have chosen to use a fixed template in our final application.

B Simulations
In this section, we conduct simulations to assess the finite sample classification performance of the model proposed when compared to the models introduced in Section 6.5.Here we focus on the functional univariate setting described in Section 3.
We use a triangle mesh M T with 642 nodes that is an approximation of a brainstem.On this triangulated surface, we generate the orthonormal functions {v l : l = 1, 2, . . ., 40} consisting of 40 eigenfunctions of the Laplace-Beltrami operator computed on M T .Then, we generate two sets of smooth functional data supported on M T , with identical within-group covariance structures, as follows: Group 1: where u ij and w ij are zero-mean independent random variables that represent the scores and are distributed according to a normal distribution with variance σ 2 j decreasing in j.The function µ is the groups' difference, chosen to be a fixed linear combination of the eigenfunctions {v l : l = 1, 2, . . ., 40}, and its magnitude is controlled by a parameter α > 0 defining the 'difficulty' of the classification problem, that is, the signal-to-noise ratio.We then identify three regimes, with α ∈ {0.2, 0.4, 0.6}.Given a new function x * , the aim is to recover the group this observation belongs to.
We compare the proposed FLDA method against the models introduced in Section 6.5, e.g., FPCA followed by LDA, Lasso and Ridge logistic regression, random forests, support vector machines, and  = 128, 256, 512, 1024) and signalto-noise ratios (α = 0.2, 0.4, 0.6), where α reflects the strength of the discriminant signal.Prediction accuracy is measured using AUC and the simulations were repeated 50 times for each setting.
fully-connected feed-forward neural networks.For standard multivariate models, we use the values of x i at the vertices of M T to construct the data matrix.We evaluate the performance of each method for different sample sizes of the training data: n = 128, 256, 512, and 1024.For every n, besides the training set, we generate a validation set of size n and a test set including 20K samples, and repeat the experiment 50 times.To select the hyperparameters of the models, we employ a validation set approach.We summarize the classification performances on the test set in Figure 5.
Figure 5 shows that by increasing α, or the sample size n, the performances of all models tend to improve.This is expected, given that a larger α makes the classification task easier.In addition, a larger sample size allows for a more accurate estimation of the unknown model parameters.In the setting of equal within-class covariance structures, it is well known that the best classifier is linear, so it is not surprising that some of the nonlinear models, e.g., RF, SVM, and NN, show worse performances.In particular, FLDA appears to perform better than the other methods.For large sample sizes, the influence of the regularization terms becomes negligible and the difference in  = 128, 256, 512, 1024) and signal-to-noise ratios (α = 0.2, 0.4, 0.6), where α reflects the strength of the discriminant signal.The prediction accuracy was evaluated through AUC and the simulations were repeated 50 times for each setting.
performance starts vanishing.Next, we explore the performance of the different methods in the setting where the two groups have different within-group covariance structures.Specifically, we generate the data as in equation ( 24) with u ij zero-mean independent random variables distributed according to a normal with variance σ 2 j , decreasing in j, and w ij zero-mean independent random variables distributed according to a normal with variance σ 2 40−j , and therefore, increasing in j.The results of the simulations are shown in Figure 6.In this setting, it is well known that the best classifier is not linear.Hence, it is not surprising that the linear methods tend to perform worse than the nonlinear ones and that this difference does not vanish by simply increasing the sample size.Moreover, a larger α leads to better performance across all the tested methods.Specifically, the proposed FQDA model outperforms all the linear models, but other non-linear approaches, such as SVM, perform even better.Importantly, the proposed FLDA model performs best among the linear models even when it is misspecified.
Classification performance is not the only relevant metric.In the application motivating this  = 128, 256, 512, 1024) and signal-to-noise ratios (α = 0.2, 0.4, 0.6).Here, we measure the performance using the estimation error ∥ β − β 0 ∥ 2 L 2 (M) , with β an appropriately normalized version of the estimate of the true functional parameter β 0 .work, we are particularly interested in accurately estimating the classification rule, which for linear models is fully described by a parameter β.Therefore, in the setting of homogeneous covariances, we compare the performance of the different models with respect to the metric ∥ β − β 0 ∥ 2 L 2 (M) , with β an appropriately normalized version of the estimate of the true functional parameter β 0 .Note that the estimates from a linear discriminant analysis and a logistic regression model can be compared, as they assume the same model but estimate their parameters differently (Hastie et al. 2009).For the standard multivariate methods, β is constructed by interpolating a piecewise linear function to its estimated discrete counterpart.The results, shown in Figure 7, indicate that the proposed FLDA model does not only yield more accurate predictions but also more accurate estimates of the underlying functional parameters.

Figure 1 :
Figure 1: Panel A: FoSs of three subjects in the training sample, where g i ∈ {'C', 'AD'} denotes the disease state of the ith individual (C: Control, AD: Alzheimer's Disease), M i is a two-dimensional manifold encoding the geometry of the cerebral cortex, and z i : M i → R is a real function, supported on M i , describing cortical thickness (in mm).Panel B: Linear representation (v i , x i ) of each FoS (M i , z i ) shown in Panel A.Here v i : R 3 → R 3 is a vector-valued function encoding the geometry of the ith individual.This is depicted as a collection of 3D vectors {v i (p j )} for a dense set of points {p j } ⊂ R 3 .For clarity, the function v i is displayed only on half of its domain R 3 .The function x i : M → R describes the spatially normalized thickness map of the ith individual on the fixed template M. Panel C: FoS (φ v i (M), x i • φ −1 v i ) parametrized by the associated functions (v i , x i ) in Panel B. This is a close approximation of the FoS (M i , z i ) in Panel A.

Figure 2 :
Figure2: On the left hand side, we show an element of the FE basis {ψ l : M T → R, l = 1, . . ., s}.This is a scalar affine function within each triangle of the mesh M T that takes value 1 on a fixed vertex and value 0 on every other vertex.On the right hand side, we show an element of the basis Section 4.1 to the training data {g i , v i − v, x i − x}, where v and x are the sample means of {v i } and {x i }.The training data are a subset of the entire dataset containing 50% of the observations.From this process, we derive the estimates βG : R 3 → R 3 and βF : M → R. Given a new subject with predictors (v * , x * ), these estimates can be used in conjunction with the classification rule ⟨v * − v, βG ⟩ + ⟨x * − x, βF ⟩ > c th to predict the diagnostic label of a new subject.The cut-off level c th can be chosen by computing sensitivity and specificity on a test set, for different values c th , and by selecting the desired level and type of accuracy.

Figure 3 :
Figure3: On the left hand side, we show the most discriminant geometric and thickness directions as estimated from the linear representations {(v i − v, x i − x)}.These are a vector field βG : R 3 → R 3 , representing the most predictive geometric pattern of AD, and a function βF : M → R, representing the most predictive cortical thickness pattern of AD.For a new FoS, with linear representation (v * , x * ), we compute the score ⟨v * − v, βG ⟩ + ⟨x * − x, βF ⟩ and predict whether the subject has AD by comparing the score value with a predetermined threshold c th .On the right hand side, we depict the process of mapping back the estimates βG and βF to the space of FoSs.On the same space, we also pictorially map the classification rule adopted.In the βF figure, the blue regions represent the areas of the cortical surface where a thinner cortex, relative to the population average, is indicative of AD.These are mostly localized in the lateral temporal, entorhinal, inferior parietal, precuneus, and posterior cingulate cortices.The red arrows in the βG figure represent the regions where differences in the morphological configuration of the cerebral cortex, compared to the population average, are most predictive of AD.The specific types of morphological changes can be inspected by comparing the surfaces φ v−c 1 βG (M) and φ v+c 1 βG (M), on the right hand side diagram.

Figure 5 :
Figure5: Results of the simulation study to assess the performance of our proposed method, under the assumption of homogeneous covariances, for various sample sizes(n = 128, 256, 512, 1024)  and signalto-noise ratios (α = 0.2, 0.4, 0.6), where α reflects the strength of the discriminant signal.Prediction accuracy is measured using AUC and the simulations were repeated 50 times for each setting.

Figure 6 :
Figure6: Results of the simulation study for heterogeneous covariance structures across different sample sizes(n = 128, 256, 512, 1024)  and signal-to-noise ratios (α = 0.2, 0.4, 0.6), where α reflects the strength of the discriminant signal.The prediction accuracy was evaluated through AUC and the simulations were repeated 50 times for each setting.

Figure 7 :
Figure7: Results of the simulation study to compare the performance of the different linear methods considered, using homogeneous covariances, for various sample sizes(n = 128, 256, 512, 1024)  and signal-to-noise ratios (α = 0.2, 0.4, 0.6).Here, we measure the performance using the estimation error∥ β − β 0 ∥ 2 L 2 (M), with β an appropriately normalized version of the estimate of the true functional parameter β 0 .

Table 1 :
Table summarizing estimates and associated function spaces, norms and kernels.