Machine learning-based quantification for disease uncertainty increases the statistical power of genetic association studies

Abstract Motivation Allowance for increasingly large samples is a key to identify the association of genetic variants with Alzheimer’s disease (AD) in genome-wide association studies (GWAS). Accordingly, we aimed to develop a method that incorporates patients with mild cognitive impairment and unknown cognitive status in GWAS using a machine learning-based AD prediction model. Results Simulation analyses showed that weighting imputed phenotypes method increased the statistical power compared to ordinary logistic regression using only AD cases and controls. Applied to real-world data, the penalized logistic method had the highest AUC (0.96) for AD prediction and weighting imputed phenotypes method performed well in terms of power. We identified an association (P<5.0×10-8) of AD with several variants in the APOE region and rs143625563 in LMX1A. Our method, which allows the inclusion of individuals with mild cognitive impairment, improves the statistical power of GWAS for AD. We discovered a novel association with LMX1A. Availability and implementation Simulation codes can be accessed at https://github.com/Junkkkk/wGEE_GWAS.


Introduction
Alzheimer's disease (AD), the most common cause of dementia, is a multifactorial condition influenced by innate and lifestyle risk factors (2022).The disease is highly heritable, with an estimated 58%-79% (h 2 ) of the liability explained by genetic factors (Gatz et al. 2006), most notably the APOE genotype and more than 75 other loci identified by genome-wide association studies (GWAS) conducted in European, African, and Asian ancestry cohorts (Kang et al. 2021, Kunkle et al. 2021, Bellenguez et al. 2022, Sherva et al. 2022).
Despite the success of AD GWAS conducted in large cohorts of AD cases and controls of European ancestry, the challenge of assembling sufficiently large well characterized cohorts has been a major limitation of GWAS in other populations, which have traditionally been more reluctant to participate in genetic research and are less likely to be evaluated by memory disorder diagnosis experts.To alleviate this problem, an extended definition of cases has been utilized to increase the sample size.For instance, some studies have considered both patients with AD and mild cognitive impairment (MCI) (Driscoll et al. 2014, Hong et al. 2020) as cases.Other studies have included data from the UK Biobank and used an AD "proxy" phenotype based on parental AD status and other convenient sample cohorts in which AD status is determined through self-reporting.Two studies in which proxy AD cases accounted for more than one-half of those considered to have AD identified genome-wide significant (GWS) association with several loci that were not previously reported (Wightman et al. 2021, Bellenguez et al. 2022).
Even though utilizing the extended definition of disease status increases the statistical power, several problems still exist.First, most patients with cognitive impairment are in the MCI stage, of whom only 10%-15% progress to dementia annually (Plassman et al. 2008).Second, patients with MCI or potential patients with AD assigned as cases are substantially more etiologically heterogeneous than clinic-based AD cases (Escott-Price and Hardy 2022).The power loss problem because of subject misclassification or genetic heterogeneity can be alleviated by using statistical classification methods.For instance, MCI is regarded as an intermediate stage between being cognitively normal (CN) and having AD, and some patients will eventually progress to AD.An AD prediction model with good performance would allow patients with MCI to be incorporated into GWAS on AD.Several approaches for imputing phenotypes have been proposed.Hormozdiari et al. (2016) suggested imputing phenotypes based on a multivariate normal assumption.In addition, a deep-learning-based prediction model has been proposed to impute phenotypes using fundus images (Alipanahi et al. 2021).They confirmed that imputing phenotypes could increase the statistical power of GWAS.However, both studies did not explicitly consider the imputation accuracy of the phenotypes in GWAS, which differentiates our method.In particular, when imputing inter-phenotypes, the accuracy of the imputed phenotypes differs by subject, and such differences must be considered in GWAS.
Here, we present a new statistical method that incorporates imputation accuracy for MCI and phenotype-unknown subjects.We conducted a GWAS for AD involving individuals with clinically diagnosed MCI and unknown AD status.Our method improves the statistical power of GWAS for AD, and we discovered a novel significant association with rs143625563 in LMX1A.

Method overview
In this study, we assumed that the disease status of subjects with MCI is unknown.To impute the disease status of subjects with missing phenotypes, we developed an AD prediction model using the AD and CN groups.Subsequently, we classified the subjects with missing phenotypes into either the AD or CN group utilizing the developed model.During this process, we calculated weights considering the accuracy of the imputed phenotypes for subjects with missing phenotypes.Finally, we conducted a case-control GWAS, which included the group with missing phenotypes; weighting imputed phenotypes (WIP) method using weighted generalized linear equation (wGEE).
To demonstrate the superiority of our method, we considered three statistical models: ordinary logistic regression (LR), GEE, and wGEE.LR is a type of a generalized linear model (GLM) and commonly used in GWASs to analyze the association between genetic variants and disease outcomes.GEE is an extension of GLM that allows modeling both continuous and categorical outcomes, and it estimates the populationaveraged effects while accounting for within-group correlations.However, assuming that response is binary and the subjects are independent, GEE is equivalent to LR. wGEE assigns weights to subjects based on their importance in regression analysis and can provide an unbiased estimate when the weights are appropriately assigned.The weights are usually calculated by estimating the inverse probability of a subject dropping out at the observed time for longitudinal data (Fitzmaurice et al. 1995, Preisser et al. 2002).Here, we derived the weights based on the accuracy of the imputed phenotypes.

Notations and disease model
We developed a method based on a dichotomous disease phenotype; however, it can be extended to polytomous or continuous phenotypes.We assumed that there were n a cases and n c controls and that the disease status for n m subjects is missing.The disease status of subject i is denoted by y i , which was coded as 1 and 0 for cases and controls, respectively.For simplicity, we assumed that subjects were ordered by their disease status: the first n a subjects (i ¼ 1, . .., n a ) are affected, the following n c subjects (i ¼ n a þ 1, . .., n a þ n c ) are unaffected, and the last n m subjects (i The probability of subject i being affected is denoted as p i , where p i was 1 and 0 for the cases and controls, respectively.If the disease status was unknown, p i was estimated.y i for those subjects was coded as:

&
Based on these definitions, we considered LR with case/controls for the ordinary disease model, and the model is defined as follows: where, l i ¼ E y i j Z i ; G i ð Þ , a ¼ a 0 ; a 1 ; . . .; a q ½ is a vector of the regression coefficients of the covariates Z i including an intercept, and b is a vector of the regression coefficients for SNP G i .

Statistical methods
Suppose w i is the weight for y i , w i was calculated by p i and is coded as: & w i is 1 for the cases and controls; whereas, it ranges from 0.5 to 1 for subjects with missing phenotypes.
Then, we fitted a wGEE to calculate the quasi-score for the parameter h ¼ a T ; b T À Á T by incorporating both cases/controls and subjects with missing phenotypes as follows: where, and R g ð Þ are the working covariance matrix and working correlation matrix, respectively.In this study, an identity matrix was utilized for R g ð Þ .Maximum likelihood estimates for a and b were calculated by solving U ĥ ð Þ ¼ U â; b À Á ¼ 0, using Wald statistics for both parameters.
Alternatively, we used the score statistics, considering a as a nuisance parameter, and if â0 : U a h ð Þj a¼ â0 ;b¼0 ¼ 0; the quasi-score of b was defined by Then, generalized score test statistic T for testing H 0 : b ¼ 0 was defined by:

Estimating the prediction model and affection probability
To estimate the probability of disease status, the following two-step approach was applied.First, prediction models were built with the predictors (X i ) of subjects whose disease status was known using LR, support vector machines (SVMs), random forest (RF), and gradient boosting (GB).However, the model can be extended to other machine learning or deep learning algorithms.Predictors can encompass any SNPdisease mediated variables, but SNPs should not be directly included as predictors to prevent potential false-positive issues that may arise from reusing the genetic data during association testing.Next, we selected the model with the highest area under the curve (AUC) with k-fold cross-validation (CV), and the probability of being affected was calculated.For instance, in LR, if the prediction model was: and we let the estimated parameter be ĉ, then the prediction model was applied to subjects whose disease status was unknown, and the estimated probability was derived as follows:

Estimating bias of SNP effects due to misclassification
The prediction model described above should accurately predict the disease status of subjects with unknown disease status.However, the incorporation of subjects with unknown disease status into statistical analyses can induce a bias in the coefficient estimates, which may be affected by the accuracy of our prediction model.To quantify the amount of bias, we assumed that the main risk factor, G, affected the disease status, Y, of subjects with unknown disease status, predicted with the variable X.This relationship indicates G can affect Y by mediating X indirectly or directly mediating X through an alternative pathway (Fig. 1).In this context, predictors can be considered as mediators, and the effects of SNPs on the disease can be categorized into direct and indirect effects.We denote the regression coefficient between G and Y as b GY , the SNP effects from G to X as b GX , and the effects of mediator X on Y as b XY .If X is continuous and there is no interaction between X and G, the total SNP effect b between G and Y becomes ( Figure 1 provides a general summary of the relationship between G, X, and Y. Here, we considered the LR of Y on G, and X was not considered as a covariate.Y for subjects with unknown disease status was predicted using X, and was included in the LR.LR coefficients with and without subjects with unknown disease status are denoted by b Ã and b, respectively.Then, b Ã and b can be considered biased and true regression estimates of G for Y, respectively.The prediction model built using X without G cannot account for the direct effect (b GY Þ of G, and its accuracy is proportional to the relative proportion of the indirect effect.If we denote the relative proportion of the direct effect by: b Ã À b is expected to be proportional to d Ã .We let c 0 and c 1 be the probabilities that true controls and cases are misclassified.c 0 and c 1 can be parameterized with the accuracy of the prediction model (see Supplementary Text S1).Based on these definitions, b Ã was approximately quantified with a scale factor (Neuhaus 1999), as follows: Here, B was obtained by: Based on this relationship, the true regression coefficient was obtained by: b % bÃ

Simulation studies
We generated 10 000 replicates, and empirical type-1 error and powers were estimated with these replicates.For each replicate, we generated phenotypes and genotypes for 50 000 participants.Subsequently, 5000 cases and 5000 controls were randomly selected.Among the remaining subjects, n m =2 cases and n m =2 controls were selected from the case/control group, respectively, and their disease status was assumed to Machine learning-based quantification for disease uncertainty be unknown.We denote the probability of a person being affected by p i .The disease status y i of subject i was determined as follows: If we denote SNP, sex, and 55 MRI traits of subject i by G i , D i , and X i1 , . .., X i55 , respectively, then p i can be calculated as follows: : Here, we assumed that G i $ B 2; For a and c j , we utilized LR estimates from our real-data analyses.D i was randomly assigned to either male or female subjects.We assumed that 55 MRI traits were available, and X 0 ij , for j-th region of interest (ROI) was generated from the normal distributions.The first k MRI traits were assumed to be affected by the SNP as follows: This simulation setting indicates that SNPs can affect the disease status directly and indirectly through the first k MRI traits.was utilized to parameterize the weight of the indirect effect of SNP on disease status, and was set to 0, 0.3, 0.5, 0.8, and 1.We considered k ¼ 10; 30.i indicates the unobserved environment effect and was generated from N 0; 1 ð Þ. b D and b I indicate the parameters for direct and indirect effects, respectively, and they were obtained using the following two different measures h 2 d and h 2 i , respectively: h 2 d and h 2 i were set to (0,0) and (0.0001,0.001), respectively.The results from the former and latter were used to estimate empirical type-1 error and empirical power, respectively.Accordingly, ðb D ; b I Þ became (0,0) and (0.049,0.088), respectively.

Subject classification and characteristics
A total of 5193 participants were recruited by the Gwangju Alzheimer's and Related Dementia (GARD) study at Chosun University in Gwangju, South Korea.The clinical diagnosis of AD was made according to the National Institute of Neurological and Communicative Disorders and Stroke-Alzheimer Disease and Research Disorders Association (NINCDS-ADRDA) criteria (Park et al., 2019).The sample consisted of 1241 patients with AD, 1256 with MCI, 2382 CN subjects, and 314 subjects who were not classified because of missing information (i.e.unknown) (Table 1).All study volunteers or authorized guardians of cognitively impaired individuals provided written informed consent before participation.

Statistical methods for AD prediction model with GARD cohort and evaluation
The AD prediction models were built with information from 369 subjects with AD and 2267 CN subjects whose characteristics are shown in Supplementary Text S2 and Supplementary Table S1.These models included 55 MRI traits, including 31 cortical thickness-related traits and 24 subcortical volume measures for brain ROIs, and five SNSB cognitive test scores (one test each for measures of attention, language, visuospatial function, memory, and frontal/executive function), log-transformed intracranial volume, and sex.The prediction model was generated with four different algorithms including penalized LRs (penalized LRs), GB, RFs, and SVM.
To evaluate the performance of different models, 5-fold nested CV was applied to calculate the AUCs.Outer CV was used to estimate the test AUCs, and inner CV was used to optimize the hyperparameters for our prediction model.Details of optimizing hyperparameters are shown in Supplementary Text S3.Finally, the model with the best AUC was fitted to all CN subjects and subjects with AD.The overall scheme for building and evaluating the models is shown in Supplementary Fig. S1.All analyses were conducted using the Scikit-learn (v.1.1.1)library in Python (v.3.8).

Genome-wide association studies with GARD cohort
A GWAS for AD was conducted using three different models.Ordinary LR (LR) models were applied to a sample including only AD cases and CN controls.IP and WIP method were employed to test association for AD in enlarged samples that also included MCI and AD status unknown subjects using GEE and wGEE.In these analyses, the weights for AD cases and CN subjects were 1, whereas the estimated AD probabilities were used for MCI and status unknown subjects.The detailed procedure for the overall scheme is summarized in Supplementary Fig. S2 (see details for genotyping, quality control, and imputation procedures in Supplementary Text S4).Age, sex, and the first three PCs were included as covariates in all models.The GWS level was set at 5Â10 À8 .LR and IP analyses were performed using PLINK (v.1.90beta) (Purcell et al. 2007), and WIP method were conducted in Python (v.3.8) using the Statmodel library (v.0.13.2).We used LocusZoom (Pruim et al. 2010) to generate regional plots and R software (v.3.7)(R Development Core Team, Vienna, Austria) to generate QQ and Manhattan plots.
To estimate the bias introduced by the misclassification probabilities and correct the odds ratios (ORs) for SNPs in the GWAS, the true ratio of cases among MCI/status unknown subjects (q) was assumed to be 0.15 based on the conversion rate from MCI to dementia per year (Petersen et al. 2001).Furthermore, the prediction performance for MCI/ status unknown subjects (p 1 ; p 0 ) was adjusted using the sensitivities and specificities of the simulation results to calculate the disease status misclassification probabilities.The proportion of direct effects (d Ã ) was estimated by dividing the SNP heritability adjusted for MRI traits by the SNP heritability without adjustment.

Type-1 error assessment and statistical power comparison with simulated data
We compared the empirical type-1 errors and power estimates at the nominal significance level a, for a¼0.05, 0.01, and 0.001.Type-1 errors and powers were defined as the ratio of subjects whose P-values were smaller than the thresholds.As shown in Table 2, we assumed that ðb D ; b I Þ=(0, 0), and type-1 errors for all methods did not result in inflation at the nominal level.Moreover, the simulation results with SNP effect sizes ðb D ; b I Þ=(0, 0) were not affected by various n m and , which indicated that our simulation settings were valid.For power comparison, we assumed h 2 i ¼ 0:001; h 2 d ¼ 0:0001, and in this case, the SNP effect sizes became ðb D ; b I Þ=(0.049,0.081).
Table 3 shows that the power estimates of all methods increased proportionally with for a fixed k (k ¼ 10).The indirect effect b I was set to be larger than the direct effect b D , and when increased, the indirect effect of the SNPs on the disease increased.Assuming that all SNPs directly affected the disease ð ¼ 0Þ, the logistic method was statistically more powerful than the IP and WIP methods including subjects with missing AD status.Furthermore, when additional n m subjects with missing AD status were utilized, the power of the IP and WIP methods worsened.However, if increases, then the IP and WIP methods using subjects with unknown AD status outperform the logistic method.When ¼ 0:3, the power of the IP and WIP methods was comparable to that of the logistic method, and if ! 0:5, then the IP and WIP methods performed better than the logistic model.Concerning the change in n m , for a fixed ( ¼ 0.5, 0.8 1), the IP and WIP methods were more powerful as the number of n m increased, and these positive effects due to the sample size become larger as increased.In addition, the WIP method had slightly higher power than the IP model, without considering the weights for all values of and n m .
The results for k ¼ 30 are listed in Supplementary Table S2.The results for k ¼ 30 were consistent with those for k ¼ 10, but the IP and WIP methods tended to have better power than the LR as long as !0:3.

Estimation of SNP coefficients and bias correction with simulated data
SNP regression coefficients were estimated using ordinary LR, IP, and WIP: bLR , bIP , and bWIP , respectively, and compared across methods for effect sizes ðb D ; b I Þ=(0.049,0.081) (Supplementary Table S3).The coefficients for all methods tended to increase proportionally with for the same reasons as power.The coefficients for the IP and WIP methods ( bIP , bWIP Þ had a downward bias compared to those of the logistic method.The underestimation of SNP coefficients worsened with lower and higher n m values.However, for ¼ 1 (no direct effects), bIP and bWIP were approximately the same as bLR , regardless of the sample size (n m ).
We also calculated the bias estimate ( B) to correct the coefficient estimates for the IP and WIP methods.For each simulation, we obtained the misclassification probabilities for the subjects with missing diagnosis status using the prediction errors and true simulation values, and the proportion of direct effects with various K and .Finally, we derived B and the adjusted B as bIP and bWIP , respectively, to obtain the bias-  ).After bias correction, the adjusted estimates for the IP and WIP methods were larger than the nonadjusted estimates in all situations and close to bLR .However, for 0:3, adjusted estimates showed a downward bias when the proportion of subjects with missing diagnosis status increased.Supplementary Table S4 shows the results of SNP estimates and bias correction for k ¼ 30.

Application of the AD prediction model to the GARD cohort
Figure 2A shows the prediction results obtained from models including cognitive and MRI measures among the CN and patients with AD.The results showed that the penalized LR had the best AUC and balanced accuracy (AUC ¼ 0.968, balanced accuracy ¼ 0.910).The prediction model built with penalized LR was applied to 1570 MCI and AD status unknown subjects, among whom 388 MCI/status unknown subjects were classified under AD (MCI 311, unknown 77) and 1182 subjects were classified under CN (MCI 945,unknown 237).The distributions of the probability of AD in status unknown subjects are shown in Fig. 2B.The MCI and status unknown groups were on an average more similar to the CN group with mean AD probabilities of 0.41 6 0.16 and 0.39 60.20, respectively.

SNP AD heritability accounted for by MRI and cognitive traits
The AD heritability accounted for by SNPs was estimated to be 67% (P < .001)after adjusting for sex, age, and PCs.However, heritability decreased to 36% when MRI traits and cognitive test scores were included in the model (P < .001).These results indicate that MRI traits and cognitive score tests are associated with a substantially large number of disease susceptibility loci (Methods are in Supplementary Text S5).

Evaluating prediction results with genetics
To evaluate the performance of the AD prediction model applied to MCI and status unknown subjects, CN subjects, AD cases, and MCI/status unknown subjects who were predicted as CN (Pred-CN) and AD subjects (Pred-AD) were compared using the best linear unbiased prediction estimations (BLUPs) for hippocampal volume.Details of calculating BLUPs are in Supplementary Text S6). Figure 2C shows that the mean difference of BLUPs between Pred-CN and Pred-AD subjects was significantly different (P < .001).Furthermore, the distributions of BLUPs for Pred-CN and CN subjects were not significantly different (P ¼ .22).The mean of BLUPs was higher for Pred-AD subjects than for AD cases (P < .001).These  S4).In the analysis including the clinically diagnosed AD cases and controls only, we identified the GWS association of AD with the APOE SNP encoding e4 (rs429358, P¼9:4 Â 10 À41 ).The significance of this finding increased in the analyses including the MCI/status unknown subjects classified under AD or CN using the IP (P¼4:0 Â 10 À45 ) and WIP (P¼7:4 Â 10 À49 ) methods.Similar patterns of GWS association were observed for SNPs in other genes in the APOE region.In addition, we identified a novel association with rs143625563 located in LMX1A (LIM homeobox transcription factor 1 alpha) in the analyses including the additional subjects who were classified using the IP (P¼1:4 Â 10 À8 ) and WIP (P¼5:3 Â 10 À8 ) methods, respectively (Supplementary Fig. S5A).This association was several orders of magnitude less significant in the analyses without the newly classified subjects (P¼6:9 Â 10 À6 ).Notably, rs3829687 located in the previously established AD gene ABCA7 showed mild association upon analysis of the clinically diagnosed AD cases and controls (P¼6:4 Â 10 À6 ) and in the analysis with the newly classified subjects (P¼3:5 Â 10 À7 ) using the IP method but was marginal (P¼8:9 Â 10 À8 ) using the WIP method (Supplementary Fig. S5B).
The estimated ORs of our methods using MCI/status unknown subjects for GWS SNPs were underestimated compared to those of LR (Table 3).We calculated the adjusted ORs (OR adj ) for the IP and WIP methods by estimating bias with several assumptions, and the estimates were close to the estimates of LR.

Discussion
The sample size is a major contributor to the power of GWAS for complex diseases.We developed a method using a WIP approach that boosts power by allowing the incorporation of subjects with intermediate or missing phenotypes who are assigned a probability of disease status based on diseaserelated endophenotypes.We applied this method to a GWAS for AD in which subjects with MCI or an unknown AD status were assigned an AD probability based on a set of brain MRI and cognitive test parameters.Evaluation of the performance of the WIP method through simulation studies showed that it minimized type-1 errors and had superior statistical power compared to LR, which is commonly used in GWAS.The accuracy of the AD probability calculation when applied to subjects with MCI and an unknown AD status, as well as the increase in power, are exemplified by the increased significance of associations we observed for established AD loci, including several genes in the APOE region and ABCA7.
Utilization of this method for GWAS of complex diseases requires careful investigation of the endophenotypes that might be included in models for predicting disease status among those with intermediate or unknown disease status.Studies have reported moderate to high SNP heritability for brain structure (26%-88%) (Matura et al. 2014, Hibar et al. 2015, van der Lee et al. 2019) and cognitive function (20%-46%) (Davies et al. 2016), and their strong associations with AD (Raji et al. 2009, Suzuki et al. 2019).We estimated the SNP heritability of AD in the GARD study dataset to be 0.67.Reduction in the heritability estimate to 0.36 after adjusting for MRI traits and cognitive test scores suggests that the effects of SNPs on AD risk are mediated by brain structure and function to some extent.To validate our prediction model, we showed that the distribution of PRS for MCI/status unknown subjects who had a high probability of AD was significantly different from the PRS distribution for subjects who were predicted to likely be CN.
We identified GWS association with SNP rs143625563 in a novel gene, LMX1A, and a highly suggestive association with a variant in an established AD gene, ABCA7.These associations were not evidenced in a previous study using clinically diagnosed AD cases and controls in the same dataset (Kang et al. 2021).LMX1A is known to be a key transcription factor associated with dopamine (DA) neurogenesis in the midbrain (Andersson et al. 2006, Yan et al. 2011, Doucet-Beaupre ´et al. 2015, Rolstad et al. 2015) and is linked to neurodevelopmental and neurodegenerative DA-related diseases such as Parkinson's disease (Cai et al. 2009, Laguna et al. 2015).Several studies have reported that DA is involved in AD pathogenesis, and DA dysfunction is associated with amyloid deposition (Perez et al. 2005, Himeno et al. 2011).Reduced DA expression is correlated with atrophy of the prefrontal cortex and hippocampus (Kumar and Patel 2007).One recent study of working memory training in amnestic and non-amnestic patients with MCI found that the nonamnestic memory group had larger gains than the amnestic MCI group, especially in APOE e4 noncarriers who had the LMX1A rs4657412 AA genotype (Hernes et al. 2021).This variant is located approximately 42 kb from rs143625563 but was not highly correlated with our finding (r¼0.05).AD has been associated with other ABCA7 variants in non-Hispanic White (Jansen et al. 2019, Kunkle et al. 2019) (Wang et al. 2022)] are apart by approximately 10, 0.7, 3, and 7 kb from rs3829687, respectively, and correlated with our findings (r ¼0.34, 0.97, not in our discovery dataset and 0.25, respectively).The proposed method for assigning the probability of disease based on endophenotypic measures has several limitations.First, although it performed well in terms of statistical power, the estimates of SNP coefficients showed a slight downward bias compared with those from the base method.This phenomenon resulted from the misclassification of subjects with an unknown disease status, and if the prediction model is not accurate, then the underestimation can be substantial.We quantified the bias in terms of prediction errors, the number of subjects with unknown disease status, and the ratio of direct effects between SNPs and diseases.Such quantification was validated through simulation studies and real-world data, and it should be noted that the result of the hypothesis testing was not affected.However, to adjust the estimates, unknown parameters that cannot be easily estimated must be specified.Second, more sophisticated methods, such as deep learning, can be utilized to improve the prediction accuracy of disease status for missing phenotypes.Some deep-learning algorithms with convolutional neural networks have been reported to perform with better than 95% accuracy using raw MRI data (Basaia et al. 2019, Zhang et al. 2021).However, compared to traditional machine-learning models, as the architecture of deep neural networks becomes deeper to improve the accuracy, the model becomes miscalibrated (Guo et al. 2017).Therefore, in our framework, we need to further consider how to calibrate the model because the probabilities were utilized as weights for the GWAS.Third, our findings should be replicated in independent datasets.Further experimental and clinical studies with additional data are necessary to demonstrate the effects of rs3829687 and rs143625563 on AD.
In conclusion, we showed that the statistical power of GWAS can be improved by utilizing a prediction model to quantify AD risk for subjects with MCI and an unknown AD status.Our results illustrate the practical importance of the proposed method and may improve our understanding of the genetic basis and pathogenesis of AD.
have missing phenotypes.The sets of subject indices are denoted by N a , N c and N m , respectively, and the total set of subject indices is represented by N

Table 1 .
Descriptive statistics.For GWAS, subjects consist of 1241 AD subjects, 2382 CN subjects, and 1570 Other subjects.The 1570 Other subjects consist of 1256 MCI subjects and 314 unknown subjects."Unknown" indicates subjects with missing diagnosis.MMSE, Mini-Mental Status Examination.