BLINK: A Package for Next Level of Genome Wide Association Studies with Both Individuals and Markers in Millions

Big data, accumulated from biomedical and agronomic studies, provides the potential to identify genes controlling complex human diseases and agriculturally important traits through genome-wide association studies (GWAS). However, big data also leads to extreme computational challenges, especially when sophisticated statistical models are employed to simultaneously reduce false positives and false negatives. The newly developed Fixed and random model Circulating Probability Unification (FarmCPU) method uses a bin method under the assumption that Quantitative Trait Nucleotides (QTNs) are evenly distributed throughout the genome. The estimated QTNs are used to separate a mixed linear model into a computationally efficient fixed effect model (FEM) and a computationally expensive random effect model (REM), which are then used iteratively. To completely eliminate the computationally expensive REM, we replaced REM with FEM by using Bayesian information criteria. To eliminate the requirement that QTNs be evenly distributed throughout the genome, we replaced the bin method with linkage disequilibrium information. The new method is called Bayesian-information and Linkage-disequilibrium Iteratively Nested Keyway (BLINK). Both real and simulated data analyses demonstrated that BLINK improves statistical power compared to FarmCPU, in addition to a remarkable improvement in computing time. Now, a dataset with half million markers and one million individuals can be analyzed within five hours, compared with one week using FarmCPU.


INTRODUCTION 1
Biomedical innovations have outpaced computing innovations since the completion 2 of the human genome project 1,2 . Genome-wide association studies (GWAS) have 3 identified many genetic loci presumed to control some human diseases and agriculturally 4 important traits 3-5 . However, a substantial proportion of these discoveries were false 5 positives, attributed to a failure to consider population structure and cryptic relationships 6 among individuals in the analyses 6-8 . Incorporating population structure and cryptic 7 relationships as covariates dramatically reduces false positives, but also causes false 8 negatives and computational burdens 9-11 . 9 Population structure is typically incorporated as a fixed effect in the General Linear 10 Model (GLM), which is computationally efficient. Initially, population structure was derived 11 as the proportions of individuals belonging sub-populations 12,13 . Several alternatives for 12 defining population structure, such as principal component analysis 14,15 , were developed 13 to further improve computational efficiency. Cryptic relationships can be incorporated in 14 two ways. One way is to include all genetic markers as random effects. Some of these 15 markers will capture the effects of Quantitative Trait Nucleotides (QTNs) through Linkage 16 Disequilibrium (LD) [16][17][18] . The other way is to first derive kinship among individuals, using 17 all genetic markers. The kinship is subsequently used to define the variance structure of 18 individual effects as random effects. In the latter, both population structure and kinship 19 can be incorporated into a fixed effect and random effect Mixed Linear Model (MLM) 9 . 20 However, the computation of the MLM is intensive. Thus, multiple methods have been 21 developed to reduce the computing times of MLM. 22 The first milestone that reduced this computational burden was the development of 23 the Efficient Mixed Model Association (EMMA) 19 . Prior to EMMA methods, Maximum 24 Likelihood (ML) or REstricted Maximum Likelihood (REML) functioned as the genetic 25 variance and the residual variance. With ML or REML, optimization must be performed in 26 two dimensions using methods such as Expectation and Maximization (EM). By using 27 EMMA, ML or REML functions only as the ratio between genetic variance and residual 28 variance. By reducing the optimization from two dimensions (genetic variance and 29 residual variance) to one dimension (genetic-to-residual variance ratio), computing speed 30 dramatically improves. 31 optimization without first calculating kinship. As a result, computing time is linear to subset 23 size and independent of sample size. 24 The fifth milestone was GRAMMAR-Gamma 24 , a method that splits the association 25 analysis into two steps. The first step uses MLM to derive the residual. The second step 26 tests the residual as transformed traits in a fixed effect model and applies a correction 27 factor to test statistical values. The computing complexity of the second step is completely 28 linear to both number of individuals and markers. 29 With the exception of CMLM, the primary aim of the above milestones was to improve 30 computing speed. The statistical power of each of these milestones remains similar to the 31 conventional MLM 9 because the same or similar kinship is used regardless of the traits 1 being analyzed. CMLM, on the other hand, represented the first adjustment of kinship to 2 improve statistical power 21 . In CMLM, genetic effects of individuals in the conventional 3 MLM are replaced by the genetic effects of their corresponding kinship groups. That is, 4 kinship among individuals is replaced by kinship among groups. Furthermore, the 5 adjustment on kinship is optimized for the particular traits being studied. For example, the 6 kinship with the best ML or REML is used for testing markers one at a time. Other 7 optimizations were also developed to define minimum and maximum group kinship, in 8 addition to average kinship 25 . 9 The second adjustment employs kinship that is not only specific for traits, but also 10 specific for testing markers 26,27 . Kinship is built by using only the markers that are 11 associated with a trait. However, multiple associated markers can be genetically linked. 12 To remove this redundancy, a bin procedure was developed within the Settlement of MLM 13 Under Progressively Exclusive Relationship (SUPER) method to ensure that, at most, 14 only one associated marker is selected from each bin 27 . Furthermore, the kinship changes 15 according to testing markers to eliminate the confounding between kinship and testing 16 markers. The trait-associated markers are excluded from the kinship calculation if they 17 are also associated with the testing markers. This association is determined by linkage 18 disequilibrium (LD) in SUPER 27 , or when the associated markers are on the same 19 fragment (within 1Mb) as the testing markers 26 . 20 The third adjustment, known as the multi-locus mixed-model (MLMM) approach, 21 applies weight to kinship 28 . In addition to random individual effects, this adjustment also 22 fits multiple associated markers in the MLM to split the variance explained by kinship in a 23 stepwise regression fashion. The forward stepwise regression stops when the kinship 24 explains no variance. The multiple markers are then tested simultaneously without kinship. 25 Recently, a fourth adjustment, called the Fixed and random model Circulating 26 Probability Unification (FarmCPU) 29 was developed to completely eliminate the kinship in 27 the model for testing markers. In FarmCPU, the marker-testing model becomes a Fixed 28 Effect Model (FEM). To control spurious association for testing a marker, associated 29 markers are fitted as covariates. To avoid the over-fitting problem, firstly, the whole 30 genome is equally divided into certain number of bins, and only one significant maker with 31 smallest P value is selected to be the candidates of covariates (pseudo QTNs). Secondly, 1 these pseudo QTNs in the FEM are determined by a Random Effect Model (REM). The 2 pseudo QTNs, from various bin with different width, are firstly ranked by P value, then, 3 the best combinations between different bin and the number of possible QTNs are 4 determined by REM. The two types of models (FEM and REM) are performed iteratively 5 until no change occurs in the selection of pseudo QTNs. 6 Despite these valuable advancements, more innovative computing tools and analysis 7 methods are needed. For example, although FarmCPU boosts statistical power in GWAS, 8 its REM process remains computationally demanding. Additionally, the bin approach from 9 SUPER requires that all QTNs be evenly distributed throughout the genome, which is 10 hardly true. However, only one QTN can be selected as a covariate even if multiple QTNs 11 are located in the same bin, which limits statistical power. Thus, a critical need still exists 12 for a method that can increase both computing efficiency and statistical power. 13 14

IDEA 15
Herein, we present a new statistical method that was inspired by this critical need and 16 builds upon our previous method, FarmCPU. In the new method, we use 1) Bayesian 17 Information Criteria (BIC) in a FEM to replace REML in the REM, and 2) linkage 18 disequilibrium information to replace the bin method. As a result, we have completely 19 eliminated the computationally expensive REM and the requirement that QTNs be evenly 20 distributed throughout the genome (Fig.  1). 21 Our proposed method requires performing two FEMs in an iterative fashion. The first 22 FEM tests genetic markers (SNPs), one at a time, and uses a set of associated markers 23 as covariates. For the convenience of illustration, these associated markers are also 24 named pseudo QTNs. The model can be written as: 25 where y i is the observation of the i th individual;; S * i1 , S * i2 ,…, S * ik are the genotypes of k 27 pseudo QTNs, initiated as an empty set;; b 1 , b 2 , …, b k are the corresponding effects of the 28 pseudo QTNs;; S ij is the genotype of the i th individual and j th testing marker;; d j is the 29 corresponding effect of the j th genetic marker;; and e i is the residual having a distribution 1 with a mean of zero and variance of . 2 Marker genotypes and P values from equation (1) are used to update the pseudo 3 QTNs. Markers are filtered based on their P values by using a multiple test corrected 4 threshold. The most significant marker is retained, while the markers in LD with the most 5 significant markers are removed. Then, markers in LD with the second significant marker 6 are removed, and so on, until no markers are in LD with each other. The first k of t 7 remaining pseudo QTNs are fit into the second FEM, which is a reduced model that 8 excludes the testing marker (S ij d j ) compared with equation (1). The second FEM model 9 can be written as: 10 Bayesian Information Criteria (BIC) is examined with varied numbers of pseudo QTNs 12 (k = 1 to t) until the optimum k is identified. We named the new method Bayesian-13 information and Linkage-disequilibrium Iteratively Nested Keyway (BLINK). The BLINK 14 procedure is further detailed in the online methods section. 15 16

RESULTS 17
We implemented the BLINK algorithm in two packages, one was written in R and the 18 other in C. The R package was designed for the fact what more audiences are familiar 19 with R than C. The C package was designed for computational efficiency. We named the 20 two packages as BLINK-C and BLINK-R, respectively. As most of the analyses were 21 conducted by BLINK-C, we simplified BLINK-C as BLINK. Both of these packages were 22 compared with other two packages that are complementary for computing speed and 23 statistical power. One is PLINK 30 and the other is FarmCPU 29 . The package of PLINK 24 was written in C and implemented the GLM method that has the minimum theoretical 25 computing time complexity. The package of FarmCPU was written in R and implemented 26 the FarmCPU algorithm that is superior to GLM in respect of statistical power. Statistical 27 power was examined based on false positives, true positives, and statistical power at 28 different levels of false discovery rate (FDR) and Type I error. To retain the real population 29 2 e σ structure (Fig.  S1), we used phenotypes simulated from real genotypes that covered a 1 wide range of species including human, one crop (maize), one livestock (pig), and two 2 model species (Arabidopsis and mouse). Association study on real phenotype was 3 conducted on flowering time in maize. Enrichment on a different study was performed to 4 compare BLINK and FarmCPU. Finally, real genotype and phenotype data were also 5 duplicated to synthetically create big data to examine observed computing time of BLINK-6 C, BLINK-R, PLINK and FarmCPU. 7 8 Receiver operating characteristic 9 Using real genotypes from all five species, we simulated the QTNs controlling the 10 phenotypes in two scenarios. The first scenario was a situation that rarely, if ever, 11 occurs in practice-all QTNs were randomly located on the chromosomes without 12 clusters. We call it "synthetic" scenario. The second scenario was a situation closer to 13 reality-QTNs were clustered on chromosomes. Every two QTNs were restricted to next 14 each other within 100 Kb. We call it "real" scenario. For each scenario, we examined 15 statistical power under different levels of FDR and Type I error. FDR was defined as the 16 proportion of false positives among the total number of positives identified. Type I error 17 was derived from the empirical null P-value distribution of all non-QTN-bins. The 18 relationship between statistical power and FDR or Type I error is described by the 19 receiver operating characteristic (ROC) curves (Fig  2  and  S3). The method with a larger 20 area under curve (AUC) is preferred over the method with a smaller AUC. BLINK had a 21 larger AUC than FarmCPU and PLINK for both power versus FDR and power versus 22 Type I error;; PLINK had a smaller AUC than FarmCPU and BLINK for both 23 comparisons. This situation held true across all five species. 24 The model selection criteria (BIC) was compared with AIC (Akaike Information 25 Criterion) 31 and EBIC (Extended BIC) 32 in all the five species examined. BIC 26 overperformed other two model selection criteria (Fig S4). The determination of two 27 markers in LD was based on the absolute values of their Person correlation coefficient. 28 The default vales (0.7) was based on the comparisons of statistical power under different 29 FDR (Fig  S5). 30 1

Associations and enrichment on real phenotype 2
We conducted GWAS on flowering time in maize using the three methods (BLINK, 3 FarmCPU, and PLINK). PLINK exhibited strongly inflated P values (Fig.  3) Notably, BLINK identified more associated SNPs than FarmCPU. FarmCPU 10 identified 14 SNPs that passed the Bonferroni multiple test threshold (α = 0.01). In 11 contrast, BLINK not only revealed 9 of these 14 SNPs, but also identified an additional 40 12 loci that passed the Bonferroni threshold. The significant SNPs identified by BLINK 13 included the SNPs that are 2Kb from ZmCCT, 441kb from ZCN8, and 568 kb from Vgt1-14 three genes that have been previously cloned (Fig.  3). Compared to ZmCCT, which has 15 a SNP at the same distance in both BLINK and FarmCPU, the distance between 16 significant SNPs and the other two candidate genes are much closer in BLINK than in 17 FarmCPU. Plus, BLINK identified three significant SNPs close to NAM QTL. BLINK and 18 FarmCPU had much better control on inflated P values across the genome than PLINK. 19 At the same level of control on inflated P values across the genome, BLINK identified 20 more genetic loci associated flowering time.  (Table  1).

3
Observed computing time 4 The two BLINK software packages (C and R versions) gave the identical P values (Fig.  5   S6). We compared their computing times with PLINK 30 and FarmCPU 29 for analyzing big 6 Among the four packages compared above, BLINK-C is the only software package 18 that can fully use modern computer architecture with multiple cores for parallelization. We 19 further examined the efficiency of BLINK-C on multiple-core computer systems. We tested 20 BLINK-C on computers with core numbers ranging from 2 to 12 under different operating 21 systems, including Linux and Mac. Results showed that the total computing time 22 decreased linearly with number of cores (Fig.  5). For the dataset with about one million 23 individuals and one-half million SNPs, a Mac Pro with 12 cores completed the analysis in 24 just 30 minutes instead of 5 hours with a single core. 25 26

DISCUSSION 1
Inspired by the critical need for computational efficiency and statistical power in big data 2 analysis and by the most recently developed GWAS method, FarmCPU, we developed a 3 faster and more powerful method. By substituting REML in FarmCPU's REM with BIC in 4 a FEM and by replacing the bin approach with LD, we achieved optimization in one 5 dimension (number of pseudo QTNs) instead of two dimensions (number of pseudo QTNs 6 and bin size). The optimization of the genetic-to-residual variance ratio was also 7 eliminated by substituting REM with FEM, which directly solves residual variance without 8 iterations. These improvements not only reduced computing time, but also simultaneously 9 reduced false positives and false negatives. 10 11

Substitution of REML with BIC 12
In both FarmCPU and BLINK models, markers are tested one at a time, with pseudo 13 QTNs added as covariates to control false positives and reduce false negatives. can be selected if multiple pseudo QTNs are close enough to fall into the same bin. In 2 practice, real QTNs are often clustered, rather than evenly distributed;; thus, BLINK is 3 more robust than other methods. 4 5

Limitations 6
At the first iteration, the number of SNPs examined for LD is even more critical relative to 7 both computing time and pseudo QTN selection. The P values may be severely inflated 8 due to lack of control on false positives. Thus, the P threshold is much tighter at the first 9 iteration compared to the rest of the iterations. For example, in the first iteration, the 10 current default setting in BLINK © and BLINK ® only allows examination of SNPs with P 11 values of 0.01 after a multiple test correction. For the remaining iterations, this P-value 12 threshold is substantially released, 100% after a multiple test correction. When P values 13 are extremely inflated at the first iteration, a stricter limit is set on number of markers 14 (default is 1,000) examined for LD. 15 The selection of pseudo QTNs is significantly influenced by the threshold setting that Although BLINK has the same computing time complexity as PLINK and FarmCPU, 29 BLINK-C was not only faster than FarmCPU, but also faster than the most recently 30 released beta testing version of PLINK. BLINK-C can analyze an extremely big dataset-31 one million individuals and one-half million markers-in 5 hours with a single core, or in 1 30 minutes with 12 cores. The software package BLINK-C has been released to the public 2 as an executable program. The executable program of BLINK, help documents, 3 demonstration data, and tutorials are freely available at http://zzlab.net/blink. The second FEM is employed to optimize the selection of pseudo QTNs. 23 Mathematically, the second FEM can be written as follow: 24 (2) 25 Equations (1) and (2) differ in two ways. First, the testing marker term in the first FEM is 26 removed from the second FEM;; therefore, no testing marker P values are output in (2). 27 Second, the number of covariate pseudo QTNs are varied in the second FEM to select 28 the optimum set of the first k out of t pseudo QTNs. The optimization is performed using 1 Bayesian Information Criteria (BIC), which is twice the negative log likelihood plus the 2 penalty on number of parameters, as follows: 3 where LL is Log Likelihood, k is the number of pseudo QTNs, Ln is natural Log, and n 5 is the number of individuals. There are t available pseudo QTNs that are sorted with the 6 most significant at beginning and the lest significant at end. The first k pseudo QTNs are 7 selected for examination with k varied from 1 to t. 8 All markers in equation (1) are candidates for pseudo QTNs in equation (2). These 9 markers are filtered with two criteria: P value and LD. All markers are sorted first and then 10 filtered out if their P values are larger than a threshold. Among the m SNPs kept, if their 11 correlation (r) with the first SNP (S 1 * ) is larger than a threshold, they are removed. This 12 process is repeated to select S 2 * , S 3 * ,…, until the last SNP, S t * , is selected (Fig  1). 13 As the t remaining markers are sorted and not in LD each other, the first set of k 14 markers is more critical than the second set of k markers. We fit the first k markers in 15 equation (2) and vary k until all possibilities are examined. The set of k markers with the 16 best BIC is used as the set of pseudo QTNs in equation (1). This process is iterated until 17 the pseudo QTNs remain the same. We named this alternative solution as Bayesian-18 information and Linkage-disequilibrium Iteratively Nested Keyway (BLINK) method. 19 20

Genotype and phenotype Data 21
We used the exact same datasets we used in our previous publication for the FarmCPU 22  Table  S1. The principal components were calculated 26 by PLINK using all the SNPs. 27 In maize genotype dataset, there are a total of 2,279 inbred lines, each line with 28 681,258 SNPs. The real phenotype of all the samples in this dataset is flowering time that 29 was measured as days to silk (URL: http://www.panzea.org/#!genotypes/cctl). 1 The human dataset, "East Asian lung cancer dataset", ID # phs000716.v1.p1, was 2 obtained from dbGaP 5 . The name of this dataset is "East Asian lung cancer dataset" (ID 3 # phs000716.v1.p1). Respecting the privacy and intentions of research participants, the 4 dataset is only available under the permission of NIH (National Institutes of Health) and 5 Intramural NCI (National Cancer Institute). There are a total of 8,807 individuals was 6 involved into computing testing, each of them with 629,968 SNPs (URL: 7 http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000716.v1.p1). 8 For the datasets of Arabidopsis thaliana, it contains a larger sample with 1,179 9 individuals that were genotyped with 214,545 SNPs (URL: 10 http://archive.gramene.org/db/diversity/diversity_view). 11 12

Synthetic data 13
The human dataset was synthetically duplicated to evaluate computing efficiency on 14 large-scale datasets. The human dataset contained about one-half million (629,968) 15 SNPs and 8,807 individuals. These individuals were duplicated 2, 5, 10, 20, 50, and 100 16 times. The largest synthetic dataset contained nearly one million (880,700) individuals. 17 The number of SNPs remained the same, at approximately one-half million.

Simulated phenotypes 20
The real genotypes of the five species were used to simulate phenotypes to examine 21 statistical power under different levels of Type I error and FDR. The simulated phenotypes 22 had heritability of 75%, controlled by a variable number of QTNs that were sampled from 23 all real SNPs. Two scenarios were applied to the sampling of SNPs, without and with 24 restriction. The restriction was that QTN must be within distance of 300 Kb with another 25 QTN. The QTNs had effects that followed a normal distribution. These QTNs were 26 summed together as the total additive genetic effect for each individual according to its 27 real genotypes. Variance of additive genetic effect was calculated across all individuals. 28 A normally distributed residual effect was assigned to each individual. The variance of the 29 residual effect was assigned accordingly so that the proportion of additive genetic 1 variance equaled heritability. Genomes were divided into 10KB-sized bins. Bins were 2 classified as QTN bins if they contained a QTN, otherwise as non-QTN bins. The P value 3 of a bin was represented by the most significant single nucleotide polymorphism (SNP). 4 Statistical power was defined as the proportion of QTNs detected at different levels of 5 false discovery rate (FDR) and Type I error derived from the empirical null distribution of 6 non-QTN bins. 7 8 Power, Type I error and FDR 9 Number of false and true positives were counted based on bins described in our previous 10 study 29 . The size of bins was varied starting at a single base pair to one mega base pairs. 11 We reported the results from using a 10 KB bin size. The P value of a bin was represented 12 by the most significant SNP. A bin was considered a QTN-bin if it contained at least one 13 QTN, otherwise, a non-QTN-bin. A non-QTN-bin with a P value that passed a threshold and editing the manuscript. 7 8

Ethics Statement 9
Any opinions, findings, conclusions, or recommendations expressed in this publication 10 are those of the authors and do not necessarily reflect the views of the funding agencies. 11 All datasets analyzed herein were published previously. This study did not involve 12 samples from humans or animals. 13 14

Competing Interests Statement 15
The authors declare that they do not have competing financial interests. 16 , as illustrated by comparing the true QTNs (red triangles) positioned along the horizontal axis in (c). Our alternative method is to sort the M SNPs first and filter them out if their P values are larger than a threshold (α). Among the m SNPs kept, additional SNPs are removed if their correlation (r) with the first SNP (S 1 * ) is larger than a threshold (β). This process is repeated to select S 2 * , S 3 * ,…, until the last SNP S t * is selected (d). As the t remaining SNPs are sorted, we fit the first k of them in a Fixed Effect Model (FEM) (e) and examine the corresponding twice negative log likelihood (-2LL) and Bayesian Information Criteria (BIC) (f). As more SNPs are fitted, -2LL continually improves (blue line), while BIC reverses (red line) because BIC applies a penalty with increasing numbers of SNPs. The k SNPs that give the best BIC are used as pseudo QTNs and fitted as covariates in another FEM to test all SNPs, one (s i ) at a time, as described by the conceptual model (g). This process (d to g) is iterated until the pseudo QTNs remain the same. We named this alternative solution the Bayesian-information and Linkage-disequilibrium Iteratively Nested Keyway (BLINK) method.