Abstract

Summary: The software tool P2BAT provides a massive parallel and user friendly implementation of the PBAT-analysis tools for family-based association tests (FBATs) in large-scale studies, including genome-wide association studies with several thousand subjects. Built on the original PBAT-implementation of the Lange–Van Steen algorithm to bypass the multiple testing problem in family-based association studies, P2BAT integrates all PBAT-analysis tools for binary and complex traits into R and makes them accessible through a user-friendly GUI. The genome-wide analysis tools are fully automated and can be ran massively parallel directly through the GUI. P2BAT is fully documented and contains graphical output tools for time-to-onset analysis. P2BAT also features the ability to test for gene and environment/drug interaction.

Availability: The P2BAT package is available as the R package ‘pbatR’ which can be downloaded from Author Webpage. The PBAT-software is available at Author Webpage.

Contact:thoffman@hsph.harvard.edu

1 INTRODUCTION

The area of genome-wide association studies has finally started (Herbert et al., 2006; Kachergus et al., 2005; Klein et al., 2005), offering a unique chance to identify genes for complex traits through an unbiased search at a genome-wide level. The initial fear was that the new wealth of genomic data could not be translated into an increased statistical power to detect new genes, but would be diluted by the multiple comparisons problems. This concern about the major statistical road block in such study seems now to be fading, as new methodology emerges. For studies of unrelated individuals, several statistical approaches have been suggested (Hirschhorn and Daly, 2005; Thomas et al., 2004; Roeder et al., 2005; Verzilli et al., 2006). For genome-wide association studies in family-based designs, Van Steen et al. (2005) proposed a novel testing strategy that bypasses the multiple testing problem within one study and thereby reduces the impact of study heterogeneity. The approach has successfully been applied to a 100 K-scan in a family-sample of the Framingham Heart Study (Herbert et al., 2006; Lange et al., 2003; Laird and Lange, 2006), which has been up-to-date, the only successful genome-wide association study revealing a novel, replicable candidate gene for obesity.

However, so far, no software implementations for genome-wide association study exist that can analyze the vast amount of information, which is produced by such studies, in a user-friendly way and that runs massively parallel on clusters, minimizing the analysis time to a couple of minutes. With P2BAT, we have developed such a software tool. P2BAT implements all the analysis features of PBAT in R (R Development Core Team, 2005) and makes them accessible through a user-friendly GUI. Further, without requiring any additional efforts by the user, P2BAT allows one to run the analysis massively parallel with as many parallel jobs as specified (Fig. 1). The parallelization process in P2BAT is achieved by running multiple instances of the original PBAT program, using the queuing system of a cluster. The process is fully automated and monitored by P2BAT. The package P2BAT is available as the R package ‘pbatR’ in conjunction with the software PBAT. The two software packages can be downloaded from Author Webpage and from Author Webpage, respectively. Detailed instructions are available on the webpage Author Webpage.

Fig. 1

The P2BAT graphical interface.

Fig. 1

The P2BAT graphical interface.

2 USAGE/DATA FORMAT

P2BAT can be run in both a command line version and a graphical interface version. In both cases, the data must be in the format of a pedigree and phenotype file. The first line in the pedigree file contains the names of the markers. Each subsequent line corresponds to an individual's pedigree id, subject id, father id, mother id, gender, affectation status and each pair of marker alleles, all separated by spaces. Missing data here is encoded with a ‘0’. Except for the marker names, the ped-file may not contain any characters. The first line of the phenotype file lists the names of all the traits in the phenotype file. Each subsequent line corresponds to an individual's pedigree id, subject id and the values of each trait, all separated by spaces. In contrast to the pedigree file, a hyphen ‘-’ must be used here to indicate missing data.

2.1 Graphical user interface

The main window of the analysis portion of the graphical interface is started with the command pbat() and is shown in Figure 1. Phenotypes, covariates, SNPs/haplotype blocks, stratification variables and other options can be selected from lists within the interface. For instance, for testing single traits one would select ‘gee’ for FBAT-GEE, for testing multiple traits simultaneously one would select ‘pc’ for FBAT-PC, and for testing time-to-onset traits one would select ‘logrank’ for FBAT-LOGRANK.

It is easy to take advantage of PBAT's parallel implementation. To use multiple processors or multiple cores, one can choose the ‘multiple’ option, and specify the number of cores on a single processor machine for instance. To spread PBAT out on a cluster, one can use the ‘cluster’ option, and specify the number of nodes for the number of jobs. If a cluster refresh time of ‘0’ is specified, the jobs will be submitted, and pbatR will not wait for the output; otherwise it specifies the number of seconds to wait before checking if the processes are done. When ‘0’ is specified, additional command line commands can be used to paste the output together at a later time.

The power and sample size interface partially shown in Figure 1 is started with the command pbat.power(). Options to calculate power are available for both binary and continuous, in both family-based and population-based studies. Additionally, options to calculate sample size are available for the population-based studies.

2.2 Command line interface

For additional control, or an alternative interface, one can also use the command line. For the analysis portion, data is only partially (default for the GUI) or completely read in with read.ped and read.phe, either loading in just the marker names or no names for datasets will millions of SNPs in the former case, or the entire dataset into objects that extend a dataframe. P2BAT is then run with the command pbat.m. The default options and values are identical to the ones shown in the graphical interface in Figure 1. An intuitive formula notation is used to specify the model for the association testing. For instance, suppose that we have phenotypes p1 and p2; covariates c1 to c3; and SNPs m1 to m6. The formula for a single phenotype, a single covariate, and three SNPs is given by  
p1c1|m1|m2|m3.
If instead we wanted to test for an association of the phenotype p1 with one of the two haplotype blocks (m1, m2, m3 and m4, m5, m6) in the presence of the gene–environment interaction with variable c2 and the covariate c1, we would specify  
p1c1+mi(c2)|m1+m2+m3|m4+m5+m6,
where mi(.) denotes the interaction term. Now, if we wanted to do a multivariate analysis with FBAT-PC, testing our phenotypes p1 and p2 simultaneously, and including the third covariate c3 to second order, we would have  
p1+p2c1+mi(c2)+c32|m1+m2+m3|m4+m5+m6.

Finally, if we wanted to do a time-to-onset analysis with FBAT-LOGRANK (Lange et al., 2004b; Jiang et al., 2006) on all the SNPs, we would have ‘time & censor∼c1’. Further examples are available in the documentation. The result of this operation returns an object that works with the standard generic R functions, such as summary and plot (only time-to-onset has plots). The time-to-onset plots follow the algorithm developed in (Jiang et al., 2006), shown in Figure 2. To configure multiple jobs under the command line use pbat.setmode. Lastly, the power and sample size commands can also be used from the command line, with commands such as pbat.binaryFamily.

Fig. 2

The time-to-onset graph (Jiang et al., 2006) can be saved in the various graphical formats supported by R.

Fig. 2

The time-to-onset graph (Jiang et al., 2006) can be saved in the various graphical formats supported by R.

3 RESULTS

To assess the performance of P2BAT, we re-ran the analysis of the 100 K-scan in the Framingham Heart Study (Herbert et al., 2006; Laird and lange, 2006), using the entire data set with 1400 probands. We analyzed BMI-measurements at the six exams of the study as longitudinal data in the FBAT-PC approach (Lange et al., 2004a). Running the analysis in parallel on a cluster with 50 dual-nodes (Xeon™ 3.2 Ghz), P2BAT used 70 MB of memory (per node) and took 41 min to complete the analysis. The aggregated results from the program runs are shown in Figure 3. Since P2BAT is able divide the analysis into as many parallel jobs as SNPs are available, the analysis could have been split up into 100 000 parallel jobs. Assuming that there are ∼8 million common SNPs (Carlson, 2006) and the constantly growing cluster sizes, even the analyses of all common SNPs, if the technology should become available, will not face running time issues.

Fig. 3

P2BAT-analysis results from a 100 K-scan in the Famingham Heart Study: the top 10 SNPs based on the conditional power estimates. After adjusting for selecting 10 comparisons/SNPs, the P-values for SNP SNP_A – 1669246 and SNP_A – ???????? achieve genome-wide significance. SNP SNP_A – 1669246 was previously identified in Herbert et al. (2006).

Fig. 3

P2BAT-analysis results from a 100 K-scan in the Famingham Heart Study: the top 10 SNPs based on the conditional power estimates. After adjusting for selecting 10 comparisons/SNPs, the P-values for SNP SNP_A – 1669246 and SNP_A – ???????? achieve genome-wide significance. SNP SNP_A – 1669246 was previously identified in Herbert et al. (2006).

4 DISCUSSION/CONCLUSION

In the search for genes for complex diseases, genome-wide association studies are more and more replacing standard linkage studies. For complex diseases with their numerous disease-related phenotypes, the analysis of such studies is cumbersome, error-prone and computationally intensive. In order to translate the wealth of information into the successful identification of novel genes (Herbert et al., 2006), powerful and user-friendly analysis tools are needed. With P2BAT, we have developed such a tool based on the original PBAT program. P2BAT is software-package for the analysis of family-based association studies that is embedded into the R-environment, contains a user-friendly GUI-interface and that allows to run the analysis of genome-wide association studies massively parallel on cluster, reducing the analysis time of 100 000 SNPs and more to a couple of minutes.

The authors thank the participants of the FHS for their contribution and the NHLBI-FHS investigators for providing DNA samples and phenotypic data for our analysis. The authors would also like to thank the reviewers for their suggestions. Funding was provided in part by grant MH17119. Funding to pay the Open Access publication charges for this article was provided by the Department of Biostatistics, Harvard School of Public Health.

Conflict of Interest: none declared.

REFERENCES

Carlson
C.S.
Agnosticism and equity in genome-wide association studies
Nat. Genet.
2006
, vol. 
38
 (pg. 
605
-
606
)
Herbert
A.
, et al. 
A common genetic variant is associated with adult and childhood obesity
Science
2006
, vol. 
312
 (pg. 
279
-
283
)
Hirschhorn
J.N.
Daly
M.J.
Genome-wide association studies for common diseases and complex traits
Nat. Rev. Genet.
2005
, vol. 
6
 (pg. 
95
-
108
)
Jiang
H.
, et al. 
Family-based association test for time-to-onset data with time-dependent differences between the hazard functions
Genet. Epidemiol.
2006
, vol. 
30
 (pg. 
124
-
132
)
Kachergus
J.
, et al. 
Identification of a novel LRRK2 mutation linked to autosomal dominant parkinsonism: evidence of a common founder across European populations
Am. J. Hum. Genet.
2005
, vol. 
76
 (pg. 
672
-
680
)
Klein
R.J.
, et al. 
Complement factor H polymorphism in age-related macular degeneration
Science
2005
, vol. 
308
 (pg. 
385
-
389
)
Laird
N.M.
Lange
C.
Family-based designs in the age of large-scale gene-association studies
Nat. Rev. Genet.
2006
, vol. 
7
 (pg. 
385
-
394
)
Lange
C.
, et al. 
Using the noninformative families in family-based association tests: a powerful new testing strategy
Am. J. Hum. Genet.
2003
, vol. 
73
 (pg. 
801
-
811
)
Lange
C.
, et al. 
A family-based association test for repeatedly measured quantitative traits adjusting for unknown environmental and/or polygenic effects
Stat. Appl. Genet. Mol. Biol.
2004
, vol. 
3
  
Article17
Lange
C.
, et al. 
Family-based association tests for survival and times-to-onset analysis
Stat. Med.
2004
, vol. 
23
 (pg. 
179
-
189
)
Roeder
K.
, et al. 
Analysis of single-locus tests to detect gene/disease associations
Genet. Epidemiol.
2005
, vol. 
28
 (pg. 
207
-
219
)
R Development Core Team.
R: A Language and Environment for Statistical Computing
2005
Vienna, Austria
R Foundation for Statistical Computing
 
ISBN 3-900051-07-0
Thomas
D.
, et al. 
Two-Stage sampling designs for gene association studies
Genet. Epidemiol.
2004
, vol. 
27
 (pg. 
401
-
414
)
Van Steen
K.
, et al. 
Genomic screening and replication using the same data set in family-based association testing
Nat. Genet.
2005
, vol. 
37
 (pg. 
683
-
691
)
Verzilli
C.J.
, et al. 
Bayesian graphical models for genomewide association studies
Am. J. Hum. Genet.
2006
, vol. 
79
 (pg. 
100
-
112
)

Author notes

Associate Editor: Keith A Crandall

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.