-
PDF
- Split View
-
Views
-
Cite
Cite
Thomas Hoffmann, Christoph Lange, P2BAT: a massive parallel implementation of PBAT for genome-wide association studies in R, Bioinformatics, Volume 22, Issue 24, 15 December 2006, Pages 3103–3105, https://doi.org/10.1093/bioinformatics/btl507
Close -
Share
Abstract
Summary: The software tool P2BAT provides a massive parallel and user friendly implementation of the PBAT-analysis tools for family-based association tests (FBATs) in large-scale studies, including genome-wide association studies with several thousand subjects. Built on the original PBAT-implementation of the Lange–Van Steen algorithm to bypass the multiple testing problem in family-based association studies, P2BAT integrates all PBAT-analysis tools for binary and complex traits into R and makes them accessible through a user-friendly GUI. The genome-wide analysis tools are fully automated and can be ran massively parallel directly through the GUI. P2BAT is fully documented and contains graphical output tools for time-to-onset analysis. P2BAT also features the ability to test for gene and environment/drug interaction.
Availability: The P2BAT package is available as the R package ‘pbatR’ which can be downloaded from Author Webpage. The PBAT-software is available at Author Webpage.
Contact:thoffman@hsph.harvard.edu
1 INTRODUCTION
The area of genome-wide association studies has finally started (Herbert et al., 2006; Kachergus et al., 2005; Klein et al., 2005), offering a unique chance to identify genes for complex traits through an unbiased search at a genome-wide level. The initial fear was that the new wealth of genomic data could not be translated into an increased statistical power to detect new genes, but would be diluted by the multiple comparisons problems. This concern about the major statistical road block in such study seems now to be fading, as new methodology emerges. For studies of unrelated individuals, several statistical approaches have been suggested (Hirschhorn and Daly, 2005; Thomas et al., 2004; Roeder et al., 2005; Verzilli et al., 2006). For genome-wide association studies in family-based designs, Van Steen et al. (2005) proposed a novel testing strategy that bypasses the multiple testing problem within one study and thereby reduces the impact of study heterogeneity. The approach has successfully been applied to a 100 K-scan in a family-sample of the Framingham Heart Study (Herbert et al., 2006; Lange et al., 2003; Laird and Lange, 2006), which has been up-to-date, the only successful genome-wide association study revealing a novel, replicable candidate gene for obesity.
However, so far, no software implementations for genome-wide association study exist that can analyze the vast amount of information, which is produced by such studies, in a user-friendly way and that runs massively parallel on clusters, minimizing the analysis time to a couple of minutes. With P2BAT, we have developed such a software tool. P2BAT implements all the analysis features of PBAT in R (R Development Core Team, 2005) and makes them accessible through a user-friendly GUI. Further, without requiring any additional efforts by the user, P2BAT allows one to run the analysis massively parallel with as many parallel jobs as specified (Fig. 1). The parallelization process in P2BAT is achieved by running multiple instances of the original PBAT program, using the queuing system of a cluster. The process is fully automated and monitored by P2BAT. The package P2BAT is available as the R package ‘pbatR’ in conjunction with the software PBAT. The two software packages can be downloaded from Author Webpage and from Author Webpage, respectively. Detailed instructions are available on the webpage Author Webpage.
2 USAGE/DATA FORMAT
P2BAT can be run in both a command line version and a graphical interface version. In both cases, the data must be in the format of a pedigree and phenotype file. The first line in the pedigree file contains the names of the markers. Each subsequent line corresponds to an individual's pedigree id, subject id, father id, mother id, gender, affectation status and each pair of marker alleles, all separated by spaces. Missing data here is encoded with a ‘0’. Except for the marker names, the ped-file may not contain any characters. The first line of the phenotype file lists the names of all the traits in the phenotype file. Each subsequent line corresponds to an individual's pedigree id, subject id and the values of each trait, all separated by spaces. In contrast to the pedigree file, a hyphen ‘-’ must be used here to indicate missing data.
2.1 Graphical user interface
The main window of the analysis portion of the graphical interface is started with the command pbat() and is shown in Figure 1. Phenotypes, covariates, SNPs/haplotype blocks, stratification variables and other options can be selected from lists within the interface. For instance, for testing single traits one would select ‘gee’ for FBAT-GEE, for testing multiple traits simultaneously one would select ‘pc’ for FBAT-PC, and for testing time-to-onset traits one would select ‘logrank’ for FBAT-LOGRANK.
It is easy to take advantage of PBAT's parallel implementation. To use multiple processors or multiple cores, one can choose the ‘multiple’ option, and specify the number of cores on a single processor machine for instance. To spread PBAT out on a cluster, one can use the ‘cluster’ option, and specify the number of nodes for the number of jobs. If a cluster refresh time of ‘0’ is specified, the jobs will be submitted, and pbatR will not wait for the output; otherwise it specifies the number of seconds to wait before checking if the processes are done. When ‘0’ is specified, additional command line commands can be used to paste the output together at a later time.
The power and sample size interface partially shown in Figure 1 is started with the command pbat.power(). Options to calculate power are available for both binary and continuous, in both family-based and population-based studies. Additionally, options to calculate sample size are available for the population-based studies.
2.2 Command line interface
Finally, if we wanted to do a time-to-onset analysis with FBAT-LOGRANK (Lange et al., 2004b; Jiang et al., 2006) on all the SNPs, we would have ‘time & censor∼c1’. Further examples are available in the documentation. The result of this operation returns an object that works with the standard generic R functions, such as summary and plot (only time-to-onset has plots). The time-to-onset plots follow the algorithm developed in (Jiang et al., 2006), shown in Figure 2. To configure multiple jobs under the command line use pbat.setmode. Lastly, the power and sample size commands can also be used from the command line, with commands such as pbat.binaryFamily.
The time-to-onset graph (Jiang et al., 2006) can be saved in the various graphical formats supported by R.
The time-to-onset graph (Jiang et al., 2006) can be saved in the various graphical formats supported by R.
3 RESULTS
To assess the performance of P2BAT, we re-ran the analysis of the 100 K-scan in the Framingham Heart Study (Herbert et al., 2006; Laird and lange, 2006), using the entire data set with 1400 probands. We analyzed BMI-measurements at the six exams of the study as longitudinal data in the FBAT-PC approach (Lange et al., 2004a). Running the analysis in parallel on a cluster with 50 dual-nodes (Xeon™ 3.2 Ghz), P2BAT used 70 MB of memory (per node) and took 41 min to complete the analysis. The aggregated results from the program runs are shown in Figure 3. Since P2BAT is able divide the analysis into as many parallel jobs as SNPs are available, the analysis could have been split up into 100 000 parallel jobs. Assuming that there are ∼8 million common SNPs (Carlson, 2006) and the constantly growing cluster sizes, even the analyses of all common SNPs, if the technology should become available, will not face running time issues.
P2BAT-analysis results from a 100 K-scan in the Famingham Heart Study: the top 10 SNPs based on the conditional power estimates. After adjusting for selecting 10 comparisons/SNPs, the P-values for SNP SNP_A – 1669246 and SNP_A – ???????? achieve genome-wide significance. SNP SNP_A – 1669246 was previously identified in Herbert et al. (2006).
P2BAT-analysis results from a 100 K-scan in the Famingham Heart Study: the top 10 SNPs based on the conditional power estimates. After adjusting for selecting 10 comparisons/SNPs, the P-values for SNP SNP_A – 1669246 and SNP_A – ???????? achieve genome-wide significance. SNP SNP_A – 1669246 was previously identified in Herbert et al. (2006).
4 DISCUSSION/CONCLUSION
In the search for genes for complex diseases, genome-wide association studies are more and more replacing standard linkage studies. For complex diseases with their numerous disease-related phenotypes, the analysis of such studies is cumbersome, error-prone and computationally intensive. In order to translate the wealth of information into the successful identification of novel genes (Herbert et al., 2006), powerful and user-friendly analysis tools are needed. With P2BAT, we have developed such a tool based on the original PBAT program. P2BAT is software-package for the analysis of family-based association studies that is embedded into the R-environment, contains a user-friendly GUI-interface and that allows to run the analysis of genome-wide association studies massively parallel on cluster, reducing the analysis time of 100 000 SNPs and more to a couple of minutes.
The authors thank the participants of the FHS for their contribution and the NHLBI-FHS investigators for providing DNA samples and phenotypic data for our analysis. The authors would also like to thank the reviewers for their suggestions. Funding was provided in part by grant MH17119. Funding to pay the Open Access publication charges for this article was provided by the Department of Biostatistics, Harvard School of Public Health.
Conflict of Interest: none declared.
REFERENCES
Author notes
Associate Editor: Keith A Crandall



