ONETOOL for the analysis of family-based big data

Song, Yeunjoo E; Lee, Sungyoung; Park, Kyungtaek; Elston, Robert C; Yang, Hyeon-Jong; Won, Sungho

doi:10.1093/bioinformatics/bty180

Abstract

Motivation

Despite the need for separate tools to analyze family-based data, there are only a handful of tools optimized for family-based big data compared to the number of tools available for analyzing population-based data.

Results

ONETOOL implements the properties of well-known existing family data analysis tools and recently developed methods in a computationally efficient manner, and so is suitable for analyzing the vast amount of variant data available from sequencing family members, providing a rich choice of analysis methods for big data on families.

Availability and implementation

ONETOOL is freely available from http://healthstat.snu.ac.kr/software/onetool/.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The importance of family-based designs has been repeatedly stressed for analyses with sequence data because of the genetic homogeneity between family members (Bailey-Wilson and Wilson, 2011; Wijsman, 2012). Family study designs provide not only the enrichment of genetic loci containing rare variants, but also methods to control for genetic heterogeneity and population stratification.

Family-based data have different properties from population-based data owing to the genetic relatedness among family members and Mendelian transmission. These well-known features have allowed family-based designs to play a key role in the history of genetic analysis, but they also limit the use of many available tools designed for the analysis of population-based data. Despite the need for separate tools to analyze family-based data, there is only a handful of tools available for family-based data, especially the big data from sequencing which comprise the readily available form of data for genetic and genomic analyses these days.

As for the population-based data analysis tools, the most common family-based big data analysis tools aim to filter/rank/QC/annotate variants (Supplementary Table 1). These tools are optimized to be used with the vast amount of variant data from sequencing but lack the choices of essential analyses needed to test and infer the valid results regarding the relation between traits of interest against the common and/or rare variants. Therefore, users need to use the separate tools individually. And, though there exists a handful of family-based imputation tools, there is a clear lack of family-based association analysis tools that can analyze the dosage files directly as an input.

There are many very well-known family data analysis tools available from the era of linkage analysis and genome-wide association studies (GWAS). Among these, S.A.G.E. 6.4 (http://darwin.cwru.edu/sage/) and Merlin (Abecasis et al., 2002) are still used a lot by many researchers. PLINK (Purcell et al., 2007) is one of most popularly used tool for GWAS. It is a part of many standard pipeline tools, therefore, the PLINK input file formats are the standard format for many sequence data analysis tools. However, since it is designed mainly for population-based case-control data, the analysis options are very limited for use with family data.

All three tools have their pros and cons. In this work, we introduce a novel comprehensive tool that combines the good features of these existing tools and many newly developed family-based association analysis methods along with a novel feature to analyze the dosage data, providing in a computationally efficient manner a rich choice of analysis options to use for the vast amount of variant data coming from the sequencing of families. This provides a convenience and time-saver that enables a researcher to perform many of the family-based genetic and genomic analyses using one tool, i.e. ONETOOL, instead of hopping among many different tools to accomplish a family data analysis project.

2 Features

ONETOOL provides four main analyses: informatics and quality control (InfoQC), trait analysis, linkage analysis and association analysis with both genotype data and dosage data (in Table 1 and also see the Supplementary Table 2 for details).

Table 1.

Summary of available family-based analyses in ONETOOL

Main	Sub-category	Detail
InfoQC analysis	Variant information	F_ST, Ts/Tv ratio, MAF, HWE, PCA
	Sample information	Het, Het/Hom
	Pedigree information	Description and summary, plot, relative pairs
	Error detection	Mendelian error
	Relatedness matrix	Kinship, IBS, GRM
Trait analysis	Familial aggregation	Correlation
	Heritability	Based on Kinship, IBS, GRM
	Segregation analysis	Mode of inheritance
Linkage analysis	Model-based	Two-point, utilizing segregation analysis
Linkage analysis	Model-free	Multipoint, modeling LD
Association analysis	Single variant	Score test, TDT/SDT, MQLS, FQLS, EQLS, GEMMA
	Gene-based	Collapsing, PEDCMC, FAMVT, FARVAT, FBSKAT, PEDGENE, RVTDT
	Genotype probability & dosage data	Scoretest, EQLS, GEMMA, CMC, PEDCMC, FAMVT, FARVAT, PEDGENE

Main	Sub-category	Detail
InfoQC analysis	Variant information	F_ST, Ts/Tv ratio, MAF, HWE, PCA
	Sample information	Het, Het/Hom
	Pedigree information	Description and summary, plot, relative pairs
	Error detection	Mendelian error
	Relatedness matrix	Kinship, IBS, GRM
Trait analysis	Familial aggregation	Correlation
	Heritability	Based on Kinship, IBS, GRM
	Segregation analysis	Mode of inheritance
Linkage analysis	Model-based	Two-point, utilizing segregation analysis
Linkage analysis	Model-free	Multipoint, modeling LD
Association analysis	Single variant	Score test, TDT/SDT, MQLS, FQLS, EQLS, GEMMA
	Gene-based	Collapsing, PEDCMC, FAMVT, FARVAT, FBSKAT, PEDGENE, RVTDT
	Genotype probability & dosage data	Scoretest, EQLS, GEMMA, CMC, PEDCMC, FAMVT, FARVAT, PEDGENE

Table 1.

Summary of available family-based analyses in ONETOOL

Main	Sub-category	Detail
InfoQC analysis	Variant information	F_ST, Ts/Tv ratio, MAF, HWE, PCA
	Sample information	Het, Het/Hom
	Pedigree information	Description and summary, plot, relative pairs
	Error detection	Mendelian error
	Relatedness matrix	Kinship, IBS, GRM
Trait analysis	Familial aggregation	Correlation
	Heritability	Based on Kinship, IBS, GRM
	Segregation analysis	Mode of inheritance
Linkage analysis	Model-based	Two-point, utilizing segregation analysis
Linkage analysis	Model-free	Multipoint, modeling LD
Association analysis	Single variant	Score test, TDT/SDT, MQLS, FQLS, EQLS, GEMMA
	Gene-based	Collapsing, PEDCMC, FAMVT, FARVAT, FBSKAT, PEDGENE, RVTDT
	Genotype probability & dosage data	Scoretest, EQLS, GEMMA, CMC, PEDCMC, FAMVT, FARVAT, PEDGENE

Main	Sub-category	Detail
InfoQC analysis	Variant information	F_ST, Ts/Tv ratio, MAF, HWE, PCA
	Sample information	Het, Het/Hom
	Pedigree information	Description and summary, plot, relative pairs
	Error detection	Mendelian error
	Relatedness matrix	Kinship, IBS, GRM
Trait analysis	Familial aggregation	Correlation
	Heritability	Based on Kinship, IBS, GRM
	Segregation analysis	Mode of inheritance
Linkage analysis	Model-based	Two-point, utilizing segregation analysis
Linkage analysis	Model-free	Multipoint, modeling LD
Association analysis	Single variant	Score test, TDT/SDT, MQLS, FQLS, EQLS, GEMMA
	Gene-based	Collapsing, PEDCMC, FAMVT, FARVAT, FBSKAT, PEDGENE, RVTDT
	Genotype probability & dosage data	Scoretest, EQLS, GEMMA, CMC, PEDCMC, FAMVT, FARVAT, PEDGENE

2.1 InfoQC, trait and linkage analysis

Family data requires additional error checking and filtering that also consider the family structure, so that the family structures are maintained. ONETOOL provides the proper methods to deal this complexity of family data and the downstream analyses as an integrated tool. Moreover, ONETOOL’s options for the variant-wise InfoQC and filtering are similar with those in Plink, but they are implemented in a computationally optimized way providing more speed and efficiency. It also provides visualization of family data as done utilizing the R package kinship2 to generate a plot (Sinnwell et al., 2014).

As shown in Supplementary Table 1, not many tools are available for trait analysis nor are optimized to work with the current pipeline of family big data. ONETOOL fulfills this gap by integrating the tools for familial aggregation or correlation, narrow-sense heritability estimation and segregation analysis.

With ONETOOL, both types of linkage analyses, model-based linkage and model-free linkage accounting for linkage disequilibrium, can be done directly with the current genomic data files.

2.2 Association analysis

Depending on the types of trait data (binary or continuous), family data (random or ascertained, trio or general) and variant data (common or rare) in hand, the different family-based association analyses provide the best estimates in terms of both power and type 1 error. Many times, a complex disease data analysis project involves not just a phenotype but a set of multiple phenotypes with different characteristics. It also involves analyzing a set of different types of genetic data.

By combining many different types of association methods developed for specific cases into an integrated tool with a common interface, ONETOOL enables more seemingly harmonized family-based association analyses. In Supplementary Tables 3 and 4, we summarized the proper timing to use for the family-based association test available in ONETOOL.

2.3 Imputation and dosage data

ONETOOL provides an option to impute the missing genotypes. Expected missing genotypes for typed variants are imputed based on the familial relationship, and if phenotypes of any subjects with missing genotypes are available, genotypes imputed with family members’ genotypes can improve statistical power (see Supplementary Material for details).

ONETOOL also enables the family-based association analysis with dosage data and genotype probability. See the Supplementary Table 5 for the supported dosage and genotype probability file formats from several popular imputation tools.

3 Discussion

ONETOOL enables a researcher to perform many of the family-based genetic and genomic analyses in a computationally efficient manner. It provides a convenience and time-saver with a rich choice of analysis options available, both existing and novel. ONETOOL supports various types of data input files includes the dosage and genotype probability files from several imputation tools. Using two different family datasets, we show, in Table 2, the performance of ONETOOL and the time savings by running several analyses at once compare to the separate run for each component (see Supplementary Material for details).

Table 2.

Efficiency of the integrated analyses in ONETOOL

Analyses	Run type	Data1	Data2
InfoQC+Trait	separate run	2.21s	55.83s
InfoQC+Trait	ONETOOL	0.74s	44.09s
InfoQC+Trait+single-variant	separate run	2.41s	58.74s
InfoQC+Trait+single-variant	ONETOOL	0.83s	54.09s
InfoQC+Trait+gene-based	separate run	2.47s	59.20s
InfoQC+Trait+gene-based	ONETOOL	0.85s	54.76s

Analyses	Run type	Data1	Data2
InfoQC+Trait	separate run	2.21s	55.83s
InfoQC+Trait	ONETOOL	0.74s	44.09s
InfoQC+Trait+single-variant	separate run	2.41s	58.74s
InfoQC+Trait+single-variant	ONETOOL	0.83s	54.09s
InfoQC+Trait+gene-based	separate run	2.47s	59.20s
InfoQC+Trait+gene-based	ONETOOL	0.85s	54.76s

Table 2.

Efficiency of the integrated analyses in ONETOOL

Analyses	Run type	Data1	Data2
InfoQC+Trait	separate run	2.21s	55.83s
InfoQC+Trait	ONETOOL	0.74s	44.09s
InfoQC+Trait+single-variant	separate run	2.41s	58.74s
InfoQC+Trait+single-variant	ONETOOL	0.83s	54.09s
InfoQC+Trait+gene-based	separate run	2.47s	59.20s
InfoQC+Trait+gene-based	ONETOOL	0.85s	54.76s

Analyses	Run type	Data1	Data2
InfoQC+Trait	separate run	2.21s	55.83s
InfoQC+Trait	ONETOOL	0.74s	44.09s
InfoQC+Trait+single-variant	separate run	2.41s	58.74s
InfoQC+Trait+single-variant	ONETOOL	0.83s	54.09s
InfoQC+Trait+gene-based	separate run	2.47s	59.20s
InfoQC+Trait+gene-based	ONETOOL	0.85s	54.76s

Funding

This work was supported by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (Grant no. HC15C1302); and the Bio-Synergy Research Project (NRF-2017M3A9C4065964) of the Ministry of Science, ICT and Future Planning through the National Research Foundation.

Conflict of Interest: none declared.

References

Abecasis

G.R.

et al. (

2002

)

Merlin-rapid analysis of dense genetic maps using sparse gene flow tree

.

Nat. Genet

.,

30

,

97

–

101

.

Bailey-Wilson

J.E.

,

Wilson

A.F.

(

2011

)

Linkage analysis in the next generation sequencing era

.

Hum. Hered

.,

72

,

228

–

236

.

Purcell

S.

et al. (

2007

)

PLINK: a tool set for whole-genome association and population-based linkage analyses

.

Am. J. Hum. Genet

.,

81

,

559

–

575

.

Sinnwell

J.P.

et al. (

2014

)

The kinship2 R package for pedigree data

.

Hum. Hered

.,

78

,

91

–

93

.

Wijsman

E.M.

(

2012

)

The role of large pedigrees in an era of high-throughput sequencing

.

Hum. Genet

.,

131

,

1555

–

1563

.

Author notes

The authors wish it to be known that, in their opinion, Yeunjoo E. Song and Sungyoung Lee authors should be regarded as Joint First Authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Associate Editor:

Download all slides

Month:	Total Views:
March 2018	344
April 2018	244
May 2018	66
June 2018	42
July 2018	27
August 2018	365
September 2018	118
October 2018	24
November 2018	60
December 2018	43
January 2019	52
February 2019	21
March 2019	54
April 2019	44
May 2019	43
June 2019	33
July 2019	31
August 2019	31
September 2019	25
October 2019	25
November 2019	30
December 2019	19
January 2020	39
February 2020	35
March 2020	55
April 2020	24
May 2020	21
June 2020	53
July 2020	52
August 2020	33
September 2020	34
October 2020	28
November 2020	30
December 2020	45
January 2021	27
February 2021	46
March 2021	78
April 2021	36
May 2021	25
June 2021	22
July 2021	30
August 2021	12
September 2021	32
October 2021	60
November 2021	27
December 2021	28
January 2022	30
February 2022	26
March 2022	18
April 2022	12
May 2022	40
June 2022	20
July 2022	25
August 2022	36
September 2022	82
October 2022	40
November 2022	35
December 2022	22
January 2023	12
February 2023	15
March 2023	20
April 2023	25
May 2023	18
June 2023	23
July 2023	17
August 2023	18
September 2023	28
October 2023	12
November 2023	27
December 2023	20
January 2024	24
February 2024	29
March 2024	31
April 2024	18

Article Contents

ONETOOL for the analysis of family-based big data

Abstract

1 Introduction

2 Features

2.1 InfoQC, trait and linkage analysis

2.2 Association analysis

2.3 Imputation and dosage data

3 Discussion

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

ONETOOL for the analysis of family-based big data

Abstract

1 Introduction

2 Features

2.1 InfoQC, trait and linkage analysis

2.2 Association analysis

2.3 Imputation and dosage data

3 Discussion

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only