Abstract

Motivation

Despite the need for separate tools to analyze family-based data, there are only a handful of tools optimized for family-based big data compared to the number of tools available for analyzing population-based data.

Results

ONETOOL implements the properties of well-known existing family data analysis tools and recently developed methods in a computationally efficient manner, and so is suitable for analyzing the vast amount of variant data available from sequencing family members, providing a rich choice of analysis methods for big data on families.

Availability and implementation

ONETOOL is freely available from http://healthstat.snu.ac.kr/software/onetool/.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The importance of family-based designs has been repeatedly stressed for analyses with sequence data because of the genetic homogeneity between family members (Bailey-Wilson and Wilson, 2011; Wijsman, 2012). Family study designs provide not only the enrichment of genetic loci containing rare variants, but also methods to control for genetic heterogeneity and population stratification.

Family-based data have different properties from population-based data owing to the genetic relatedness among family members and Mendelian transmission. These well-known features have allowed family-based designs to play a key role in the history of genetic analysis, but they also limit the use of many available tools designed for the analysis of population-based data. Despite the need for separate tools to analyze family-based data, there is only a handful of tools available for family-based data, especially the big data from sequencing which comprise the readily available form of data for genetic and genomic analyses these days.

As for the population-based data analysis tools, the most common family-based big data analysis tools aim to filter/rank/QC/annotate variants (Supplementary Table 1). These tools are optimized to be used with the vast amount of variant data from sequencing but lack the choices of essential analyses needed to test and infer the valid results regarding the relation between traits of interest against the common and/or rare variants. Therefore, users need to use the separate tools individually. And, though there exists a handful of family-based imputation tools, there is a clear lack of family-based association analysis tools that can analyze the dosage files directly as an input.

There are many very well-known family data analysis tools available from the era of linkage analysis and genome-wide association studies (GWAS). Among these, S.A.G.E. 6.4 (http://darwin.cwru.edu/sage/) and Merlin (Abecasis et al., 2002) are still used a lot by many researchers. PLINK (Purcell et al., 2007) is one of most popularly used tool for GWAS. It is a part of many standard pipeline tools, therefore, the PLINK input file formats are the standard format for many sequence data analysis tools. However, since it is designed mainly for population-based case-control data, the analysis options are very limited for use with family data.

All three tools have their pros and cons. In this work, we introduce a novel comprehensive tool that combines the good features of these existing tools and many newly developed family-based association analysis methods along with a novel feature to analyze the dosage data, providing in a computationally efficient manner a rich choice of analysis options to use for the vast amount of variant data coming from the sequencing of families. This provides a convenience and time-saver that enables a researcher to perform many of the family-based genetic and genomic analyses using one tool, i.e. ONETOOL, instead of hopping among many different tools to accomplish a family data analysis project.

2 Features

ONETOOL provides four main analyses: informatics and quality control (InfoQC), trait analysis, linkage analysis and association analysis with both genotype data and dosage data (in Table 1 and also see the Supplementary Table 2 for details).

Table 1.

Summary of available family-based analyses in ONETOOL

MainSub-categoryDetail
InfoQC analysisVariant informationFST, Ts/Tv ratio, MAF, HWE, PCA
Sample informationHet, Het/Hom
Pedigree informationDescription and summary, plot, relative pairs
Error detectionMendelian error
Relatedness matrixKinship, IBS, GRM
Trait analysisFamilial aggregationCorrelation
HeritabilityBased on Kinship, IBS, GRM
Segregation analysisMode of inheritance
Linkage analysisModel-basedTwo-point, utilizing segregation analysis
Model-freeMultipoint, modeling LD
Association analysisSingle variantScore test, TDT/SDT, MQLS, FQLS, EQLS, GEMMA
Gene-basedCollapsing, PEDCMC, FAMVT, FARVAT, FBSKAT, PEDGENE, RVTDT
Genotype probability & dosage dataScoretest, EQLS, GEMMA, CMC, PEDCMC, FAMVT, FARVAT, PEDGENE
MainSub-categoryDetail
InfoQC analysisVariant informationFST, Ts/Tv ratio, MAF, HWE, PCA
Sample informationHet, Het/Hom
Pedigree informationDescription and summary, plot, relative pairs
Error detectionMendelian error
Relatedness matrixKinship, IBS, GRM
Trait analysisFamilial aggregationCorrelation
HeritabilityBased on Kinship, IBS, GRM
Segregation analysisMode of inheritance
Linkage analysisModel-basedTwo-point, utilizing segregation analysis
Model-freeMultipoint, modeling LD
Association analysisSingle variantScore test, TDT/SDT, MQLS, FQLS, EQLS, GEMMA
Gene-basedCollapsing, PEDCMC, FAMVT, FARVAT, FBSKAT, PEDGENE, RVTDT
Genotype probability & dosage dataScoretest, EQLS, GEMMA, CMC, PEDCMC, FAMVT, FARVAT, PEDGENE
Table 1.

Summary of available family-based analyses in ONETOOL

MainSub-categoryDetail
InfoQC analysisVariant informationFST, Ts/Tv ratio, MAF, HWE, PCA
Sample informationHet, Het/Hom
Pedigree informationDescription and summary, plot, relative pairs
Error detectionMendelian error
Relatedness matrixKinship, IBS, GRM
Trait analysisFamilial aggregationCorrelation
HeritabilityBased on Kinship, IBS, GRM
Segregation analysisMode of inheritance
Linkage analysisModel-basedTwo-point, utilizing segregation analysis
Model-freeMultipoint, modeling LD
Association analysisSingle variantScore test, TDT/SDT, MQLS, FQLS, EQLS, GEMMA
Gene-basedCollapsing, PEDCMC, FAMVT, FARVAT, FBSKAT, PEDGENE, RVTDT
Genotype probability & dosage dataScoretest, EQLS, GEMMA, CMC, PEDCMC, FAMVT, FARVAT, PEDGENE
MainSub-categoryDetail
InfoQC analysisVariant informationFST, Ts/Tv ratio, MAF, HWE, PCA
Sample informationHet, Het/Hom
Pedigree informationDescription and summary, plot, relative pairs
Error detectionMendelian error
Relatedness matrixKinship, IBS, GRM
Trait analysisFamilial aggregationCorrelation
HeritabilityBased on Kinship, IBS, GRM
Segregation analysisMode of inheritance
Linkage analysisModel-basedTwo-point, utilizing segregation analysis
Model-freeMultipoint, modeling LD
Association analysisSingle variantScore test, TDT/SDT, MQLS, FQLS, EQLS, GEMMA
Gene-basedCollapsing, PEDCMC, FAMVT, FARVAT, FBSKAT, PEDGENE, RVTDT
Genotype probability & dosage dataScoretest, EQLS, GEMMA, CMC, PEDCMC, FAMVT, FARVAT, PEDGENE

2.1 InfoQC, trait and linkage analysis

Family data requires additional error checking and filtering that also consider the family structure, so that the family structures are maintained. ONETOOL provides the proper methods to deal this complexity of family data and the downstream analyses as an integrated tool. Moreover, ONETOOL’s options for the variant-wise InfoQC and filtering are similar with those in Plink, but they are implemented in a computationally optimized way providing more speed and efficiency. It also provides visualization of family data as done utilizing the R package kinship2 to generate a plot (Sinnwell et al., 2014).

As shown in Supplementary Table 1, not many tools are available for trait analysis nor are optimized to work with the current pipeline of family big data. ONETOOL fulfills this gap by integrating the tools for familial aggregation or correlation, narrow-sense heritability estimation and segregation analysis.

With ONETOOL, both types of linkage analyses, model-based linkage and model-free linkage accounting for linkage disequilibrium, can be done directly with the current genomic data files.

2.2 Association analysis

Depending on the types of trait data (binary or continuous), family data (random or ascertained, trio or general) and variant data (common or rare) in hand, the different family-based association analyses provide the best estimates in terms of both power and type 1 error. Many times, a complex disease data analysis project involves not just a phenotype but a set of multiple phenotypes with different characteristics. It also involves analyzing a set of different types of genetic data.

By combining many different types of association methods developed for specific cases into an integrated tool with a common interface, ONETOOL enables more seemingly harmonized family-based association analyses. In Supplementary Tables 3 and 4, we summarized the proper timing to use for the family-based association test available in ONETOOL.

2.3 Imputation and dosage data

ONETOOL provides an option to impute the missing genotypes. Expected missing genotypes for typed variants are imputed based on the familial relationship, and if phenotypes of any subjects with missing genotypes are available, genotypes imputed with family members’ genotypes can improve statistical power (see Supplementary Material for details).

ONETOOL also enables the family-based association analysis with dosage data and genotype probability. See the Supplementary Table 5 for the supported dosage and genotype probability file formats from several popular imputation tools.

3 Discussion

ONETOOL enables a researcher to perform many of the family-based genetic and genomic analyses in a computationally efficient manner. It provides a convenience and time-saver with a rich choice of analysis options available, both existing and novel. ONETOOL supports various types of data input files includes the dosage and genotype probability files from several imputation tools. Using two different family datasets, we show, in Table 2, the performance of ONETOOL and the time savings by running several analyses at once compare to the separate run for each component (see Supplementary Material for details).

Table 2.

Efficiency of the integrated analyses in ONETOOL

AnalysesRun typeData1Data2
InfoQC+Traitseparate run2.21s55.83s
ONETOOL0.74s44.09s
InfoQC+Trait+single-variantseparate run2.41s58.74s
ONETOOL0.83s54.09s
InfoQC+Trait+gene-basedseparate run2.47s59.20s
ONETOOL0.85s54.76s
AnalysesRun typeData1Data2
InfoQC+Traitseparate run2.21s55.83s
ONETOOL0.74s44.09s
InfoQC+Trait+single-variantseparate run2.41s58.74s
ONETOOL0.83s54.09s
InfoQC+Trait+gene-basedseparate run2.47s59.20s
ONETOOL0.85s54.76s
Table 2.

Efficiency of the integrated analyses in ONETOOL

AnalysesRun typeData1Data2
InfoQC+Traitseparate run2.21s55.83s
ONETOOL0.74s44.09s
InfoQC+Trait+single-variantseparate run2.41s58.74s
ONETOOL0.83s54.09s
InfoQC+Trait+gene-basedseparate run2.47s59.20s
ONETOOL0.85s54.76s
AnalysesRun typeData1Data2
InfoQC+Traitseparate run2.21s55.83s
ONETOOL0.74s44.09s
InfoQC+Trait+single-variantseparate run2.41s58.74s
ONETOOL0.83s54.09s
InfoQC+Trait+gene-basedseparate run2.47s59.20s
ONETOOL0.85s54.76s

Funding

This work was supported by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (Grant no. HC15C1302); and the Bio-Synergy Research Project (NRF-2017M3A9C4065964) of the Ministry of Science, ICT and Future Planning through the National Research Foundation.

Conflict of Interest: none declared.

References

Abecasis
 
G.R.
 et al.  (
2002
)
Merlin-rapid analysis of dense genetic maps using sparse gene flow tree
.
Nat. Genet
.,
30
,
97
101
.

Bailey-Wilson
 
J.E.
,
Wilson
A.F.
(
2011
)
Linkage analysis in the next generation sequencing era
.
Hum. Hered
.,
72
,
228
236
.

Purcell
 
S.
 et al.  (
2007
)
PLINK: a tool set for whole-genome association and population-based linkage analyses
.
Am. J. Hum. Genet
.,
81
,
559
575
.

Sinnwell
 
J.P.
 et al.  (
2014
)
The kinship2 R package for pedigree data
.
Hum. Hered
.,
78
,
91
93
.

Wijsman
 
E.M.
(
2012
)
The role of large pedigrees in an era of high-throughput sequencing
.
Hum. Genet
.,
131
,
1555
1563
.

Author notes

The authors wish it to be known that, in their opinion, Yeunjoo E. Song and Sungyoung Lee authors should be regarded as Joint First Authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
Associate Editor: Oliver Stegle
Oliver Stegle
Associate Editor
Search for other works by this author on:

Supplementary data