- Split View
-
Views
-
Cite
Cite
Péter Kómár, Deniz Kural, geck: trio-based comparative benchmarking of variant calls, Bioinformatics, Volume 34, Issue 20, October 2018, Pages 3488–3495, https://doi.org/10.1093/bioinformatics/bty415
- Share Icon Share
Abstract
Classical methods of comparing the accuracies of variant calling pipelines are based on truth sets of variants whose genotypes are previously determined with high confidence. An alternative way of performing benchmarking is based on Mendelian constraints between related individuals. Statistical analysis of Mendelian violations can provide truth set-independent benchmarking information, and enable benchmarking less-studied variants and diverse populations.
We introduce a statistical mixture model for comparing two variant calling pipelines from genotype data they produce after running on individual members of a trio. We determine the accuracy of our model by comparing the precision and recall of GATK Unified Genotyper and Haplotype Caller on the high-confidence SNPs of the NIST Ashkenazim trio and the two independent Platinum Genome trios. We show that our method is able to estimate differential precision and recall between the two pipelines with uncertainty.
The Python library geck, and usage examples are available at the following URL: https://github.com/sbg/geck, under the GNU General Public License v3.
Supplementary data are available at Bioinformatics online.
1 Introduction
Next-generation sequencing (NGS) has enabled entire human genomes to be sequenced in a single experiment and has, since its development, played a crucial role in many areas including population genetics (Auton et al., 2015; Mallick et al., 2016), disease genetics (Bamshad et al., 2011) and the study of human mutations (Veltman and Brunner, 2012). Many variant callers have been developed to call genetic variants from NGS data (Sandmann et al., 2017), employing multiple steps of data processing and statistical inference (Li, 2011). Due to their wide impact, benchmarking variant calling pipelines is of utmost importance (Li, 2014; Shringarpure et al., 2015).
A widely-used method of measuring the accuracy of variant calling pipelines is tallying mis-called variants on samples that have been characterized previously (Cornish and Guda, 2015; Hwang et al., 2015; Olson et al., 2015). High-confidence variant sets, or ‘truth sets’, are compiled by teams of experts, such as the Genome in a Bottle (GIAB) Consortium, by integrating variant calls from several sequencing and data processing technologies. Due to careful curation, these datasets reach very high fidelity and provide a benchmarking reference for SNPs, indels (Boutros et al., 2014; Talwalkar et al., 2014; Zook et al., 2014) and structural variants (Fang et al., 2016; Human Genome Structural Variant Consortium, 2017; Nutsua et al., 2015; Parikh et al., 2016). In all cases, high-coverage sequencing data from a handful of (often related) individuals are required to produce a truth set (Eberle et al., 2017; Zook et al., 2016). Such truth sets are expensive to create, specific to the population and biased toward the methods that produced them.
Sequencing data from related individuals provide an alternative way to detect errors made by a variant calling pipeline. Since de-novo mutation rates [ per locus (Veltman and Brunner, 2012)] are several orders of magnitude smaller than calling and genotyping error rates (often ), one can use the variant calls inconsistent with the rules of Mendelian inheritance to assess genotyping errors. While not all errors result in a Mendelian violation, which limits the sensitivity of this strategy, as noted by Li (2014), additional assumptions about the error probabilities permit the estimation of genotyping error rates from parent-offspring dyads (Haaland and Skaug, 2013), siblings (Johnson and Haydon, 2007; Korostishevsky et al., 2009; Wang, 2004), trios (Douglas et al., 2002; Jostins, 2011; Saunders et al., 2007; Sobel et al., 2002) and larger pedigrees (Eberle et al., 2017). If sequencing reads from many related individuals are available, recombination events can be inferred and used to estimate locus-specific genotyping errors (Chen et al., 2013; Kojima et al., 2013; Markus et al., 2011; Peng et al., 2013). All these approaches have the advantage of not relying on a truth set, thereby eliminating the bias towards established variant callers.
When modeling genotyping calling errors in related individuals, one is tempted to parametrize the genotyping error profile with a couple of parameters and estimate their values independently for different variant callers. This method has two significant shortcomings by neglecting the following correlations: First, errors made by different variant calling pipelines are often correlated due to identical assumptions (e.g. well-mapped reads) which, if violated in the input data, make mistakes of the pipelines coincide at particular loci. Second, mapping and calling errors depend on sequence context, which is similar among closely related individuals and induces correlations between genotyping errors in samples of different family members. To the extent of our knowledge, there exists no published method that addresses both correlations.
Here, we present a statistical model that addresses both shortcomings for the case of a family trio. First, we aggregate the number of different genotype trios for the two pipelines jointly, and model this data. This way, we take correlations between two variant callers explicitly into account. Second, our model consists of a statistical mixture of different genotyping error profiles. Although each error profile assumes independence between family members, enabling an efficient mathematical solution, their mixture captures the correlations.
We validate our model in a series of experiments on real trios. First, we run two variant caller pipelines on publicly available NGS data from trios, and let our model estimate the precision and recall of the two pipelines. Then we compare these estimated performance metrics with their true values. The truth is directly calculated by comparing the calls on the children with their correct genotypes from the truth datasets. We demonstrate that our model is able to detect differential performance (absolute difference of precision or recall) between two pipelines as small as . With this level of accuracy, our model enables informed development of pipelines targeting less-studied variants and populations for which no curated truth set is available.
The rest of the paper is divided into three parts: In Section 2, we introduce our notation, describe the structure and assumptions of our model and show how one can use it to infer the number of genotyping errors and stratify them by type. In Section 3, we describe how we validate our method on publicly available trio data. Finally, in Section 4, we present the results of the validation experiment.
2 Model
2.1 Definitions and notation
Trio genotypes. Let denote the set of individual, bi-allelic, unphased genotypes: homozygous reference (00), heterozygous (01) and homozygous alternate (11). Let be an ordered trio of individual genotypes, representing the genotypes of variants in the genomes of father (g1), mother (g2) and child (g3). Let T denote the set of mathematically possible genotype trios, i.e. , and let denote its subset that conforms with Mendelian inheritance, i.e. or , where and are the first and second symbols (0 or 1) of g3. One can show that and by enumeration.
Observed genotypes. Two variant calling pipelines run independently on samples from the three family members produce six separate sets of variant calls. We require that variant representations are harmonized, after which we can aggregate the counts of different genotype trio pairs of one class of variants (e.g. SNPs, short indels, structural variants) called by the two pipelines across all loci. This yields the joint counts of genotype trios , where is the number of variants whose trio genotypes are called by pipeline 1, and by pipeline 2. An example is shown in Figure 1.
Correct genotypes. Explicit information about the correct trio genotypes is missing from the observed counts N. Let denote the (hidden) complete distribution of called and correct genotype trios, where is the number of variants with called genotype trios and correct genotype trio g, such that . In the next section, we explicitly model the genotyping errors in order to infer the hidden components in , examples of which are shown in Supplementary Figures S3–S5. Uncovering this data, , from the observed mixture, N, is our main goal.
Genotyping can be understood as a classification task with one negative (00) and two positive classes (01 and 11). For person p, we define the true positive, false negative, false positive and ‘mixed-up positive’ counts for pipeline 1 as , , and , respectively, where denotes summation over the third index. For pipeline 2, similar formulas apply where the summations are prescribed for the second index. We define precision and recall with the usual formulas: and , for each person and each pipeline.
2.2 Generative model
We introduce a generative model in a form of a mixture of categorical distributions over pairs, where each mixture component is labeled by a correct trio genotype , and an error category m (defined in this section). The graphical representation of this model is shown in Figure 2. We define the parameters below.
2.3 Limiting complexity
The model described so far has 480 algebraically independent parameters (Supplementary Material A.1.1), which is comparable to 27 × 27 = 729, the number of entries in the input data N. To improve the stability of our estimates, we lower the complexity of the model in two ways: limiting the range of error rates, and using parameter sharing.
Limiting error rates. We limit the error rates by requiring that . This guarantees that the largest fraction of the calls (j, k) are indeed correct (), separately for each error category.
Parameter sharing. We partition t into three disjunct subsets , where and , and require, whenever both are in t1, that and . We choose this partitioning expecting that variants that are present in none (t0) or all (t2) of the genomes will be subject to different error rates than variants for which at least one family member is heterozygous (t1). These constraints allow us to introduce the notation and for all where s = 0, 1, 2, and reduce the number of algebraically independent parameters to 96. (Supplementary Material A.1.2)
2.4 Inference
Gibbs sampling. The conditional posterior of [Equation (7)] and [Equation (10)] are products of multinomial and Dirichlet distributions, respectively, both of which can be efficiently sampled. This enables Gibbs sampling, where we start with an initial guess . In Step 1, we sample the independent multinomials of Equation (7) and obtain the τth sample of the hidden components, . Then, in Step 2, we sample the independent Dirichlet variables of Equation (10) and obtain the next sample of the model parameters, . We make sure that by first sampling each slice from an unconstrained Dirichlet distribution, followed by lowering all components that violate this criteria to the level of , and re-normalizing. Repeating Steps 1 and 2 yields approximate samples from the joint posterior . With appropriate ‘burn-in’ (τ0) and ‘thinning’ (), this procedure provides estimates of the hidden components in the form of a list of samples . Details of choosing τ0 and are given in Supplementary Material B.
Benchmarking metrics. From the samples , the τth estimate of any metric μ can be directly calculated, , enabling accurate estimation of its posterior.
3 Materials and methods
To assess the accuracy of our model, we carried out the validation experiment shown on Figure 3. First, we selected trios for which not only the NGS reads but also curated truth-sets of the children are published, and used two variant caller pipelines to produce variant calls from the read data. Second, we simultaneously used our model to estimate benchmarking metrics for both pipelines on all three members of the trio, and used the correct genotypes from the child’s truth set to calculate the true values of the same benchmarking metrics. Finally, we compared the estimated metrics with the true metrics to calculate the accuracy of our model.
3.1 Data
Read data. We used 50x, 148-bp-long, paired-end, Illumina reads from three trios: the GIAB Ashkenazim trio (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/; HG003, HG004, HG002), published by the GIAB Consortium (Zook et al., 2016) and 101-bp-long, paired-end, Illumina reads from the two unrelated trios from Platinum Genome (ftp://ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172924/bam/; NA12889, NA12890, NA12877) and (NA12891, NA12892, NA12878), published by Illumina, Inc. (Eberle et al., 2017). From these alignment files, we extracted and sorted the reads using samtools (https://github.com/samtools/samtools) and dropped duplicates and unpaired reads, using a custom script. Command lines are listed in Supplementary Material D.
Truth set for child. Curated, high-confidence variants are publicly available for all three children, HG002 (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/ (v3.3.2)), NA12877 and NA12878 (ftp://platgene_ro@ussd-ftp.illumina.com/2016-1.0/hg19/). We used bcftools (http://samtools.github.io/bcftools/) to select bi-allelic SNPs on chromosomes 1–22 ( million). Merging and comparing with the variants produced by the variant caller pipelines allows direct, ‘truth set-based’, benchmarking of the pipelines. (See Supplementary Material D for command lines.) We restricted the scope of our validation experiment to SNPs to avoid having to reconcile different representations of identical variants, which often arise for indels. Supplementary Material C.4 and C.5 contain validation results where we further excluded SNPs near indels, and where we included only SNPs outside of high-confidence regions, respectively.
Pipelines. We used two variant caller pipelines, both of which use BWA-MEM 0.7.13 (Li and Durbin, 2009) for alignment, GATK tools (McKenna et al., 2010) for local realignement and base quality recalibration and take advantage of known SNPs and indels (HapMap, dbSNP, 1000GP, Omni) according to GATK Best Practices recommendations (DePristo et al., 2011; Van der Auwera et al., 2013). In the variant calling step, one pipeline used GATK HaplotypeCaller v3.5, and the other GATK UnifiedGenotyper v2.7. Command lines are listed in Supplementary Material D. (Additionally, Supplementary Material C.6 shows results of comparing pipelines with different aligners).
Aggregating genotype calls. For each trio, we merged the SNPs called by the two pipelines using bcftools, and filtered out the variants that fell outside the high-confidence regions or become multi-allelic. We aggregated the remaining SNPs ( million) by counting the occurrences of each combinations, where the trio genotype is called by Haplotype Caller and by Unified Genotyper. During aggregation, we recorded missing variants as genotype 00 and disregarded phasing information, i.e. regarded both heterozygous genotypes, and , as 01. This yielded the observed counts of joint trio calls .
Sub-sampling. To test the robustness of our model, we repeated the validation experiments for subsets of SNPs of different sizes. We sub-sampled a pre-defined number of variants from N by choosing a random subset of SNPs with uniform probability.
3.2 Running the estimator
Imputing ‘uncalled variants’. Neither Haplotype Caller nor Unified Genotyper reports homozygous reference genotypes (i.e. 00) when run on a single sample. Due to this limitation, the number of potential variants that do not get called 01 or 11 by either pipeline in any family member, i.e. , is technically zero. With this ‘bulk’ of the 00 genotypes missing, we would not find the correct and error rates. To fix this, we used an estimate of the total number of polymorphic loci: the total number of variants reported in dbSNP (https://www.ncbi.nlm.nih.gov/projects/SNP/) .
Details of Gibbs sampling. For each set of input data N, we perform Gibbs sampling as described in Section 2.4. We set the parameters of the Dirichlet priors to , and , if i = j = k and 1 otherwise . The skewed choice for represents our expectation that the error rates are small . We draw the initial values of from their prior, and let the Gibbs sampler run for 50 000 iterations, allowing it to reach equilibrium. We proceeded by running 100 000 iterations, while recording samples with period, to decrease the correlation between consecutive samples. This procedure resulted in 100 samples, , drawn from its posterior . With our scipy-based Python implementation, this procedure takes 4.5 h, using 1 CPU on the AWS instance m1.small.
3.3 Metrics
For each trio, we run the trio-based benchmarking, which produces samples . From this, we directly calculate the samples of the confusion matrix for the child [from Equation (1)]. Using the formulas for precision and recall, we obtain their samples and , for both pipelines. We estimate each metric with the average of its samples.
To investigate how sensitive our model is to the total number of variants, we repeatedly sub-sample the observed counts N, and run the trio-based benchmarking all over again. We do this for 8 different number of variants (103—), repeating each sub-sampling and estimation 10 times for every trio.
4 Results
The simplest method of analyzing trio results from the two pipelines is counting Mendelian violations. This, alongside with the true genotyping errors in the child, is shown in Table 1. Comparing the two columns shows that the difference in the number of Mendelian violations correctly indicates the difference in the actual number of genotyping error between the two pipelines.
. | Genotyping errors in child . | Mendelian violations in trio . | ||
---|---|---|---|---|
HC . | UG . | HC . | UG . | |
HG002a | 7496 | 20 142 | 3946 | 7601 |
NA12877b | 36 759 | 39 780 | 57 179 | 59 038 |
NA12878b | 34 436 | 37 224 | 11 875 | 14 286 |
. | Genotyping errors in child . | Mendelian violations in trio . | ||
---|---|---|---|---|
HC . | UG . | HC . | UG . | |
HG002a | 7496 | 20 142 | 3946 | 7601 |
NA12877b | 36 759 | 39 780 | 57 179 | 59 038 |
NA12878b | 34 436 | 37 224 | 11 875 | 14 286 |
Note: Counts are shown for Haplotype Caller and Unified Genotyper.
With aGIAB (HG002), and bPlatinum Genomes truth VCF and BED files.
. | Genotyping errors in child . | Mendelian violations in trio . | ||
---|---|---|---|---|
HC . | UG . | HC . | UG . | |
HG002a | 7496 | 20 142 | 3946 | 7601 |
NA12877b | 36 759 | 39 780 | 57 179 | 59 038 |
NA12878b | 34 436 | 37 224 | 11 875 | 14 286 |
. | Genotyping errors in child . | Mendelian violations in trio . | ||
---|---|---|---|---|
HC . | UG . | HC . | UG . | |
HG002a | 7496 | 20 142 | 3946 | 7601 |
NA12877b | 36 759 | 39 780 | 57 179 | 59 038 |
NA12878b | 34 436 | 37 224 | 11 875 | 14 286 |
Note: Counts are shown for Haplotype Caller and Unified Genotyper.
With aGIAB (HG002), and bPlatinum Genomes truth VCF and BED files.
To see how accurately our trio-based benchmarking method can compare the two pipelines, we compare estimates of precision and recall with their true values for the children in Table 2. Although the estimated values of the metrics significantly differ from the truth, both the signs and magnitudes of differential metrics are accurately estimated. This is also visible from the smaller estimation uncertainty σtrio, reported by the model.
. | . | Precision . | Recall . | ||||
---|---|---|---|---|---|---|---|
HC . | UG . | . | HC . | UG . | . | ||
HG002 | Trutha | 0.9980 | 0.9942 | 3.81 | 0.9994 | 0.9988 | 0.56 |
trio | 0.9554 | 0.9529 | 2.50 | 0.9988 | 0.9985 | 0.33 | |
σtrio | ± 0.0076 | ± 0.0076 | ± 0.05 | ± 0.0006 | ± 0.0006 | ± 0.05 | |
NA12877 | Truthb | 0.9986 | 0.9963 | 2.23 | 0.9902 | 0.9912 | –1.02 |
trio | 0.9525 | 0.9499 | 2.63 | 0.9834 | 0.9839 | –0.47 | |
σtrio | ± 0.0006 | ± 0.0006 | ± 0.02 | ± 0.0001 | ± 0.0001 | ± 0.02 | |
NA12878 | Truthb | 0.9986 | 0.9965 | 2.15 | 0.9910 | 0.9920 | –1.04 |
trio | 0.9945 | 0.9917 | 2.79 | 0.9964 | 0.9969 | –0.46 | |
σtrio | ± 0.0008 | ± 0.0008 | ± 0.02 | ± 0.0003 | ± 0.0003 | ± 0.02 |
. | . | Precision . | Recall . | ||||
---|---|---|---|---|---|---|---|
HC . | UG . | . | HC . | UG . | . | ||
HG002 | Trutha | 0.9980 | 0.9942 | 3.81 | 0.9994 | 0.9988 | 0.56 |
trio | 0.9554 | 0.9529 | 2.50 | 0.9988 | 0.9985 | 0.33 | |
σtrio | ± 0.0076 | ± 0.0076 | ± 0.05 | ± 0.0006 | ± 0.0006 | ± 0.05 | |
NA12877 | Truthb | 0.9986 | 0.9963 | 2.23 | 0.9902 | 0.9912 | –1.02 |
trio | 0.9525 | 0.9499 | 2.63 | 0.9834 | 0.9839 | –0.47 | |
σtrio | ± 0.0006 | ± 0.0006 | ± 0.02 | ± 0.0001 | ± 0.0001 | ± 0.02 | |
NA12878 | Truthb | 0.9986 | 0.9965 | 2.15 | 0.9910 | 0.9920 | –1.04 |
trio | 0.9945 | 0.9917 | 2.79 | 0.9964 | 0.9969 | –0.46 | |
σtrio | ± 0.0008 | ± 0.0008 | ± 0.02 | ± 0.0003 | ± 0.0003 | ± 0.02 |
Note: Δ denotes the difference between HC and UG values. σtrio is self-reported uncertainty of the model, i.e. SD of the Gibbs samples.
With aGIAB (HG002), and bPlatinum Genomes truth VCF and BED files.
. | . | Precision . | Recall . | ||||
---|---|---|---|---|---|---|---|
HC . | UG . | . | HC . | UG . | . | ||
HG002 | Trutha | 0.9980 | 0.9942 | 3.81 | 0.9994 | 0.9988 | 0.56 |
trio | 0.9554 | 0.9529 | 2.50 | 0.9988 | 0.9985 | 0.33 | |
σtrio | ± 0.0076 | ± 0.0076 | ± 0.05 | ± 0.0006 | ± 0.0006 | ± 0.05 | |
NA12877 | Truthb | 0.9986 | 0.9963 | 2.23 | 0.9902 | 0.9912 | –1.02 |
trio | 0.9525 | 0.9499 | 2.63 | 0.9834 | 0.9839 | –0.47 | |
σtrio | ± 0.0006 | ± 0.0006 | ± 0.02 | ± 0.0001 | ± 0.0001 | ± 0.02 | |
NA12878 | Truthb | 0.9986 | 0.9965 | 2.15 | 0.9910 | 0.9920 | –1.04 |
trio | 0.9945 | 0.9917 | 2.79 | 0.9964 | 0.9969 | –0.46 | |
σtrio | ± 0.0008 | ± 0.0008 | ± 0.02 | ± 0.0003 | ± 0.0003 | ± 0.02 |
. | . | Precision . | Recall . | ||||
---|---|---|---|---|---|---|---|
HC . | UG . | . | HC . | UG . | . | ||
HG002 | Trutha | 0.9980 | 0.9942 | 3.81 | 0.9994 | 0.9988 | 0.56 |
trio | 0.9554 | 0.9529 | 2.50 | 0.9988 | 0.9985 | 0.33 | |
σtrio | ± 0.0076 | ± 0.0076 | ± 0.05 | ± 0.0006 | ± 0.0006 | ± 0.05 | |
NA12877 | Truthb | 0.9986 | 0.9963 | 2.23 | 0.9902 | 0.9912 | –1.02 |
trio | 0.9525 | 0.9499 | 2.63 | 0.9834 | 0.9839 | –0.47 | |
σtrio | ± 0.0006 | ± 0.0006 | ± 0.02 | ± 0.0001 | ± 0.0001 | ± 0.02 | |
NA12878 | Truthb | 0.9986 | 0.9965 | 2.15 | 0.9910 | 0.9920 | –1.04 |
trio | 0.9945 | 0.9917 | 2.79 | 0.9964 | 0.9969 | –0.46 | |
σtrio | ± 0.0008 | ± 0.0008 | ± 0.02 | ± 0.0003 | ± 0.0003 | ± 0.02 |
Note: Δ denotes the difference between HC and UG values. σtrio is self-reported uncertainty of the model, i.e. SD of the Gibbs samples.
With aGIAB (HG002), and bPlatinum Genomes truth VCF and BED files.
In Figure 4, we plot the deviation of the trio-based estimate from the true value as a function of number of variants. While the estimation error of the metrics themselves hovers around 0.1–0.01 even when large number of variants are considered, the model’s error on the differential metrics is around for large number of variants, and stay below even for a mere 1000 variants. Recall values (both absolute and differential) are estimated more accurately then precision.
To understand the reason why our model is much better at estimating differential metrics than the metrics themselves, we plot the entries of the genotype confusion matrix (n) for NA12878 in Figure 5. Our model accurately estimates the number of loci where either one of the pipelines makes an incorrect call, but not both. Since these entries of the confusion matrix contribute the most to differential performance, it is no surprise that the difference in precision and recall are accurately estimated, even if their actual values are not. We can intuitively understand this disparity in the following way: We expect the two pipelines to have similar performance. Only the 27 diagonal elements of N convey information about this common performance. On the other hand, all off-diagonal elements tell us something valuable about the differential performance. Since there are more of the latter, they paint a more detailed picture and allow us to estimate the differences more accurately.
5 Discussion
Aggregating observed Mendelian violations across many loci in related individuals provides an approximate notion of the accuracy of variant calling pipelines. When comparing two pipelines, we expect the better one to produce fewer Mendelian inconsistencies, and indeed, as shown in Table 1, this is the case: Haplotype Caller produces fewer Mendelian violations in all three trios, and makes fewer genotyping errors for all three children.
In the light of our results in Table 2, two caveats of this naive method are immediately visible: First, the numbers of Mendelian violations do not accurately reflect how much better is the better pipeline. Second, the bare counts of Mendelian violations do not even hint that Unified Genotyper actually has a higher recall on the samples NA12877 and NA12878, something our trio-based benchmarking method estimates correctly (Table 2). This exemplifies that estimating precise quantitative differences between variant caller pipelines, in the absence of a reliable truth set, is a non-trivial task. As we demonstrated, our method can do this with high accuracy.
Previous trio-based benchmarking approaches range from fitting a calibration curve between observed number of Mendelian violations and true number of errors (Hao et al., 2004), to modeling haplotypes and recombination events during meiosis (Kojima et al., 2013; Markus et al., 2011; Peng et al., 2013). All previous approaches focus on a set of variants produced by a single pipeline, disregarding all information about the correlation between pipelines. We showed that by analyzing the joint counts (N) of the called genotype trios from two pipelines one can estimate differential performance with high accuracy (with uncertainty of ). In fact, exactly because of the information contained in the correlations, the differential metrics can be estimated with more than an order of magnitude lower uncertainty than the metrics themselves. As a consequence, our uncertainty is an order of magnitude lower than it was previously achieved with dyads (Haaland and Skaug, 2013), and comparable to the uncertainties obtained from pure simulations for trios (Hao et al., 2004).
Although our current method is limited to using bi-allelic variants from trios, the mathematical framework enables straightforward generalization. Data from multi-allelic variants and larger pedigrees can be incorporated by expanding the sets I and T. The demand of computational resources, which increases exponentially with the sizes of I and T, can be curbed by truncating the model to limit the number of genotyping errors per locus, which enables implementation with sparse arrays.
One limitation of our method is the need for harmonized variant representation in the input. While this problem is solved for trios by Toptaş et al. (2018), there is no easily-deployable method to faithfully match six samples with different variant representation. Another limitation, stemming from the limitation of the input data, is that some of the Mendelian-consistent trios are in fact results of double (or triple) errors. Our method relies on explicit assumptions about error profiles, its result is accurate only if these assumptions hold.
Our method estimates aggregate metrics. The estimated complete distribution and the person-specific confusion matrices contain a wealth of additional information that can be directly used to improve the performance of variant calling. Bayesian model averaging [e.g. Fragoso and Neto (2015)] can be used to reconcile genotypes at loci where the pipelines do not agree. By counting how often one pipeline is right when the other is wrong for each called trio combination , one can estimate the posterior probability of the correct genotype, . E.g. from the confusion matrix n shown on Figure 5, we can calculate the probability of HC being correct when HC and UG call 01 and 00, respectively: . One can imagine extending this scheme to a council of pipelines where, instead of giving equal say to each pipeline, a weighted voting is carried out to determine the genotype of each variant.
Acknowledgements
Discussions with Yilong Li, Lizao Li, Maxime Huvet, Amit Jain, Milos Popovic, Aqeel Ahmed, Maria Suciu, Sun-Gou Ji, Ivan Johnson, Kaushik Ghose, İlker Gündoğdu were very valuable at multiple stages of this work. The authors are grateful to them for their insights and suggestions.
Funding
This work was supported in part by UK Department of Health grant SBRI Genomics Competition: Enabling Technologies for Genomic Sequence Data Analysis and Interpretation administered by Genomics England.
Conflict of Interest: none declared.
References