Clair3-trio: high-performance Nanopore long-read variant calling in family trios with trio-to-trio deep neural networks

Abstract Accurate identification of genetic variants from family child–mother–father trio sequencing data is important in genomics. However, state-of-the-art approaches treat variant calling from trios as three independent tasks, which limits their calling accuracy for Nanopore long-read sequencing data. For better trio variant calling, we introduce Clair3-Trio, the first variant caller tailored for family trio data from Nanopore long-reads. Clair3-Trio employs a Trio-to-Trio deep neural network model, which allows it to input the trio sequencing information and output all of the trio’s predicted variants within a single model to improve variant calling. We also present MCVLoss, a novel loss function tailor-made for variant calling in trios, leveraging the explicit encoding of the Mendelian inheritance. Clair3-Trio showed comprehensive improvement in experiments. It predicted far fewer Mendelian inheritance violation variations than current state-of-the-art methods. We also demonstrated that our Trio-to-Trio model is more accurate than competing architectures. Clair3-Trio is accessible as a free, open-source project at https://github.com/HKU-BAL/Clair3-Trio.


Introduction
Accurate identification of genetic variants in family trios in the human genome is an important task in genomics, which provides insight into precision medicine and phenotype understanding [1]. The human genome follows the Mendelian inheritance [2], with half of the child's genome in family trios inherited from each parent. Calling genetic variants in trios provides a more comprehensive understanding of the inheritance pattern of genetic variants in families [3].
Several state-of-the-art deep learning-based methods are available for calling small variants from Oxford Nanopore Technologies (ONT) data. They are based on two main designs: pileup and full-alignment. Clairvoyante [4], Clair [5] and Nanocaller [6] use a pileup-based design, which summarizes the read alignments into features and counts, which are then piped into a variantcalling network. PEPPER-Margin-DeepVariant (PEPPER) [7], on the other hand, applies a haplotype-aware variant calling pipeline and uses full alignment-based input to call variants via neural networks. Clair3 [8] combines the two major designs, using an advance and cascade design, which symphonizes pileup for the best speed and full-alignment for the best accuracy for calling variants from ONT data. Other variant-calling methods, including Medaka [9] and Longshot [10], are also available for ONT data. However, all the state-of-the-art methods are designed for calling individual variants from trios and fail to leverage Mendelian inheritance in the family for better variant-calling accuracy for ONT data.
For calling varaints with genetic information shared in family trios, two pilot studies based on DeepVariant [11] have been developed. dv-trio [12] provides a processing pipeline to call variants using DeepVariant, together with GATK [13] and FamSeq [14], to reduce the number of Mendelian inheritance violations in its variant calling. DeepTrio [15] extends DeepVariant's single sample input to accept the input of three samples in its deep neural networks to call candidate sites identified by heuristic checking. Current trio variant callers do not include Mendelian inheritance violation factors in their model architecture designs or decisions. Furthermore, all these methods are designed for Illumina and PacBio HiFi data and cannot call variants from ONT data. Therefore, there is currently no trio information-aware caller available for calling variants from ONT data.
Generally, two research gaps remain for calling variants from trios for ONT data: (i) how to train the model to learn from the information about both individuals and that preserved in family trios and (ii) how to train the model to predict the following Mendelian inheritance, a basic feature in family trios. Unfortunately, these two questions have never been studied in the ONT data and remain unsolved in the community.
To fill the two main research gaps and improve variant calling from trios' ONT data, we propose a new model: Clair3-Trio. Clair3-Trio is the first variant caller tailored for family trios ONT data with a Trio-to-Trio deep neural network model design that allows it to input the trio's sequencing information and output all of the trio's predicted variants. Using the Trio-to-Trio model, Clair3-Trio can efficiently call variants based on individual and family trio information. We also designed a loss function, MCVLoss (Mendelian Inheritance Constraint Violation Loss), to make the model explicitly encode the priors of Mendelian inheritance in trios to improve its variant calling (described in the Methods section). Based on our experiment on the Genome in a Bottle (GIAB) HG002 trio data [16], Clair3-Trio showed comprehensive improvement in experiments compared with state-ofthe-art methods. It showed an increment of over +10% in the F1-score of the child and +5% in the F1-score of the parents compared with Clair3 and PEPPER when tested at 10× ONT data. In addition, it showed an order of magnitude fewer Mendelian inheritance violations than other methods. All codes and experimental settings for Clair3-Trio are publicly available at https://github.com/ HKU-BAL/Clair3-Trio.

Family trio variant calling with Clair3-trio
Clair3-Trio consists of two main modules ( Figure 1A): (i) data preprocessing, which uses the Clair3 pileup model and WhatsHap phase, as well as the haplotag submodule [17] function to phase the data of each individual in a family; and (ii) model calling, which calls family trio variants with the Clair3-Trio model. The inputs for Clair3-Trio are three alignment files from a family trio: child, mother and father. The workf low and model are discussed in the following.

Data preprocessing
For data preprocessing, first, we use the Clair3 pileup model to efficiently find all genetic variants that can be easily predicted with high confidence, and then, we use WhatsHap to obtain all phase variants and haptag reads, based on the called heterozygous single nucleotide polymorphisms (SNP) to get phased alignments for the Clair3-Trio model. With all individuals' haptaged alignments available, we use a simple heuristic approach to identify candidate positions that might have any genetic variants in the family, as follows: (i) the Clair3 pileup model grasps all positions with supporting alternative allele frequency exceeding 0.08 and outputs all individual variant and non-variant calling with confidence scores [8]. (ii) Next, all pileup variants called and 20% of low-quality pileup reference calls are collected from each individual as the individual's potential variant candidate sites. (iii) Then, we unite all the potential variants of each individual in the family as the trio's variant candidates. Thus, any variants identified in a sample can be treated as candidates in the Clair3-Trio model.

Clair3-trio model: a trio-to-trio deep neural network model
The Clair3-Trio model is a Trio-to-Trio model that can input all alignments from the family trio and output all variants from the same trio. The inputs for the Clair3-Trio model are generated by merging phased full alignments from trios. For each individual, the full-alignment information is converted into eight different feature channels, as previously discussed for Clair3 [8]. For each channel, we aggregate the same channel from each individual in the same family order as the input of the Clair3-Trio model ( Figure 1B).
The neural network of the Clair3-Trio model consists of multiple layers: convolutional layers (Conv), residual convolutional layers (ResBlock), pyramid pooling layers and dense layers ( Figure 1B). Clair3-Trio uses independent dense layers to predict each individual's genotype, zygosity and two insertion or deletion (INDEL) lengths in the last layer. All outputs from the model are then combined and converted to variant records for each individual.

Training a Clair3-trio model
To train a Clair3-Trio model in family trio data, we applied (i) a label cleaning module (Representation Unification) to clean the training data, and (ii) a trio data filtering module (MCV filtering) to further filter Mendelian violation sites in the training data. The two modules were established based on experiments. We use the Representation Unification module from Clair3 to unify the true variants label with the alignment information in the training data. The Representation Unification model may include Mendelian conflict in the unification process. We added MCV filtering to discard a few candidate sites (0.05% of candidate sites) in training data that violated Mendelian inheritance constraints. After cleaning the data, we performed random downsampling to make the model increase its generalization at different levels of data coverage. We downsampled the data into a range of coverage of 10×, 30×, 60× and 80× for all samples, kept the child data at high coverage and downsampled only the parent samples for low coverage. After downsampling, we kept 30% of the data of each coverage combination to balance speed and performance, leading to 33 353 000 candidates (from the GIAB HG002 family) in our training dataset. With the training dataset available, Clair3-Trio was trained in a two-step procedure. First, we trained an initial model of Clair3-Trio via the focus loss function, and then, we fine-tuned the initial Clair3-Trio model with the addition of multiple task MCVLoss function. We also tried other training techniques, but they failed to improve Clair3-Trio. This is elaborated in Supplementary Notes (see Supplementary Data available online at http://bib.oxfordjournals.org/).

Differences between Clair3-trio and the Clair3 full-alignment model
Our approach differs from Clair3 mainly in the following ways:

Modeling Mendelian inheritance with MCVLoss in deep neural networks
The Trio-to-Trio model can predict the trio's variants with trio's information, but how to explicitly add the Mendelian inheritance information to the model remains an open question. In the following subsections, we discuss the MCVLoss function, which is designed to control the Mendelian inheritance violation rate in the model. We briefly describe the original loss function in Clair3 and then introduce the MCVLoss function.

Loss function for a single sample
First, we detail the original loss function for an individual, inherited from Clair3, to better illustrate the basic components in the Clair3-Trio loss function. The output of Clair3 includes four variant tasks-genotype, zygosity and two INDEL length tasks-as previously described in Clair3 [8]. The most important task in Clair3 is to predict the genotypes, which are classified into 21 genotypes. If X denotes the alignment from a single sample, the probability of each possible genotype from the 21 genotypes for each sample is where F gt represents the Clair3 model's last layer-the 21-genotype outputting layer-and F represents all the other Clair3 layers, other than the last F gt layer, as in Figure 1B. Based on the probability of 21 genotypes, the loss function of Clair3 can be simplified as where Y gt denotes the true 21-genotype label, P gt denotes the predicted probability of each 21-genotype label and L 2 denotes the L2 regularization terms of the model. We ignore the zygosity and INDEL length terms in this simplified formula for simplicity (their formulas are identical to the 21 genotypes task). For applications, the complete loss functions, including 21 genotypes, zygosity, INDEL length 1 and INDEL length 2, are described in the Clair3 paper [8].

The output of Clair3-trio and the computation of the trio probability
We extended the model output in Clair3 from the individual to compute trio genotypes in Clair3-Trio. The probability of the trio members is represented as where the F represents all layers of Clair3-Trio except for the last layer and F gt,c , F gt,p 1 , F gt,p 2 represents the last three fully connected layers for computing the 21 corresponding child, parent-1 and parent-2 genotypes, respectively. Parent-1 can be the mother or father in the trio, and parent-2 is the remaining parent. The probability of each trio genotype in the family is computed as For each individual's probability, we simply have the property that Combining formulas (4) and (5) for the trio genotype, we have a similar property for the trio's probability

The Mendelian constraint violation loss function: MCVLoss
MCVLoss is based on the idea of penalizing the trio genotype that violates the Mendelian inheritance. For each trio genotype, we define a parameter β, representing the valid degree of the genotype μ, if MCV, and child has one allele mismatch, μ 2 , if MCV, and child has two alleles mismatch , where μ is the mutation rate per generation, set as 1e−8 by default [18]. Combining the probability of each trio genotype in the family and the corresponding valid degree, the sspredicted overall valid degree for trio prediction becomes Based on formulas (6)- (8), we know that V trio ∈ (0, 1). With all this information, the MCVLoss is defined as where α controls the importance of the Mendelian inheritance penalty in the model, and is a small number (1e−9 by default) to cap the log function to avoid reaching infinity. α is set as 1 by default, which was decided experimentally.
With the MCVLoss available, the final Clair3-trio loss function is where Y gt denotes the true 21-genotype ground truth and P gt denotes the predicted probability of each 21-genotype label.
In this manner, MCVLoss introduced the Mendelian inheritance prior to model training. The detailed results of using MCVLoss are presented in Results section.

Benchmarking methods and metrics
We use Precision, Recall and F1-score metrics to evaluate the family trio variant-calling performance in different configurations. The Precision, Recall and F1-score are computed via hap.py (v0.3.12) [19]. We computed the number of Mendelian violation variants in trios using the following steps: (i) merging all trio variants results using BCFtools (v1.12) [20] with the flag '-f PASS -0 -m all' and (ii) computing the number of Mendelian violations via RTG tools (v3.12.1) [21]. We also computed the number of de novo variants in the model's prediction, where the de novo variants [15] are defined as variants confidently genotyped as 0/1 in the child and as 0/0 or unknown in the parents. Note that the metrics of Precision, Recall, F1-score and number of de novo variants are constrained in the confidence region, while the number of the Mendelian violations is computed in all sites.

Assessing variant-calling accuracy in individuals
We compared the Clair3-Trio variant-calling performance against Clair3 and PEPPER at different coverage in individuals from the GIAB trio. The overall benchmark results are shown in Figure   Comparing performance gain among members of the family trio, we found that the performance gain in the child (HG002) was much more profound than that in the parents (HG003 and HG004). For INDEL, Clair3-Trio achieved a +5.62% increment in the F1score in the child compared with Clair3 at 60×, while the improvement dropped to +2.38% in the parents (Supplementary Table 1, see Supplementary Data available online at http://bib.oxfordjournals.org/). The rationale is that for calling variants, the family trio provided more information about the child, which shares two haplotypes with parents, while each parent shares only one haplotype with the child.

Assessing variant-calling accuracy in a family trio
Comprehensively evaluating variants across all family members using metrics such as the number of Mendelian violations is important when calling variants in a family trio. In Mendelian inheritance violations, Clair3-Trio showed an order of magnitude fewer violations than Clair3 and PEPPER at 10× to 30× coverage. As shown in We also benchmarked Clair3-Trio at the Chinese trio (HG005 trio) from GIAB. We obtained the ONT sequencing

Assessing the effect of varying parental coverage on variant-calling accuracy
When calling variants from trios, it is common to see parents having halved or even lower coverage against the full of the coverage of the children to manage sequencing costs [23]. To assess the effect of low parental coverage on variant calling, we set the child sample to coverage of 60× and downsampled the sequencing data of parents from 60× into test ranges of 10×, 20×, 30×, 40×, 50× and 60×. The test results are shown in  . For the child sample, the performance of Clair3-Trio is similar to that of Clair3 when the parent has very low coverage (10×) overall, indicating that 20× or more for parents is required for trio calling to improve the variant calling for the child. When the parents have half the child's coverage (child 60×, parents 30×), Clair3-Trio achieved an overall F1-score of 96.92%, compared with 96.50 and 95.98% for Clair3 and PEPPER, respectively. Separating the results from SNP and INDEL, we found that Clair3-Trio outperformed the other tools when parents had coverage higher than 10× for SNP calling and coverage higher than 30× for INDEL calling. Furthermore, in Clair3-Trio, there was a large improvement in the performance of lowcoverage parent data when higher coverage for the child was provided (Figure 4). Clair3-Trio achieved a +6.02% increment in the F1-score in HG003 (10× parent sample) compared with Clair3. Furthermore, when parents had half the child's coverage (60× for child and 30× for parents), Clair3-Trio had an F1-score of 96.54% for HG003, which is also higher than 95.83% in Clair3 and 95.07% in PEPPER.
We also tested Clair3-Trio in a scenario where only the child has lower coverage (with 10× for the child and 30× for parents). The scenario with higher coverage for parents is easier for trio calling compared with lower or equal coverage for parents in trio data. The results are available in Supplementary Table 6 (see Supplementary Data available online at http://bib.oxfordjournals.org/). We found that Clair3-Trio remains to have a higher performance, with 93.69% of F1-score in 10× in the child, than 82.78% in Clair3 and 54.77% in PEPPER in this scenario.
The improvement of Clair3-Trio on the trio data makes it useful for population genome projects in which better variant calling performance is expected for both parents and children.

Comparison of different architectures and model shape
We first categorized different methods based on their input and output information to generalize different methods for variant calling from family child-motherfather trio data. The One-to-One model inputs single sample information and outputs single sample variants. Clair3, PEPPER and Medaka are typical One-to-One models. The Trio-to-One model inputs data from three samples into the model and outputs single sample variants. For example, DeepTrio, which works with Illumina and PacBio HiFi data, is a typical Trio-to-One model. Finally, the Trio-to-Trio model inputs data from three samples into the model and outputs the three samples' variants simultaneously. In Clair3-Trio, we built the first Trio-to-Trio model. To compare the performance of different architectures, we ablated the input and output tensors of Clair3-Trio models accordingly to test as three architectures: One-to-One, Trio-to-One and Trio-to-Trio models. The One-to-One model has single sample input and predicts single sample variants, as in Clair3 and PEPPER. The Trioto-One model has information of three samples in its input but predicts single sample variants in its model, as in DeepTrio. The Trio-to-Trio model is a native version of Clair3-Trio, which has three samples input and three samples output, but with deactivated MCVLoss and finetuning. We trained a single model for all architectures on chromosome 1 64× data from the GIAB HG002 trio and tested the performance on chromosome 20. For the Trioto-One model, which is sample order specific, we trained two models separately to make predictions: a child model and a parent model. The benchmark results for the child as well as the number of Mendelian violations are in Figure 5.
We found that including trios information in the model efficiently improves the variant calling performance overall, especially in terms of Mendelian inheritance violations ( Figure 5). Switching from One-to-One to Trioto-One alone can boost the F1-score in the child by about +0.23%. The performance increment is consistent with the DeepTrio results [15]. The performance-boosting increased to +0.37% and +1.2% when the architecture was switched to Trio-to-Trio and Clair3-Trio (with MCVLoss and fine-tuning), respectively. For the child sample, the F1-score for Trio-to-Trio, One-to-One and Trio-to-One was 96.30, 95.93 and 96.16%, respectively. However, for the parent samples, Trio-to-Trio was only slightly better than One-to-One and Trio-to-One. We also found that the Trio-to-Trio architecture predicted many fewer Mendelian inheritance violation variants: 7872 in the Trio-to-Trio model, 29 753 in the One-to-One model and 20 016 in the Trio-to-One model.
To further explore the best architecture for the Trio-to-Trio model, we also evaluated the effect of using different model shapes. With three inputs and three outputs available, we developed multiple candidates for model shape, as illustrated in Figure 5: (i) Model-A, which inputs the information of all samples into Resblock and divides the last dense layer to give three outputs; (ii) Model-B, which inputs the information of all samples into Resblock divided at all dense layers; (iii) Model-C, which inputs single sample information into shared Resblock and divides the last dense layer to generate three outputs and (iv) Model-D, which shares multiple Resblock from a single input and divides the last dense layer to generate three outputs. We found that Model-A and Model-C achieved a similar F1-score (96.30% for Model-A and 96.26% for Model-C) in the child sample to that in Model-B (96.18%) and Model-D (96.25%), but Model-A had many fewer Mendelian violation predictions than the other models (7872 compared with 11 278, 10 053 and 10 370, respectively, in the other shapes). For this reason, we selected Model-A as the best shape for the Trio-to-Trio architecture.

Finetuning with MCVLoss
The MCVLoss (Mendelian Inheritance Constraint Violation Loss) function is designed to improve variant calling in trios by leveraging the explicit encoding of the priors of the Mendelian inheritance in trios. We found that MCVLoss can effectively reduce Mendelian violation prediction in variant calling. However, the prediction is better accompanied with fine-tuning techniques, in which we train a Clair3-Trio model in two steps: (i) training Clair3-Trio without MCVLoss with the default learning rate (1e−3 in our setting) and (ii) fine-tuning the trained Clair3-Trio model with MCVLoss with a lower learning rate (1e−5 in our setting). When using the finetuning technique alone, the F1-score from HG002, HG003 and HG004 had a performance boost of +0.2% (Table 1). We got the best results when combining fine-tuning and MCVLoss with the +0.2% F1-score increment and a Mendelian violations reduction from 7872 to 4352.
We also evaluated the effect of using a different α rate in MCVLoss ( Table 2). The α rate in MCVLoss controls the weighting in terms of loss function, as in formula (9). We observed that increasing the α rate efficiently decreases the number of Mendelian inheritance violations but slightly decreases the overall performance based on the F1-score. We found the α rate of 1 to be the best setting for MCVLoss, which balances the F1-score and the number of Mendelian inheritance violations metrics.

Computational efficiency of Clair3-trio
We inherited the highly optimized modules from Clair3 and created a Clair3-Trio workflow ( Figure 1A) with parallel computing features in each component to enable efficient variant calling from trio data. We benchmarked the efficiency of Clair3-Trio with Clair3, and PEPPER on a machine with two 12-core Intel Xeon Silver 4116 processors. Clair3-Trio was computationally efficient for trio

Discussion
Clair3-Trio outperformed single sample callers especially at lower coverages, making sequencing a family trio at a relatively lower coverage more favorable than sequencing only the child to a high coverage. As an example, in Supplement Figure 6 (see Supplementary Data available online at http://bib.oxfordjournals.org/), two genotypes 0/1 and 0/2 in the child had an equal number of read supports in 10× data. Clair3 and PEPPER failed to call the variant base only on the child's data. Clair3-Trio called the child genotype as 0/1 correctly with information from the parents at the same site. We found that most of the Mendelian violation cases from Clair3-Trio (68.6%) for parent-1, parent-2 and child, respectively, are: (0/0, 1/1, 0/0), (0/0, 1/1, 1/1), (0/0, 0/0, 0/1) and (0/0, 0/0, 1/1) ( Figure 3B). All these violations are prone to be found when there is a switch between heterozygosity and homozygosity in a single trio sample at a site. For example, in the case of Mendelian violations (0/0, 1/1, 0/0), the switch between heterozygous and homozygous in any member's calling changes the variant calling to a non-Mendelian inheritance violation call. As all members have a chance of being miscalled, these cases remain a challenge even when trio data are available.
Clair3-Trio has high performance overall, but it has fewer de novo variants predicted than Clair3 and PEP-PER. The drop in TP of de novo variants is expected, as Clair3-Trio is designed to predict variants by leveraging information from family trios that favor having fewer Mendelian violations in their prediction. For detecting de novo variants that do not follow Mendelian inheritance, One-to-One model-based methods such as Clair3 and PEPPER can be used to supplement Clair3-Trio.
There are some challenges and future works needed regarding trio variant calling from ONT data. Experiments show that Clair3-Trio's improvement over state-of-the-art methods is profound when the trio data have similar coverage among family members, but it only marginally improves with calling variants from different data coverage (such as child coverage of 60× and parent coverage of 10×). These results leave room for further improvement in trio calling in diverse coverage applications. The current model is trained with multiple coverage down-sampled from the full coverage, but only with the coverage of the child kept equal to or larger than that of the parents, and not the cartesian product of the downsampled coverage of the three samples. This is a practical decision to reduce the amount of training data and since the coverage of the child in a trio is usually higher than that of the parents. However, this may also challenge Clair3-Trio when the coverage of parents exceeds that of the child. An improved training scheme is expected to handle the large amount of training data when all coverage combinations are used. On the other hand, there is a research gap in applying variant calling in the human sex chromosome region. The current training and testing was constrained to the autosome region, which assumes that the variants are diplotypes and inherited from one of the parents. However, on the sex chromosome, the assumption is unheld when calling variants in the child's Y chromosome, which is a haplotype and is obtained only from the father's side. Currently, there are no tools available for calling variants in the sex chromosome region with the family information from ONT data. We need a new design for calling variants from the sex chromosome region to fill this research gap. In the future, we would like to design a heuristic approach to solve the question: if the child is female, use Clair3-Trio directly at the sex chromosome; if the child is male, use Clair3-Trio to call variants in the pseudoautosomal regions (PAR1 and PAR2) of the sex chromosome and build a tailored haplotype model to call variants in the remaining regions.

Conclusion
In conclusion, we introduced Clair3-Trio, a highperformance Nanopore long-read variant caller in family trios with a Trio-to-Trio deep neural network. Clair3-Trio is the first family trio variant caller tailored for Nanopore long-read data with a Trio-to-Trio deep neural network model and MCVLoss. In our experiments, Clair3-Trio outperformed current state-of-the-art methods on trio variant calling in terms of F1-score and the number of Mendelian inheritance violations in all three samples from a trio. We also demonstrated that the architecture of the Trio-to-Trio model is much more accurate than the One-to-One and Trio-to-One model. The source code and the results of this study are publicly available on GitHub.
Authors' contributions R.L. conceived the study. J.S. and R.L. designed the algorithms, implemented Clair3-Trio and wrote the paper. All authors evaluated the results and revised the manuscript.

Key Points
• Developed a Trio-to-Trio model to predict trio variants in ONT data. • Introduced a novel loss function, MCVLoss, to model Mendelian inheritance in trio data. • Demonstrated that the Clair3-Trio model trained on GIAB data improves variant calling in trio data. • Demonstrated that Trio-to-Trio models can efficiently decrease Mendelian inheritance violations compared with One-to-One and Trio-to-One models.

Supplementary data
Supplementary data are available online at http://bib. oxfordjournals.org/.