ZygosityPredictor

Abstract Summary ZygosityPredictor provides functionality to evaluate how many copies of a gene are affected by mutations in next generation sequencing data. In cancer samples, the tool processes both somatic and germline mutations. In particular, ZygosityPredictor computes the number of affected copies for single nucleotide variants and small insertions and deletions (Indels). In addition, the tool integrates information at gene level via phasing of several variants and subsequent logic to derive how strongly a gene is affected by mutations and provides a measure of confidence. This information is of particular interest in precision oncology, e.g. when assessing whether unmutated copies of tumor-suppressor genes remain. Availability and implementation ZygosityPredictor was implemented as an R-package and is available via Bioconductor at https://bioconductor.org/packages/ZygosityPredictor. Detailed documentation is provided in the vignette including application to an example genome.

Let ac tum define the number of affected copies in the tumor (ac som tum for somatic variants and ac germ tum for germline variants), c tum the copy number in the tumor of the genomic segment the variant is located in and V AF som tum the allele-frequency of the somatic variant in the tumor sample.Let n tum and n norm designate the number of tumor and normal cells in the sample, respectively.Then, in case of a 100 % pure tumor, the allele frequency of a somatic variant is defined as: In case the purity is not 100 %, the formula needs to be extended.Let c norm define the expected copy number in normal tissue of the segment the variant is located in.This will be 2 for most genes located on autosomes and for genes on the X chromosome in female samples, and 1 for genes on the X and Y chromosomes in male samples (without the pseudoautosomal regions).Formula 1 can then be extended as follows: The purity of a tumor sample is defined by: From this, we can deduce n norm = n tum • 1−p p and: and thus In the tumor sample, the allele frequency of germline variants can be defined as: Of note, ac germ tum and ac germ norm define numbers of affected copies for the germline variant in tumor and normal tissue, respectively.Using V AF germ norm , the allele-frequency of the germline variant in the normal control, we can substitute ac germ norm =V AF germ norm • c norm and obtain: When applying the same arithmetics as above, we get: and thus In case V AF germ norm is unknown, it can be assumed to be 0.5 for heterozygous variants in most genes located on autosomes or on the X chromosome in female samples, and 1 for homozygous variants and variants in genes on the X and Y chromosomes in male samples (without the pseudoautosomal regions), for a normal chromosome set.

Allelic Imbalance Phasing (AIP)
If read-level dependent phasing approaches fail, segments of allelic imbalance can be used to determine a constellation of two variants.In such segments, sCNAs have taken place during tumor development and the number of copies of one allele differs from the number of copies of the other allele.In NGS this is reflected by deviating allele frequencies depending on which allele a variant is located on.If, for example, a genomic segment was called with allelic imbalance of 1:2, i.e., one allele is present once while the other allele was duplicated, we expect variants in this segment to have either a low or high allele-frequency, depending on whether they are on the major or the minor allele.In this work, we use genotype likelihoods to determine to which of the two cases a given variant belongs.By using formula 10 from [1], the likelihood of a variant to be located on the allele with the respective genotype can be defined.
Let a denote the number of reads supporting the alternative allele at a position of a variant and r the number of reads supporting the reference allele.The genotype is denoted by g, i.e., the number of copies of the allele which is currently checked.ϵ j is the error probability in read j, i.e. ϵ j is defined by (1 − p mq,j ) * (1 − p bc,j ) = 1 − ϵ j where p mq,j is the mapping quality and p bc,j is the error probability of the basecall.Then, the genotype likelihood of the genotype g is defined as: Formula 10 can now be used to determine the genotype with the highest likelihood for a given position which may contain a mutation.To determine the constellation of two variants m1 and m2 to each other, four genotype likelihoods are calculated.gt1 and gt2 denote the two possible genotypes of the segment of allelic imbalance.Likelihood for the two possible constellations can be calculated according to: The constellation of higher likelihood is selected.For reasons of numeric stability, we use the log-likelihood ratio: The absolute value of the log-likelihood ratio is used as confidence measure for AIP cases.Of note, the confidence measures of AIP and read-level phasing cannot be directly compared -which, in our opinion, should not be done anyways.Both information are annotated in separate columns of the output of ZygosityPredictor.
As we assume that our samples are not fully pure, i.e the tumor cell content is lower than 100 %, we need to slightly adjust the formula for the computation of genotype likelihood.

Somatic variants
For somatic variants, the r parameter needs to be reduced, as we expect reads supporting the reference base in admixed normal tissue.For germline variants, both r and a need to be adjusted as normal cells carry reads supporting both of them.We define the variant allele frequency as follows: By combining formulae 14 and 1 we can conclude that the parameter r needs to be adapted to r ′ by the following formula: As we do not expect any alternative supporting reads in normal tissue for somatic variants, formula 15 is sufficient to adjust the number of reads for somatic variants.

Germline variants
For germline variants, however, where reference and alternative reads are expected in admixed normal tissue, we need to make the distinction between r tum and r norm (a tum and a norm ), defining the numbers of reads supporting reference/alternative in tumor and in normal tissue, respectively.In order to deconvolve necessary adjustments on these four parameters, we make use of four definitions or formulae, which form a linear equation system: The formulae in (17) are the required adjustments of the numbers of reference and alternative reads for germline variants.For computation of the genotype likelihood, only r ′ tum and a ′ tum are needed.In the implementation in ZygosityPredictor, in order to follow a conservative and safe strategy, the r ′ tum and a ′ tum worst reads are selected.
r = r tum + r norm The linear equation system has the following solutions: norm * (V AF germ tum −1)−V AF germ tum * (r ′ tum −r ′ norm ) V AF germ tum −1