Supplementary Data: Ultra-Fast Local-Haplotype Variant Calling Using Paired-end DNA-Sequencing Data Reveals Somatic Mosaicism in Tumor and Normal Blood Samples

We show how to derive part I in equation (1) in the paper. Recall that the goal is to compute


MATHEMATICAL DETAIL
We show how to derive part I in equation (1) in the paper. Recall that the goal is to compute P r(λ j = 1 | s obs , m) = P r(λ j = 1 | s obs , H H m) ∝ p(s obs | λ j = 1) likelihood P r(λ j = 1) Expanding the first term I of Equation (1) and augmenting the model with indicators z i ∈ {1, . . . , L} for which {z i = j} is defined as the event that read i is generated from a DNA segment possessing haplotype j, we have p(s obs i | z i = j , λ j = 1, λ −j )P r(z i = j | λ j = 1, λ −j ).
Let e ir be the error probability for the nucleotide called at base r on short read i. Typically e ir is known from upstream analysis (e.g., in the form of Phred quality score). Following the notion in Ji et al. [1], we assume that the observed data s obs ir follows a multinomial prior P r(s obs ir = b ir | z i = j , λ j = 1, λ −j ) = (1 − e ir ) I(a ir =h jr ) b ir =h jr e ir 3 where a ir ∈ {A, C, G, T } are observed genotype of SNV r on read i, and the index set {a ir = h jr } has three elements referring to the three nucleotides that are different from h jr . We also assume that P r(s obs Lastly, we define a prior for z i . Assume where c 1i (λ) is a normalizing constant. The indicator I(λ j = 1) is needed since if the haplotype j is not present in the sample, the probability that read i is from j should be zero. Plugging into (2) we get the equation in the paper, i.e.,

Additional Computational Detail
The goal is to calculate the probability in equation (6) below. Algorithm 1 in the main paper speeds up computation by taking advantage of the repeated items in the summation.
We state Algorithm 1 again. Denote the set of indices for the event {λ 1 = 1 | s obs } and {λ 1 = 0 | s obs } by V j1 and V j0 , respectively.
Calculation of P r(s obs | C l ) for all l = 0, 1, . . . , 2 L − 1 in Algorithm 1 is carried out in two parts. In the first part P r(s obs i | λ j = 1, λ −j ) is calculated for all i = 1, 2, . . . , N (as shown by I in Equation (6)) and in the second part we calculate the prior term P r(λ j = 1, λ −j ) as shown in II in Equation (6).
With the use of appropriate data structure in C++, we can find out the DNA sequences of the haplotype, h j = {h jr : j = 1, 2, . . . , L; r = 1, 2, . . . , R} by a linear (O(n)) scan through all of the data. After that, for all the data {s i obs : i = 1, 2, . . . , N } and all the haplotypes h j , we create a matrix D that stores a triplet {w i , a i,j , d i,j } in each cell (i, j). Recall that, w i is the number of missing bases in the i-th read. a i,j and d i,j denote the number of agreeing and disagreeing bases of s obs i with h j . From equation (5), let where c 1i (λ) is a normalizing constant. Denoting E ij by we have from Equations (6), (7), (8) Note that this calculation is for one particular configuration of λ.
We denote e ir to be the error probability for the base called at locus r on short read i. Currently, for all i = 1, 2, . . . , N and for all r = 1, 2, . . . , R, e ir is taken as e. So after generating the matrix D, the error matrix E (given in Equation 8) is very easy to compute from D if we use a single e. Each term of T ij and E ij is straightforward to calculate depending on s i obs , w i , h j and the particular configuration of λ. Equation (9) can be easily calculated by first doing a Hadamard matrix multiplication of T by E and then summing the elements from each row i of the resultant matrix F .

The hcf FILE FORMAT
Each line in the hcf file contains information about one particular LHV segment. Below is a line in an hcf file from analysis of real-world data. Same as vcf, an hcf file is a tab-delimited text file. After the initial header fields, each line in the hcf file represents a local haplotype (might not be a variant) and has seven column fields, which are explained next.
Chromosome name (CHROM), SNV positions on the chromosome (POS), nucleotides at those positions in the reference genome (REF), number of significant haplotypes (NumSig), called haplotypes (HAP-Call), all the possible haplotypes (All-HAP) and data for the sample (DataForSample=sample-name). In the "Hap-Call" field, we include the posterior probability of each haplotype variant that is considered statistically significant. The "All-HAP" field contains the posterior probability and corresponding posterior false discovery rate (FDR) for each possible haplotype. Haplotype variants in the "Hap-Call" field are generated from those in "All-HAP" by using an FDR threshold of 0.01. In the last field, a few basic statistics about the input data are given, that include total number of SNPs (nSNP), total number of reads (nTot), number of reads having at least one entry in one of the SNP position (nACGT), number of blank reads (nBlank), number of discrepant reads (nDisc), number of reads with no missing entries, missing entries in one position or two positions or three positions (nM0, nM1, nM2, nM3, respectively) and number of clusters directly observed from the data (nClus). Note that, number of unique read groups of data that has no missing SNP defines the number of clusters. For more details refer to the Quick Manual document (http://compgenome.org/lochap/code_release/QuickManual-LocHap-release-v1.0.pdf) of LocHap software. Summary of data for a sample: At the end of each hcf file a summary stating the total number of SNVs in the vcf, number of segments with zero significant haplotypes, one significant haplotype, two significant haplotypes and so on, are given. Below is an example.

ANALYSIS DETAIL Universal setup
We analyzed the vcf file containing all the variant calls to identify segments with multiple proximal SNVs. Using APIs of SAMTools, we developed a computer program to extract nucleotides of read bases that were aligned with the SNVs using indexed bam files. For all the Illumina data, reads with an alignment quality score less than 30 were discarded. In addition, if a base had a Phred quality score less than 30, the called DNA sequence was ignored and a missing base M was recorded. Thus, we only kept short reads with high quality alignment and bases with high quality calling scores.
HNC data Exome sequencing data of 30 pairs of tumor and matched normal samples (total 60 samples) from patients with head and neck cancer [2] were downloaded from the Sequence Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra). We extracted fast-q sequences from the SRA files and used BWA [3] to map short reads to the HG19 human reference genome [4]. Next, we used SAMTools [5] on each sample to generate a bam and a bai file, removing polymerase chain reaction (PCR) duplicates. We then used the realigner-target-creator and indel-realigner modules of GATK version 2.1.9 to refine alignments near all indels [6]. Finally, we called SNPs for all 60 samples with the UnifiedGenotyper module of GATK (version 2.1.9) using a minimum base quality threshold of 30 ("-mbq 30"). GATK caps the quality score of a base at its mapping quality and hence this also forces GATK to ignore any reads mapped with an alignment quality less than 30. All 60 samples were analyzed together in a single run of the UnifiedGenotyper and a single vcf file was generated containing all the variants across all the samples.
CEU-TRIO data The TRIO dataset was generated by whole-genome sequencing. A similar preprocessing scheme was taken to generate the bam/bai files and a single vcf file.
Validation CGI data CGI sequencing platform employs high-density DNA nanoarrays that are populated with DNA nanoballs (DNBs) and base identification is performed using a non-sequential, unchained read technology, known as combinatorial probe-anchor ligation (cPAL). CGI sequencing technology, including the general library construction process and ligation-based assay approach, the the methodology and performance of the platform are described in [7]. More information on read mapping, variation calling, read data format can be found in (http://www.completegenomics.com/documents/DataFileFormats_ Standard_Pipeline_2.5.pdf). The data corresponding to a single genome is organized into three main directories - • ASM -Assembly of the complete genome: variations called, coverage, and annotations.
• LIB -DNB structure for the library used in the sequencing assay.
• MAP -Reads, quality scores, and alignments to the reference genome.
The variation (var) file (http://www.completegenomics.com/documents/DataFileFormats_Standard_ Pipeline_2.5.pdf) contains records for each position in the reference genome, describing whether the corresponding position was called in the CGI data, and if so, whether it is called as reference (its sequence is same as the reference genome) or variant. This is done independently for each of the two diploid alleles of the sequenced genome. The master variations file (masterVarBeta) is a simple, integrated report of the variant calls and annotation information produced by the CGI assembly process. The file format is derived heavily from the variations file format. The LocHap pipeline requires bam and vcf files as inputs. We generated corresponding bam and vcf files from CGI data using open source tool CGATools (http://cgatools.sourceforge.net/docs/1.8.0/cgatools-user-guide.pdf and http://cgatools.sourceforge.net) and SAMTools [5].

Model Checking
We performed some basic model checking using graphical plots to examine the model estimates against observed data. For example, for the HNC data, we present the box plots in Fig. S1 which compares the number of inferred LHs against the number of unique read-groups mapped to all SNVs, each read-group refers to a set of reads possessing a unique genotype and mapping to all the SNVs in the segment without missing bases. For example, in Fig. 1 of the main article, the number of unique read-groups is three for both examples. Intuitively, there should be concordance between the two numbers. For example, if the number of unique read-groups is higher, there should be more inferred LHs since unique read-groups directly support the presence of the corresponding haplotypes. We see this concordance in the box-plots.

Copy number variation (CNV) analysis
Double hits of a structure mutation such as CNV and sequence mutation could falsely be explained by somatic mosaicism even if cells are homogeneous in a sample (Fig. S2). For example, a copy number gain coupled with a somatic mutation on the additional copy could lead to greater than 2 local haplotypes with short-read-based next-generation sequencing. Therefore, we further examine CNVs on LHV segments generated by LocHap. We used XHMM [8][9][10] for the HNC data and CNVnator [11,12] for CEU TRIO data for CNV analysis. In both examples shown in the paper, we did not find enrichment of CNVs in the regions where LHVs were present, suggesting that the haplotype variants were not associated with copy number variants. In the HNC data, we essentially found no (< 0.01%) CNVs in any of the LHVs.

Additional simulation result for testing Sensitivity and Specificity
Simulation setup: We have used two different settings for the additional simulation study. In the first setting, we took a segment containing two SNVs, which had three different haplotypes. So it was a local haplotype variant (LHV) segment. In the second setting, we took a segment containing two SNVs, which had two different haplotypes. Therefore it was not a local haplotype variant (LHV) segment. We ran 100 simulations for each LHV segment. In each simulation, we generated a random number (between 160 and 180) of short reads that mapped to at least one SNV.
After that we used LocHap to call the number of haplotypes with their corresponding genotypes based on the observed short reads. In the first setting, we defined "sensitivity" as the ratio between the number of simulations where we successfully called three significant haplotypes with the correct genotypes and the total number of simulations and in the second setting, we defined "specificity" as the ratio between the number of simulations where we successfully called two significant haplotypes with correct genotypes and the total number of simulations. Sensitivity Results: Simulated short reads were generated from three true haplotypes with proportions 0.5, 0.25 and 0.25. We assumed at each SNV, one could observe up to two different alleles. With probability (1 − e b ) we generated a genotype on the short read that is same as the true genotype of either allele and a different genotype with probability e b . That is, e b is the error probability for each SNV of the short read.
Recall that in the posterior inference first the inferred haplotypes were sorted according to the decreasing order of posterior probabilities and then the FDR threshold was used to select the reported haplotypes. Here we used two different values of FDR thresholds 0.01 and 0.001. With different setting of e b we tabulated the results in Table S10. Sensitivity is worse with higher error rates e b and gets better with a less stringent FDR threshold. Specificity Results: We assumed now the segment has only two true haplotypes, with proportions 0.5 and 0.5. Here also, with different setting of e b we report the following results in Table S11. Specificity is worse with higher error rates e b and gets better with a more stringent FDR threshold.

Filters
LocHap includes optional post-processing filters. Despite the proposed principled and rigorous statistical inference, noise and artifacts from NGS data can still affect the quality of the estimated LHVs. We introduce an additional filtering method to remove dubious LHVs in the hcf files. We devise three types of multi-staged filtering schemes. First we introduce some abbreviations.
• nHaplo is the number of inferred haplotypes within an LHV segment.
• nClus is the number of unique read-groups observed directly from the short read data within an LHV segment. For example, in Fig. 1 of the main article nClus is 3 for both examples.
• min-SNV-dist is the shortest distance allowed between two successive SNVs within one segment. This will be used to remove SNVs that are too close to each other, due to potential bias caused by artifacts reported in Pickrell et al. [13].
• short-RL is the short-read length from NGS data.
Three different types of filters are proposed according to their stringency levels -type I, type II, type III. Among all three filter types, type I is the most conservative and type III is the least. Here, a conservative filter would remove more estimated LHVs in an hcf file. The filters remove LHV segments from the hcf files based on the following criteria. Metaphorically, we consider the filtering a process of revealing a portrait of an elephant. Without knowing which animal is on the portrait, Type I filter only reveals the tail of the animal, Type II filter reveals legs and tail, and Type III reveals the whole elephant.
• Common to all three types is a Fisher's exact test for testing the strand bias [14,15]. An LHV segment is removed if the if short reads show strand bias based on a significance level of 0.05. Also, we require that at least one read is mapped to each strand.
• In addition, the following filtering steps are applied to all three types of filters based on discussion in the literature [15][16][17]. These filters aim at removing short reads, not segments, that have poor quality.
• If on either end of a pair-end read, an Indel is called along with an SNV, the read is removed. Typically co-occurrence of different mutations on the same end of a read is caused by artifact and noise in base calling.
• Any SNVs are removed if they reside on a region within 10% of the read length from the end of any short read.  Figure S2: Illustration of a hypothetical event of "Double-hits". A copy number gain occurs and on the new copy, a point mutation replaces an "A" with a 'T". When all three copies are sequenced, potentially three different alleles bearing "A", "T" and "C" could be observed. Without accounting for copy number gains, somatic mosaicism could be falsely inferred.    Table S11: Specificity analysis for simulated data from a segment with two SNVs and only two true haplotypes in the sample.