PhaseME: Automatic rapid assessment of phasing quality and phasing improvement

Abstract Background The detection of which mutations are occurring on the same DNA molecule is essential to predict their consequences. This can be achieved by phasing the genomic variations. Nevertheless, state-of-the-art haplotype phasing is currently a black box in which the accuracy and quality of the reconstructed haplotypes are hard to assess. Findings Here we present PhaseME, a versatile method to provide insights into and improvement of sample phasing results based on linkage data. We showcase the performance and the importance of PhaseME by comparing phasing information obtained from Pacific Biosciences including both continuous long reads and high-quality consensus reads, Oxford Nanopore Technologies, 10x Genomics, and Illumina sequencing technologies. We found that 10x Genomics and Oxford Nanopore phasing can be significantly improved while retaining a high N50 and completeness of phase blocks. PhaseME generates reports and summary plots to provide insights into phasing performance and correctness. We observed unique phasing issues for each of the sequencing technologies, highlighting the necessity of quality assessments. PhaseME is able to decrease the Hamming error rate significantly by 22.4% on average across all 5 technologies. Additionally, a significant improvement is obtained in the reduction of long switch errors. Especially for high-quality consensus reads, the improvement is 54.6% in return for only a 5% decrease in phase block N50 length. Conclusions PhaseME is a universal method to assess the phasing quality and accuracy and improves the quality of phasing using linkage information. The package is freely available at https://github.com/smajidian/phaseme.

Are you submitting this manuscript to a special series or article collection? No

Experimental design and statistics Experimental design and statistics
Full details of the experimental design and statistical methods used should be given in the Methods section, as detailed in our Minimum Standards Reporting Checklist. Information essential to interpreting the data presented should be made available in the figure legends.
Have you included all the information requested in your manuscript?

Resources Resources
A description of all resources used, including antibodies, cell lines, animals and software tools, with enough information to allow them to be uniquely identified, should be included in the Methods section. Authors are strongly encouraged to cite Research Resource Identifiers (RRIDs) for antibodies, model organisms and tools, where possible.
Have you included the information requested as detailed in our Minimum Standards Reporting Checklist?

Yes
Availability of data and materials Availability of data and materials All datasets and code on which the conclusions of the paper rely must be either included in your submission or deposited in publicly available repositories (where available and ethically appropriate), referencing such data using a unique identifier in the references and in the "Availability of Data and Materials" section of your manuscript.
Have you have met the above requirement as detailed in our Minimum Standards Reporting Checklist?
Yes to provide insights into phasing performance and correctness. We observed unique phasing issues for each of the sequencing technologies, highlighting the necessity of quality assessments. PhaseME is able to decrease the Hamming error rate significantly by 26.2% averaged across all four technologies. Additionally, a significant improvement is obtained in the number of long switch errors especially for HiFi 54.6% with only a 5% decrease in phase block N50.
Conclusions: PhaseME is a universal method to assess the phasing quality and accuracy and improves the quality of phasing using linkage information. The package is freely available at https://github.com/smajidian/phaseme.

Findings
Humans as well as many other organisms have a diploid genome, meaning that there are two homologous copies of every somatic chromosome inherited from mother and father. These copies include genomic variation including Single Nucleotide Variation (SNV) and Structural Variation (SV) [1]. Each variation represents a difference in a nucleotide(s) unique to each of the chromosome copies also called haplotypes [2]. Thus, haplotypes represent the individual copy of each genomic element and need to be studied independently to investigate the impact of variations.
Haplotype phase information is essential to understand where a mutation occurs, predict their interactions (i.e. if two SNVs are on the same DNA molecule) and their potential impact on genes and their expression and thus phenotypes. This is important for multiple applications and organisms. For humans, phasing plays an important role in e.g. Mendelian diseases [3], cancer genomics [4][5][6], neurological diseases [7], genetic research and other medical applications [8,9]. As an example, compound heterozygosity shows the importance of phase information for relating genotype to phenotype. Numerous examples of disorders influenced by compound heterozygosity are known [8]. For example, Thiopurine S-methyltransferase (TPMT) is a 27 kbp gene located on chromosome 6p22.3. It is translated into an enzyme catalyzing S-methylation of thiopurine drugs [10]. These drugs are used as chemotherapeutic and immunosuppressant agents in lymphoid malignancies, leukemia and inflammatory bowel disease. The enzyme activity is controlled by the genetic polymorphism of the gene. There are two SNPs (rs1800460 and rs1142345) where either are known to impact the activity of TPMT and cause missense changes [11]. Thus, it is important to know if both SNPs exist, and if so if they co-occur on the same haplotype (cis) leading to an inactivation of TPMT or not (trans). Patients with low TPMT activity are at higher risk of life-threatening severe myelosuppression and hematopoietic toxicity when treated with conventional doses of mercaptopurine (or azathioprine) [12][13].
Currently, we distinguish four different approaches to obtain phasing information [14]: 1. Wet-lab based phasing is based on mainly two different methods: encapsulation and 3D structure capture [15]. One method is to extract chromosomes when cells are in metaphase and then microdissect them into subsets [16].
2. Population-based phasing, which utilizes linkage information derived from hundreds to thousands of individuals [17]. However, this approach misses rare and de novo variants [18].
3. Parental methods [19] have the advantage to phase entire genomes but lack the ability to phase de novo mutations (i.e. important in many Mendelian diseases). Furthermore, it requires sequencing of the parents, which is more expensive and often not applicable.
4. Single individual haplotyping (SIH), or haplotype assembly, is the most comprehensive approach; since it includes de novo mutations and rare variations, but often suffers from fragmentation [1]. HapCut2 [20] and WhatsHap [21] are the two most commonly used algorithms for this approach capable of utilizing Illumina, PacBio, Oxford Nanopore, 10Xgenomics and HiC information. In short, these approaches rely on aligned reads to a reference genome from which the SNVs and their genotypes are inferred. Subsequently, the SIH methods cluster the reads along the heterozygous SNV into two groups corresponding to the two haplotypes. The resulting VCF file reports SNVs with their phasing information, which includes the assignment of each SNV to a phase block. It is important to note that the relationship between individual phase blocks remains undetermined [14].
Based on the above information SIH methods are important and necessary for a better understanding of the genome at hand. However, it remains tedious to impossible to assess the accuracy and even the performance of individual samples. Generally, all of them may suffer from inaccurate results, which includes errors in the grouping of reads leading to incorrectly assigned SNVs (flip errors) or inaccurately joined haplotypes (switch errors).
Currently phasing is often evaluated based solely on its phasing length e.g. N50. Phasing N50 is the minimum phase block length, where the sum of its phase blocks with all larger phase blocks span at least 50% of the total phase length. Similar to assembly methods however, N50 does not represent the accuracy or quality of a result. In addition, most phasing methods do not provide a quality score to assess the reliability of the phasing itself.
Making it near impossible for users to assess the quality/correctness of their results.
To solve these problems and limitations we developed PhaseME, a method to automatically estimate the quality of the SIH results and report multiple statistics to enable a deeper understanding for a broad range of users. This is done based on population data to assess the accuracy of common variations across the individual phasing information. Furthermore, PhaseME can detect phasing errors and highlight their locations. If desired, PhaseME continues to correct the detected phasing errors and generates a report of the improvement and impact of these changes. Thus, PhaseME is unique in its usability and application as it provides more insights into the phasing accuracy per sample. To the best of our knowledge, there exists only one other highly specialized tool, which requires population information together with Hi-C data to correct SIH phasing and does not provide insights in the phasing results [22]. In the following, we describe the features of PhaseME and its applications for different technologies based on SIH phasing for HG002. We assess the performance of PhaseME based on parental phasing information from GIAB [23] across four different sequencing technologies . PhaseME Figure 1:. Summary of PhaseME. PhaseME consists of three steps: extracting population information, quality control, and phase error correction.
PhaseME is a novel method to assess the quality of phasing results and provide a quality control report. It also reduces phasing errors by exploiting population information. Figure 1 gives an overview of the three main steps of PhaseME. First, PhaseME requires the phased SNVs VCF file for an individual obtained from a SIH method, which is compared to precomputed linkage information that is available per ethnicity (see the Methods Section for details).
Second, PhaseME reports an in-depth quality assessment report of the phasing result to provide a detailed overview. PhaseME calculates the quality ratio across phase blocks based on the previously obtained linkage information. Here, for each phase block, we compute the ratio of SNVs with non-conflicting to conflicting linkage information and report the average per chromosome. Thus, 0 represents the lowest accuracy, while 1 indicates that everything is supported by the linkage information and more likely correct. PhaseME further reports the N50 of phase block length in kbp, the number of phased and non-phased heterozygous and homozygous variants, average phase block quality and phase rate (see the Quality control and Methods Section for details).
Third, PhaseME corrects the previously identified SNVs that are in conflict with the linkage information. We distinguish small (2-20bp) and large (21bp+) switch errors that represent a stretch of incorrectly phased SNV and thus PhaseME splits the existing phase block into two at the first conflicting SNV (see the section on error correction below for details).

Quality control
As highlighted above, it is essential to obtain insights into the phasing quality. PhaseME is designed with this as its main application and to provide an easy to understand and comprehensive quality report across the phasing results. We outline the provided summaries below based on SIH results for HG002 from GIAB [24] (see the Methods Section).
We To inspect the haplotype length, PhaseME reports: i) N50 of phase block length in kbp, which highlights the overall length of the phasing, ii) number of phased and non-phased heterozygous variants, which illustrates the completeness of the phasing, iv) the phase rate to indicate the proportion of phased regions for each chromosome, v) the average phase block quality to indicate the agreement with the linkage information and vi) the number of homozygous variants. The detailed definition of each criterion is provided in the Methods Section. Based on this report, one can determine the phasing quality for each chromosome and phase block of the sample at hand. Each of these statistics is automatically generated and provided by PhaseME in the QC report file.
As expected we observed a high N50 phase block length for 10Xgenomics (10.9Mb), but interestingly even higher for ONT (15.6Mbp) compared to PacBio CLR (369.3kbp) or PacBio HiFi (314.3kbp) (Figure 2A). One likely reason is the longer read lengths of the technologies compared to PacBio. Interestingly PhaseME highlights a higher number of heterozygous SNVs for 10Xgenomics (2,890,988) followed by ONT (2,807,291), whereas for PacBio we observed a lower number of heterozygous SNVs, CLR (2,520,418) and HiFi (2,418,009) (see Figure 2B). For 10Xgenomics the rate of non-phased heterozygous SNVs is also high in contrast to ONT ( Figure 2C). The results of PhaseME show that the average phase block quality was the highest for HiFi (0.994) followed by 10Xgenomics (0.985), ONT (0.983) and CLR (0.978). We observed a lower phase quality for chromosome 9 when using ONT and 10Xgenomics data (see Figure 2D). However, we did not observe that this affects the same regions.
It is interesting to note that the PacBio CLR data shows a lower N50 and a lower phase accuracy compared to ONT. However, this might be explained by the fact that CLR data is from 2015 and was generated with the RSII instrument for GIAB [23]. Another important observation is that the ONT data has the largest N50 phase length but is also quite accurate only 0.11 behind the HiFi reads which was the most accurate, but has an almost 60 times reduced N50. Another important metric is the phasing completeness, which represents the fraction of regions phased along the genome. Figure 2E shows the results based on the PhaseME report across the technologies. 10Xgenomics (0.947) and ONT (0.949) have the highest ratio of phased heterozygous SNV. This is followed by CLR (0.75) and HiFi (0.69) probably strongly related to the molecule/read size and their higher number of heterozygous SNV. 10Xgenomics (22.7%) showed by far the highest ratio of non-phased heterozygous SNV followed by ONT (1.1%), whereas for PacBio we observed a lower ratio CLR (0.34%) and HiFi (0.32%) (see Figure 2C).

Phasing error correction
Phasing errors lead to the misassignment of mutations to the wrong haplotype. PhaseME aims to correct larger switch errors as the linkage data does not provide sufficient resolution to correct single flip errors. To detect switch errors, PhaseME considers mismatches indicated by upstream SNVs to improve the signal. PhaseME requires a minimum of two mismatches and considers the ratio between matching and mismatching linkage information for the SNVs.
If this ratio (matches/mismatches) < 1 (by default) then PhaseME breaks the section in two separate phase blocks with the first conflicting SNV as the start of the new phase block.
We benchmark the error correction ability of PhaseME across different sequencing technologies (ONT, PacBio HiFi, PacBio CLR, 10Xgenomics) based on a male Ashkenazi proband, HG002 (NA24385) from the Genome in a Bottle (GIAB) repository (see the Methods Section), and compared the results to the parental phasing information based on Illumina SNVs [23]. Note that HG002 is not included in the population dataset (1000G data) that was used to obtain the linkage information.
We also consider the Hamming error rate [26] to evaluate the phasing, which shows the fraction of incorrectly phased over the total number of phased heterozygous variants (see the PhaseME overall reduced the phasing errors based on the linkage information. Figure 3 shows the results of improving the phasing and impact on phase block length for PhaseME. For the Hamming error rate, we observed on average a 26.2% reduction in errors across the technologies. For ONT, the reduction was the highest of 34.3% followed by 10Xgenomics (25.1%), PacBio HiFi (24.5%) and PacBio CLR (20.8%). Figure 3 shows the improvement of long switch errors (20 SNP+ Figure 3) which leads to a reduced N50 from 314.3kbp to 300.3kbp. Next, we evaluated PhaseME for short switches (2-20bp). Here, the linkage data does not provide the resolution to improve most of them. Thus, the number of short switches (with the length of <20) is decreased for CLR (22.4%), HiFi (18.6%) and 10Xgenomics (1.5%). However, our comparisons to the parents indicated that for ONT we actually introduced 6.6% of short switch errors. Thus, we provide parameters to adjust PhaseME.  To further ease the usage of PhaseME for non-expert users we implemented two modes. In general, we recommend following the instructions to obtain linkage information for each sample. Nevertheless, to ease the usage of PhaseME, we also investigated the usage of precomputed linkage maps. We computed the linkage map based on SNP for a given ethnicity  Table 1 and 2). Therefore, we recommend using the precomputed linkage maps only for non-expert users but suggest following our guidelines to obtain a sample specific linkage map once larger errors are initially detected. PhaseME can compute a rapid quality assessment which run time dependents on phase block size. For these data sets, PhaseME took between nine minutes (HiFi) to 135 minutes (10Xgenomics) to compute the QC results, which includes error detection (see Table   1). For the error correction steps, PhaseME is optimized to run multiple times with different parameters and only requires between four minutes (HiFi) up to seven minutes (10Xgenomics).
PhaseME represents a versatile and easy to use method to obtain insights into the phasing performance independent of the underlying sequencing technology. It allows non-expert users to gain valuable insights into the data set and the correctness of the phasing given a linkage map is available. Here we have shown the performance of PhaseME based on HG002 across four different sequencing technologies demonstrating a significant improvement over long switch errors. Smaller errors in phasing remains a challenge due to the lower resolution of linkage data. Consequently, we did not attempt to correct single SNV phasing errors. To enable utility to a broader range of users we have provided precomputed linkage maps that can be used to obtain an initial improvement and insight, but highly recommend users to compute the linkage map specific to their study. PhaseME is capable of being run on multiple organisms and functions irrespective of the phasing method.

PhaseME Prerequisites
PhaseME requires phased VCF files as input, which needs to follow the VCF standards 4.1 or newer [31]. For processing, PhaseME requires the tags (PS and GT) to identify the phase blocks and genotypes such as 0|1 or 1|0 to indicate the haplotypes per SNV. PhaseME is written in Python3 and requires the Numpy package.
To exploit population information and obtain the linkage information, we used Shapeit2 (version v2r900) [27] based on the 1000 Genome dataset [23]. Since the Shapeit2 package needs phased data per chromosome, we split the input VCF file into 22 VCF files corresponding to 22 chromosomes. We also removed non-genotyped variants from each VCF.
Using the '-check' subprogram of Shapeit2 with '--input-vcf' option, we report the missing variants in the reference panel which is then excluded using '--exclude-snp' option of Shapeit2.
Then we generate a haplotype graph which is a compact format of population information  were aligned using NGMLR [28] to the human reference genome (hg19). Subsequently, we identified SNV using the Clair2 package [29]. We used WhatsHap [21]  the three call sets of the son (being the one to be benchmarked), mother and father using bcftools merge [31]. Then by considering the heterozygous SNVs, we generate a phasing set for the son using in-house python code. Given the overlap between a het SNV of the son and to only the father, we report the phasing as 0|1. If it was overlapping exclusively with the mother, 1|0 is reported. The output VCF file is used for evaluation of the individual phasing method to calculate the Hamming error rate, the number of short and long switches using an in-house python code provided on our GitHub page.

Availability of supporting source code and requirements
Project name: PhaseME Project home page: https://github.com/smajidian/phaseme Operating system: Linux.
Programming language: Python Other requirements: Python 3.6 or higher License: The MIT License.

Availability of supporting data
The data set supporting the results of this article is available in the GigaDB repository, [last reference].

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Competing interests
F.J.S. has participated in PacBio and Oxford Nanopore sponsored meetings over the past few years and has received travel reimbursement for presenting at these events. F.J.S. also received the SMRT PacBio sequencing grant in 2018.

Funding
This work was supported in part by the US National Institutes of Health (UM1 HG008898)

Authors' contributions
The project was conceived by F.J.S. Methods were designed by F.J.S and S.M. Algorithm code was implemented by S.M. All authors wrote the manuscript. F.J.S. supervised the project.