HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses

Abstract Access to accurate viral genomes is important to downstream data analysis. Third-generation sequencing (TGS) has recently become a popular platform for virus sequencing because of its long read length. However, its per-base error rate, which is higher than next-generation sequencing, can lead to genomes with errors. Polishing tools are thus needed to correct errors either before or after sequence assembly. Despite promising results of available polishing tools, there is still room to improve the error correction performance to perform more accurate genome assembly. The errors, particularly those in coding regions, can hamper analysis such as linage identification and variant monitoring. In this work, we developed a novel pipeline, HMMPolish, for correcting (polishing) errors in protein-coding regions of known RNA viruses. This tool can be applied to either raw TGS reads or the assembled sequences of the target virus. By utilizing profile Hidden Markov Models of protein families/domains in known viruses, HMMPolish can correct errors that are ignored by available polishers. We extensively validated HMMPolish on 34 datasets that covered four clinically important viruses, including HIV-1, influenza-A, norovirus, and severe acute respiratory syndrome coronavirus 2. These datasets contain reads with different properties, such as sequencing depth and platforms (PacBio or Nanopore). The benchmark results against popular/representative polishers show that HMMPolish competes favorably on error correction in coding regions of known RNA viruses.


Viterbi equation for finding best graph-to-pHMM alignment
In the main text, we provide the equation representing the transition from a match state to a match state only.Below is the complete Viterbi equation set for the graph-to-pHMM alignment.
V ((i − 1) M , j − 6) + w 1 (a((i − 1) M , i M ) + e(i M , S j−2 S j−1 S j )) + w 2 (G(S j−2 S j−1 S j )), V ((i − 1) I , j − 6) + w 1 (a((i − 1) I , i M ) + e(i M , S j−2 S j−1 S j )) + w 2 (G(S j−2 S j−1 S j )), V ((i − 1) D , j − 6) + w 1 (a((i − 1) D , i M ) + e(i M , S j−2 S j−1 S j )) + w 2 (G(S j−2 S j−1 S j )); V ((i − 1) M , j − 6) + w 1 (a((i − 1) M , i M ) + e(i M , S j−3 S j−1 S j )) + w 2 (G(S j−3 S j−1 S j )), V ((i − 1) I , j − 6) + w 1 (a((i − 1) I , i M ) + e(i M , S j−3 S j−1 S j )) + w 2 (G(S j−3 S j−1 S j )), V ((i − 1) D , j − 6) + w 1 (a((i − 1) D , i M ) + e(i M , S j−3 S j−1 S j )) + w 2 (G(S j−3 S j−1 S j )), V ((i − 1) M , j − 4) + w 1 (a((i − 1) M , i M ) + e(i M , S j−3 S j−1 S j )) + w 2 (G(S j−3 S j−1 S j )), V ((i − 1) I , j − 4) + w 1 (a((i − 1) I , i M ) + e(i M , S j−3 S j−1 S j )) + w 2 (G(S j−3 S j−1 S j )), V ((i − 1) D , j − 4) + w 1 (a((i − 1) D , i M ) + e(i M , S j−3 S j−1 S j )) + w 2 (G(S j−3 S j−1 S j )), ......All other paths of three nodes (possible codons) ending with j in the graph ......All other paths of three nodes (possible codons) ending with j in the graph 2 Supplementary Tables 2.5 Results on three SARS-CoV-2 spike gene datasets  The comparison was conducted using real Nanopore Sequencing data of SARS-CoV-2, with a dataset of 191MB containing 48,905 reads.Among the five tools evaluated, HMMPolish ranked third in running time, outperformed by the learning-based tool Medaka and the consensus-based tool PBDAG-Con.The memory usage of the three alignment graph-based tools was higher than that of the other two tools.However, this can be optimized and mitigated by pruning low-weight edges in the graph.We will continue to optimize the software in our near future work.4 Supplementary Information for Experiments 4.1 Influence of the weights of the HMM score and path score An important factor to consider is the quality of the pHMM for graph searching, as using a low-quality pHMM can result in a high number of errors.For example, the pHMM may be trained using lowly-conserved sequences or a high taxonomy diversity, rather than being designed for the targeted viral protein.Another scenario is when the pHMM is highly conserved, but the training protein cluster shows low similarity with the target viral genes.

Supplementary Figures
To illustrate this issue, we present the results of using models from Pfam (accession number: PF00075) to polish simulated HIV-1 data in Figure .S2.This model represents the Ribonuclease H protein family but has diverse taxonomy compositions, with only around 1/3 of the training sequences coming from viruses.Under this setting, using HMM scores alone (w 2 = 0) results in sequences containing over 40 errors.However, when we combine the path score with the optimization function, the number of errors significantly decreases.Our preliminary experiments show that the best setting is w 1 = 0.9 and w 2 = 0.1, which is also the default weight value set for HMMPolish.Another important factor to consider is the coverage of reads, as the path score value is highly dependent on the edge weights, which quantify the reads coverage.In the recursive score equation, we deduct the backbone edge score when calculating the three-base path score to normalize it with the HMM score.However, when the reads coverage is exceptionally high, the path score may become dominant, and adjusting the w 1 and w 2 weights may not make any significant difference in the results.
To illustrate this issue, we present the results of the real Nanopore sequencing data of norovirus (dataset 2 in the experiment).Figure .S3 shows that, except for the obvious increase in the path score when w 2 increases from 0 to 0.1, the path score only has minor fluctuations with the increase of w 2 .

Metrics for the quality of HMMPolish output
HMMPolish has a "verbose" mode that provides additional information to assess the quality of the polished coding region.The two metrics provided are: 1.The ratio between the polished path weight and the consensus path weight: This metric provides a measure of confidence in the quality of the polished region.A higher ratio indicates a higher level of confidence.The path weights are quantified using the consensus path generated by DAG-Con.
2. The bitscore: This metric is obtained by aligning the pHMM to the polished sequence using HMMER (1) and provides a measure of the similarity between the polished sequence and the pHMM.Users can compare this score with the bitscore of a genome sequence on the same pHMM to evaluate the quality of the polished sequence.

Figure S1 :
Figure S1: Sequencing depths of the original three norovirus datasets.

Figure S2 :
Figure S2: Impact of changing the weight of path score (w 2 ) on the quality of the output in HIV-1 data.

Figure S3 :
Figure S3: Impact of changing the weight of path score (w 2 ) on the quality of the output in norovirus dataset 2

Table S2 :
Results of simulated H3N2 PacBio sequencing data in different depths.
* : Coverage of outputs on all segments.This number reflects the completeness of the results by each polisher.2.2 Experiments on simulated HIV-1 Nanopore Sequencing data 2.2.1 Profile HMMs used for HIV-1

Table S5 :
Results on six H1N1 PacBio Sequencing datasets

Table S6 :
Results on six H3N2 PacBio Sequencing datasets