WarpSTR: determining tandem repeat lengths using raw nanopore signals

Abstract Motivation Short tandem repeats (STRs) are regions of a genome containing many consecutive copies of the same short motif, possibly with small variations. Analysis of STRs has many clinical uses but is limited by technology mainly due to STRs surpassing the used read length. Nanopore sequencing, as one of long-read sequencing technologies, produces very long reads, thus offering more possibilities to study and analyze STRs. Basecalling of nanopore reads is however particularly unreliable in repeating regions, and therefore direct analysis from raw nanopore data is required. Results Here, we present WarpSTR, a novel method for characterizing both simple and complex tandem repeats directly from raw nanopore signals using a finite-state automaton and a search algorithm analogous to dynamic time warping. By applying this approach to determine the lengths of 241 STRs, we demonstrate that our approach decreases the mean absolute error of the STR length estimate compared to basecalling and STRique. Availability and implementation WarpSTR is freely available at https://github.com/fmfi-compbio/warpstr


Supplementary
: During signal polishing, the signal values are first aggregated based on the state that they were aligned to; the states are shown as colored bands.
Supplementary Figure S2: The transformation function for signal polishing is estimated as an interpolation of pairs of values (m j , e j ), where m j is the mean of signal values aligned to state j (the x-axis) and e j is the expected value for this state (the y-axis).
Supplementary Figure S3: Effects of signal polishing. The y-axis represents the signal level, the x-axis represents all states in the finite automaton, ordered according to the mean of the signals aligned to this state (m j , the blue line). Note that these values are significantly different from the Oxford Nanopore expected signal values e j for the sequence associated with the same state (the green line). The polishing algorithm creates a continuous transformation function that is then used to transform all signal levels in the original raw signal in order to shift the means aligned to the states closer to the ONT expected values. While the effect is subtle (the orange line), these changes may change the number of detected repeats-see Figure 5 in the main text.

S2 High similarity of event values
It is a common knowledge that homopolymeric sequences are impossible to characterize correctly using nanopore sequencing as the measured signal is just a long stretch of near-constant signal values with some length, but due to the variable strand velocity, it is not possible to infer the homopolymeric sequence length.
Very low variance signal values can also occur when some repeating pattern produces very similar signal values for each possible k-mer. This is as problematic for basecalling as for WarpSTR or any other tool. A good example is the pattern AAGAG, which repeats on the reverse strand in a genomic region given by coordinates chr10:8332996-8333065 on the hg38 reference genome (see Figure S5).
Supplementary Figure S5 clearly shows that the events of the repeating signal on the reverse strand are very similar to each other in their normalized expected signal values, while on the template strand, this does not hold, and the events are very well distinguishable apart from each other. This usually creates a situation where predictions on the high-similarity strand have much larger error (see Figure S6).
In WarpSTR, we warn users about this problem by computing the mean absolute difference of consecutive signal values. In this example, mean state difference was 1.727 and 0.312 for template and reverse strand, respectively.
Supplementary Figure S6 shows predictions split by strand, using NA12878 data, where the number of repeats as given by the golden set is 14, i.e. allele length is 70, for both alleles. The figure also shows how this high state similarity influences basecalling accuracy. On the left, the figure shows violin plots for allele lengths as given by WarpSTR and basecalling for the template strand, and for the reverse strand on the right. We can see that the problem of low variance signal is difficult to overcome also for basecallers.
Using this mean state difference as given by WarpSTR, we can prevent incorrect determination of repeat length due to the low variance in repeating pattern on one strand. However, it is also important to note, that in cases where both strands have very high similarity, this does not give a powerful message apart from the information, that the prediction will be of lower confidence for that locus.

S3 Details of running time measurements
As both tools have different configurations and initial steps, we have measured only the time of finding the STR signal part and determining the repeat count. For STRique, it was the running time used for calling the 'count' option of STRique tool. For WarpSTR, the running time was defined as the time required to extract the repeating signal, generate expected signals, construct the automaton, and run both alignment phases. As clustering is not implemented in STRique, WarpSTR clustering is not measured. Parameters were the same for WarpSTR and STRique as in other experiments.
For this experiment, we used a personal laptop with an AMD Ryzen 7 4800H processor and Ubuntu 20.04 as OS. First, we chose 10 random loci and obtained reads for them for NA12878 subject with 20-30 reads for locus. Then we used STRique and WarpSTR 5 times and measured the running time, which we averaged in the end. The experiment was run using 1, 2, 4 and 8 threads.

S4 Per locus predictions
WarpSTR also visualizes (if set in configuration file) the distribution of WarpSTR and basecalling predictions, with the final predicted alelle length noted under the plot.
Supplementary Figure S7 shows such results for HD locus of NA24385 subject. Gold standard shown two alleles, one with 84bp and the other with 105bp. WarpSTR achieved the same result, while basecalling shown a small error while it reported alelles with 83bp and 102bp.
Supplementary Figure S7: Predictions for a whole HD locus of NA24385 subject.

S5 Per unit predictions
In case of complex STRs, WarpSTR also predicts the number of repeat units per each repeating pattern. The number of repeats is read out from the state automaton transitions. These predictions are also S7 genotyped using the same method based on filtering and clustering, as for simple STRs. This is useful as typically only one repeating pattern is clinically relevant, or because of the case where predictions of one repeating pattern have more dispersion thus making predictions of the whole allele length more dispersed as well.
Supplementary Figure S8 shows such results for HD locus of NA24385 subject. Our input config sequence was (AGC)AACAGCCGCCAC(CGC), WarpSTR reports counts for each repeating state (denoted in parentheses). Gold standard gives two alleles, one with 17 AGC repeats and 7 CGC repeats, and the other with 24 AGC repeats and 7 CGC repeats. Using WarpSTR, we came to the same result.