-
PDF
- Split View
-
Views
-
Cite
Cite
Cheng Yee Tang, Rick Twee-Hee Ong, MIRUReader: MIRU-VNTR typing directly from long sequencing reads, Bioinformatics, Volume 36, Issue 5, March 2020, Pages 1625–1626, https://doi.org/10.1093/bioinformatics/btz771
- Share Icon Share
Abstract
Mycobacterial interspersed repetitive unit-variable number tandem repeat (MIRU-VNTR) typing is widely used to genotype Mycobacterium tuberculosis complex in epidemiological studies for tracking tuberculosis transmission. Recent long-read sequencing technologies from Pacific Biosciences and Oxford Nanopore Technologies can produce reads that are long enough to cover the entire repeat regions in each MIRU-VNTR locus which was previously not possible using the short reads from Illumina high-throughput sequencing technologies. We thus developed MIRUReader for MIRU-VNTR typing directly from long sequence reads.
Source code and documentation for MIRUReader program is freely available at https://github.com/phglab/MIRUReader.
Supplementary data are available at Bioinformatics online.
1 Introduction
Mycobacterial interspersed repetitive unit-variable number tandem repeat (MIRU-VNTR) typing is a polymerase chain reaction (PCR)-based method widely used to genotype Mycobacterium tuberculosis complex (MTBC) that causes tuberculosis (TB). MTBC isolates with identical MIRU-VNTR profiles can be clustered to identify epidemiologically linked cases (Supply et al., 2006). Although whole-genome sequencing (WGS) is demonstrated to have higher resolution for cluster identification (Wyllie et al., 2018), a large database of distinct MTBC isolates with MIRU-VNTR genotypes had been collected over the past decade worldwide by TB researchers and National TB control programmes (Allix-Beguec et al., 2008; Lim et al., 2013). To facilitate comparison and linking to these historical isolates (Chee et al., 2015), it is important that MIRU-VNTR genotypes can be determined from the WGS data.
MIRU-Profiler is a tool that performs MIRU-VNTR typing from complete genomes and draft assemblies (Rajwani et al., 2018). However, de novo genome assembly is usually computationally intensive and requires significant amount of time. Recently, sequencing platforms from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) can generate long reads (1–100 kbp) (Ameur et al., 2019) that span the entire tandem repeats for each MIRU-VNTR locus. We thus develop MIRUReader, a software tool that can rapidly determine MIRU-VNTR genotypes from either the long reads directly or assembled genomes, considering the high sequencing errors.
2 Implementation and datasets used
MIRUReader is written in Python and accepts sequencing reads in FASTQ or FASTA format. The primersearch program by EMBOSS (Rice et al., 2000) is used to scan long reads to find amplicons flanked by the PCR primers for each locus with maximum of 18% mismatch. For each MIRU-VNTR locus, zero, one or more amplicons could be identified. The repeat number each amplicon corresponds to can be deduced from the amplicon length and an allele calling table (Weniger et al., 2010), mimicking the laboratory protocol in determining the 24-locus MIRU-VNTR profile. Amplicons longer than 1828 bp are excluded from analysis. This will result in a set of repeat numbers for each MIRU-VNTR locus. The mode will thus be assigned as the repeat number for that locus. If multiple modes exist, the mismatch in the primer sequences alignments obtained through primersearch will be analyzed. The modal repeat number that has the lowest total number of mismatches in the alignments will be the assigned repeat number for the locus. In situations where the modal repeat numbers have equal total number of mismatches, the locus will have multiple repeat numbers. For example, locus MIRU2996 for sample MTB08 was assigned two repeat numbers (Supplementary Table S1). If no amplicon is detected, the locus will be assigned ‘ND’. MIRUReader outputs the 24-locus MIRU-VNTR pattern to the display screen in a tab-delimited format that can be redirected to a text file and viewed in Excel spreadsheet.
We compared MIRU-profiler and MIRUReader across three datasets. The first dataset consists of 17 samples sequenced using ONT MinION. MIRU-VNTR genotyping was performed using the Genoscreen MIRU-VNTR Quadraplex kit according to the manufacturer’s protocol. Two samples were excluded from analysis due to incomplete experimental MIRU-VNTR profiles. Raw fast5 files were demultiplexed and base-called using Albacore (v2.3.3). The sequence reads were demultiplexed again where adapters were trimmed using Porechop (v0.2.3, available from https://github.com/rrwick/Porechop). The filtered ONT reads were then de novo assembled using Canu (v1.7.1) (Koren et al., 2017). The raw draft assemblies were polished with ONT reads to improve consensus accuracy using nanopolish (v0.10.2) (Loman et al., 2015). The second dataset comprises six PacBio and one ONT sequenced samples where their reads and genome assemblies were downloaded from the National Center for Biotechnology Information database. Experimental MIRU-VNTR profiles for these samples were obtained through literature review. The third dataset is the set of 17 genome assemblies presented in Table 1 of the MIRU-profiler manuscript (Rajwani et al., 2018).
3 Results and performance
For datasets 1 and 2, MIRUReader achieved better accuracies in the prediction of 24-locus MIRU-VNTR profiles than MIRU-profiler on assembled genomes. Using experimental MIRU-VNTR results, 13 out of 15 (86.67%) samples in dataset 1 had their MIRU-VNTR profiles determined correctly by MIRUReader. MIRU-profiler was only able to correctly predict 5 (33.33%) based on polished genome assemblies, and none if uncorrected draft genome assemblies were used (Fig. 1; Supplementary Table S1). We observed similar trend for dataset 2, whereby 1 (14.29%) and 5 (71.43%) samples had their MIRU-VNTR profiles determined accurately out of the 7 samples for MIRU-profiler and MIRUReader respectively (Fig. 1; Supplementary Table S2). In dataset 3, MIRUReader obtained identical results as MIRU-profiler (default parameters) for 15 out of 17 (88.24%) samples using the published downloaded genome assemblies as input for analysis. In the two discordant samples (NC_008769 and NC_012207), we were unable to obtain the reported results for MIRU-profiler in five loci based on the default parameters, while MIRUReader was able to obtain accurate genotypes (Supplementary Table S3). Using the raw PacBio sequence reads from two samples (CP019613 and CP019610), MIRUReader however had discordant results at three loci (424, 580 and 1644) which were accurately determined when using the downloaded genome assemblies as input into MIRUReader.

Accuracy of MIRU-profiler and MIRUReader in correctly predicting the 24-locus MIRU-VNTR genotypes using (a) ONT data from this study (n = 15); (b) publicly released PacBio/ONT reads and genome assemblies (n = 7)
MIRUReader is much faster than MIRU-profiler since it does not require the additional steps of genome de novo assembly and polishing (Supplementary Fig. S1). Based on dataset 1, the MIRU-VNTR profiles can be obtained in about an hour using MIRUReader with only one computing thread. In contrast, the shortest analysis time for the MIRU-profiler approach was 160 min using 10 computing threads.
Overall, MIRUReader is an accurate and rapid tool that can perform in-silico typing of the standard 24-locus MIRU-VNTR genotypes for MTBC isolates directly from long sequencing reads.
Acknowledgements
The authors would like to thank the Central Tuberculosis Laboratory (CTBL) at the Singapore General Hospital for performing the ONT sequencing runs and both CTBL and STEP for providing the laboratory results of the 24-locus MIRU-VNTR of the samples sequenced.
Funding
This work was supported by the Singapore Infectious Diseases Initiative (SIDI/2014/003) and NUS Startup Grant awarded to RTHO.
Conflict of Interest: none declared.
References