rMETL: sensitive mobile element insertion detection with long read realignment

Jiang, Tao; Liu, Bo; Li, Junyi; Wang, Yadong

doi:10.1093/bioinformatics/btz106

Abstract

Summary

Mobile element insertion (MEI) is a major category of structure variations (SVs). The rapid development of long read sequencing technologies provides the opportunity to detect MEIs sensitively. However, the signals of MEI implied by noisy long reads are highly complex due to the repetitiveness of mobile elements as well as the high sequencing error rates. Herein, we propose the Realignment-based Mobile Element insertion detection Tool for Long read (rMETL). Benchmarking results of simulated and real datasets demonstrate that rMETL enables to handle the complex signals to discover MEIs sensitively. It is suited to produce high-quality MEI callsets in many genomics studies.

Availability and implementation

rMETL is available from https://github.com/hitbc/rMETL.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Mobile element insertions (MEIs) represent about 25% of structure variations (SVs) in human genomes (Gardner et al., 2017), which are mainly contributed by active transposons such as Alu, L1 and SVA families (Stewart et al., 2011). Efforts have been made to detect MEIs with short reads (Sudmant et al., 2015), however, short-read-based approaches have their own limitations to deal with repetitive mobile elements.

Long reads are promising to handle repeats and more sensitively detect SVs (Sedlazeck et al., 2018a). However, with the repetitiveness of mobile elements and high sequencing error rates, the MEI signals implied by long reads are highly complex. State-of-the-art long-read-based SV detection tools use unified approaches to detect various kinds of SVs (Sedlazeck et al., 2018b). However, this ‘one-fits-all’ strategy does not fully consider the characteristics of MEIs, which may affect the detection.

Herein, we propose Realignment-based Mobile Element insertion detection Tool for Long read (rMETL). rMETL takes advantage of its specifically designed chimeric read re-alignment approach to handle the complex MEI signals. This novel approach has improved ability to produce high quality MEI callsets.

2 Materials and methods

Using aligned long reads (a sorted BAM file) as input, rMETL extracts and re-aligns chimerically aligned reads to discover MEIs (range from 50 bp to 1 million bp) in four steps.

rMETL extracts the chimerically aligned parts of the reads which have split alignment, large clippings and/or large indels;
rMETL clusters the chimerically aligned read parts in pre-defined rules to infer a set of putative MEI sites as candidates;
rMETL realigns the clustered read parts to the consensus sequences of Alu, L1 and SVA families with a well-tuned aligner;
rMETL investigates the realignment results to find out the evidence to call MEIs as well as filter false positive candidates.

Please also refer to Supplementary Figures S1 and S2 for schematic illustrations and Supplementary Notes for more detailed information on the implementation of rMETL.

3 Results

We implemented rMETL on simulated and real datasets to assess its ability. A state-of-the-art long-read-based SV calling approach, Sniffles (Sedlazeck et al., 2018b), was employed for comparison.

Four PacBio-like datasets (mean read length: 8000 bp, mean error rate: 15%) on four sequencing depths (5×, 10×, 20× and 50×) were simulated with an in silico haploid human genome having 20 000 MEIs (Section 2.1 of Supplementary Notes). For both rMETL and Sniffles, all the parameters were set as default values except the numbers of supporting reads (−s parameters), which were tuned as 5 for the 5×, 10× and 20× and 10 for the 50× datasets, referring to previous studies on the tradeoff between sensitivity and specificity (Sedlazeck et al., 2018b).

The sensitivities and accuracies of rMETL and Sniffles are in Table 1 and Supplementary Table S1. Overall, rMETL achieves higher sensitivity than Sniffles, especially on the lower depth (5× and 10×) datasets. Moreover, both of the two approaches have low false positive rates (0.01–0.23% for rMETL and 0.04–1.95% for Sniffles).

Table 1.

Sensitivities of rMETL and Sniffles on four simulated PacBio datasets (−s indicating the number of supporting reads parameter)

	5× (−s 5)	10× (−s 5)	20× (−s 5)	50× (−s 10)
rMETL	49.24%	78.64%	88.19%	90.14%
Sniffles	28.06%	68.93%	86.19%	89.43%

Open in new tab

Table 1.

Sensitivities of rMETL and Sniffles on four simulated PacBio datasets (−s indicating the number of supporting reads parameter)

	5× (−s 5)	10× (−s 5)	20× (−s 5)	50× (−s 10)
rMETL	49.24%	78.64%	88.19%	90.14%
Sniffles	28.06%	68.93%	86.19%	89.43%

Open in new tab

Furthermore, we implemented rMETL on a 50× simulated PacBio-like dataset from another in silico haploid human genome having 20 000 non-MEI insertions (Section S2.1 of Supplementary Notes). Only 366 (1.8%) of the 20 000 events were false positively called as MEIs, suggesting that rMETL has the ability to prevent false positives.

rMETL and Sniffles were further benchmarked with a 55× real PacBio dataset (Zook et al., 2014) and a 28× real ONT dataset (Jain et al., 2018). Their -s parameters were respectively set as 10 (PacBio) and 5 (ONT), referring to the previous study (Sedlazeck et al., 2018b). A callset proposed by 1000 Genomes Project (Sudmant et al., 2015) (which is produced by multiple approaches) and other four callsets generated by state-of-the-art short-read-based MEI calling tools, i.e. MELT (Gardner et al., 2017), Tangram (Wu et al., 2014), Mobster (Thung et al., 2014) and Tea (Lee et al., 2012), were employed as pseudo ground truth. Each of them is termed as a ‘SR-callset’.

rMETL called 4704 and 5439 MEIs, and Sniffles called 21613 and 59870 INS/DELs, on the PacBio and ONT datasets respectively. Sniffles’ higher numbers of calls are also reasonable since it detects all kinds of large insertions and deletions. We assessed the numbers of the calls supported by various SR-callsets and observed two issues.

1) It indicates that, the callsets of rMETL covered 1589 (with the PacBio dataset) and 1588 (with the ONT dataset) of the 1628 MEIs which co-exists in at least two SR-callsets (Fig. 1 and Supplementary Table S2). Moreover, the upset plots (Supplementary Figs S3–S6) indicate that rMETL recovered 1696 (with the PacBio dataset) and 1699 (with the ONT dataset) of 1764 MEIs in the 1000 Genomes Project callset. This suggests that in absolute terms rMETL has good sensitivity, considering that the MEIs called by multiple approaches could be more confidently seen as true MEIs, and rMETL recovered most of them.

Fig. 1.

Open in new tab Download slide

The numbers of long read-based calls supported by various numbers of SR-callsets. Each bar indicates a specific callset produced by rMETL or Sniffles on PacBio or ONT dataset, and its height indicates the number of calls in the callset being supported by X (2–5) SR-callsets, i.e. the calls also exist in at least X SR-callsets

2) On the same levels of SR-callset supports (i.e. supported by the same numbers of SR-callsets), rMETL always has more MEI calls than Sniffles does (Fig. 1). This indicates that the sensitivity of rMETL is higher than that of Sniffles.

We find that the good sensitivity of rMETL derives from its realignment approach, which enables to transform ambiguous and chimeric read alignments into homogenous alignments. This helps to find strong MEI evidence from complex signals, which is still non-trivial to unified SV detection approaches. An example is in Supplementary Figure S7.

There are also MEIs called by rMETL which are not supported by any of the SR-callsets (i.e. 2412 and 3120 calls for the PacBio and ONT datasets, respectively). However, 77% (PacBio) and 79% (ONT) of such calls also exist in the callset of Sniffles (Supplementary Fig. S8), indicating that they could be plausible. We found that most of such unsupported calls also have strong evidence. That is, there are many chimeric read parts in the called MEI regions, and most of them can be confidently aligned to mobile elements (Supplementary Fig. S9).

The elapsed times, CPU times and memory footprints with 1, 2, 4, 8 and 16 CPU threads were assessed (Supplementary Table S3). Mainly, rMETL processed the PacBio and the ONT datasets in respectively 2.1 and 1.5 h with 8 CPU threads (peak memory: 7.05 and 6.52 GB), about 2 times faster than Sniffles.

The benchmarking results suggest that overall rMETL has good ability to detect MEIs. However, it has a few drawbacks. rMETL might fail at the incorrect realignment of read parts or the lack of supporting reads. These are also important future works for us to improve rMETL. Moreover, to some extent, rMETL relies on the consensus sequences of mobile elements. A more detailed discussion is in Supplementary Notes.

Funding

This work was partially supported by the National Key Research and Development Program of China (Nos: 2018YFC0910504, 2017YFC0907503 and 2017YFC1201201).

Conflict of Interest: none declared.

References

Gardner

E.J.

et al. . (

2017

)

The Mobile Element Locator Tool (MELT): population-scale mobile element discovery and biology

.

Genome Res.

,

27

,

1916

–

1929

.

Jain

M.

et al. . (

2018

)

Nanopore sequencing and assembly of a human genome with ultra-long reads

.

Nat. Biotechnol.

,

36

,

338

–

345

.

Lee

E.

et al. . (

2012

)

Landscape of somatic retrotransposition in human cancers

.

Science

,

337

,

967

–

971

.

Sedlazeck

F.J.

et al. . (

2018a

)

Piercing the dark matter: bioinformatics of long-range sequencing and mapping

.

Nat. Rev. Genet.

,

19

,

329

–

346

.

Sedlazeck

F.J.

et al. . (

2018b

)

Accurate detection of complex structural variations using single-molecule sequencing

.

Nat. Methods

,

15

,

461

–

468

.

Stewart

C.

et al. . (

2011

)

A comprehensive map of mobile element insertion polymorphisms in humans

.

PLoS Genet.

,

7

,

e1002236

.

Sudmant

P.H.

et al. . (

2015

)

An integrated map of structural variation in 2 504 human genomes

.

Nature

,

526

,

75

–

81

.

Thung

D.T.

et al. . (

2014

)

Mobster: accurate detection of mobile element insertions in next generation sequencing data

.

Genome Biol.

,

15

,

488

.

Wu

J.

et al. . (

2014

)

Tangram: a comprehensive toolbox for mobile element insertion detection

.

BMC Genomics

,

15

,

795

.

Zook

J.M.

et al. . (

2014

)

Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls

.

Nat. Biotechnol.

,

32

,

246

–

251

.

Author notes

The authors wish it to be known that, in their opinion, Tao Jiang and Bo Liu authors should be regarded as Joint First Authors.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
February 2019	46
March 2019	56
April 2019	55
May 2019	35
June 2019	20
July 2019	5
August 2019	7
September 2019	111
October 2019	61
November 2019	57
December 2019	35
January 2020	33
February 2020	5
March 2020	11
April 2020	12
May 2020	5
June 2020	18
July 2020	26
August 2020	14
September 2020	13
October 2020	32
November 2020	12
December 2020	16
January 2021	18
February 2021	11
March 2021	23
April 2021	27
May 2021	28
June 2021	29
July 2021	28
August 2021	13
September 2021	28
October 2021	15
November 2021	16
December 2021	16
January 2022	18
February 2022	15
March 2022	14
April 2022	20
May 2022	26
June 2022	26
July 2022	37
August 2022	29
September 2022	49
October 2022	101
November 2022	43
December 2022	63
January 2023	33
February 2023	23
March 2023	29
April 2023	41
May 2023	26
June 2023	17
July 2023	25
August 2023	21
September 2023	24
October 2023	20
November 2023	20
December 2023	24
January 2024	43
February 2024	37
March 2024	29
April 2024	22

Article Contents

rMETL: sensitive mobile element insertion detection with long read realignment

Abstract

1 Introduction

2 Materials and methods

3 Results

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

rMETL: sensitive mobile element insertion detection with long read realignment

Abstract

1 Introduction

2 Materials and methods

3 Results

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only