Abstract

Summary

Viral sequence data from clinical samples frequently contain contaminating human reads, which must be removed prior to sharing for legal and ethical reasons. To enable host read removal for SARS-CoV-2 sequencing data on low-specification laptops, we developed ReadItAndKeep, a fast lightweight tool for Illumina and nanopore data that only keeps reads matching the SARS-CoV-2 genome. Peak RAM usage is typically below 10 MB, and runtime less than 1 min. We show that by excluding the polyA tail from the viral reference, ReadItAndKeep prevents bleed-through of human reads, whereas mapping to the human genome lets some reads escape. We believe our test approach (including all possible reads from the human genome, human samples from each of the 26 populations in the 1000 genomes data and a diverse set of SARS-CoV-2 genomes) will also be useful for others.

Availability and implementation

ReadItAndKeep is implemented in C++, released under the MIT license, and available from https://github.com/GenomePathogenAnalysisService/read-it-and-keep.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Since experimental isolation of viral DNA from the host is imperfect, viral sequence data is frequently contaminated with host DNA sequence data. Removal of host sequence is a first step for many analyses, and where the host is human this is essential to safeguard patient anonymity. Typical approaches (Bush et al., 2020) either map reads directly to the host genome [e.g. using BWA MEM (Li, 2013), Bowtie2 (Langmead and Salzberg, 2012)] or use a metagenomics classifier [e.g. Kraken2 (Wood et al., 2019)] to assign each read to a species. However in some circumstances (such as a global pandemic following a recent zoonosis) the viral genome is known and of limited diversity, opening up the possibility of positively identifying viral reads by mapping to a reference. In this article, we develop a simple tool that scans sequence data and retains only that which maps to the viral genome. By rigorously testing both theoretically and with human data from diverse global populations predating the pandemic, we are able to give convincing evidence that mapping to a modified SARS-CoV-2 reference is sufficient to guarantee removal of human data. The tool, named ReadItAndKeep, is extremely fast and requires very little RAM—typically a few MB as compared with around 10GB for methods based on mapping to the human genome. This allows read decontamination locally on a standard laptop before uploading to a shared or public server for analysis, or depositing in read archives.

2 Materials and methods

ReadItAndKeep is implemented in C++, using the API of minimap2 to match reads to a target genome. Hits from minimap2 are used without performing full alignment (equivalent to minimap2 default command line options, reporting approximate mappings).

A read is retained if it has a match that is at least 50 bp or is at least 50% of the length of the read (these are default values of user-specifiable parameters). In the case of paired reads, a pair is kept if either of the reads have a suitable match. ReadItAndKeep uses the minimap2 presets ‘short read’ or ‘ont’ for Illumina and Oxford Nanopore Technology (ONT) reads respectively (same as command line options -x sr or -x map-ont). Retained reads are written to gzipped FASTQ file(s).

We compared ReadItAndKeep with a standard approach of removing reads matching the human genome. We benchmarked against the tool Dehumanizer (https://github.com/SamStudio8/dehumanizer), which wraps mappy/minimap2, with its recommended reference comprising the human genome GRCh38 plus decoy and HLA sequences (details in Supplementary Text). We refer to this collection of references as the ‘human reference’ throughout.

3 Results

Complete benchmarking results are shown in Supplementary Table S1, summarized in Table 1 and described below.

Table 1.

Summary of testing ReadItAndKeep and Dehumanizer on Human and SARS-CoV-2 reads

DatasetSamplesTotal readsPercent reads retained
Mean run time
Peak RAM (MB)
DehumanizerReadItAndKeepDehumanizerReadItAndKeepDehumanizerReadItAndKeep
Human 75-mers13 182 668 3180.7630.05082 min83 min30 3786
Human 150-mers13 181 197 2260.0340.07332 min149 min30 3766
Human Illumina2720 772 464 0240.9030.02860 min59 min11 8318
Human ONT115 666 88710.2830.09591 min103 min10 839243
SARS-CoV-2 75-mers129 796100.0100.048.0 s0.3s11 3057
SARS-CoV-2 150-mers129 721100.0100.048.0 s0.3s11 3057
SARS-CoV-2 Illumina246610 451 01499.99499.894102.4 s49.6s11 3309
SARS-CoV-2 ONT18930 422 462100.099.99252.1 s14.3s83879
DatasetSamplesTotal readsPercent reads retained
Mean run time
Peak RAM (MB)
DehumanizerReadItAndKeepDehumanizerReadItAndKeepDehumanizerReadItAndKeep
Human 75-mers13 182 668 3180.7630.05082 min83 min30 3786
Human 150-mers13 181 197 2260.0340.07332 min149 min30 3766
Human Illumina2720 772 464 0240.9030.02860 min59 min11 8318
Human ONT115 666 88710.2830.09591 min103 min10 839243
SARS-CoV-2 75-mers129 796100.0100.048.0 s0.3s11 3057
SARS-CoV-2 150-mers129 721100.0100.048.0 s0.3s11 3057
SARS-CoV-2 Illumina246610 451 01499.99499.894102.4 s49.6s11 3309
SARS-CoV-2 ONT18930 422 462100.099.99252.1 s14.3s83879

Note: Percent reads retained is calculated from summing across reads from all samples in the dataset. Mean run time is the mean wall clock time used across all samples in the dataset.

Table 1.

Summary of testing ReadItAndKeep and Dehumanizer on Human and SARS-CoV-2 reads

DatasetSamplesTotal readsPercent reads retained
Mean run time
Peak RAM (MB)
DehumanizerReadItAndKeepDehumanizerReadItAndKeepDehumanizerReadItAndKeep
Human 75-mers13 182 668 3180.7630.05082 min83 min30 3786
Human 150-mers13 181 197 2260.0340.07332 min149 min30 3766
Human Illumina2720 772 464 0240.9030.02860 min59 min11 8318
Human ONT115 666 88710.2830.09591 min103 min10 839243
SARS-CoV-2 75-mers129 796100.0100.048.0 s0.3s11 3057
SARS-CoV-2 150-mers129 721100.0100.048.0 s0.3s11 3057
SARS-CoV-2 Illumina246610 451 01499.99499.894102.4 s49.6s11 3309
SARS-CoV-2 ONT18930 422 462100.099.99252.1 s14.3s83879
DatasetSamplesTotal readsPercent reads retained
Mean run time
Peak RAM (MB)
DehumanizerReadItAndKeepDehumanizerReadItAndKeepDehumanizerReadItAndKeep
Human 75-mers13 182 668 3180.7630.05082 min83 min30 3786
Human 150-mers13 181 197 2260.0340.07332 min149 min30 3766
Human Illumina2720 772 464 0240.9030.02860 min59 min11 8318
Human ONT115 666 88710.2830.09591 min103 min10 839243
SARS-CoV-2 75-mers129 796100.0100.048.0 s0.3s11 3057
SARS-CoV-2 150-mers129 721100.0100.048.0 s0.3s11 3057
SARS-CoV-2 Illumina246610 451 01499.99499.894102.4 s49.6s11 3309
SARS-CoV-2 ONT18930 422 462100.099.99252.1 s14.3s83879

Note: Percent reads retained is calculated from summing across reads from all samples in the dataset. Mean run time is the mean wall clock time used across all samples in the dataset.

Evaluation of human read removal: we first checked that ReadItAndKeep should in principle remove human reads, using all 75-mers and 150-mers of the human reference as ‘reads’, with the target genome SARS-CoV-2 MN908947.3. Only 90 469 (0.003%) 75-mers were retained, all of which matched the 33 bp poly-A tail of MN908947.3. Since this tail provides no useful information and is excluded by SARS-CoV-2 amplicon sequencing, we removed if from the viral genome for all further analysis. Using this trimmed sequence as the target, all tested k-mers were removed by ReadItAndKeep. Dehumanizer retained 0.76% of the 75 bp reads, and 0.03% of the 150 bp reads (Table 1).

We then measured the success at human read removal on 27 Illumina runs from the expanded 1000 genomes project (Byrska-Bishop et al., 2021): the well-studied sample NA12878 plus one sample from each of the 26 populations, originating from Africa, Asia, Europe and the Americas (Supplementary Table S2). A high depth run of ONT reads from NA12878 was also tested (Jain et al., 2018). Note that all of these samples were sequenced years before the SARS-CoV-2 virus jumped into humans, and so we assume that all reads in these datasets should be excluded. Across all these samples, ReadItAndKeep retained zero reads, but Dehumanizer kept 1.8% Illumina and 10% ONT reads (Table 1). Further investigation of the 10% showed they were heavily enriched for very low quality and repetitive reads, with multi-kb softclipped regions.

Quantification of SARS-CoV-2 read retention: we confirmed that all 75-mers and 150-mers from the SARS-CoV-2 reference genome were retained by Dehumanizer and ReadItAndKeep. Next, a set of genetically diverse samples was collated, comprising 246 Illumina and 189 ONT sequencing runs, chosen (see Supplementary Text) to maximize unique protein mutations and ensure a range of lineages as assigned by Pangolin (O’Toole et al., 2021). Dehumanizer retained > 99.99% of reads, and ReadItAndKeep kept > 99.99% of ONT reads and 99.89% of Illumina reads (Table 1). For diagnostic purposes, those reads excluded by ReadItAndKeep were then mapped to the SARS-CoV-2 genome using Bowtie 2 (Langmead and Salzberg, 2012) with the -very-sensitive-local option. The excluded reads were highly enriched for low quality—with either a very short match or high error rate (see Supplementary Fig. S1, Supplementary Table S3). The greatest loss in mean per-base depth was 0.21% for ONT and 1.87% for Illumina (238/246 Illumina samples had mean loss <1%) (Supplementary Table S3). We conclude this loss of a tiny volume of low quality reads would not affect downstream analyses.

4 Discussion

There are broadly three options for decontaminating SARS-CoV-2 datasets: exclude reads mapping to human (as done by Dehumaniser), keep reads mapping to the virus (as done by ReadItAndKeep) or do both (first map to the virus, and then exclude any of that also map to human, as is done by the COG consortium). We have shown that, by trimming the poly-A tail from the SARS-CoV-2 genome used by ReadItAndKeep, we completely remove spurious matches of human reads. Thus ReadItAndKeep offers an approach that is more reliable than just mapping to the human genome, and lighter weight (low RAM, fast) than either of the other two approaches.

We also investigated using ReadItOnKeep for Influenza A and HIV-1 samples, which are known to be significantly more diverse than SARS-CoV-2. Although all human reads were removed, the method was not effective in retaining viral reads, in extreme cases rejecting more than half. Therefore we only recommend ReadItAndKeep for viruses with low levels of diversity—our focus was SARS-CoV-2.

Finally, one challenge for implementing pathogen sequencing in healthcare systems is justifying what proportion of human reads must be removed to guarantee non-identifiability. By explicitly testing with all possible 75 and 150 bp reads in the (extended) human reference genome, and 27 human genome samples from different global ancestries, we were able to show ReadItAndKeep excluded every single human read. We hope the benchmarking approach itself will be of use, and that the speed and low resource requirements will make ReadItAndKeep of wide utility.

Author contributions

M.H., Z.I. designed the study and wrote the article. M.H. developed the software and performed analyses. B.C., J.S. added Bioconda recipe and Docker container, respectively. P.W.F. collected the viral dataset.

Data availability

The data underlying this article are available in the article and in its online supplementary material.

Funding

M.H. was funded by the National Institue for Health and Care Research Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance [NIHR200915]—full statement in Supplementary Text.

Conflict of Interest: none declared.

References

Bush
S.J.
 et al. (
2020
)
Evaluation of methods for detecting human reads in microbial sequencing datasets
.
Microb. Genomics
,
6
,
e000393
.

Byrska-Bishop
M.
 et al. (
2021
) High coverage whole genome sequencing of the expanded 1000 genomes project cohort including 602 trios. bioRxiv.

Jain
M.
 et al. (
2018
)
Nanopore sequencing and assembly of a human genome with ultra-long reads
.
Nat. Biotechnol
.,
36
,
338
345
.

Langmead
B.
,
Salzberg
S.L.
(
2012
)
Fast gapped-read alignment with Bowtie 2
.
Nat. Methods
,
9
,
357
359
.

Li
H.
(
2013
) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-bio]. arXiv: 1303.3997.

O’Toole
Á.
 et al. (
2021
)
Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool
.
Virus Evol
.,
7
,
veab064
.

Wood
D.E.
 et al. (
2019
)
Improved metagenomic analysis with Kraken 2
.
Genome Biol
.,
20
,
257
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Can Alkan
Can Alkan
Associate Editor
Search for other works by this author on:

Supplementary data