GenomeFLTR: filtering reads made easy

Abstract In the last decade, advances in sequencing technology have led to an exponential increase in genomic data. These new data have dramatically changed our understanding of the evolution and function of genes and genomes. Despite improvements in sequencing technologies, identifying contaminated reads remains a complex task for many research groups. Here, we introduce GenomeFLTR, a new web server to filter contaminated reads. Reads are compared against existing sequence databases from various representative organisms to detect potential contaminants. The main features implemented in GenomeFLTR are: (i) automated updating of the relevant databases; (ii) fast comparison of each read against the database; (iii) the ability to create user-specified databases; (iv) a user-friendly interactive dashboard to investigate the origin and frequency of the contaminations; (v) the generation of a contamination-free file. Availability: https://genomefltr.tau.ac.il/.


INTRODUCTION
Sequencing costs are constantly decreasing ( 1 ). Research groups are now able to generate large sequence datasets from various organisms, and hence the size of GenBank doub les e v ery fe w months ( 2 ). These data dri v e discov eries in ecology, evolution, molecular biology, and medicine (3)(4)(5). Detecting and filtering contaminant DNA is a main challenge when processing next-generation sequencing (NGS) data. Contaminant reads are defined as reads tha t origina ted from an organism different from the one that the r esear chers aimed at sequencing. Read contamination can have a significant effect on downstream analyses, such as false positi v e single-nucleotide polymorphisms (SNP) identification ( 6 ), incorrect labels on sequences in metagenomic studies ( 7 ) and inaccurate phylogenetic inference ( 8 ).
Pre vious studies hav e shown that some of the most used biological datasets contain a large proportion of contaminated sequences. For e xample, ov er two billion contaminated sequences were detected in RefSeq and over 14,000 putati v e contaminants were identified in the nonredundant (NR) database ( 9 ). Moreover, repeated elements characterizing human cells were found in a quarter of nonprimate genomes available in NCBI (the National Center f or Biotechnology Inf orma tion) ( 10 ). As these da tasets are often taken as 'ground truth', filtering contaminated sequences is of high importance prior to most bioinformatic analyses ( 11 , 12 ).
Previous contaminant-detection algorithms can be classified b y v arious criteria ( 13 ). One main criterion is the presence or absence of reference databases to detect contamina tions. Methodologies tha t do not rely on r efer ence databases, sear ch for r ead-specific featur es such as lowcomplexity and low-quality scores ( 14 ). When searching for bacterial contaminants, the following features were sear ched for: atypical GC content, pr esence of intron-less genes, and small scaffolds ( 15 , 16 ). Regarding algorithms that rely on reference databases for detecting contaminations, se v eral search for the presence of a few single-copy gene markers [e.g. ( 17 , 18 )]. These methods are aimed to detect the presence of additional copies of these markers indicating the presence of contaminations and possibly identify their sources. However, these gene markerbased methods aim at detecting contaminations and not at filtering a dataset from contaminants. Other algorithms, using r efer ence databases take a genome-wide approach to detect contaminations. With these algorithms, filtration may take place pre-or post-assembly ( 13 ). One advantage of post-assembly methods is that they can take synteny into account to detect contaminant sequences ( 12 ).
Howe v er, it is likely that the assembly itself can be improved by removing contaminated reads prior to the assembly. Contamination-detection algorithms are often tuned to specific taxa, e.g. the tool GUNC searches for lack of phylo genetic homo geneity acr oss pr okaryotic contigs ( 19 ). Se v eral pre-assemb l y methods rel y on splitting the reads into small fragments and finding similarities against specific datasets using BLAST ( 20 ). Great progress was achie v ed by the de v elopment of efficient algorithms for mapping short DNA segments to genomes, which allows classifying reads to taxonomic units (21)(22)(23)(24). Such fast approaches are a pr er equisite for the de v elopment of efficient w e b servers for the detection and filtering of contaminated reads and contigs.
The above algorithms, as well as additional tools developed by specific r esear ch groups ( 21 , 25 ), r equir e downloading the pipeline components (e.g. scripts, programs), downloading and maintaining databases, and may r equir e heavy computer clusters (i.e. multi-CPUs) and technological skills. Here we present GenomeFLTR, a w e b server that easily filters genomic reads. No technical skill, downloading, or computational power is needed. Raw r eads ar e uploaded to the server and contaminated reads are removed, based on similarity to databases that are periodically and automa tically upda ted. A user can also provide a tailored dataset to compare against. The contaminated reads are analyzed, e.g. the reads tax onom y distribution is provided. Our server provides a simple and interacti v e graphical user interface (GUI) that allows controlling the filtering process (Video 1, Supplementary data).

Input
The sole mandatory input for the GenomeFLTR w e b server is a file (or two files for pair ed r eads, see below), containing the reads to be filtered. Standard formats such as Fastq and Fasta are accepted. In addition, a user has to select a database against which the r eads ar e queried (e.g. to detect bacterial contaminants, a user can choose a bacterial database containing multiple genomes from a di v erse set of bacteria). A user may also input a custom database (see below). Finally, a user may specify an email to which the results link will be sent.

Database
The entire set of sequence data bases availa ble in Genome-FLTR is automa tically upda ted monthly from NCBI. These databases are processed for the Kraken search engine format ( 21 ). We also allow users to choose the database against which to compare their read da ta. Default da tabases are bacteria, human, fungi, protozoa, uni v ec (i.e. a dataset of vector sequences), plasmid, archaea, vir al, Kr aken standard (i.e. all complete bacterial, archeal, and viral genomes in Refseq), and custom. For the custom database, a user inserts the NCBI tax onom y identifiers of the species included in RefSeq (NCBI Reference Sequence Database) to compare against and may choose specific accession numbers of genomes from this species to analyze. If accession numbers are not provided, the first three genomes from RefSeq are downloaded for each species. To download the genomes for the custom database we use a script available at https: //github.com/kblin/ncbi-genome-download .

Search engine
Each read is first split into k -mers ( k -mers are substrings of the read with length k ; for example, 3-mer for the read: ' ATGG' will be: ' AT G' and 'T GG'). To maximize both speed and accuracy, we use the Kraken 2 search engine ( 21 ) to query each k -mer (with k = 35) against the selected database. A phylogenetic tree representing the evolutionary relationships within the taxon included in each Kraken database is used to classify hits to either species or ancestral nodes. If a k -mer only matches a single species, it will be assigned to it. If a k -mer matches multiple species, it will be assigned to the most recent common ancestral node of all these species. Note that different k -mers within the same read might be assigned to different nodes of the phylogenetic tree. The output of this step is a file containing, for each of the reads, a list of species or ancestral nodes and the number of k -mers matched to each node.

Read classification
The output of the previous step is further processed in order to classify each read to a specific node in the tree. To this end, for each read and for each node we define a readnode score, which is the number of k -mers mapped to this node divided by the total number of k -mers possible for that read ( lk + 1, where l is the read length). For each read, we identify the node that maximizes the read-node score and assign the read to this node. A tabular description ( Figure  1 A) of the number of contaminated reads from each node is provided as interacti v e visual output by GenomeFLTR as well as a pie chart indicating the percentage of contaminated reads (Figure 1 B).

Determining which reads to filter
We also define a read-contamination score, which is the sum, over all nodes of the tree, of the read-node score. This score quantifies the percent of k -mers that were mapped to the contamina ted da tabase out of the lk + 1 total kmers. The higher the read-contamination score, the more likely it is that the read is a contamination and hence should be filtered. A histogram illustrating the distribution of the r ead-contamination scor e is gi v en as an interacti v e graphical output by GenomeFLTR (Figure 1 C). The user specifies a threshold cutoff that determines which reads will be labeled as contamination and which will be retained in the 'clean' data. By default, this threshold is set to 0.5. This threshold can be set interacti v ely by clicking on the bars of the histogram. Reads with a score lower than the threshold (this threshold is marked by a red line in the graphical plot) are colored blue and will be retained, while reads colored orange will be filtered once the user presses the 'Get filter ed r esults' button. Of note, r eads tha t do not ma tch any of the genomes in the database are also part of the clean data.
It is possible that a user chooses to retain reads of specific species. For example, if a user sequenced a metagenomic sample containing multiple bacteria species, and would like to retain only a subset of those bacteria, e.g. bacteria that are known to exist in a specific niche. He can do so, by choosing specific species to retain / filter from the interacti v e tabular section of the GUI. The pie chart and the histogram are updated accordingly in real-time. We note that in this case, some blue reads (retained reads) could appear to the right of the red bar, which indicates the readcontamination score threshold.

Obtaining contamination-free data
Pressing the 'Get filtered results' button initiates the post process, which iterates over the reads and identifies the 'cleaned' from the contaminated ones. When the post process is finished, a link to download a compressed file (i.e. a '.gz' file) containing all the non-contaminated reads is provided on the screen and via email to the user.

Implementation
GenomeFLTR is implemented in Python using the Flask frame wor k. The source code is available at: https://github. com/michaelalb/GenomeFltr . The w e b server submits jobs that are processed on ProLiant XL170r Gen9, equipped with 128 GB RAM and 28 CPU cores per node. Background images were generated using Dall-E 2 ( 26 ).

P air ed-end files
Another feature implemented in the w e b server is the filtering of paired-end reads. Each end is first processed independently as described abov e. Ne xt, the node-score of the pair-end read is the maximum over the two ends. For example, if one end has a read-node score of 0.2 for species X, and the other end the r ead-node scor e is 0.75 to species Y, the result of the pair ed r ead is a read-node score of 0.75 to species Y. Based on the read scores, the paired reads are either filtered or not, thus, if one end within a pair is considered to be a contamination, the entire paired-end read is discarded.

Data structure
We transform the list of the reads (which contain millions of reads) into a matrix, in which rows are bins of the read-node score (101 bins: 0, 0.01, . . . , 1) and columns are nodes in the tree. Each cell denotes the number of reads for that bin and node. This data structure allows us to present interacti v e graphs in real-time.

Multiple filtration steps
It may be beneficial to filter reads first against the univec database and then to filter the obtained clean data against, e.g. bacterial contaminants. Iterati v e e xecutions allo w breaking do wn the filtration procedure, thus cleaning the data against combinations of pre-existing categories. Such an approach is demonstrated in the case study below.

CASE STUDY
We present the GenomeFLTR output by analyzing the transcriptome reads available under accession number SRR1300899. This pair ed-r ead da taset origina ted from a m yx ozoan parasite ( Kudoa iwatai ). Myxozoans are microscopic eukaryotic parasites of fish, with a large negati v e economic impact ( 27 ). Because of their small size and their presence within fish tissues, we expected to find fish reads as well as some bacterial reads and possibly a small number of human reads in these NGS data. This parasitic dataset was published before the fish host genome was available and thus these data were not filtered before their submission to public repositories ( 28 ). We analyzed a total of 50 million pair ed-end r eads (100 million reads in total) from these data in two steps. First, we excluded the fish reads by performing a custom filtering analysis in which we provided the taxonomic id (taxid 8175) of the host fish as input. The program automatically downloaded the corresponding Refseq genome GCF 900880675.1 for this analysis. GenomeFLTR inferred that ∼17.3% of the reads (8,641,393 paired reads) were of fish origin using a read similarity threshold above 0.75. We then downloaded the remaining uncontaminated reads and conducted a second filtering analyses against the Kraken standard database (again with a threshold of 0.75). The r emaining r ead data contained bacterial contaminations from various sources, for example Proteobacteria (39,129 paired reads) and Staphylococcus (49,243 paired reads). It also contained a number of reads from human origin (107,658 paired reads). Using the w e b server option to mark nodes that should not be filtered, we decided not to filter cellular organisms and root (taxid 1), which reflect reads that are potentially of eukaryotic origin and thus may be genuine m yx ozoan reads. In total, 1.22% of the r eads wer e filtered, genera ting contamina tion-free da ta tha t ar e r eady for further analyses.

DA T A A V AILABILITY
GenomeFLTR is freely available without registration or login r equir ements at https://genomefltr.tau.ac.il/ .

SUPPLEMENT ARY DA T A
Supplementary Data are available at NAR Online.