copMEM2: robust and scalable maximum exact match finding

Abstract Summary Finding Maximum Exact Matches, i.e. matches between two strings that cannot be further extended to the left or right, is a classic string problem with applications in genome-to-genome comparisons. The existing tools rarely explicitly address the problem of MEM finding for a pair of very similar genomes, which may be computationally challenging. We present copMEM2, a multithreaded implementation of its predecessor. Together with a few optimizations, including a carefully built predecessor query data structure and sort procedure selection, and taking care for highly similar data, copMEM2 allows to compute all MEMs of minimum length 50 between the human and mouse genomes in 59 s, using 10.40 GB of RAM and 12 threads, being at least a few times faster than its main contenders. On a pair of human genomes, hg18 and hg19, the results are 324 s and 16.57 GB, respectively. Availability and implementation copMEM2 is available at https://github.com/wbieniec/copmem2.

Basic characteristics of the datasets are presented in Table 1. The experiments comprise the following pairs of datasets (see the column 'Index'), where the first index is the reference and the second one the query genome: 01-03, 01-04, 05-06, 01-02, 07-08, 09-10. 2 How to run copMEM2 Assuming that the input genomes are named hum.all.fa and panTro3.fa, where the former is the reference and the latter the query genome, an exemplary minimal command line may look like: ./copmem2 -o hp-100.txt hum.all.fa panTro3.fa > hp-100.log where -o specifies the output file. For this command line the minimum MEM length is set by default to 100, although -l followed by number may be added. Currently, L = 50 is the minimum value handled by copMEM2.
Additionally, copMEM2 is capable of finding reverse-complement matches with the switch -r: ./copmem2 -r -o hp-300-r.txt -l 300 hum.all.fa panTro3.fa > hp-300-r.log or both forward and reverse-complement matches with the switch -b: ./copmem2 -b -o hp-300-b.txt -l 300 hum.all.fa panTro3.fa > hp-300-b.log Additional useful parameters: -mf -switches to the memory-frugal mode. It reduces the number of bits holding the hash to 28 and sizes of the text buffers for the output generation.
-t 4 -runs the program in the multithreaded mode (i.e., 4 working threads are used in this example). The passed value is the maximum number of threads the program can use while running, but the number of working threads is limited by the number of sequences in the query dataset. A maximum of 64 threads can be set.

Configuration
In order to thoroughly examine the properties of the software, it is possible to set some parameters in runtime, and rest of them even before the compilation phase. The change requires interference in the source code copmem2.cpp and makefile.

Tuning runtime parameters
The list of other parameters is available after ./copmem2 -h. Those parameters may be used as diagnostic and for tuning the application in terms of speed for some datasets, yet are not normally needed.
Event logging level (thanks to these switches, it is possible to mute all messages or show all messages and parameters of the processed files): • -q -quiet mode. No screen output, • -v -verbose mode. Display all runtime details.
Output generation: • -r -compute only reverse-complement matches, • -b -compute forward and reverse-complement matches, • -ilb -ignore lowercase bases. Not used by default (i.e., lowercase bases are NOT ignored). When set, all lowercase symbols will be treated as N.
Tuning parameters: • -H n -changes hash function. Five functions have been implemented. 1: maRushPrime1HashSimplified (default), 2: xxhash32, 3: xxhash64, 4: metroHash64, 5: cityHash64, • -K n -changes the number of symbols used for Hash calculation to n. Default K is 44 but it also depends on L, • -hash bits n -manually set a length of the hash in bits which affects the size of Hash Table. Implemented values are 28, 29, 30, • -k1 n -manually set k 1 . Use with -k2 option, • -k2 n -manually set k 2 . Use with -k2 option, k 1 and k 2 should be coprimes, • -e -forces k 2 = 1, which is similar to E-MEM, • -fbr 1|2 -manually force big Ref -long datatypes for processing. 1 -big, 2 -huge. It affects Hash Table and output structures sizes (see Sec. 5.1), • -lm n -long MEM threshold. Default is 4096. Its change may affect the performance when comparing similar sets, e.g. hg18-hg19 (see Sec. 5.5), • -multi1 n -manually set a buffer size for HT creation. Range: 64-512. Should be a power of two. Newer CPU may need to increase this value, • -multi2 n -manually set a buffer size for HT creation. Notes same as above, • -multi n -manually set a buffer size for processing query. Notes same as above.

Compilation parameters
Some parameters cannot be variables, as this would adversely affect the performance of the program. For this reason, it is not possible to use them on the command line. Nevertheless, that can be changed during compilation. In CopMEM2.cpp one can find some preprocessor definitions: • #define radixsort 1 -enables LSD Radix sort. If you set it to 0, std::sort is invoked and -1 disables all sorting (may produce false results). For results, see Sec. 4.
• #define stdfmt 0 -enables fmt library for numbers to strings conversion. If you want std::string methods to be called instead, set this value to 1.
• #define predsv 1 -turns on fast search for beginning of the sequence. If you set the value to 0, std::lowerbound method will be used.
• #define dumptimer 0 -disables measurements and displaying of times used for sorting and formatting found MEMs before sorting. If you set the value to 1, these benchmarks will be shown. Note that frequently switching the timers slows down the entire process (see Sec. 4).
In makefile there exist some gcc compile switches: • -march=native -makes the program run faster on some machines, but makes the code non-portable. In addition, it may clash with valgrind, • -funroll-loops -further optimization for speed, • -O3 -yet another optimization for speed, • -pthread -enables multithreading.
Other parameters that may be interesting for benchmarking are set as constant identifiers, also in Cop-MEM2.cpp.
• constexpr int MAX THREADS = 64 -limits the number of used threads. You may change it.
• constexpr uint32 t MAX MULTI = 512 -limits the size of the array used to prefetch. In case of a better CPU, it may be increased.
• constexpr size t DEF MATCH BLOCK SIZE = 1 << 21 -the maximum number of elements for partial sorting. See sec. 5.4.
• constexpr size t MAX BLOCK SIZE = 1ULL << 31 -the maximum number of elements for full (emergency) sorting.
• constexpr size t MATCHES SORT T1 = 1ULL << 10 -the maximum number of elements in the array that are sorted with std::sort.
• size t MATCHES BUFFER SIZE = 1ULL << 24 -the maximum buffer size for formatting matches as a string before flushing them to disk in the query phase.

Benchmarking
All tests were performed in a Debian 11 environment. The time program was used to measure execution time and memory consumption. Listing 1 shows a script that tests E-MEM, bfMEM and copMEM2.
Listing 1: benchmarks.sh -exemplary linux script for testing competing software # !/ bin / bash MemFill = " ./ memoryfill 118 G " R = " ../ datasets / hum . all . fa " Q = " ../ datasets / panTro3 . fa " TST = " hp " for pass in a b c do for L in 200 100 80 do for T in 8 4 1 do echo " Test = " $TST " Pass = " $pass " L = " $L " The script is most conveniently run in the background with the nohup command. $ nohup ./benchmarks.sh > benchmarks.log & Explanations to the sample script: • the sample script tests the hum and panTro3 datasets for minimum values L = 200, 100 and 80 for 1, 4 and 8 threads. Each test is performed 3 times (for the purposes of this article, we calculated the median of three measurements); • all programs are in the ../progs dir and the test files are in the ../datasets dir. Result files are created in the current dir and they are deleted after the timing; • before each test, the main memory is filled with memoryfill program (attached to the project). This prevents caching of the input files in the RAM memory; • E-MEM, unlike the others, does not have a parameter specifying the output file. Therefore, the command includes redirecting standard output to a file; • processing with bfMEM requires two phases. The main program, bfMEM, searches for the MEMs, but it sorts the matches in a manner inconsistent with other programs. Therefore, it is necessary to use the supplementary program formatconvert, provided by the authors.

Technical description of the copMEM2
This section presents copMEM2 technicalities. The processing involves several phases (cf. Fig. 1). Note that the general MEM finding mechanism, as well as most of the data structures, are preserved from v1. Elements introduced or changed in copMEM2 are pointed out below.
• scanMultiFasta The function scans the genome Q for the beginning of individual sequences, saved in a list (i.e., C++'s vector). Then this list is split into nT hreads lists (where nT hreads is the specified number of threads) of approximately equal size. Elements of each list are grouped into blocks that will be loaded into memory at once. A block may contain one or more sequences, but its size cannot exceed the length of the longest sequence.
• readMultiFasta The function loads the genome R into the memory. It creates a list of sequences, analogously for Q, but each item is augmented with extra data, the sequence name (i.e., the list contains 2-element tuples). Additionally, the function: -removes sequence names and whitespaces from the array, and aligns the data, -converts all non-ACGT symbols into a special character, and converts the symbols to uppercase.
• Creating a hash table of K-mers sampled from R with step k 1 . In copMEM2 this phase may be performed using multiple threads.
• Processing of the query dataset and result generation. This phase includes scheduling portions of data for threads. Each thread reads a sequence, finds its matches (i.e., MEMs), and dumps the results, in a sorted form, into a textual file. At the end, all these temporary files are merged into one.
Below, some details of key structures and algorithms are revealed.

Creation of a hash table
The concept of the hash table (HT ) is the same as in copMEM. It will contain indices of R's suffixes sampled with step k 1 , grouped by the hash of K-long prefix of each suffix. HT physically consists of two one-dimensional arrays: • cum -the actual hash table, initialized with zeros. The elements of the array are unsigned 32-bit integers, if the size of R is at most k 1 · 4 GB, or 64-bit ones otherwise. The number of elements in cum is 2 29 , because 29 is the fixed number of hash bits.
• sampledPos -an array of occurrences. The elements are 32-bit integers if R's size is 4 · k 1 GB or less, otherwise 64-bit ones. There are ⌊(N − K + 1)/k 1 ⌋ + 2 elements in the array (the extra 2 is added for technical reasons).
When the hashes of all ⌊(N −K +1)/k 1 ⌋ K-mers are computed (multiple threads can be used in this phase in copMEM2), and cum and sampledPos filled, each pair cum[j] and cum[j+1] will determine the interval of K-mers whose hash value is j. More precisely, assuming that cum[j] ̸ = cum[j+1], sampledPos[cum [j]] contains the index of the first K-mer whose hash value is j (if any), while sampledPos[cum [j+1]] contains the index of the first K-mer whose hash value is larger than j. Note that there may be no sampled K-mers with hash value j, but then the corresponding interval is empty.
The following improvements were implemented in copMEM2.
• The number of hash bits HS can be set to 29 (by default) or 28 (-mf mode). In the latter case, the array cum gets halved.
• When the size of the genome R is between 2 32 and k 1 · 2 32 , we can still use 4-byte elements in the array sampledPos. In this case, the items are divided by k 1 while writing and multiplied by k 1 while retrieving. Note that k 1 is usually at least 4. Although the extra divisions and multiplications pose some overhead, this change seems to be beneficial on the overall. Table 2 gives a couple of examples for the HT memory usage in copMEM and copMEM2. Note that the values of k 1 differ in the 5th and the 6th row; this is explained in Sec. 5.3. Table 2: Illustration of the memory usage for the hash table, which is comprised of cum and sampledPos components, for copMEM and copMEM2 (default mode). The size of the reference sequence (R) is assumed to be 3 GB, 6 GB or 24 GB, respectively. 1 GB = 10 9 bytes.

Multithreading
In the considered programs, multithreaded processing has been implemented in a different way. E-MEM uses OpenMP. Additionally, the first phase of work, which is building a hash table, is single-threaded. bfMEM and copMEM2 use the low-level threaded programming available in C++. Both of these programs parallelize the MEM lookup in a similar manner, that is, the individual sequences in Q are distributed among the threads, such that the thread processes at least one whole sequence. Hence, there is a limitation in both programs that the maximum number of working threads must not exceed the number of sequences in Q (so in the extreme case of having a single sequence in Q, both bfMEM and copMEM2 will work in the single-threaded mode, no matter the chosen switches). We noticed that bfMEM reports the number of threads in an inaccurate manner. For example, if we run it with a maximum number of 6 threads, the foreground thread spawns 6 worker threads and waits for their completion. In fact, the program then runs on 7 threads. (Neither E-MEM nor copMEM2 behave in this way.) copMEM2 uses threads as follows.
• Reading and analyzing input files: this phase is performed serially (except for a subphase of replacing symbols N and n, which may involve 2 threads, with a minor benefit in speed).
• Creation of the hash table: the array containing the genome R is split evenly between threads for HT building.
• MEM finding phase. After analyzing Q, this genome is divided at sequence boundaries between threads so that each of them gets approximately the same portion of data. Each thread opens its (temporary) output file in which it will save the results in the target format.
• Merging temporary files: output files should be merged in the correct order. This is handled by the first thread just after it has finished processing of its portion of data. Meanwhile, other threads may be finishing their processing. This allows appending files partially in parallel with other threads still searching for matches.
This approach is quite frugal when it comes to using disk space. Details are given in Sec. 6.

Increasing k 1 when possible
In copMEM2, like in its predecessor, the sampling parameters k 1 and k 2 are relatively prime. In copMEM they are possibly close to √ L − K + 1; namely, k 1 is set to be the largest integer such that k 1 · (k 1 − 1) ≤ L − K + 1 and k 2 = k 1 − 1. This selection scheme is refined in copMEM2. We start as above, but then try to increase k 1 as much as possible, that is, to still have the condition k 1 · k 2 ≤ L − K + 1. The value of k 2 is unchanged.

Dealing with large match lists
How copMEM2 handles large match lists is described in the main paper (Sec. 2.1), so the reader is advised to read it first. Now we only add an illustrating example.
In the unlucky case when the added element is too small (e.g., 800 in the example above, rather than 932), an emergency procedure is provided. In this procedure, the current thread must start from scratch and use a new value of MATCH BLOCK, which is increased to 2 31 . Note it does not affect the processing of the remaining threads. Note also that if MATCH BLOCK is set (for safety) to a huge value, it does not mean that the space requirements of the tool grow correspondingly; it is much more likely that the current (even if large and problematic) sequence from Q is finished earlier and the thread's match list cleared.

Dealing with long matches
Another problem to be solved is finding matches in similar genomes, as in the case of the hg18-hg19 test. A substantial number of very long MEMs will be found in such tests. With scanning Q with some fixed increment, it is likely that the same MEM will be found multiple times despite a changing anchor. The other contenders cannot effectively handle such cases and search for the same match multiple times, deleting the duplicates only before formatting the results. This explains their long processing times for the pair of datasets hg18 and hg19. A mechanism to prevent this type of situation was designed in the (non-public) E-MEM2 software.
We (essentially) implemented the idea from E-MEM2, which is to discard a MEM (of length at least L) if it is fully contained in a MEM found at the previously sampled position in Q; in this way the number of accesses to R is hugely reduced in highly similar genomes, which avoids a great number of possibly costly match extensions.
More precisely, if a match is found whose length from the seed position to its right boundary is at least LONG MEM (set to 4096 by default), newer matches are handled in a special way. Let us denote the mentioned long match by LM . We check in the current match if its right boundary is still within the span of LM and the difference between the positions of the seed in Q and its 'anchor' reference in R is still the same as such a difference for LM . Such a case means that at the current position we found exactly the same match as before (which is going to be of length at least LONG MEM), therefore we can skip its (costly) extension.

Minor improvements
Using the fmt library In copMEM2 we use the open-source formatting library fmt for creating formatted match lines (converting unsigned integers to strings). To our knowledge, it provides the fastest portable alternative to C stdio and C++ I/O streams. fmt is available at https://fmt.dev.

Variable prefix length (K)
In copMEM K has a fixed value (default 44, and the user could choose K from {36, 44, 56}). copMEM2 enables manual setting of K, from 32 to 96, as a multiple of 4. If K is not set by a parameter, the program will assign K = 56 for L ≥ 200, K = 44 for L ∈ {80, . . . , 199} and 36 for a smaller L.

Memory frugal mode
In the introduced memory-frugal mode (-mf) the number of slots in the hash table is halved and the size of the buffer holding text-formatted output before dumping it to file is reduced from 2 24 to 2 22 bytes. Additionally, the seed size K is set a bit differently than in the default mode: to 44 if L ≥ 200 and to 36 otherwise. Table 3 shows the progress of copMEM development. The basic copMEM2 architecture is compared to the old version of our software (copMEM), and then is augmented with optimizations in match sorting (related to output generation), sequence search and printing the textual messages, to be sent as the output. Note that all these changes essentially improve the performance only when the number of MEMs is large, e.g., for small values of L in our experiments. The first two sets of columns, "copMEM" and "copMEM2, basic" demonstrate the impact of general changes in the (single-threaded) architecture of copMEM2. The large speedup, exceeding the factor of 2 in the hp50 experiment, is obtained mostly due to sorting the matches in blocks rather than in total. This more than offsets the extra I/O operations responsible for merging the output on disk. Also the match items, to be sorted, are stored more compactly in copMEM2, which helps both in speed and memory usage. Changing the sorting algorithm affects the overall performance only moderately, up to about 8%. On the other hand, the static predecessor query solution resulted in a bigger gain, up to about 16%. However, it comes at a cost of a slight increase of the memory usage. The improvement from using the fmt library is rather small and inconsistent. Table 3 shows the progress of copMEM development. hp300 means human vs. chimp with L = 300 and tatd 50 is common wheat vs. durum wheat with L = 50. Serial and 12 thread processing. Times in sec, memory (RAM) usages in GBs (G = 10 9 ).

Additional results
The basic copMEM2 architecture is compared to the old version of our software (copMEM), and then is augmented with optimizations in match sorting (related to output generation), sequence search and printing the textual messages, to be sent as the output. Note that all these changes essentially improve the performance only when the number of MEMs is large, e.g., for small values of L in our experiments. The first two sets of columns, "copMEM" and "copMEM2, basic" demonstrate the impact of general changes in the (single-threaded) architecture of copMEM2. The large speedup, exceeding the factor of 2 in the hp50 experiment, is obtained mostly due to sorting the matches in blocks rather than in total. This more than offsets the extra I/O operations responsible for merging the output on disk. Also the match items, to be sorted, are stored more compactly in copMEM2, which helps both in speed and memory usage. Changing the sorting algorithm affects the overall performance only moderately, up to about 8%. On the other hand, the static predecessor query solution resulted in a bigger gain, up to about 16%. However, it comes at a cost of a slight increase of the memory usage. The improvement from using the fmt library is rather small and inconsistent.
More detailed information on testing the sorting process itself is presented in Table 4. The hm50 and hp50 cases were selected because they generate the longest output lists (requiring 25 GB and 78 GB space on the disk, respectively).
The compile options presented in Sec. 3 allow to switch between C++ std and radix sorting modes. It is also possible to enable sort time logging. Unfortunately, this mode incurs some overhead due to the extra instructions to resume and pause timers. The presented study shows that using a combined radix sort algorithm is almost twice faster than using std::sort.
In all the original papers presenting the MEM tools used in our comparison, the RAM consumption of a running process was tested, but not the auxiliary disk space. To our knowledge, E-MEM and copMEM are the only programs that do not use more disk space than needed to store their output files. However, E-MEM uses a less compact output format (runs of space characters instead of tabs), which sometimes translates to close to 3 times larger output space. bfMEM and copMEM2 (the latter only in a multithreaded mode) produce temporary files during their work. Table 5 presents the results of disk occupancy measurements for the tested programs.
Both copMEM and copMEM2 for 1 thread use only as much disk space as needed to write the output file. copMEM2 for two or more threads needs additional space, which results from the fact that the final output file is created from joining temporary files generated by individual threads and additional disk space is needed during copying. This extra disk space can roughly be estimated as |output|/nT hreads (although in the unlikely worst case the overhead may be as large as |output|).
Measurements for the hm80, hm50, hp80 and hp50 test cases suggest that bfMEM needs around 2·|output| of the disk space. The corresponding disk space for copMEM2 was around 1.5 · |output| in the worst case (2