AliSim-HPC: parallel sequence simulator for phylogenetics

Abstract Motivation Sequence simulation plays a vital role in phylogenetics with many applications, such as evaluating phylogenetic methods, testing hypotheses, and generating training data for machine-learning applications. We recently introduced a new simulator for multiple sequence alignments called AliSim, which outperformed existing tools. However, with the increasing demands of simulating large data sets, AliSim is still slow due to its sequential implementation; for example, to simulate millions of sequence alignments, AliSim took several days or weeks. Parallelization has been used for many phylogenetic inference methods but not yet for sequence simulation. Results This paper introduces AliSim-HPC, which, for the first time, employs high-performance computing for phylogenetic simulations. AliSim-HPC parallelizes the simulation process at both multi-core and multi-CPU levels using the OpenMP and message passing interface (MPI) libraries, respectively. AliSim-HPC is highly efficient and scalable, which reduces the runtime to simulate 100 large gap-free alignments (30 000 sequences of one million sites) from over one day to 11 min using 256 CPU cores from a cluster with six computing nodes, a 153-fold speedup. While the OpenMP version can only simulate gap-free alignments, the MPI version supports insertion–deletion models like the sequential AliSim. Availability and implementation AliSim-HPC is open-source and available as part of the new IQ-TREE version v2.2.3 at https://github.com/iqtree/iqtree2/releases with a user manual at http://www.iqtree.org/doc/AliSim.


Introduction
Phylogenetic inference is an important problem in bioinformatics, which aims to reconstruct a phylogenetic tree that describes the evolutionary relationship among a set of organisms (Felsenstein 2004, Lemey et al. 2009).Typical phylogenetic inference methods require a multiple sequence alignment (MSA) containing DNA or amino-acid sequences as input and return a phylogenetic tree and a substitution model as output (Fig. 1A).In a phylogenetic tree, tips (leaves) represent the organisms in the MSA, internal nodes denote the extinct common ancestors.A substitution model is typically a Markov process that describes the rates of changes between nucleotides (for DNA sequences) or amino acids (for protein sequences).
Sequence simulation is an inverse problem of phylogenetic inference: we want to simulate MSAs from a phylogenetic tree and a substitution model (Fig. 1B).Simulated data has many applications in phylogenetics, such as evaluating phylogenetic methods (Garland et al. 1993, Kuhner and Felsenstein 1994, Tateno et al. 1994, Huelsenbeck 1995), testing hypothesis (Goldman 1993, Adell and Dopazo 1994, Scho ¨niger and von Haeseler 1999), and more recently, generating data for training new machine-learning methods (Abadi et al. 2020, Leuchtenberger et al. 2020, Ling et al. 2020, Suvorov et al. 2020).
Many applications require simulations of a vast number of MSAs.To do so, users can manually parallelize it by independently launching several AliSim jobs in a high-end server or a cluster (e.g. with sbatch), each job simulates a subset of MSAs.However, the RAM requirement of this manual approach grows linearly with the number of jobs, which may limit the number of jobs that can run in parallel.Moreover, this approach does not work well in a particular scenario of simulating an extremely large MSA due to the sequential implementation of AliSim and all other simulation software mentioned above.For example, AliSim takes hours to simulate an MSA with millions of sequences or sites (Ly-Trong et al. 2022).Our main aim here is to solve both problems, i.e. not only to speed up the simulation of single large MSAs but also to reduce the memory footprint.Parallelization has been widely used for phylogenetic inference methods (Altekar et al. 2004, Bouckaert et al. 2014, Kozlov et al. 2015, 2019, Morel et al. 2019, Minh et al. 2020) but has not yet been employed in simulation software.
In this paper, we introduce AliSim-HPC, a highperformance computing version of AliSim.AliSim-HPC parallelizes the simulations at both multi-core and multi-CPU levels using OpenMP (Chapman et al. 2007) and the message passing interface (MPI) (Gropp et al. 1998), respectively.We first present two multi-threading algorithms to parallelize the simulation of a single (large) gap-free alignment with the OpenMP library.Next, we utilize the MPI library to parallelize the simulation of many alignments across distributed CPUs.We can thus deploy AliSim-HPC that combines OpenMP and MPI on a high-performance computing cluster with many nodes.We note that the proposed algorithms are generally applicable to shared and distributed-memory paradigms.We only chose OpenMP and MPI because these two libraries have already been used in IQ-TREE.While the OpenMP algorithms can only simulate gap-free alignments, the MPI version fully supports insertion-deletion models (Indels) like the original AliSim.
AliSim-HPC shows an excellent scaling behavior: it reduces the simulation time of 100 large alignments (30 000 sequences of one million sites) without gaps from over one day to 11 min by using 256 CPU cores (153-fold speedup).AliSim-HPC is flexible: it can run on a personal computer with multithreading, as well as on a distributed-memory cluster with many CPUs and multiple cores per CPU.
Our contributions are 4-fold.First, this is the very firsttime high-performance computing techniques are applied to phylogenetic sequence simulators.Second, we provide AliSim-HPC as an extension of IQ-TREE (Nguyen et al. 2015, Minh et al. 2020), an open-source and widely used phylogenetic software, thus maximizing its usage and benefit to the user community.Third, we demonstrate that AliSim-HPC can efficiently simulate large genomic data sets, thus facilitating large-scale benchmarking of phylogenetic methods and providing training data for machine learning-based applications.And fourth, we provide practical recommendations on the choice of the number of threads per process and multi-threading algorithms for simulating large genomic data sets.

The sequential AliSim algorithm
Here, we provide a brief summary of the published AliSim algorithm (Ly-Trong et al. 2022).Assuming that we want to simulate an alignment with N sequences, each of which contains L sites from a phylogenetic tree T and a substitution model M. Let S j denote the sequence at node j of tree T. AliSim first generates a sequence with L sites at the root of the tree based on the state frequencies of model M (Fig. 2).Then, AliSim traverses tree T in a preorder manner to simulate a new sequence at each node based on the sequence of its parent node and the substitution model M (Ly-Trong et al. 2022).At tips, AliSim writes the simulated sequences to an MSA file.To generate many alignments, AliSim repeats this process sequentially.We present the sequential AliSim in Algorithm 1.
Thanks to a memory-saving technique (Ly-Trong et al. 2022), the memory complexity of the sequential AliSim algorithm is OðNÞ þ OðD Ã LÞ, where D is the depth of tree T. The first OðNÞ and the second OðD Ã LÞ terms represent the memory to store the tree structure and the simulated sequences, respectively.

AliSim-OpenMP
We now introduce two different algorithms to parallelize AliSim using OpenMP, which assumes no insertion-deletions (Indels) as follows.Without Indels, the evolution of sites in the MSA is independent, which naturally allows a parallel scheme for simulating an alignment with OpenMP (Fig. 3): each thread independently simulates a continuous block of the MSA with a roughly similar length.Given K threads, each thread i simply executes SequentialAliSim(T; M; L K ; F i ) to generate a temporary file F i containing N sequences of length L K .Next, we need to combine individual F i files into a single alignment file F. AliSim invokes another parallel section, where each thread reads a subset of roughly N K sequences across all temporary files, concatenates the subsequences into the full sequences of length L, then writes the full concatenated sequences to file F in a critical OpenMP section because file writing operations are not thread-safe.This algorithm is called "AliSim-OpenMP using external memory" because it creates temporary files to store intermediate alignments and is outlined in Algorithm 2.  nodes 8, 9, 10, 0, 1, 2, 11, 3, 4, 12, 5, and 6.AliSim simulates S 8 based on S 7 and the substitution model M, and so on.At tip nodes 0 to 6, it writes S 0 to S 6 to an output alignment.
Figure 3. Illustration of the AliSim-OpenMP algorithm using external memory to simulate an alignment without gaps with three threads.These threads execute the sequential AliSim algorithm independently to generate three temporary files F 1 ; F 2 ; F 3 ; each file contains seven subsequences of length L 3 .Then, thread 1 reconstructs two sequences S 0 and S 1 by concatenating their subsequences from all three temporary files.At the same time, thread 2 reconstructs S 2 and S 3 while thread 3 reconstructs S 4 ; S 5 , and S 6 .The concatenated sequences are written one by one to the output alignment.
Algorithm 2 has the same memory complexity of OðNÞ þ OðD Ã LÞ as the sequential AliSim algorithm, where D is the depth of tree T.However, it consumes double the amount of external memory to store temporary files.
Algorithm 2 contains two parallel sections.The first section is embarrassingly parallel without any inter-thread communications; thus, we expect this section to gain linear speedup.However, the second section can be the main bottleneck due to too many I/O operations.A quick solution to deal with this problem would be to re-implement F i as an internal memory storage.However, it requires an additional memory of OðN Ã LÞ, which by large exceeds OðD Ã LÞ and is therefore undesirable for large alignment simulations.Therefore, we designed another algorithm called "AliSim-OpenMP using internal memory" (Fig. 4) to avoid writing temporary files as follows.We allocate K À 1 threads, each thread simulates one of the ðK À 1Þ blocks of the MSA by calling a modified version of SequentialAliSim(T; M; L KÀ1 ; F i ), where F i is now redesigned as a buffer in the internal memory.Whereas the last thread K is dedicated to only writing the buffers into the output file F.
To reduce the RAM consumption, each thread does not store all N subsequences in its buffer F i , but only a fraction N Ã k of subsequences, where k is a parameter between 0 and 1. Whenever each "simulating" thread i (from 1 to K À 1) simulated a subsequence S i j at tip j, that thread will compute the absolute position of S i j in the output file F, then store S i j and its position to F i if F i has some free memory.Otherwise, thread i will need to wait until F i becomes available.The I/O thread K continuously monitors the buffers.Whenever there is any subsequence S i j stored in any buffer F i , it will jump to the precomputed position of S i j in file F and write S i j then free the corresponding memory in F i .The I/O thread then checks the next buffer in the round-robin fashion (either . This ensures a relative balance in memory availability among the buffers.This algorithm is outlined in Algorithm 3.The memory complexity of this algorithm is OðNÞ þ OðD Ã L þ N Ã k Ã LÞ.Small k will increase the waiting time of the "simulating" threads, thus potentially increasing the runtime.Whereas large k will increase the RAM consumption.To balance the trade-off between runtime and RAM consumption, we set the default k to min ðKÀ1ÞÃ2 N ; 1Þ because with more threads (higher K), each thread needs to simulate shorter subsequences, which is faster than having fewer threads, and therefore a larger buffer size is needed.
The two AliSim-OpenMP algorithms introduced above have their own advantages and disadvantages depending on the simulating conditions, but they complement each other.
The design of AliSim-OpenMP is based on the assumption that different sites in the alignment evolve independently.However, this assumption does not hold for some advanced models, such as Indels.In the following, we introduce AliSim-MPI, which can simulate many alignments with Indels.

AliSim-MPI
A practical demand is to simulate many alignments.Naively, users can manually run several AliSim jobs on a cluster, each simulating one alignment.To make it more convenient, we developed AliSim-MPI, which automatically distributes this task on a cluster within a single run.Specifically, to simulate H alignments using P MPI processes, AliSim-MPI simulates roughly d H P e alignments per process.These processes perform simulations independently and write separate alignment files.No communication is needed between the processes.The memory complexity is proportional to the number of processes: OðP Ã NÞ þ OðP Ã D Ã LÞ.Unlike AliSim-OpenMP, this algorithm perfectly supports Indels as the sequential version of AliSim.

AliSim-HPC for high-performance computing systems
We now combine AliSim-OpenMP and AliSim-MPI to enable simulations on a large cluster with P processes, each having K threads (i.e. a total of P Ã K threads are run in parallel).Note that AliSim-HPC does not support Indels if K > 1.The AliSim-HPC algorithm is outlined in Algorithm 4. We thereby refer to the two variants of AliSim-HPC that integrate AliSim-OpenMP using external memory and internal memory as AliSim-HPC-EM and AliSim-HPC-IM, respectively.

Random generator initialization
Reproducibility is strongly desirable in any software, which involves random number generation.To ensure this, AliSim-HPC employs The Scalable Parallel Random Number Generators Library (SPRNG) (Mascagni and Srinivasan 2000) and allows users to specify a random number generator seed r.It then computes a unique seed number for each thread of each process as r þ p Ã 1000 þ k, where p and k denote the process and thread IDs, respectively.If r is not provided, it will be set to the current microsecond of the CPU.

Benchmark setup
We evaluated the performance of the two variants of AliSim-HPC (using external and internal memory) compared with the sequential AliSim on the Gadi supercomputer (https://nci.org.au/our-systems/hpc-systems), a cluster of 3200 nodes with a total of 155 000 CPU cores, 567 TB of RAM, and 640 GPUs.We also benchmarked the manual approach, where users submit several sequential AliSim jobs.We employed up to six computing nodes, each of which has 2 Â 24-core Intel Algorithm 2. AliSimOpenMP_EM(T, M, L, F, K) Input: a phylogenetic tree T with N tips; a substitution model M; a sequence length L; a number of threads K. Output: a gap-free alignment file F containing N sequences of length L. 1.For each thread i from 1 to K: Call SequentialAliSim(T ; M; L K ; F i ) to generate a temporary alignment file F i 2. For each thread i from 1 to K: For each tip index j from ði À 1Þ Ã N K þ 1 to i Ã N K : 2.1.Read subsequences S i j from all temporary files F i , where i 2 f1; 2; . . .; K g. 2.2.Concatenate these subsequences S i j into the full-length sequence S j .2.3.Write S j to file F in a critical OpenMP section.
We measured the strong scaling behavior and total RAM consumption when simulating 100 large alignments without Indels.Due to computational resource constraints, we set the maximum number of processes, number of threads per process, and number of jobs at 32.By varying the number of processes P and the number of threads per process K at 1, 2, 4, 8, 16, and 32, we formed a total of 33 combinations, such that the total number of CPU cores (the number of processes times the number of threads) was up to 256.Similarly, we varied the number of jobs at 1, 2, 4, 8, 16, and 32.We simulated two types of large alignments: 1M sequences of 30K sites, which we called a "deep-alignment"; and 30K sequences of 1M sites, which we called a "long-alignment".The input phylogenetic trees were randomly drawn under the Yule-Harding model (Yule 1925, Harding 1971) and exponentially distributed branch lengths with a mean of 0.1.For the model of evolutions, we applied the general time reversible (GTR) (Tavare ´1986) with an invariant site proportion of 0.2 and discrete Gamma rate heterogeneity (Gu et al. 1995) with a Gamma shape of 0.5.

Two AliSim-OpenMP algorithms complemented each other
We first benchmark the pure multi-threading algorithms (AliSim-OpenMP) without multi-processing.Figure 5 shows the performance of the two AliSim-OpenMP algorithms using internal (IM) and external memory (EM).For long (gap-free) 2 and store these subsequences in the corresponding buffers F 1 and F 2 .Thread 3 continuously accesses buffers F 1 and F 2 to write subsequences into their precomputed positions in the output alignment F and then free the corresponding memory in the buffers.Thread 3 repeats that process until threads 1 and 2 are finished and all subsequences are outputted to the output alignment.
AliSim-HPC: parallel sequence simulator for phylogenetics alignment simulations, when the number of threads is 8 or more, AliSim-OpenMP-IM obtained 5.9-to 9.2-fold speedups (compared with the sequential AliSim), while the speedups for AliSim-OpenMP-EM were slightly lower at 5.2-8.8folds (Fig. 5A).In contrast, for deep (gap-free) alignment simulations, AliSim-OpenMP-EM obtained higher speedups (from 1.6 to 8.7 folds) than the IM variant (from 1.1 to 2.8 folds) (Fig. 5C).This is because the IM algorithm maintains an internal memory space shared between threads (Fig. 4) that may cause a bottleneck when there are many threads waiting to write to this shared memory, e.g.speedup for 8 and 16 threads is worse than four threads (Fig. 5C).Therefore, the two versions complemented each other.
Regarding the RAM consumption, our AliSim-OpenMP algorithms consumed approximately 0.3 GB and 1.4 GB RAM to simulate long and deep alignments, respectively (Fig. 5B and D).Due to a large number of nodes N in the phylogenetic tree, simulating deep alignments consumed more RAM than long alignments (see Section 2).Besides, with the default setting of k (see Section 2), the number of threads insignificantly affected the memory consumption.

AliSim-MPI obtained high parallel efficiency
Next, we benchmark the pure multi-processing AliSim-MPI version, where each process is single-threaded.AliSim-MPI and the manual approach performed equally well, which achieved almost linear speedup with high parallel efficiency (Fig. 5A and C).With 32 CPU processes or jobs, they achieved roughly 24Â and 26Â speedups (73% and 81% parallel efficiencies) for long and deep-alignment simulations, respectively.
The RAM consumption grew, as expected, proportionally with the increasing number of processes (Fig. 5B and D).AliSim-MPI required 0.3-13 GB, and 1.3-47 GB RAM to simulate long and deep alignments, respectively, which were 0.2-4 GB more RAM than the manual approach.

AliSim-HPC achieved excellent strong scaling behavior
We now benchmark AliSim-HPC, which combines the benefits of AliSim-OpenMP (low RAM consumption) and AliSim-MPI (excellent speedups).Figure 6 illustrates the performance of the two variants AliSim-HPC-EM and AliSim-HPC-IM, using external and internal memory, respectively.For longalignment simulations, while both variants achieved excellent strong scaling when increasing the total number of CPU cores ðP Ã KÞ, AliSim-HPC-IM (Fig. 6B) obtained higher speedups than AliSim-HPC-EM (Fig. 6A).For example, AliSim-HPC-IM reached 153-fold speedup using 32 processes Â 8 threads (Fig. 6B), but AliSim-HPC-EM only reached 91-fold speedup.In fact, AliSim-HPC-EM achieved 182-fold speedup in step 1 (of Algorithm 2) compared with the sequential AliSim algorithm.However, AliSim-HPC-EM required an additional phase (step 2 in Algorithm 2) to concatenate temporary files, which took approximately the same amount of the runtime of step 1, thus reducing the overall performance of that algorithm.
For simulating deep alignments, AliSim-HPC-EM (Fig. 6C) often outperformed AliSim-HPC-IM (Fig. 6D).AliSim-HPC-EM obtained a 93-fold speedup compared with only 35-fold speedup of AliSim-HPC-IM for 32 processes Â 8 threads.But interestingly, AliSim-HPC-IM with four threads per process performed better than the EM variant, obtaining 73-fold speedup for 32 processes Â 4 threads compared with 58-fold speedup for AliSim-HPC-EM.Unfortunately, for this setting, we could not run our tests with a higher number of processes due to excessive memory requirements.
In summary, both versions of AliSim-HPC achieved excellent scaling behavior.The best setting of AliSim-HPC-EM and AliSim-HPC-IM reduced the wall-clock time from over 1 day to 11 and 21 min for simulating long and deep gap-free alignments, respectively.

The RAM consumption of AliSim-HPC increased with the number of processes
Figure 7 shows the memory footprint of AliSim-HPC-EM and AliSim-HPC-IM.As expected, the RAM consumption of the two variants increased with the number of processes P.
The memory footprints of the two variants of AliSim-HPC were almost identical.When increasing the number of processes from 1 to 32, they consumed 0.3-13.4GB (Fig. 7A and  B) and 1.3-47.4GB RAM (Fig. 7C and D) to simulate long and deep alignments, respectively.With the same number of processes, the two variants of AliSim-HPC consumed the same amount of RAM as AliSim-MPI.
Algorithm 3. AliSimOpenMP_IM(T, M, L, F, K) Input: a phylogenetic tree T with N tips; a substitution model M; a sequence length L; a number of threads K. Output: a gap-free alignment file F containing N sequences of length L. For each thread i from 1 to K À 1: Call the modified SequentialAliSim(T ; M; L K À1 ; F i ) to simulate subsequences, store those sequences and their absolute positions in the alignment file F into buffer F i .For thread K: 1. Initialize i ¼ 1. 2. While any of the previous threads is not finished: Find a subsequence S i j in F i .Jump to the pre-computed position in file F, then write S i j .Free the corresponding memory in F i .Set i ¼ i mod ðK À 1Þ þ 1. 3. When all previous threads are finished: Write all F i to file F for all i from 1 to K À 1.
Algorithm 4. AliSimHPC(T, M, L, H, P, K) Input: a phylogenetic tree T with N tips; a substitution model M; a sequence length L; a number of alignments H; a number of processes P; a number of threads per process K.
Output: H alignment files F z , where z 2 f1; 2; . . .; Hg; each alignment file contains N sequences of length L.
For each process w from 1 to P: For z from 1 to H: Call AliSimOpenMP_EM(T ; M; L; F z ; K ) or AliSimOpen MP_IM(T ; M; L; F z ; K ).
This paper introduces AliSim-HPC, a high-performance sequence simulator for phylogenetics.We present two multithreading algorithms to simulate a single large gap-free alignment with OpenMP, and an embarrassingly parallel scheme to simulate many alignments (with/without gaps) with MPI on a distributed-memory system.In the future, we would also like to extend AliSim-HPC to employ Single Instruction Multiple Data (SIMD) (Cardoso et al. 2017) and GPU-based parallelization.
With an appropriate setting of the number of threads per process, AliSim-HPC is highly efficient because it involves minimal inter-thread and no inter-process communications.Strong scaling experiments also showed that AliSim-HPC is scalable: we obtained 153-fold speedup when employing 256 CPU cores (32 MPI processes, each with eight threads) to simulate 100 large alignments without Indels; further speedup using more cores is achievable since it is not yet saturated (Fig. 6).
The pure MPI implementation allows users to simulate many MSAs within a single run, which is more convenient than the manual approach of launching several AliSim jobs.Whereas the pure OpenMP implementation allows us to save AliSim-HPC: parallel sequence simulator for phylogenetics memory, which is useful for machines with limited RAM.AliSim-HPC combines both OpenMP with MPI, giving further advantage to optimize both runtimes and memory usage.To illustrate this, we ran the manual approach and different variants of AliSim-HPC on a machine with 32 cores available.Table 1 shows the runtimes and peak memory usage of these approaches when simulating 100 long alignments.The manual approach and AliSim-MPI are equally fast, but the manual approach is more memory-efficient, requiring 8.5 GB instead of 12.6 GB (the increase in RAM usage is purely due to the MPI library).AliSim-HPC offers many options to combine processes and threads per process.Among these options, AliSim-HPC using four processes with eight threads each is arguably the best option that only increases the runtime from 71 to 74 min while significantly reducing the RAM usage from 8.5 GB to 1.6 GB.AliSim-OpenMP (1 process Â 32 threads) further reduces the memory usage to just 0.3 GB but with a trade-off of increasing the runtime to 182 min.It would be interesting to design a mechanism to automatically determine the best number of processes and threads to balance the trade-off between time and memory.That depends on many factors, including the size of the MSA (e.g.long versus deep) and the machine/cluster configuration, thus beyond the scope of our study.Ly-Trong et al.
The performance of the two AliSim-HPC variants greatly depends on the sequence length.Based on the experimental results, we recommend applying AliSim-HPC-IM for simulating long (gap-free) alignments.In contrast, to simulate short and moderate sequences (e.g.30K sites) without gaps, AliSim-HPC-EM is preferable.However, determining whether an alignment is long or short is subjective and the performance of our algorithms also relies on the hardware (e.g.processor, SSD/HDD storage).Therefore, our future work also includes designing a mechanism to automatically select the best algorithm on the fly.
The OpenMP version has a limitation of not being able to simulate insertion-deletions.To alleviate this limitation, an alternative parallel strategy is to perform level order traversal on the tree so that sequences at same-depth nodes can be simulated simultaneously; each thread simulates a full-length sequence at a node.Besides, to simulate many alignments where the number of alignments is much greater than the number of CPU cores, we can also extend our AliSim-OpenMP algorithms so that each thread can simulate entire alignments independently to avoid the writing bottleneck.
Finally, the I/O operations are currently the bottleneck of our AliSim-OpenMP algorithms, which explains their farfrom-perfect speedups (Fig. 5).Future work would employ parallel I/O techniques that will remove this bottleneck and make AliSim-HPC even more efficient for much larger-scale simulations.AliSim-HPC: parallel sequence simulator for phylogenetics

Figure 1 .
Figure 1.Phylogenetic inference (A) to infer a phylogenetic tree and a substitution model from an input multiple sequence alignment; and Sequence simulation (B) to generate multiple sequence alignment(s) from a phylogenetic tree and a substitution model.

Figure 2 .
Figure2.Illustration of the sequential AliSim algorithm to simulate a single alignment along an example tree.It starts by generating a sequence S 7 at the root node 7, and subsequently, performs a preorder traversal visiting nodes 8, 9, 10, 0, 1, 2, 11, 3, 4, 12, 5, and 6.AliSim simulates S 8 based on S 7 and the substitution model M, and so on.At tip nodes 0 to 6, it writes S 0 to S 6 to an output alignment.

Figure 4 .
Figure4.Illustration of the AliSim-OpenMP algorithm using internal memory to simulate an alignment without gaps with three threads.Threads 1 and 2 execute a modified version of the sequential AliSim algorithm independently to simulate subsequences of S 0 to S 6 of length L 2 and store these subsequences in the corresponding buffers F 1 and F 2 .Thread 3 continuously accesses buffers F 1 and F 2 to write subsequences into their precomputed positions in the output alignment F and then free the corresponding memory in the buffers.Thread 3 repeats that process until threads 1 and 2 are finished and all subsequences are outputted to the output alignment.

Figure 5 .
Figure 5. Strong scalings and peak RAM consumptions of the two AliSim-OpenMP algorithms, AliSim-MPI, and the manual approach (running multiple jobs) for long-alignment [30K sequences Â 1M sites (A, B)]; and deep-alignment [1M sequences Â 30K sites (C, D)] simulations without Indels.In deepalignment simulations, AliSim-OpenMP-IM using 32 threads per process took an excessively long runtime; thus, we skipped that test to save the computational resources.

Figure 6 .
Figure 6.Strong scaling of AliSim-HPC-EM and AliSim-HPC-IM in long-alignment (A, B) and deep-alignment (C, D) simulations without Indels.In deepalignment simulations, the curve of AliSim-HPC-IM using 32 threads per process is missing since these tests took excessively long runtime; thus, we skipped them to save the computational resources. 8

Table 1 .
Runtimes and memory consumption of the manual approach and different AliSim-HPC settings when simulating 100 long (30K sequences Â 1M sites) gap-free alignments on a computer with 32 cores available.