Accelerated FDPS: Algorithms to use accelerators with FDPS

We describe algorithms implemented in FDPS (Framework for Developing Particle Sim-ulators) to make efﬁcient use of accelerator hardware such as GPGPUs (general-purpose computing on graphics processing units). We have developed FDPS to make it possible for researchers to develop their own high-performance parallel particle-based simulation programs without spending large amounts of time on parallelization and performance tuning. FDPS provides a high-performance implementation of parallel algorithms for particle-based simulations in a “ generic ” form, so that researchers can deﬁne their own particle data structure and interparticle interaction functions. FDPS compiled with user-supplied data types and interaction functions provides all the necessary functions for parallelization, and researchers can thus write their programs as though they are writing simple non-parallel code. It has previously been possible to use accelerators with FDPS by writing an interaction function can achieve good performance on systems with much smaller memory and communication bandwidth. Thus, our implementation will be applicable to future generations of accelerator system.


Introduction
In this paper we describe new algorithms implemented in FDPS (Framework for Developing Particle Simulators: Iwasawa et al. 2016;Namekata et al. 2018), to make efficient use of accelerators such as GPGPUs (general-purpose computing on graphics processing units). FDPS is designed to make it easy for researchers to develop their own programs for particle-based simulations. To develop efficient parallel programs for particle-based simulations requires a very large amount of work, comparable with the work of a large team of people for many years. This is of course true not only for particle-based simulations, but for any large-scale parallel applications in computational science. The main cause of this problem is that modern highperformance computing (HPC) platforms have become very complex, requiring a lot of effort to develop complex programs to make efficient use of such platforms.
Typical modern HPC systems are actually a cluster of computing nodes connected through a network, each with typically one or two processor chips. The largest systems at present consist of around 10 5 nodes, and we will see even larger systems soon. This extremely large number of nodes has made the design of the inter-node network very difficult, and the design of parallel algorithms has also become much harder. The calculation times of the nodes must be accurately balanced. The time necessary for communication must be small enough that the use of large systems is meaningful. The communication bandwidth between nodes is much lower than the main memory bandwidth, which itself is very small compared to the calculation speed of CPUs. Thus, it is crucial to avoid communications as much as possible. The calculation time can show an increase, instead of decreasing as it should, when we use a large number of nodes, unless we are careful to achieve good load balance between nodes and to minimize communication.
In addition, the programming environments available on present-day parallel systems are not easy to use. What is most widely used is MPI (Message-Passing Interface), which requires one to write explicitly how each node communicates with others in the system. Just to write and debug such a program is difficult, and it has become nearly impossible for any single person or even for a small group of people to develop large-scale simulation programs which run efficiently on modern HPC systems.
Moreover, this extremely large number of nodes is just one of the many difficulties of using modern HPC systems, since even within one node there are many levels of parallelism to be taken care of by the programmer. To make matters even more complicated, these multiple levels of parallelism are interwoven with multiple levels of memory hierarchy with varying bandwidth and latency. For example, the supercomputer Fugaku, which is under development in Japan at the time of writing, will have 48 CPUs (cores) in one chip. These 48 cores are divided into four groups, each with 12 cores. Cores in one group share one level-2 cache memory. The cache memories in different groups communicate with each other through a cache coherency protocol. Thus, the access of one core to the data which happens to be in its level-2 cache is fast, but that in the cache of another group can be very slow. Also, access to the main memory is much slower, and that to local level-1 cache is much faster. Thus, we need to take into account the number of cores and the size and speed of caches at each level to achieve acceptable performance. To make matters even worse, many modern microprocessors have level-3 and even level-4 caches.
As a result of these difficulties, only a small number of researchers (or groups of researchers) can develop their own simulation programs. In the case of cosmological and galactic N-body and smoothed particle hydrodynamics (SPH) simulations, Gadget (Springel 2005) and pkdgrav (Stadel 2001) are the most widely used. For star cluster simulations, NBODY6++ and NBODY6++GPU (Nitadori & Aarseth 2012) are effectively the standard. For planetary ring dynamics, REBOUND (Rein & Liu 2012) has been available. There has been no public code for simulations of the planetary formation process until recently.
This situation is clearly unhealthy. In many cases the physics that needs to be modeled is quite simple: particles interact through gravity, and with some other interactions such as physical collisions. Even so, almost all researchers are now forced to use existing programs developed by someone else, simply because HPC platforms have become too difficult to use. To add new functionality to existing programs can be very difficult. In order to make it possible for researchers to develop their own parallel code for particle-based simulations, we have developed FDPS (Iwasawa et al. 2016).
The basic idea of FDPS is to separate the code for parallelization and that for interaction calculation and numerical integration. FDPS provides the library functions necessary for parallelization, and using them researchers write programs very similar to what they would write for a single CPU. Parallelization on multiple nodes and on multiple cores in a single node are taken care of by FDPS.
FDPS provides three sets of functions. One is for domain decomposition. Given the data of particles in each node, FDPS performs the decomposition of the computational domain. The decomposed domains are assigned to MPI processes. The second set of functions is to let MPI processes exchange particles. Each particle should be sent to the appropriate MPI process. The third set of functions perform the interaction calculation. FDPS uses a parallel version of the Barnes-Hut algorithm for both long-range interactions such as gravitational interactions and short-range interactions such as intermolecular force or fluid interaction. The application program supplies the function to perform interaction calculations for two groups of particles (one group exerting forces on the other), and FDPS calculates the interaction using that function.
FDPS offers very good performance on large-scale parallel systems consisting of "homogeneous" multi-core processors, such as the K computer and Cray systems based on x86 processors. On the other hand, the architecture of largescale HPC systems is moving from homogeneous multi-core processors to accelerator-based systems and heterogeneous multi-core processors.
GPGPUs are the most widely used accelerators, and are available on many large-scale systems. They offer priceperformance ratios and performance per watt numbers significantly better than those of homogeneous systems, primarily by integrating a large number of relatively simple processors on a single accelerator chip. On the other hand, accelerator-based systems have two problems. One is that, for many applications, the communication bandwidth between CPUs and accelerators becomes the bottleneck. The second is that because CPUs and accelerators have separate memory spaces, the programming is complicated and we cannot use existing programs.
Though in general it is difficult to use accelerators, for particle-based simulations the efficient use of accelerators is not so difficult, and that is why the GRAPE families of accelerators specialized for gravitational N-body simulations have been successful (Makino et al. 2003). GPGPUs are also widely used both for collisional (Gaburov et al. 2009) and collisionless (Bédorf et al. 2012) gravitational N-body simulations. Thus, it is clearly desirable for FDPS to support accelerator-based architectures.
Though gravitational N-body simulation programs have achieved very good performance on large clusters of GPGPUs, achieving high efficiency for particle systems with short-range interactions is difficult. For example, there exist many high-performance implementations of SPH algorithms on a single GPGPU, or on a relatively small number of multiple GPGPUs (around six), but there are not many high-performance SPH programs for large-scale parallel GPGPU systems. Practically all the efficient GPGPU implementations of the SPH algorithm use GPGPUs to run the entire simulation code, in order to eliminate the communication overhead between GPGPU and CPU. The calculation cost of particle-particle interactions dominates the total calculation cost of SPH simulations. Thus, as far as the calculation cost is concerned, it is sufficient to let GPGPUs evaluate the interactions, and let CPUs perform the rest of the calculation. However, because of the relatively low communication bandwidth between CPUs and GPGPUs we need to avoid data transfer between them, and if we let GPGPUs do all the calculations it is clear that we can minimize the communication.
On the other hand, it is more difficult to develop programs for GPGPUs than for CPUs, and to develop MPI parallel programs for multiple GPGPUs is clearly even more difficult. To make such an MPI parallel program for GPGPUs run on a large cluster is close to impossible.
In order to add support for GPGPUs and other accelerators to FDPS, we decided to take a different approach. We kept the simple model in which accelerators do the interaction calculation only, and CPUs do all the rest. However, we try to minimize the communication between CPUs and accelerators as much as possible, without making the calculation on the accelerator side very complicated.
In this paper we describe our strategy of using accelerators, how application programmers can use FDPS to efficiently use accelerators, and the performance achieved. The paper is organized as follows: In section 2 we present an overview of FDPS. In section 3 we discuss the traditional approach of using accelerators for interaction calculation, and its limitations. In section 4 we summarize our new approach. In section 5 we show how users can use new APIs of FDPS to make use of accelerators. In section 6 we give the results of performance measurement for GPGPUbased systems, along with a performance prediction for hypothetical systems. Section 7 provides a summary and discussion.

Overview of FDPS
The basic idea (or the ultimate goal) of FDPS is to make it possible for researchers to develop their own highperformance, highly parallel particle-based simulation programs without spending too much time writing, debugging, and performance tuning the code. In order to achieve this goal, we have designed FDPS so that it provides all necessary functions for efficient parallel programming of particle-based simulations. FDPS uses MPI for inter-node parallelization and OpenMP for intra-node parallelization. In order to reduce the communication between computing nodes, the computational domain is divided using the recursive multisection algorithm (Makino 2004), but with weights for particles to achieve optimal load balancing (Ishiyama et al. 2009). The number of subdomains is equal to the number of MPI processes, and one subdomain is assigned to one MPI process.
Initially, particles are distributed to MPI processes in an arbitrary way. It is not necessary that the initial distribution is based on spatial decomposition, and it is even possible that initially just one process has all the particles, if it has sufficient memory. After the spatial coordinates of subdomains are determined, for each particle, the MPI process to which it belongs is determined, and it is sent to that process; this can all be achieved just by calling FDPS library functions. In order for the FDPS functions to get information on particles and copy or move them, they need to know the data structure of the particles. This is made possible by making FDPS "template based," so that at compile time the FDPS library functions know the data structure of the particles.
After the particles are moved to their new locations, the interaction calculation is done through the parallel Barnes-Hut algorithm based on the local essential tree (Makino 2004). In this method, each MPI process first constructs a tree structure from its local particles (local tree). Then, it sends to all other MPI processes the information necessary for that MPI process to evaluate the interaction with its particles. This necessary information is called the local essential tree (LET).
After one process has received all the LETs from all other nodes, it constructs the global tree by merging the LETs. In FDPS, merging is not actually done but LETs are first reduced to arrays of particles and superparticles (hereafter called "SPJ"), and a new tree is constructed from the combined list of all particles. Here, an SPJ represents a node of the Barnes-Hut tree.
Finally, the interaction calculation is done by traversing the tree for each particle. Using Barnes' vectorization algorithm (Barnes 1990), we traverse the tree for a group of local particles and create the "interaction list" for that group. Then, FDPS calculates the interaction exerted by particles and superparticles in this interaction list on particles in the group, by calling the user-supplied interaction function.
In the case of long-range interaction we use the standard Barnes-Hut scheme for treewalk. In the case of short-range interaction such as SPH interaction between particles, we still use treewalk but with the cell-opening criterion different from the standard opening angle.
Thus, users of FDPS can use the functions for domain decomposition, particle migration, and interaction calculation by passing their own particle data class and interaction calculation function to FDPS at compile time. The interaction calculation function should be designed as receiving two arrays of particles, one exerting the "force" on the other.

Traditional approach to using accelerators and its limitation
As we have already stated in the introduction, accelerators have been used for gravitational N-body simulations, both on single and parallel machines, with and without Barnes-Hut treecode (Barnes & Hut 1986). In the case of the tree algorithm, the idea is to use Barnes' vectorization algorithm, which is what we defined as the interface between the userdefined interaction function and FDPS. Thus, in principle we can use accelerators just by replacing the user-defined interaction function with one that uses the accelerators.
In the case of GRAPE processors, that would be the only thing we need to do. At the same time, this would be the only thing we can do. On modern GPGPUs, however, we need to modify the interface and algorithm slightly. There are two main reasons for this modification. The first is that the software overhead of GPGPUs for data transfer and kernel startup is much larger than for GRAPE processors. Another difference is in the architecture. GRAPE processors consist of a relatively small number of highly pipelined, applicationspecific pipeline processors for interaction calculation, with hardware support for the fast summation of results from multiple pipelines. On the other hand, GPGPUs consist of a very large number of programmable processors, with no hardware support for summation of the results obtained on multiple processors. Thus, to make efficient use of GPGPUs we need to calculate interactions on a large number of particles by a single call to the GPGPU computing kernel. The vectorization algorithm has one adjustable parameter, n grp , the number of particles which share one interaction list, and it is possible to make efficient use of GPGPUs by making n grp large. However, using an excessively large n grp causes an increase in the total calculation cost, and is thus not desirable. Hamada et al. (2009) introduced an efficient way to use GPGPUs that they called the "multiwalk" method. In their method, the CPU first constructs multiple interaction lists for multiple groups of particles, and then sends them to the GPGPU in a single kernel call. The GPGPU performs the calculations for multiple interaction lists in parallel, and Downloaded from https://academic.oup.com/pasj/article-abstract/72/1/13/5728486 by Kobe University user on 20 April 2020 returns all the results in a single data transfer. In this way we can tolerate the large overhead of invoking computing kernels on GPGPUs and the lack of support for fast summation.
Even though this multiwalk method is quite effective, there is still room for improvement, meaning that on modern accelerators the efficiency we can achieve with the multiwalk method is rather limited.
The biggest remaining inefficiency comes from the fact that with the multiwalk method we send interaction lists for each particle group. One interaction list is an array of physical quantities (at least positions and masses) of particles. Typically, the number of particles in an interaction list is ∼10 times more than the number of particles for which that interaction list is constructed, and thus the transfer time of the interaction list is around 10 times longer than that of the particles which receive the force. This means that we are sending the same particles (and superparticles) multiple times when we send multiple interaction lists.
In the next section we discuss how we can reduce the amount of communication and also further reduce the calculation cost for the parts other than the force calculation kernel.

New algorithms
As described in the previous section, to send all the particles in the interaction list to accelerators is inefficient because we send the same particles and SPJs multiple times. In subsection 4.1 we will describe new algorithms to overcome this inefficiency. In subsection 4.2 we will also describe new algorithms to further reduce the calculation cost for the parts other than the force calculation kernel. In subsection 4.3 we will describe the actual procedures with and without the new algorithms.

Indirect addressing of particles
When we use the interaction list method on systems with accelerators, in the simplest implementation, for each group of particles and its interaction list, we send the physical quantities necessary for interaction calculation, such as positions and masses in the case of gravitational force calculation. Roughly speaking, the number of particles in the interaction list is around ten times longer than that in one group. Thus, we are sending around 10n particles, where n is the number of particles per MPI process, at each timestep. Since there are only n local particles and the number of particles and tree nodes in an LET is generally much smaller than n, this means that we are sending the same data many times, and that we should be able to reduce the communication by sending particle and tree node data only once. Some GRAPE processors, including GRAPE-2A, MDGRAPE-x, and GRAPE-8, have hardware support for this indirect addressing (Makino & Daisaka 2012).
In the case of programmable accelerators, this indirect addressing can be achieved by first sending arrays of particles and tree nodes, and then sending the interaction list (here the indices of particles and tree nodes indicating their location in the arrays). The user-defined interaction calculation function should be modified so that it uses indirect addressing to access particles. Examples of such code are included in the current FDPS distribution (version 4.0 and later), and we plan to develop template routines which can be used to generate code on multiple platforms from user-supplied code for the interaction calculation.
The interaction list is usually an array of 32-bit integers (four bytes), and one item of particle data is at least 16 bytes (when positions and masses are all in single precision numbers), but can be much larger in the case of SPH and other methods. Thus, with this method we can reduce the communication cost by a large factor.
One limitation of this indirect addressing method is that all the particles in one process should fit in the memory of the accelerator. Most accelerators have relatively small memories. In such cases we can still use this method, by dividing the particles into blocks small enough to fit the accelerator memory. For each block, we construct the "global" tree structure similar to that for all particles in the process, and interaction lists for all groups under the block.

Reuse of interaction Lists
For both long-range and short-range interactions, FDPS constructs the interaction lists for groups of particles. It is possible to keep using the same interaction lists for multiple timesteps if particles do not move large distances in a single timestep. In the case of SPH or molecular dynamics simulations, it is guaranteed that particles move only a small fraction of the interparticle distance in a single timestep, since the size of the timestep is limited by the stability condition. Thus, in such cases we can safely use the interaction lists for several timesteps.
Even in the case of gravitational many-body simulations, there are cases where the change of the relative distance between particles in a single timestep is small. For example, in simulations of planetary formation processes or planetary rings, the random velocities of particles are very small, and thus, even though particles move large distances, there is no need to reconstruct the tree structure at each timestep because the changes in the relative positions of particles are small.
In the case of galaxy formation simulation using the Nbody+SPH technique, generally the timestep for the SPH part is much smaller than that for the gravity part, and thus we should be able to use the same tree structure and interaction lists for multiple SPH steps.
If we use this algorithm (hereafter we call it the reuse algorithm), the procedures for interaction calculation for the step with and without tree construction are different. The procedure for the tree construction step is: (1) Construct the local tree.
(2) Construct the LET for all other processes. These LETs should be list of indices of particles and tree nodes, so that they can be used later.
(3) Exchange LETs. Here, the physical information for tree nodes and particles should be exchanged. (4) Construct the global tree.
(5) Construct the interaction lists. (6) Perform the interaction calculation for each group using the constructed list.
The procedure for reusing steps is: (1) Update the physical information of the local tree.
(3) Update the physical information of the global tree.
(4) Perform the interaction calculation for each group using the constructed list.
In many cases we can keep using the same interaction list for around 10 timesteps. In the case of planetary ring simulation, using the same list for a much larger number of timesteps is possible, because the stepsize of planetary ring simulation using the "soft-sphere" method (Iwasawa et al. 2018) is limited by the hardness of the "soft" particles and is thus much smaller than the usual timescale determined by the local velocity dispersion and interparticle distance.
With this reuse algorithm, we can reduce the cost of the following steps: (a) tree construction, (b) LET construction, and (c) interaction list construction. The calculation costs of steps (a) and (c) are O(N) and O(Nlog N), respectively. Thus they are rather large for simulations with a large number of particles. Moreover, by reducing the cost of step (c), we can make the group size n grp small, which results in a decrease of the calculation cost due to the use of the interaction list. Thus, the overall improvement in efficiency is quite significant.
The construction and maintenance of interaction lists and other necessary data structures are all done within FDPS. Therefore, user-developed application programs can use this reuse algorithm just by calling the FDPS interaction calculation function with one additional argument indicating reuse/construction. The necessary change for an application program is very small.

Procedures with or without the new algorithms
In this section we describe the actual procedures of simulations using FDPS with or without the new algorithms. Before describing the procedures, let us introduce the four particle data types FDPS uses: FP (Full Particle), EPI (Essential Particle I), EPJ (Essential Particle J), and FORCE. FP is the data structure containing all the information on a a particle, EPI (J) is used for the minimal data of particles that receive (give) the force, and the FORCE type stores the calculated interaction. FDPS uses these additional three data types to minimize memory access during the interaction calculation. We first describe the procedure for the calculation without the reuse algorithm and then describe that for the reuse algorithm.
At the beginning of one timestep, the computational domains assigned to MPI processes are determined and all processes exchange particles so that all particles belong to their appropriate domains. Then, the coordinates of the root cell of the tree are determined using the positions of all particles. After the determination of the root cell, each MPI process constructs its local tree. The local tree construction consists of the following four steps: (1) Generate Morton keys for all particles.
(2) Sort key-index pairs in Morton order by radix sort.
(3) Reorder FPs in Morton order referencing the key-index pairs and copy the particle data from FPs to EPIs and EPJs. (4) For each level of the tree, from top to bottom, allocate tree cells and link their child cells. In each level we use binary search to find the cell boundaries.
In the case of the reusing step, these steps are skipped. After the construction of the local tree, multipole moments of all local tree cells are calculated, from the bottom to the top of the tree. The calculation of the multipole moments is performed even at the reusing step, because the physical quantities of particles are updated at each timestep.
After the calculation of the multipole moments of the local tree, each MPI process constructs LETs and sends them to other MPI processes. When the reusing algorithms is used, at the tree construction step each MPI process saves the LETs and their destination processes.
After the exchange of LETs, each MPI process constructs the global tree from received LETs and its local tree. The procedure is almost the same as that for the local tree construction.
After the construction of the global tree, each MPI process calculates the multipole moments of all cells of the global tree. The procedure is the same as that for the local tree.
After the calculation of the moments of the global tree, each MPI process constructs the interaction lists and uses them to perform the force calculation. If we do not use the multiwalk method, each MPI process makes the interaction lists for one particle group and then the userdefined force kernel calculates the forces from EPJs and SPJs in the interaction list on the EPIs in the particle group.
When we use the multiwalk method, each MPI process makes multiple interaction lists for multiple particle groups. When the indirect addressing method is not used, each MPI process gives multiple groups and multiple interaction lists to the interaction kernel on the accelerator. Thus we can summarize the force calculation procedure without the indirect addressing method for multiple particle groups as follows: (1) Construct the interaction list for multiple particle groups.
(2) Copy EPIs and the interaction lists to the send buffer for the accelerator. Here, the interaction list consists of EPJs and SPJs. (3) Send particle groups and their interaction lists to the accelerator. (4) Let the accelerator calculate interactions on the particle groups sent at step 3. (5) Receive the results calculated at step 4 and copy them back to the FPs, integrate the orbits of FPs, and copy the data from FPs to EPIs and EPJs.
To calculate the forces on all particles, the above steps are repeated until all particle groups are processed. Note that the construction of the interaction list (step 1), sending the data to the accelerator (step 3), the actual calculation (step 4), and receiving the calculated result (step 5) can all be overlapped. On the other hand, when the indirect addressing method is used, before the construction of the interaction lists, each MPI process sends the data of all cells of the global tree to the accelerator. Thus, at the beginning of the interaction calculation it should send them to the accelerator. After that, the accelerator receives the data of particle groups and their interaction lists, but here the interaction list contains the indices of EPJs and SPJs and not their physical quantities. Thus, the calculation procedure with the indirect addressing method is the same as that without indirect addressing except that all the global tree data are sent at the beginning of the calculation and the interaction lists sent during the calculation contain only indices of tree cells and EPJs.
We can use the reusing method both with and without the indirect addressing method. For the construction step, the procedures are the same. For the reusing steps, we can skip the steps for constructing the interaction list (step 1). When we use the indirect addressing method, we can also skip sending them since the lists of indices are unchanged during reuse.

APIs for using accelerators
In this section we describe the APIs (application programming interfaces) of FDPS for using accelerators and how to use them by showing sample code developed for NVIDIA GPGPUs. Part of the user kernel is written in CUDA.
FDPS has high-level APIs to perform all the procedures for interaction calculation in a single API call. For the multiwalk method, FDPS provides calcForceAllAndWriteBackMultiWalk or calcForceAllAndWriteBackMultiWalkIndex. The difference between these two functions is that the former dose not use the indirect addressing method. These two API functions can be used to replace calcForceAllAndWriteBack, which is another top-level function provided by FDPS distribution version 1.0 or later. A user must provide two force kernels: the "dispatch" and "retrieve" kernels. The "dispatch" kernel is used to send EPIs, EPJs, and SPJs to accelerators and call the force kernel. The "retrieve" kernel is used to collect FORCEs from accelerators. The reason FDPS needs two kernels is to allow overlap of the calculation on the CPU with the force calculation on the accelerator as described in the previous section.
The reusing method can be used with all three toplevel API calls described above. The only thing users do to use the reusing method is to give an appropriate FDPSprovided enumerated type value to these functions so that the reusing method is enabled. The values FDPS provides are MAKE_LIST, MAKE_LIST_FOR_REUSE, and REUSE_LIST. At the construction step the application program should pass MAKE_LIST_FOR_REUSE to the top-level API function so that FDPS constructs the trees and the interaction lists and saves them. At the reusing step, the application program should pass REUSE_LIST so that FDPS skips the construction of the trees and reuses the interaction lists constructed at the last construction step. In the case of MAKE_LIST, FDPS also constructs the trees and the interaction lists but does not save them. Thus the users cannot use the reusing method. Figure 1 shows an example of how to use the reusing method. Here, the trees and the interaction lists are constructed once every eight steps. While the same list is being reused, particles should remain in the same MPI process as at the time of list construction. Thus, exchangeParticle should be called only just before the tree construction step. Figure 2 shows an example of the dispatch kernel without the indirect addressing method. FDPS gives the dispatch kernel the arrays of pointers to the EPIs, EPJs, and SPJs as arguments of the kernel (lines 3, 5, and 7). Each pointer points to the address of the first element of the arrays of EPIs, EPJs, and SPJs for one group and its interaction list. The sizes of these arrays are given by n epi (line 4), n epj (line 6) and n spj (line 8). FDPS also passes "tag" (the first argument) and "n_walk" (the second argument). The argument "tag" is used to specify individual accelerators if multiple accelerators are available. However, in the current version of FDPS, "tag" is disabled and FDPS always passes 0. The argument "n_walk" is the number of particle groups and interaction lists.
To overlap the actual force calculation on the GPGPU with the data transfer between the GPGPU and the CPU we use a CUDA stream, which is a sequence of operations executed on the GPGPU. In this example we used N_STREAM CUDA streams. In this paper we used eight CUDA streams because the performance of our simulations does not improve even if we use more. In each stream, n walk/N STREAM interaction lists are handled. The particle data types, EPIs (lines 28-33), EPJs (lines 36-41), and SPJs (lines 48-48), are copied to the send buffers for GPGPUs. Here, we use the same buffer for EPJs and SPJs because the EPJ and SPJ types are the same. In lines 55 and 56, the EPIs and EPJs are sent to the GPGPU. Then the force kernel is called in line 60. Figure 3 shows an example of the dispatch kernel with the indirect addressing method. This kernel is almost the same as that without the indirect addressing method except for two differences. One difference is that at the beginning of one timestep, all data of the global tree (EPJs and SPJs) are sent to the GPGPU (lines 15 and 16). Whether or not the application program sends EPJs and SPJs to the GPGPU is specified by the 13th argument, "send_flag." If "send_flag" is true, the application program sends all EPJs and SPJs. Another difference is that indices of EPJs and SPJs are sent (lines 48-58 and 67-69) instead of the physical quantities of EPJs and SPJs. Here, we use the user-defined global variable CONSTRUCTION_STEP to specify whether the current step is a construction or reusing step. At the construction step CONSTRUCTION_STEP becomes unity and the user program sends the interaction list to the GPGPU and saves them in the GPGPU. On the other hand, at the reusing step, the user program does not send the list and reuses the interaction list saved in the GPGPU. Figure 4 shows an example of the retrieve kernel. The same retrieve kernel can be used with and without the indirect addressing method. In line 12, the GPGPU Downloaded from https://academic.oup.com/pasj/article-abstract/72/1/13/5728486 by Kobe University user on 20 April 2020

13-9
Publications of the Astronomical Society of Japan (2020), Vol. 72, No. 1 sends the interaction results to the receive buffer of the host (D2H copy). To let the CPU wait until all functions in the same stream on the GPGPU are completed, cudaStreamSynchronize is called in line 13. Finally, the interaction results are copied to FORCEs. Note that this retrieve kernel could be optimized by launching all D2H copies at once and then processing stream 1 once it is finished, while D2H copies for other streams are still in flight. However, in most particle simulations the execution time for the force calculation is much longer than that for D2H copies. Thus we think that using this optimized kernel would not change the performance.

Performance
In this section we present the measured performance and the performance model of a simple N-body simulation code implemented using FDPS on CPU-only and CPU + GPGPU systems.

Measured performance
To measure the performance of FDPS, we performed simple gravitational N-body simulations both with and without the accelerator. The initial model is a cold uniform sphere. This system will collapse in a self-similar way. Thus we can  use the reusing method. The number of particles (per process) n is 2 22 (4M). The opening criterion of the tree, θ , is between 0.25 and 1.0, the maximum number of particles in a leaf cell is 16, and the maximum number of particles in the EPI group, n grp , is between 64 and 16384. We performed these simulations with three different methods, listed in table 1. In the case of the reusing method, the number of reusing steps between the construction steps n reuse is between 2 and 16. For all simulations we used a monopole-only kernel. We used an NVIDIA TITAN V as an accelerator. Its peak performance is 13.8 Tflops for single precision calculation. The host CPU is a single Intel Xeon E5-2670 v3 with a peak speed of 883 Gflops for single precision calculation. The GPGPU is connected to the host CPU through a PCI  for the data transfer between the host and GPGPU. All particle data is double precision. The force calculation on the GPGPU and the data transfer between the CPU and GPGPU are performed in single precision. Figure 5 shows the average elapsed time per step for methods C1, G1, G2, and G3 with reuse intervals n reuse of 2, 4, and 16 plotted against the average number of particles which share one interaction list n i . We also plot the elapsed times for method G3 for the reusing step (i.e., this corresponds to the time with n reuse = ∞). The opening angle is θ = 0.5. We can see that in the case of method G3, the elapsed time becomes smaller as the reuse interval n reuse increases, and approaches the time of the reusing step. The minimum time of method G3 at the reusing step is ten times smaller than method C1 and four times smaller than method G1. The performance of method G3 with n reuse of 16 is 3.7 Tflops (27% of the theoretical peak).
We can also see that the optimal value of n i becomes smaller as n reuse increases. When we do not use the reuse method, the tree construction and traversal are done at every step. Thus, their costs are large, and to make it small we should increase n i . In order to do so, we need to use large n crit , which is the maximum number of particles in the particle group. If we make n i too large, the calculation cost increases (Makino 1991). Thus there is an optimal n i . When we use the reuse method, the relative cost of tree traversal becomes smaller. Thus, the optimal n i becomes smaller and the calculation cost also becomes smaller. We will give a more detailed analysis in subsection 6.2. Table 2 shows the breakdown of the calculation time for different methods in the case of θ = 0.5. For the runs with C1 and G1, we show the breakdown at the optimal values of n i , which are 230 and 1500, respectively. For the runs with G2 and G3 at the reusing step, we show the breakdowns for n i = 230. We can see that the calculation time for G3 at the reusing step is four times smaller than for G2. Thus, if n reuse 4, the contribution of the construction step to the total calculation time is small. Figure 6 shows the average elapsed time at optimal n i plotted against θ for methods C1, G1, G2, and G3 with n reuse of 2, 4, and 16. We also plot the elapsed times for the reusing step of method G3. The slope for method C1 is steeper than for the other methods. This is because the time for the force kernel dominates the total time in the case of method C1 and it strongly depends on θ .

Performance model on a single node
In the following we present the performance model of the N-body simulation with FDPS with the monopole approximation on a single node with and without an accelerator. The total execution times per step on a single node for the construction step, T step,const , and for the reusing step, T step,reuse , are given by T step,const ∼ T root + T const lt + T mom lt + T const gt T step,reuse ∼ T mom lt + T reuse reorder gt + T mom gt + T force,reuse , where T root , T const lt , T mom lt , T const gt , T mom gt , T force,const , T reuse reorder gt , and T force,reuse are the times for the determination of the root cell of the tree, the construction of the local tree, the calculation of the multipole moments of the cells of the local tree, the construction of the global tree, the calculation of the multipole moments of the cells of the global tree, the force calculation for the construction step, the reordering of the particles for the global tree, and the force calculation for the reusing step. The force calculation times T force,const and T force,reuse include the times for the construction of the interaction list, the actual calculation of the interactions, copying the interaction results from FORCEs to FPs, the integration of orbits of FPs, and copying the particle data from FPs to EPIs and EPJs. In the reuse step, the tree is reused and therefore T root , T const lt , and T const gt do not appear in T step,reuse . On the other hand, T reuse reorder gt appears in T step,reuse . This is because in the reusing step the particles are sorted in the Morton order of the local tree and the particles should be reordered in the Morton order of the global tree. In the following sub-subsections we will discuss the elapsed times of each component in equations (1) and (2).

Model of T root
At the beginning of the construction step, the coordinate of the root cell of the tree is determined. In order to do so, the CPU cores read the data of all particles (FPs) from main memory, and on modern machines the effective main memory bandwidth determines the calculation time. The actual computation in determining the root cell coordinate is much faster compared to the main memory access. Thus T root is given by where n is the number of local particles, α root is the coefficient for the memory access speed, b FP is the FP data size, and B host is the effective bandwidth of the main memory of the host as measured by the STREAM benchmark. Note that we used the "copy" kernel of the STREAM benchmark. In other words, the bandwidth B host indicates the mean bandwidth of reading and writing. On our system, the reading bandwidth is slightly faster than that of writing. Thus, for a reading-dominant procedure α would be smaller than unity. For our N-body code, we found α root ∼ 0.7. All the coefficients appearing in our performance model, such as α root and b FP , and their typical values are listed in table 3.

Model of T constlt
The time for the construction of the local tree T const lt is given by where T key lt , T sort lt , T reorder lt , and T link lt are the elapsed times for the calculation of Morton keys; sorting of key-index pairs; reordering of FPs, EPIs, and EPJs using the sorted key-index pairs; and linking of tree cells. The time for key construction is determined by the time to read FPs and write key-index pairs. Thus, T key lt is given by where b key is the data size of the key-index pair. For sorting, we use the radix sort algorithm (Knuth 1997) with a chunk size of 8 bits for the 64-bit keys. Thus we need to apply the basic procedure of the radix sort eight times. For each chunk, the data are read twice and written once. Thus the total number of memory accesses is 24. Therefore, T sort lt is given by For reordering, the key-index pair and FP are read once, and FP, EPI, and EPJ are written, so where the size parameters b EPI and b EPJ are those of the EPI and EPJ, respectively. In the linking part, we generate the tree structure from the sorted keys. In order to do so, for each cell of the tree in each level, we determine the location of the first particle by the binary search method. Thus the cost is proportional to n cell log 2 (n cell ), where n cell is the total number of tree cells. For the usual Barnes-Hut tree we have n cell ∼ n/4. Thus T link lt is given by Downloaded from https://academic.oup.com/pasj/article-abstract/72/1/13/5728486 by Kobe University user on 20 April 2020 The reason why α link lt is much larger than unity is that in the binary search the address to access depends on the data just read.

Model of T mom lt
The time for the calculation of the multipole moments of the local tree is determined by the time to read EPJ, and therefore T mom lt is given by

Model of T const gt
In the current implementation of FDPS, the global tree is constructed even if MPI is not used. The procedures for the construction of the global tree are essentially the same as for the local tree, except for reordering particles. In reordering particles, EPJ and SPJ in all LETs are first written to separate arrays. The indices of the arrays for EPJ and SPJ are also saved in order to efficiently reorder the particles at the reusing step. Thus T const gt is given by T const reorder gt ∼ (n + n LET )α const reorder gt (2b where n LET is the number of LETs and b index is the size of one array index for EPJ and SPJ in bytes. Note that for the case of center-of-mass approximation used here, the type of SPJ is the same as that of EPJ and thus the size of SPJ is equal to b EPJ . From table 3, we can see that α const reorder gt is larger than α reorder lt because for each node of LET we need to determine whether it is EPJ or SPJ.

Model of T reorder gt
At the reusing step, we also reorder the particles in the Morton order of the global tree by using the array index for EPJ and SPJ constructed at the construction step. Thus T const gt is dominated by the times to read the indices for EPJ and SPJ once and to write EPJ and SPJ once, and therefore T const gt is given by In table 3 we can see that α reuse reorder gt is much smaller than α const reorder gt because we use the array indices for EPJ and SPJ saved at the construction step to reorder the particles.

Model of T mom gt
The procedure for the calculation of the multipole moments of the cells of the global tree is almost the same as that of the local tree. Thus T mom gt is given by

Models of T force,const and T force,reuse
The elapsed times for the force calculation at the construction step, T force,const , and at the reusing step, T force,reuse , are given by T force,reuse ∼ max(T cp all , T H2D all ) where T cp all , T H2D all , T kernel , T H2D EPI , T H2D list , T D2H FORCE , T const list , T cp EPI , T cp list , and T wb+int+cp are the times for copying EPJs and SPJs to the send buffer, sending EPJs and SPJs from the host to the GPGPU, the force kernel on GPGPU, sending EPIs to the GPGPU, sending the interaction lists to the GPGPU, receiving FORCEs from the GPGPU, constructing the interaction list, copying EPIs to the send buffer, copying the interaction lists, and copying the FORCE data to FPs, integrating the orbits of FPs, and copying the FP data to EPIs and EPJs, respectively. The components in equations (17) and (18) are given by where b EPI(J) buf , b ID , b FORCE , and b FORCE buf are the data sizes of EPI(J) in the send buffer, the index of EPJ (and SPJ) in the interaction list, and the FORCE in the receive buffer, B transfer and B GPU are the effective bandwidths of the data bus between host and GPGPU and the main memory of the GPGPU, F GPU is the peak speed of floating point operations of the GPGPU, n op is the number of floating point operations per interaction, and n list is the average size of the interaction list. The elapsed times are determined by the bandwidth of the main memory of the host or the data bus between the host and the GPGPU, except for the time for the force calculation, T kernel . The model of T kernel is a bit complicated. We will describe how to construct this model in appendix 1.
In table 3 we can see that α const list and α GPU transfer are much larger than unity because random access of the main memory is slow for both the host and the GPGPU.

Model of T step,const and T step,reuse
In the previous sub-subsections we showed the performance models of each step of the interaction calculation. By using these models, the total execution time is expressed by equation (31) for the construction step and by equation (32) for the reusing step.
Thus, the mean execution time per step, T step,single , substituting the efficiency parameters α and the size parameters b with the measured values listed in table 3, is given by  In order to check whether the model we constructed is reasonable or not, in figure 7 we compare the times predicted by equations (31) and (32) with the measured times on the systems which consist of an Intel Xeon E5-2670 v3 and NVIDIA TITAN V (left panel), and an Intel Core i9-9820x and NVIDIA RTX2080 (right panel). We can see that the predicted times agree very well with the measured times on both systems.
In the following, we analyze the performance of the N-body simulations on hypothetical systems with various B host , B transfer , F GPU , and B GPU by using the performance model. Unless otherwise noted, we assume a hypothetical system with B host = 100 GB s −1 , B transfer = 10 GB s −1 , B GPU = 500 GB s −1 , and F GPU = 10 Tflops as a reference system. This reference system can be regarded as a modern HPC system with a high-end accelerator. Figure 8 shows the calculation time per timestep on the reference system for 4M particles and θ = 0.5 for n reuse = 1, 4, and 16. We can see that the difference in the performance for n reuse = 4 and n reuse = 16 is relatively small. In the rest of the section we use n reuse = 16 and θ = 0.5.
We consider four different types of hypothetical systems: GPU2X, GPU4X, LINK4X, and LINK0.5X. Their parameters are listed in table 4. Figure 9 shows the calculation times per timestep for the four hypothetical systems. We can see that increasing the bandwidth between CPU and accelerator (LINK4X) has a relatively small effect on the performance. On the other hand, increasing the overall accelerator performance has a fairly large impact. Figure 10 shows the relative execution time of hypothetical systems in the two-dimensional plane of B host and B transfer . The contours indicate the execution time relative to the reference system. In other words, systems on the contour with a value of X are X times slower than the reference system.
We can see that an increase of B host or B transfer , or even both, would not give significant performance improvement, while an increase of the accelerator performance gives a significant speedup. Even when both B host and B transfer are reduced by a factor of ten, the overall speed is reduced by a factor of four. Thus, if the speed of accelerator is improved by a factor of ten, the overall speedup is four. We can therefore conclude that the current implementation of FDPS can provide good performance not only on the current generation of GPGPUs but also for future generations of GPGPUs or other accelerator-based systems, which will have relatively poor data transfer bandwidth compared to their calculation performance.

Performance model on multiple nodes
In this section we discuss the performance model of the parallelized N-body simulation with method G3. Here, we assume the network is the same as that of the K computer. Detailed communication algorithms are given in Iwasawa et al. (2016). The time per step is given by where T exch , T dc , T LET,const , and T LET,exch are the times for the exchange of particles, the domain decomposition, the construction of LETs, and the exchange of LETs, and n dc is the number of steps for which the same domain decomposition is used. We consider the case when particles do not move much in a single step and thus ignore T exch . The time for the domain decomposition is given by where T dc,collect and T dc,sort are the times to collect sample particles to root processes and to sort particles on the root processes. According to Iwasawa et al. (2016), T dc,collect and T dc,sort are given by T dc,collect ∼ n 1/6 p τ gather,startup + n smp n 2/3 T dc,sort ∼ n smp n 2/3 p log n 3 smp n 5/3 p α dc,sort b pos where n p is the number of processes, n smp is the number of sample particles to determine the domains, b pos is the data size of the position of the particle, B inj is the effective injection bandwidth, τ gather,startup is the startup time for MPI_Gather and α gather is the efficiency parameter for communicating data with MPI_Gather. We will describe how to measure the parameters τ gather,startup and α gather in appendix 2. The coefficients for equation (37) are listed in table 3. Since we used a quick sort here, α dc sort is much larger than unity.
In the original implementation of FDPS, MPI_Alltoallv was used for the exchange of LETs and it was the main bottleneck in the performance for large n p (Iwasawa et al. 2016). Thus, we recently developed a new algorithm for Downloaded from https://academic.oup.com/pasj/article-abstract/72/1/13/5728486 by Kobe University user on 20 April 2020 the exchange of LETs to avoid the use of MPI_Alltoallv (Iwasawa et al. 2018). The new algorithm is as follows: (1) Each process sends the multipole moment of the top-level cell of its local tree to all processes using MPI_Allgather.
(2) Each process calculates the distances from all other domains.
(3) If the distance between processes i and j is large enough that process i can be regarded as one cell from process j, that domain already has the necessary information. If not, process i constructs a LET for process j and sends it to process j.
With this new algorithm, the times for the LET exchange is expressed as T LET,allgather ∼ n 1/4 p τ allgather,startup + n p,close ∼ 6 1 θ where T LET,allgather and T LET,p2p are the times for the exchange of LETs using MPI_Allgather and MPI_Isend/Irecv, n p,close is the number of processes to exchange LETs with MPI_Isend/Irecv, τ p2p,allgather and τ p2p,startup are the startup times for MPI_Allgather and MPI_Isend/Irecv, and α allgather and α p2p are the efficiency parameters for LET exchange with MPI_Allgather and MPI_Isend/Irecv, respectively. The parameters for equations (40) and (41) are listed in table 3. Figure 11 shows the weak-scaling performance for Nbody simulations in the case of the number of particles per process n = 2 20 . Here we assume that n reuse = n dc = 16, n smp = 30, n i = 250, and θ = 0.5. We can see that T step,para is almost constant for n p 10 5 . For n p 10 5 , T step,para slightly increases because T LET,allgather becomes large. Roughly speaking, when n is much larger than n p the parallel efficiency is high. Figure 12 shows the strong-scaling performance in the case of the total number of particles N = 2 30 (left panel) and N = 2 40 (right panel). In the case of N = 2 40 , T step,para scales well up to n p = 10 6 . On the other hand, in the case of N = 2 30 , for n P 3000 the slope of T step,para becomes shallower because T LET,p2p becomes relatively large. For n P 50000, T step,para increases because T LET,allgather becomes relatively large. We can also see that T step,single increases linearly with n p for n p 10 5 . This is because T step,single depends on n LET , which is proportional to n p for large n p . Thus, to improve the scalability further we need to reduce T LET,allgather and T LET,p2p . We will discuss ideas for reducing them in subsections 7.1 and 7.2.

Further reduction of communication
For simulations on multiple nodes, the communication of the LETs with neighboring nodes (T LET,p2p ) becomes a bottleneck for very large n p . Thus, it is important to reduce the data size for this communication.
An obvious way to reduce the amount of data transfer is to apply some data compression techniques. For example, the physical coordinates of neighboring particles are similar, and thus there is clear room for compression. However, for many particle-based simulations, compression in the time domain might be more effective. In the time domain we can make use of the fact that the trajectories of most particles are smooth, so we can construct fitting polynomials from several previous timesteps. When we send the data of one particle at a new timestep, we send only the difference between the prediction from the fitting polynomial and the actual value. Since this difference is many orders of magnitude smaller than the absolute value of the coordinate itself, we should be able to use a much shorter word format for the same accuracy. We probably need some clever way to use a variable-length word format. We can apply compression in the time domain not only for coordinates but for any physical quantities of particles, including the calculated interactions such as gravitational potential and acceleration. We can also apply the same compression technique to communication between the CPU and the accelerator.

Tree of domains
As we have seen in figure 12, for large n p the total elapsed time increases linearly with n p because the elapsed times for the exchange of LETs and the construction of the global tree are proportional to n p if n p is very large. To remove this linear dependency on n p we can introduce a tree structure to the computational domains (tree of domains) (Iwasawa et al. 2018). By using the tree of domains and exchanging the locally combined multipole moments between distant

Further improvement in single-node performance
Considering the trends in HPC systems, the overall performance of the accelerator (F GPU and B GPU ) increases faster than the bandwidths of the host main memory (B host ) and the data bus between the host and the accelerator (B transfer ). Therefore, in the near future the main performance bottleneck could be B host and B transfer . The amount of data copied in the host main memory and data transfer between the host and the accelerator for the reusing step are summarized in tables 5 and 6. We can see that the amounts of data copied and transfered are about 15n and 3n. One reason for the large amount of data access is that there are four different particle data types (FP, EPI, EPJ, and FORCE) and data are copied between different data types. If we use only one particle data type we could avoid data copying in procedures (e) and (g) in table 5 and procedure (B) in table 6. If we do so, the amount of data copying in the main memory and data transfer between host and accelerator could be reduced by 40% and 33%, respectively. Another way to improve the performance is to implement all procedures on the accelerator, because the bandwidth of the device memory (B GPU ) is much faster than B host and B transfer . In this case, the performance would be determined by the overall performance of the accelerator.

Summary
In this paper we have described the algorithms we implemented in FDPS to allow efficient and easy use of accelerators. Our algorithm is based on Barnes' vectorization, which has been used both on general-purpose computers (and thus previous versions of FDPS) and on GRAPE special-purpose processors and GPGPUs. However, we have minimized the amount of communication between the CPU and the accelerator by an indirect addressing method, and we further reduce the amount of calculation on the CPU side by interaction list reusing. The performance improvement over the simple method based on Barnes' vectorization on the CPU can be as large as a factor of ten on the current generation of accelerator hardware. We can also expect a fairly large performance improvement on future hardware, even if the relative communication performance is expected to degrade.
The version of FDPS described in this paper is available at https://github.com/FDPS/FDPS .  For a large message size, T allgather should be determined by the injection bandwidth B inj . Thus T allgather,words is given by Thus the elapsed time for MPI_Allgather is given by T allgather ∼ n 1/4 p τ allgather,startup + α allgather bn p B inj .
(A5) Figure 15 shows the measured and predicted times for MPI_Allgather against the message size. Here, we assume B inj = 4.8 GB s −1 from the results of a point-to-point communication test. The parameters in equation (A5) are listed in table 3. Our model agrees reasonably well with the measured data. Note that we can see a disconnect between b = 100 and 1000. We think that this disconnect is caused by a change in algorithm of the MPI library.