SWIFT: A modern highly-parallel gravity and smoothed particle hydrodynamics solver for astrophysical and cosmological applications

Numerical simulations have become one of the key tools used by theorists in all the fields of astrophysics and cosmology. The development of modern tools that target the largest existing computing systems and exploit state-of-the-art numerical methods and algorithms is thus crucial. In this paper, we introduce the fully open-source highly-parallel, versatile, and modular coupled hydrodynamics, gravity, cosmology, and galaxy-formation code SWIFT. The software package exploits hybrid shared- and distributed-memory task-based parallelism, asynchronous communications, and domain-decomposition algorithms based on balancing the workload, rather than the data, to efficiently exploit modern high-performance computing cluster architectures. Gravity is solved for using a fast-multipole-method, optionally coupled to a particle mesh solver in Fourier space to handle periodic volumes. For gas evolution, multiple modern flavours of Smoothed Particle Hydrodynamics are implemented. SWIFT also evolves neutrinos using a state-of-the-art particle-based method. Two complementary networks of sub-grid models for galaxy formation as well as extensions to simulate planetary physics are also released as part of the code. An extensive set of output options, including snapshots, light-cones, power spectra, and a coupling to structure finders are also included. We describe the overall code architecture, summarise the consistency and accuracy tests that were performed, and demonstrate the excellent weak-scaling performance of the code using a representative cosmological hydrodynamical problem with $\approx$$300$ billion particles. The code is released to the community alongside extensive documentation for both users and developers, a large selection of example test problems, and a suite of tools to aid in the analysis of large simulations run with SWIFT.


INTRODUCTION
Over the last four decades, numerical simulations have imposed themselves as the key tool of theoretical astrophysics.By allowing the study of the highly non-linear regime of a model, or by allowing in-silico experiments of objects inaccessible to laboratories, simulations are essential to the interpretation of data in the era of precision astrophysics and cosmology.This is particularly true in the field of galaxy evolution and non-linear structure formation, where the requirements of modern surveys are such that only large dedicated campaigns of numerical simulations can reach the necessary precision and accuracy targets.Hence, it is no surprise that this field has seen a recent explosion in numerical tools, models, analysis methods and predictions (for reviews, see Somerville & Davé 2015;Naab & Ostriker 2017;Vogelsberger et al. 2020;Angulo & Hahn 2022;Crain & van de Voort 2023).
★ E-mail: mschaller@lorentz.leidenuniv.nlMeeting this growing demand and complexity of numerical simulations requires increasingly efficient and robust tools to perform such calculations.For instance, these softwares involve more and more coupled differential equations to approximate, themselves coupled to increasingly complex networks of sub-grid models.At the same time, the evolution of computer architectures towards massively parallel systems further complicates the software development task.The details of the machine used, as well as an intimate knowledge of parallelisation libraries, are often required to achieve anywhere near optimal on these the systems.This, however, often puts an additional burden on scientists attempting to make small alterations to the models they run and is often a barrier to the wider adoption of software packages.Nevertheless, the significant ecological impact of large astrophysical simulations (Stevens et al. 2020; Portegies Zwart 2020) make it imperative to address these technical challenges.
Jointly, all these needs and sometimes orthogonal requirements make constructing such numerical software packages a daunting task.For these reasons, developing numerical software packages that are both efficient and sufficiently flexible has now become a task under-taken by large teams of contributors with mixed expertise, such as our own.This, in turn, implies that better code development practices need to be adopted to allow for collaborative work on large code bases.
Despite all this, the community has seen the arrival of a number of simulation software packages that rise to these challenges, many of which have also been released publicly.This recent trend, guided by open-science principles, is an important development allowing more scientists to run their own simulations, adapt them to their needs, and modify the code base to solve new problems.The public release of software is also an important step towards the reproducibility of results.Whilst some packages only offer the core solver freely to the community, some other collaborations have made the choice to fully release all their developments; we follow this latter choice here.This is an essential step that allows for more comparisons between models (as well as between models and data) to be performed and to help understand the advantages and shortcomings of the various methods used.The characterisation and inclusion of uncertainty on model predictions, especially in the field of non-linear structure formation, is now becoming common practice (for examples targeted to the needs of large cosmology surveys see Heitmann et al. 2008;Schneider et al. 2016;Grove et al. 2022).
In this paper, we introduce the fully open-source code Swift 1 designed to solve the coupled equations of gravity and hydrodynamics together with multiple networks of extensions specific to various sub-fields of astrophysics.The primary applications of the code are the evolution of cosmic large-scale structure, cluster and galaxy formation, and planetary physics.A selection of results obtained with the code is displayed in Fig. 1.
Swift was designed to be able to run the largest numerical problems of interest to the large-scale structure, cosmology & galaxy formation communities by exploiting modern algorithms and parallelisation techniques to make efficient use of both existing and the latest CPU architectures.The scalability of the code was the core goal, alongside the flexibility to easily alter the physics modules.Our effort is, of course, not unique and there is now a variety of codes exploiting many different numerical algorithms and targeted at different problems in the ever-growing field of structure formation and galaxy evolution.Examples in regular use by the community include Art (Kravtsov et al. 1997), Falcon (Dehnen 2000), Flash (Fryxell et al. 2000), Ramses (Teyssier 2002), Gadget-2 (Springel 2005), Arepo (Springel 2010b), Greem (Ishiyama et al. 2012), Pluto (Mignone et al. 2012), CubeP 3 M (Harnois-Déraps et al. 2013), 2HOT (Warren 2013), Enzo (Bryan et al. 2014), Nyx (Almgren et al. 2013), Changa (Menon et al. 2015), Gevolution (Adamek et al. 2016), HACC (Habib et al. 2016), Gasoline-2 (Wadsley et al. 2017), Pkdgrav-3 (Potter et al. 2017), Phantom (Price et al. 2018), Athena++ (Stone et al. 2020), Abacus (Garrison et al. 2021), andGadget-4 (Springel et al. 2021) as well as many extensions and variations based on these solvers.They exploit a wide variety of numerical methods and are designed to target a broad range of astrophysics, galaxy formation, and cosmology problems.
Besides exploiting modern parallelisation concepts, Swift makes use of state-of-the-art implementations of the key numerical methods.The gravity solver relies on the algorithmically-ideal fast-multipole method (see e.g.Greengard & Rokhlin 1987;Cheng et al. 1999;Dehnen 2014) and is optionally coupled to a particle-mesh method using the Fourier-space representation of the gravity equations to 1 SPH With Inter-dependent Fine-grained Tasking model periodic boundary conditions (See Springel et al. (2021) for a detailed discussion of the advantages of this coupling over a pure tree approach).The hydrodynamics solver is based on the Smoothed Particle Hydrodynamics (SPH) method (see e.g.Price 2012; Springel 2010a) with multiple flavours from the literature implemented as well as our own version (Sphenix; Borrow et al. 2022).The code is also being extended towards other unstructured hydrodynamics methods (such as moving mesh (see e.g.Springel 2010b; Vandenbroucke & De Rĳcke 2016), renormalised mesh-free techniques or SPH-ALE (see e.g.Hopkins 2015)), which will be released in the future.For cosmological applications, Swift was extended to use the particlebased "delta-f" method of Elbers et al. (2021) to evolve massive neutrinos, allowing us to explore variations of the ΛCDM model.On top of these core components, the software package was extended to provide models for galaxy formation.We make two such models available: one based on that used for the Eagle project (Schaye et al. 2015;Crain et al. 2015) and a second one based on the Gear code (Revaz & Jablonka 2018;Hausammann 2021).These were designed to target very different scales and resolution ranges-massive galaxies and their large-scale environment for Eagle, and dwarf galaxies for Gear-and are hence highly complementary.The Eagle model is additionally and optionally extended with the implementation of jet feedback from active galactic nuclei by Huško et al. (2022).
Although Swift was originally developed for large-scale structure cosmology and galaxy formation applications, it quickly became clear that the benefits of the improved parallelisation of the coupled gravity-hydrodynamics solver could also be extended to other areas in astrophysics.In particular, the code has been extended to support planetary simulations by adding equations of state for the relevant materials.These extensions have been designed by expanding the existing SPH schemes to allow for multiple materials to interact, hence opening the window to simulate the collisions and interactions of planets and other bodies made of various layers of different materials.
Another, and to our knowledge unique, feature of Swift is the extent of the material distributed as part of the public release2 .We do not only distribute the core gravity and hydrodynamics solver but also offer the multiple modules for galaxy formation mentioned and other applications above, as well as the different flavours of SPH, the full treatment of cosmological neutrinos, and more than 100 ready-to-run example problems.All these elements are documented in detail, including developer instructions for extending the code.We emphasise too that the code is in active development and we expect future releases to further extend the physics modules presented here.This paper is arranged as follows.In Section 2 we present the overall Swift code design philosophy and core principles.The equations of SPH that the code solves are summarised in Section 3. In Section 4 and 5, we introduce the equations for gravity, neutrinos, and the cosmology framework used by the code.Sections 6 and 7 are dedicated to the input & output strategy and cosmological structure finding respectively.In Section 8, we present some extensions including galaxy formation (sub-grid) models and planetary physics models.We complete the code presentation in Section 9 with some implementation details and performance results.Finally, some conclusions are given and future plans are presented in Section 10.A selection of simulation results obtained with the Swift code, illustrating the huge range of problems that have already been targeted and the flexibility of the solver.The panels show: (a) a projection of the large-scale distribution of dark matter from a 10 Mpc/ℎ slice of the (500 Mpc/ℎ) 3 benchmark simulation of Schneider et al. (2016, § 5.5); (b) the temperature of the gas weighted by its velocity dispersion in a zoom-in simulation of a galaxy cluster using the Swift-Eagle galaxy formation model ( § 8.1) extracted from the runs of Altamura et al. (2023); (c) an idealised isolated galaxy from the Agora-suite (Kim et al. 2016) simulated using the Gear model ( § 8.2) rendered using pNbody (Revaz 2013); and (d) a snapshot extracted from a Moon-forming giant impact simulation of Kegerreis et al. (2022) using the planetary physics extension of the code ( § 8.5) and rendered using the Houdini software.

CODE DESIGN AND IMPLEMENTATION CHOICES
We begin by laying out the core design principles of Swift, in particular its strategy for making efficient use of massively parallel (hybrid shared and distributed memory) high-performance computing systems.

The case for a hydrodynamics-first approach
Astrophysical codes solve complex networks of coupled differential equations, often acting on a large dynamic range of temporal and spatial scales.Over time these pieces of software frequently evolve from their original baseline, through the addition of increasingly complex equations and physical processes, some of them treated as "sub-grid" models.This process is often repeated multiple times with each new iteration of the code, leading to multiple layers of additions on top of one another.In many cases these layers do not use the most appropriate algorithms or parallelisation strategies, but rather rely on the decisions made for the previous layers' implementations.
A particularly relevant example of this issue is the generalised use of a tree-code infrastructure (e.g.Barnes & Hut 1986), originally designed to solve the equations of gravity, to also perform a neighbour-finding search for SPH (see e.g.Monaghan 1992; Price 2012, for a review).Similarly, this gas neighbour-finding code is then sometimes reused to find neighbours of star particles (for feedback or enrichment), although the two species are clustered very differently.These kinds of infrastructure re-use are ubiquitous in contemporary simulation codes (e.g.Hernquist & Katz 1989;Couchman et al. 1995;Davé et al. 1997;Springel et al. 2001;Wadsley et al. 2004;Springel 2005Springel , 2010b;;Hubber et al. 2011;Wadsley et al. 2017;Price et al. 2018;Springel et al. 2021).Although appealing for its reduced complexity, and successful in the past, this approach can in some cases result in noticeable sub-optimal computational efficiency, in particular for modern computing hardware.The data structure itself (a nested set of grids) is not the culprit here, the way it is traversed is the limitation.For example, tree walks typically involve frequent jumps in memory moving up and down the tree, a pattern that is not ideal for modern CPUs or GPUs.Such a pattern is particularly sub-optimal to make efficient use of the hierarchy of memory caches as most of the data read will be discarded.Instead, modern hardware prefers to access memory linearly and predictably, which also allows for a more efficient utilisation of the memory bandwidth and caches, but also enables vector instructions (SIMD).To exploit vector instructions, we need all the elements of the vector (e.g.particles) to follow the same branching path.Thus, if an independent tree-walk has to be performed for each particle, and there is no obvious way to meaningfully group the particles into batches that will follow the same path in the tree, then it will seriously hinder our ability to use such vector instructions in our algorithms.Such an approach would hence, from the outset, forfeit 7/8 th3 of the available computing performance of a modern system.The loss of performance due to a tree-walk's inability to make use of the various cache levels is more difficult to quantify.However, the recent trend in computing hardware to add more layers of caches is a clear sign that their use ought to be maximised in order to extract performance out of the computing units.To back up this intuition, we performed a detailed analysis of the large cosmological simulations from the Eagle project (Schaye et al. 2015), based on a heavily modified version of the Gadget-3 code.It showed that the majority (> 65%) of the computing time was spent in the neighbour-finding operations (both for gas and stars) performed via a tree walk.
All these considerations suggest that a simulation code designed with a hydrodynamics-first approach could achieve substantial performance gains.In SPH-like methods, the neighbourhood is defined entirely by demanding a certain number  ngb ∼ 50-500 of particles around the particle of interest from which to compute physical quantities and their derivatives.Similarly, many sub-grid implementations (See e.g.§ 8.1, § 8.2, and § 8.3) rely on the same neighbourhoods for most of their calculations.Hence, grouping particles in cells that contain a number of particles ≳  ngb will naturally construct neighbourhoods of the required size.This will lead to the construction of a Cartesian grid with cells whose size is similar to the size of the search radius of the particles.The neighbour-finding algorithm can then be greatly simplified.Each particle only needs to search for particles in the cell where it lies and any of the directly adjacent cells (Fig. 2).To ensure this property is always fulfilled, we force the cell sizes to not be smaller than the search radii of the particles in a given region.If the condition is violated, this triggers a reconstruction of the grid.This so-called Verlet-list method (Verlet 1967) is the standard way neighbour-finding is performed in molecular dynamics simulations.The Verlet-list method.By constructing a mesh structure with cell sizes matching the search radius  of particles, the neighbour-finding strategy is entirely set by the geometry of the cells and the list of potential candidates is thus exactly known.The particle in black only has potential neighbours in the cell where it resides or any of the 8 (26 in 3D) directly neighbouring cells (in grey).The smoothly varying nature of SPH leads to particles having similar  in nearby regions, with this scale only varying slowly over the whole simulated domain.
Once the cell structure has been constructed, all the required information is known.There is no need for any speculative tree-walk and the number of operations, as well as the iteration through memory, are easily predictable.
In the case of SPH for astrophysics, the picture is slightly more complex as the density of particles and hence the size of their neighbourhoods can vary by orders of magnitude.The method can nevertheless be adapted by employing a series of nested grids (Fig. 3).Instead of constructing a single grid with a fixed cell size, we recursively divide them, which leads to a structure similar to the ones employed by adaptive-mesh-refinement codes (See § 9.1).As we split the cells into eight children, this entire structure can also be interpreted as an oct-tree.We emphasise, however, that we do not walk up and down the tree to identify neighbours; this is a key difference with respect to other packages.
With the cells constructed, the entire SPH neighbour-related workload can then be decomposed into two sets of operations (or two sets of tasks): the interactions between all particles within a single cell and the interactions between all particles in directly adjacent cells.Each of these operations involves ∼  2 ngb particle operations.For typical scenarios, that is an amount of work that can easily be assigned to one single compute core with the required data fitting nicely in the associated memory cache.Furthermore, since the operations are straightforward (no tree-walk), one can make full use of vector instructions to parallelise the work at the lowest level.
This approach, borrowed from molecular dynamics, was adapted for multi-resolution SPH and evaluated by Gonnet (2015) and Schaller et al. (2016).It forms the basis of the Swift code described here.We emphasise that such an approach is not restricted to pure SPH methods; other mesh-free schemes, such as the arbitrary Lagrangian-Eulerian (ALE) renormalised mesh-free schemes (Vila 1999;Gaburov & Nitadori 2011;Hopkins 2015;Alonso Asensio et al. 2023), finite volume particle methods (e.g.Hietel et al. 2001Hietel et al. , 2005;;Ivanova et al. 2013), or moving mesh (Springel 2010b; Van-Figure 3.An example of interactions between regions of different densities, i.e. particles with different search radii.Particle  will interact with the particles on the left and above using the smaller cells.It will interact with the particles on the right using the larger cell.The particle  will only interact using the cells at the coarser level.Thanks to the nested grids, interactions happen at different levels in the hierarchy depending on the local search radius.Once the grid is constructed, all the possible interactions at the different levels are known without the need of a speculative tree-walk. denbroucke & De Rĳcke 2016) also naturally fit within this paradigm as they also rely on the concepts of neighbourhoods and localised interactions.
As it turns out, the series of nested grids constructed to accommodate the distribution of particles also forms the perfect structure on which to attach a gravity solver.We argued against such re-use at the start of our presentation; the situation here is, however, slightly different.Unlike what is done for the hydrodynamics, the gravity algorithm we use requires a tree-walk and some amount of pointerchasing (jumps in memory) is thus unavoidable.We eliminated the tree-walk for the identification of SPH neighbourhoods, which was our original goal.We can now use a much more classic structure and algorithm for the gravity part of the Swift solver.Viewing the grid cells as tree nodes and leaves, we implement a Fast-Multipole-Method (FMM; see Greengard & Rokhlin 1987;Cheng et al. 1999;Dehnen 2002Dehnen , 2014;;Springel et al. 2021) algorithm to compute the gravitational interactions between particles.Here again, the work can be decomposed into interactions between particles in the same cell (tree-leaf), particles in neighbouring cells, or in distant cells.Once the tree is constructed, all the information is available and no new decision making is in principle necessary.The geometry of the tree and the choice of opening angle entirely characterises all the operations that will need to be performed.All the arithmetic operations can then be streamlined with the particles treated in batches based on the tree-leaves they belong to.

Parallelisation strategy: Task-based parallelism
All modern computer architectures exploit multiple levels of parallelism.The trend over the last decade has been to increase the number of computing units (CPUs, GPUs, or other accelerators) in a single system rather than to speed up the calculations performed by each individual unit.Scientific codes that target modern high-performance computing systems must thus embrace and exploit this massive parallelism from the outset to get the most out of the underlying hardware.
As discussed in the previous section, the construction of a cellbased decomposition of the computational volume leads to natural units of work to be accomplished by the various compute cores.In principle, no ordering of these operations is required: as long as all the internal (self i.e. particle-particle interactions of particles within a single cell) and external (pair i.e. particle-particle interactions of particles residing in two different cells) interactions of these cells have been performed, all particles will have iterated over all their neighbours.One can therefore list all these cell-based units of work or tasks and use a piece of software that simply lets the different compute threads on a node fetch a task, execute it, and indicate its successful completion.Such tasks can e.g.take all the particles in a cell and compute the  2 cell SPH (or gravity) interactions between them; or take all the particles and drift them (i.e.integrate their positions) forward.This constitutes a very basic form of task-based parallelism.In astrophysics, the ChanGa code (Menon et al. 2015) uses a similar parallel framework.
Compared to the traditional "branch-and-bound" approach in which all operations are carried out in a pre-specified order and where all compute units perform the same operation concurrently, as used by most other astrophysics simulation codes, this task-based approach has two major performance advantages.Firstly, it dynamically balances the work load over the available compute cores.In most simulations, the distribution of computational work over the simulation domain is highly inhomogeneous, with a small part of the volume typically dominating the total cost.Decomposing this work a priori (i.e.statically) is a very challenging problem, and practical solutions inevitably lead to substantial work imbalance.By not preassigning regions to a specific computing unit, the task scheduler can instead naturally and dynamically assign fewer cells to an individual computing unit if they turn out to have a high computational cost, and vice versa.
The second advantage of the task-based approach is that it naturally allows the gravity and hydrodynamics computations to be performed at the same time without the need for a global synchronisation point between the two that typically leads to (sometimes substantial) idle time.The list of tasks simply contains both kinds of calculations and the threads can pick any of them; there is no need for the code to wait for all the gravity operations to be done before the SPH calculations can begin, or vice versa (Fig. 4).
This tasking approach forms the basis of Swift.In its form discussed above, however, it is too simple for the complex physics entering actual simulations.Most SPH implementations require multiple loops over the particles in their neighbourhoods.Sub-grid models often require that some hydrodynamic quantities be computed before they can themselves operate.One could first construct a list of all tasks related to the first loop and then distribute the threads on it.A second list could then be constructed of all the tasks related to the second loop and the process repeated.This would, however, re-introduce global synchronisation points between the individual lists, leading to undesirable idle time.Instead, we construct a single list but introduce so-called dependencies between operations acting on a given cell (and hence its particles).For instance, all the first loop tasks have to be performed on a given cell before the tasks associated with the second loop can be performed.This transforms the list of tasks into an orientated graph with connections indicating the localised ordering of the physical operations to perform.This graph can now include all the operations, even the ones not requiring neighbour loops (e.g.time integration).Different cells can thus naturally progress in a given time step at different rates, leading to no global barriers between each loop (Fig. 5).When a task has completed, it reports this to all other tasks that depend on it.Once all dependencies for a task are satisfied (i.e.all the other tasks that must have run before it in the graph have completed), it is allowed to run; it is placed in a queue from where it can be fetched by available compute threads.
In addition to this mechanism, the task scheduling engine in the Swift code also uses the notion of conflicts (Fig. 4) to prevent two threads from working on the same cell at the same time.This eliminates the need to replicate data in different caches, which is detrimental to performance.More crucially, it also ensures that all work Once the particles have been drifted to the current point in time, the first loop over neighbours can be run.The so-called "ghost" task serves mainly to reduce the number of dependencies between successive loops over the neighbours.Once the second loop has run, the time integration ( § 2.4) can be performed.In parallel to the SPH operations, the gravity tasks (condensed into a single one here for clarity) can be run as they act on different subsets of the data.To prevent different threads from over-writing each others' data, the various SPH loop tasks (1 self and 26 pairs) are prevented from running concurrently via our conflict mechanism.Additional loops over neighbours, used for instance in more advanced SPH implementations, in sub-grid models or for radiative transfer, can be added by repeating the same pattern.They can also be placed after the time integration tasks if they correspond to terms entering the equations in an operator splitting way.
Figure 5.The execution of various tasks using 8 threads over the course of one time-step, extracted from a cosmological hydrodynamical simulation with 2 × 128 3 particles using only gravity and hydrodynamics on a shared-memory system.The different rows correspond to the different threads on the compute node.The work each thread performs is coloured to correspond to the task type it executes.Yellow, for instance, corresponds to a self-task performing gravity operation on a cell, whereas navy blue corresponds to a pair-task performing a 3 rd SPH loop over two cells.Note that some tasks displayed in the legend do not actually run in this example.For instance, no MPI-related send or recv tasks are executed here.We show them in the legend for consistency with Fig. 9.The long bands are actually a series of the same task acting on different cells one after the others.There are for instance 512 yellow tasks.As desired, the threads display essentially no idle time (white gaps) between operations and all end their work at very nearly the same time.In other words, the load balancing is near-perfect with no parallel performance loss.The small gap at the start corresponds to cost of deciding what tasks to activate for this step.Bands of a given colour can have different lengths, indicating that tasks can correspond to very different workloads depending on how many particles are present in the cell(s) on which they act.At a given point in time, different threads often process different task types, and hence solve a different set of equations.This is different from the traditional branch-and-bound parallelism approach where all threads perform the same action and have to wait until they have all completed it before moving to the next piece of physics.
performed inside a single task is intrinsically thread-safe without the need to use atomic operations.Because the code executed by a thread inside a task is guaranteed to run on a private piece of data, developers modifying the physics kernels need not worry about all the usual complexities related to parallel programming.This reduces the difficulty barrier inherent to programming on modern architectures and allows astrophysicists to easily modify and adapt the physics model in Swift to their needs.To our knowledge, the combination of dependency and conflict management in the tasking engine is a unique feature of Swift4 .For a detailed description, we refer the reader to Gonnet et al. (2016), where a stand-alone problem-agnostic version of this task scheduling engine is introduced.
One additional advantage of this conflict mechanism is the opportunity to symmetrize the operations.As no other compute thread is allowed to access the data within a cell, we can update both particles that take part in an interaction simultaneously, effectively halving the number of interactions to compute.This is typically not possible in a classic tree-walk scenario as each particle would need to independently search for its neighbours.The same optimisation can be applied to the gravity interactions involving direct interactions of particles, usually between two tree leaves.
Last but not least, the thread-safe nature of the work performed by the tasks, combined with the small memory footprint of the data they act on, leads to them being naturally cache efficient but also prime candidates for SIMD optimization.The gravity calculations are simple enough that modern compilers are able to automatically generate vector instructions and thus parallelise the loops over pairs of particles.For instance, on the realistic gravity-only test problem of §4.6 we obtain speed-ups of 1.96x, 2.5x, and 3.14x on the entire calculation when switching on AVX, AVX2, and AVX512 autovectorization on top of regular optimization levels.This could also be the case for simple versions of the SPH loops (see discussion by Willis et al. 2018).The cut-off radius beyond which no interactions take place does, however, allow for additional optimizations.Borrowing, once more, from molecular dynamics, we implement sorted interactions and pseudo-Verlet lists (Gonnet 2013).Instead of considering all particles in neighbouring cells as potential candidates for interactions, we first sort them along the axis linking the cells' centres.By walking along this axis, we drastically reduce the number of checks on particles that are within neighbouring cells but outside each other's interaction range, especially in the cases where the cells only share an edge or a corner (Fig. 6).This way of iterating through the particle pairs is much more complex and compilers are currently unable to recognize the pattern and generate appropriate vector instructions.We therefore implemented SIMD code directly in Swift, for some of the flavours of SPH, following the method of Willis et al. (2018).This approach does, however, break down when more complex physics (such as galaxy formation models, see §8) are solved, as too many variables enter the equations.
Despite the advantages outlined above, one possible drawback to the task-based approach, as implemented in Swift, is the lack of determinism.The ordering in which the tasks are run will be different between different runs, even on the same hardware and with the exact same executable.This can (and does) lead to small differences in the rounding and truncation of floating point numbers throughout the code, which, in turn will lead to slightly different results each time.This is, of course, not an issue on its own as every single one of these results was obtained using the same combination of operations and within the same set of floating point rules.As an example, the study by Borrow et al. (2023a) shows that the level of randomness created by the code is consistent with other studies varying random seeds to generate different galaxy populations.The same differences between runs can also arise in pure MPI codes or when using other threading approaches such as OpenMP as neither of these guarantee the order of operations (at least in their default operating modes).Our approach merely exacerbates these differences.In practice, we find that the main drawback is the difficulty this intrinsic randomness can generate when debugging specific math-operation related problems.We note that nothing prevents us from altering the task scheduling engine to force a specific order.This would come at a performance cost, but could be implemented in a future iteration of the code to help with the aforementioned debugging scenario.Here the particles in the left cell receive contributions from the particles in the right cell.In the first phase, all particles are projected onto the axis linking the two cells (grey line) and sorted based on their projected coordinates.In the interaction phase, the particles iterate along this axis to identify candidates.For instance, the particle  (in black) will identify plausible neighbours (in light and dark grey) on this axis up to a distance   (indicated by the black ruler).These candidates are then tested for 3D distance to verify whether they are genuine neighbours (i.e.within the dotted circle and highlighted in dark grey here) or not.With this technique, the number of false-positives (light grey) is greatly reduced compared to the total number of possible candidates in the right-hand cell (here, 3 vs.11).The advantage is even greater when considering the next particle (from right to left) on the axis.Particle  knows that it will at most have to iterate on the axis up to the end of the ruler set by particle , i.e. its list of candidates is at most as large as 's for the same value of .Moving from particle to particle in the left-hand cell, we can also stop the whole operation as soon as the distance on the axis does not reach at least the first particle in the right-hand cell.Because particles move only by small amounts between steps, the sorted list can be re-used multiple times provided a sufficient buffer is added to the length of the black ruler.Finally, the process is reversed to update the particles on the RHS with contributions from particles in the left cell.In 3D, even larger gains are achieved when the two cells share only an edge or just a corner.

Beyond single-node systems
So far, we have described the parallelisation strategy within single shared-memory compute nodes.To tackle actual high-performance computing (HPC) systems and run state-of-the-art calculations, mechanisms must be added to extend the computational domain to more than one node.The classic way to achieve this is to decompose the physical volume simulated into a set of discrete chunks and assign one to each compute node or even each compute thread.Communications, typically using an MPI implementation, must then be added to exchange information between these domains, or to perform reduction operations over all domains.
Swift exploits a variation of this approach, with two key guiding principles: first, MPI communication is only used between different compute nodes, rather than between individual cores of the same node (who use the previously-described tasking mechanism to share work and data between each other).Second, we base the MPI domain decomposition on the same top-level grid structure as used for the neighbour finding, and aim to achieve a balanced distribution of work, rather than data, between nodes.
The base grid constructed for neighbour finding ( § 2.1) is split into regions that get assigned to individual compute nodes.The algorithm used to decide how to split the domain will be described in § 9.3; we focus here on how the exchange of data is integrated into the task-based framework of Swift.
As the domain decomposition assigns entire cells to compute nodes, none of the tasks acting on a single cell require any changes; all their work is, by definition, purely local.We only need to consider To allow for the interaction to happen, we create a set of proxy cells on the first node and create communication tasks (arrows) that import the relevant particles (in grey) from the second node.We also create a dependency between the communication and the pair task to ensure the data have arrived before the pair interaction can start.The pair task can then update the particles entirely locally, i.e. by exploiting exactly the same piece of code as for pairs that do not cross domain boundaries.A similar proxy exists on the other node to import particles in the opposite direction in order to process the pair also on that node and update its local particles.
operations involving pairs of particles, and hence pairs of cells, such as SPH loops, gravitational force calculation by direct summation (see § 4.3), or sub-grid physics calculations (see § 8).
Consider a particle needing information from a neighbour residing on another node to update its own fields.There are generally two possible approaches here.The first one is to send the particle over the network to the other node, perform a neighbour finding operation there, update the particle, and send the particle back to its original node.This may need to be repeated multiple times if the particle has neighbours on many different nodes.The second approach instead consists of importing all foreign neighbours to the node and then only updating the particles of interest local to the node once the foreign neighbour particle data is present.We use this second approach in Swift and construct a set of proxy cells to temporarily host the foreign particles needed for the interactions.The advantage of this approach is that it requires only a single communication, since no results have to be reported back to the node hosting the neighbour particle.Also, since we constructed the grid cells in such a way that we know a priori which particles can potentially be neighbours, and since we attach the communications to the cells directly, we also know which particles to communicate.We do not need to add any walk through a tree to identify which cells to communicate.
As Swift exploits threads within nodes and only uses MPI domains and communications between nodes, we actually construct relatively large domains when compared to other MPI-only software packages that must treat each core as a separate domain.This implies that each node's own particle (or cell) volume is typically much larger than any layer of proxy cells surrounding it.In typical applications, the memory overhead for import buffers of foreign particles is therefore relatively small.Furthermore, the trend of the last decade in computing hardware is to have an ever larger number of cores and memory on each node, which will increase the volume-to-surface ratio of each domain yet further.Note, however, that some of these trends are not followed by a proportional raise in memory bandwidth and some architectures also display complex NUMA designs.On such systems it may be beneficial to use a few MPI domains per node rather than a single one.7.Each compute node has a task to drift its own local cell.The foreign node (here below the thick black line) then executes a send operation.On the local node, a receive task is run to get the data before unlocking the dependency (solid arrow) and letting the scheduler eventually run the pair - interaction task.The communication itself (dotted arrow) implicitly acts as a dependency between the nodes.The converse set of tasks exists on the other compute node to allow the pair - to also be run on that node.
Once the proxy cells have been constructed, we create communication tasks to import their particles (see Fig. 7).When the import is done, the work within the pair task itself is identical to a purely local pair.Once again, users developing physics modules need therefore not be concerned with the complexities of parallel computing when writing their code.
The particles need to be communicated prior to the start of the pair interactions.After all, the correct up-to-data particle data needs to be present before the computation of the interactions for them to be correct.The commonly adopted strategy is to communicate all particles from each boundary region on all nodes to their corresponding proxy regions before the start of the calculations.This can be somewhat inefficient, for two reasons.Firstly, it typically saturates the communication network and the memory bandwidth of the system, leading to poor performance, especially on smaller, mid-range, computing facilities where the commuication hardware is less powerful than in big national centres.Secondly, no other operations are performed by the code during this phase, even though particles far from any domain boundaries require no foreign neighbours at all and could therefore, in principle, have their interactions computed in the meantime.The traditional branch-and-bound approach prevents this, but Swift treats the communications themselves as tasks that can naturally be executed concurrently with other types of calculation (see above).
At a technical level, we achieve this concurrency by exploiting the concept of non-blocking communications offered by the MPI standard5 .This allows one compute node to mark some data to be sent and then return to process other work.The data are silently transferred in the background.On the receiving end, the same can be done and a receive operation can be posted before the execution returns to the main code.One can then probe the status of the communication itself, i.e. use the facilities offered by the MPI standard to know whether the data have arrived or are still in transit.By using such a probe, we can construct send and receive communication tasks that can then be inserted in the task graph where needed and behaving like any of the other (computing) tasks.Once the data have arrived on the receiving side, the receive task can simply unlock its dependencies and the work (pair tasks) that required the foreign data can now be executed (Fig. 8).By adding the communications in the tasking system, we essentially allow computational work to take place at  5 but now split across 4 nodes, each using 8 threads, i.e. a combination of distributed and shared parallelism.This is the hybrid mode in which Swift is run for large calculations that do not fit on a single node.Each panel corresponds to a different compute node.Within each panel the different rows correspond to the different threads on the compute node.The work each thread performs is coloured to correspond to the task type it executes using the same scheme as on Fig. 5.The vertical dashed line on the right of each panel indicates the end of the time-step, which is determined by the point where the last compute node finishes.As can be seen, the node-to-node balance is not perfect; some nodes complete their work slightly earlier.This is due to the MPI library requiring some time to process messages in an unpredictable way, which the domain decomposition algorithm ( §9.3) can thus not compensate for.This leads to small gaps in the execution (white gaps in the coloured bands).All required communication for the tasks occurs within this same figure, and overlaps (asynchronously) with work that only has local or already satisfied dependencies.All the exchanges happen whilst other tasks are running.The communications are overlapping with actual work.Note also that with less work per node overall compared to the shared-memory case, shown in Fig. 5, it is easier to see here that a given point in time different threads often process different task types, and hence solve a different set of equations.
the same time as communications.Note that the communication operations can be performed by any of the running threads.We do not reserve one thread for communications.The tasks not requiring foreign data can run as normal while the data for other pairs is being exchanged, eliminating the performance loss incurred from waiting for all exchanges to complete in the traditional approach.The large volume-to-surface ratio of our domains (see above) implies that there are typically many more tasks that require no foreign data than ones that do.There is, hence, almost always enough work to perform during the communication time and overheads.
An example of task execution over multiple nodes is displayed on Fig. 9.This is running the same simulation as was shown on Fig. 5 but exploiting 4 nodes each using 8 threads.We show here the full hybrid distributed and shared memory capability of Swift.
Here again, tasks of different kind are executed simultaneously by different threads.No large data exchange operation is performed at the start of the step; the threads immediately start working on tasks involving purely local data whilst the data is being transferred.The work and communication are thus effectively overlapping.The four nodes complete their work at almost the same time and so do the threads within each node, hence showing near perfect utilisation of the system and thus the ability to scale well.
The ability of Swift to perform computations concurrently with MPI communications reduces idle time, but the actual situation is somewhat more complex.In reality, the MPI library as well as the lower software layers interacting with the communication hardware also need to use CPU cycles to process the messages and perform the required copies in memory, so that a complete overlap of communications and computations is not feasible.This is often referred to as the MPI progression problem.Such wasted time can for instance be seen as blank gaps between tasks on Fig. 9.The extra cost incurred can vary dramatically between different implementations of the MPI protocol and depending on the exact hardware used.A similar bottleneck can occur when certain sub-grid models requiring many neighbour loops are used (e.g.Chaikin et al. 2023).These may generate many back-and-forth communications with only little work to be done concurrently.
We remark, however, that whilst the communications taking place during a time-step are all formally asynchronous, we still have a synchronisation point at the end of a step where all the compute nodes have to wait.This is necessary as we need all nodes to agree what the next time-step size is for instance.This can be detrimental in the cases where the time-step hierarchies become very deep (see below) and when only a handful of particles require updates every step.A strategy akin to the one used by the Dispatch code (Ramsey et al. 2018), where regions can evolve at independent rates, would remove this last barrier.In practice, thanks to our domain decomposition aiming to balance the work not the data (see §9.3), this barrier is typically not a bottleneck for steps with a lot of work as the nodes all take a similar amount of time to reach this end-of-step barrier.

Local time-step optimisations
In most astrophysical simulations, not only do the length-scales of interest span several orders of magnitude, but so too do the timescales.It would therefore, typically, be prohibitively expensive to update all particles at every step; localised time-step sizes or even per-particle time-steps are essential.For a system governed by a Hamiltonian, it is possible to rewrite the classic leapfrog algorithm and consider sub-cycles where only a fraction of the particles receive acceleration updates (a kick operation) whilst all other particles are only moved (drifted) to the current point in time (Duncan et al. 1998;Springel 2005).Swift exploits this mechanism by first creating long time-steps for the long-range gravity interaction ( § 4.5), where all the particles are updated, and then creating a hierarchy of smaller steps using powers-of-two subdivisions, where only the short-range gravity and hydrodynamic forces are updated (Hernquist & Katz 1989).This hierarchy is implemented by mapping the physical time from start to end of a simulation to the range of values representable by an integer.A jump of one thus represents the minimum timestep size reachable by a particle (e.g.( end −  begin )/2 32 for a 32-bit integer.).Each actual time-step size is then a power-of-two multiple of this base quantum of time, hence ensuring exactly the hierarchy of time-steps we expected.Using a 64-bit integer, we get a maximal possible number of steps in a run of 2 64 ≈ 10 19 , much more than will be necessary.
In real applications, this hierarchy can be more than 10 levels deep, meaning that the longest time-step sizes can be >1000× larger than the base time-step length (see e.g.Borrow et al. 2018).
The speed gains obtained by updating only a small fraction of the particles are immense.However, at the level of code implementation and parallelisation, this concept creates complicated challenges.Firstly, it requires added logic everywhere to decide which particle and hence which cell needs updating.This can be detrimental on some architectures (e.g.GPUs or SIMD vector units) where more streamlined operations are required.Secondly, and most importantly, it leads to global simulation steps where less computing time is spent moving the system forward than is spent in overheads.This challenge cannot simply be overcome by making the software more parallel; there will be steps where there are fewer particles to update than there are CPU threads running.As small steps (i.e.steps with a low number of particles to update) are orders of magnitude more frequent than the base step, they can actually dominate the overall simulation run time.It is hence of paramount importance to minimize all possible overheads.
One of the key overheads is the time spent communicating data across the network.The domain decomposition algorithm used in Swift (see § 9.3) attempts to minimize this by not placing frequently active particles (or their cells) close to domain boundaries.If this is achieved, then entire steps can be performed without a single message being exchanged.The other main overhead is the drift operation.In the classic sub-cycling leapfrog (e.g.Quinn et al. 1997;Springel 2005), only the active particles are kicked, but all particles are drifted, since they could potentially be neighbours of the active ones.Whilst the drift is easily scalable, as it is a pure per-particle operation, it would nevertheless be wasteful to move all particles for only the handful of them that are eventually found in the neighbourhood of the few active particles.In Swift, as is also done in some other modern codes, we alleviate this by first identifying the regions of the domain that contain active particles and all their neighbours.We then activate the drift task for these cells and only them.We thus do not drift all the particles just the required ones, which is, to our knowledge, not an approach that is discussed in the literature by other authors.This additional bit of logic to determine the regions of interest is similar to a single shallow tree-walk from the root of the tree down to the level where particles will be active.The benefit of this reduced drift operation is demonstrated by Borrow et al. (2018).We note that Swift can nevertheless be run in a more standard "drift-everything" mode to allow for comparisons.

Language, implementation choices, and statistics
The design described above is, in principle, agnostic of the programming language used and of the precise libraries exploited6 to implement the physics or parallelism approach.It was decided early on to write the code in the C language (specifically using the GNU99 dialect) for its ease of use, wide range of available libraries, speed of compilation, and access to the low level threads, vector units, and memory management of the systems.
The task engine exploited by Swift is available as a stand-alone tool, QuickSched (Gonnet et al. 2016), and makes use of the standard POSIX threads available in all UNIX-based systems.The advantage of using our own library over other existing alternative (e.g.Cilk (Blumofe et al. 1995), TBB (Reinders 2007), SMPSs (Perez et al. 2008), StarPU (Augonnet et al. 2011), or the now standard OpenMP tasks) is that it is tailored to our specific needs and can be adapted to precisely match the code's structure.We also require the use of task conflicts (see § 2.2) and the ability to interface with MPI calls (see § 2.3), two requirements not fulfilled by other alternatives when the project was started.
By relying on simple and widely available tools, Swift can be (and has been) run on a large variety of systems ranging from standard x86 CPUs, ARM-based computers, BlueGene architecture, and IBM Power microprocessors.
The entirety of the source code release here comprises more than 150 000 lines of code and 90 000 lines of comments.These large numbers are on the one hand due to the high verbosity of the C language and on the other hand due to the extent of the material released and the modular nature of the code.The majority of these lines are contained in the code extensions and i/o routines.Additionally, about 30 000 lines of python scripts are provided to generate and analyse examples.The basic Cocomo model (Boehm 2000) applied to our code base returns an estimate of 61 person-years for the development of the package.
Swift was also designed, from the beginning, with a focus on an open and well-documented architecture both for ease of use within the development team but also for the community at large.For that reason, we include fifteen thousand lines of narrative and theory documentation7 , a user onboarding guide, and large open-source, well-documented, and well-tested analysis tools8 .

SMOOTHED PARTICLE HYDRODYNAMICS SOLVER
Having discussed the mechanism used by Swift to perform loops over neighbouring particles, we now turn to the specific forms of the equations for hydrodynamics evolved in the code.
Smoothed particle hydrodynamics (SPH; Lucy 1977;Gingold & Monaghan 1977) has been prized for its adaptivity, simplicity, and Lagrangian nature.This makes it a natural fit for simulations of galaxy formation, with these simulations needing to capture huge dynamic ranges in density (over 4 orders of magnitude even for previousgeneration simulations), and where the coupling to gravity solvers is crucial.Future releases of Swift will also offer more modern hydrodynamics solver options (see §10.2).
Swift implements a number of SPH solvers, all within the same neighbour-finding and time-stepping framework.These solvers range from a basic re-implementation of equations from Monaghan (1992) in §3.1 & §3.2, to newer models including complex switches for artificial conductivity and viscosity.We introduce our default scheme Sphenix in §3.3 and present our implementation of a time-step limiter and of particle splitting in §3.4 and §3.5 respectively.For completeness, we give the equations for the additional flavours of SPH available in Swift in Appendix A. Note also that in this section, we limit ourselves to the equations of hydrodynamics in a non-expanding frame.Information on comoving time integration is presented later in § 5.4.As comparing hydrodynamic models is complex, and often a significant level of investigation is required even for a single test problem (e.g.Agertz et al. 2007;Braspenning et al. 2023), we do not directly compare the implemented models in Swift here.We limit our presentation to the classic "nIFTy cluster" problem (Sembolini et al. 2016, §3.6), which is directly relevant to galaxy formation and cosmology applications.For our fiducial scheme, Sphenix, the results of many of the standard hydrodynamics tests were presented by Borrow et al. (2022).The initial conditions and parameters for these tests, and many others, are distributed as part of Swift and can be run with all the schemes introduced below.

A brief introduction to SPH
SPH is frequently presented from two lenses: the first, a series of equations of motion derived from a Lagrangian with the constraint that the particles must obey the laws of thermodynamics (see e.g.Nelson & Papaloizou 1994;Monaghan & Price 2001;Springel & Hernquist 2002;Price 2012;Hopkins 2013); or a coarse-grained, interpolated, version of the Euler equations (as in Monaghan 1992).
As the implemented methods in Swift originate from numerous sources, there are SPH models originally derived from, and interpreted through, both of these lenses.Here, we place all of the equations of motion into a unified framework for easy comparison.
SPH, fundamentally, begins with the kernel9 .This kernel, which must be normalised, must have a central gradient of zero, and must be isotropic, is usually truncated at a compact support radius .We describe the kernel as a function of radius  and smoothing length ℎ, though all kernels implemented in Swift are primarily functions of the ratio between radius and smoothing length /ℎ to ensure that the function remains scale-free.The kernel function where here  d is the number of spatial dimensions and (/ℎ) is a dimensionless function that describes the form of the kernel.Throughout, Swift uses the Dehnen & Aly (2012) formalism, where the smoothing length of a particle is independent of the kernel used, with the smoothing length given by ℎ = √ 2 ln 2 , with  the full-width half maximum of a Gaussian.The cut-off radius  =  K ℎ is given through a kernel-dependent  K .We implement the kernels from that same paper, notably the Wendland (1995) C2, C4, and C6 kernels, as well as the Cubic, Quartic, and Quintic splines (Monaghan & Lattanzio 1985) using their normalisation coefficients.Generally, we recommend that production simulations are performed with the Wendland-C2 or Quartic spline kernels for efficiency and accuracy reasons.

Constructing the number density & smoothing length
The kernel can allow us to construct smoothed, volume-dependent quantities from particle-carried quantities.Particle-carried quantities are intrinsic to individual mass elements (e.g.mass, thermal energy, and so on), whereas smoothed quantities (here denoted with a hat) are created from particle-carried quantities convolved with the kernel across the smoothing scale (e.g.mass density, thermal energy density, and so on).
The most basic smoothed quantity is referred to as the particle number density, for a sum runs over neighbouring particles .This is effectively a partition of unity across the particle position domain when re-scaled such that n(ℎ) for all positions r and constant smoothing scale 10 , assuming that the smoothing length ℎ is chosen to be large enough compared to the inter-particle separation.Given a disordered particle arrangement (i.e.any arrangement with non-uniform particle spacing in all dimensions), it is possible to invert eq. 3 with a fixed value of  to calculate the expected smoothing length given a measured number density from the current particle arrangement.In principle, this is possible for all values of , but in practice there is a (kernel dependent, see Dehnen & Aly 2012) lower limit on  which gives acceptable sampling of the particle distribution (typically  > 1.2).Higher values of  give a smoother field, and can provide more accurate gradient estimates, but lead to an increase in computational cost.For some kernels, high values of  can also lead to occurrences of the pairing instability (Price 2012;Dehnen & Aly 2012).
Given a computation of n at the position of a particle r  , for a given smoothing length ℎ  , an expected particle number density can be computed from eq. 3.In addition, we compute the derivative where here    ≡ |r  − r  |, and    ≡  (   , ℎ  ), with ∇  implying a spatial derivative with respect to r  .This gradient is used, along with the difference between the expected density and measured density, within a Newton-Raphson scheme to ensure that the smoothing length ℎ  corresponds to eq. 3 to within a relative factor of 10 −4 by default.
We calculate the mass density of the system in a similar fashion, with this forming our fundamental interpolant: where here   is the particle mass.We choose to use the particle number density in the smoothing length calculation, rather than mass, to ensure adequate sampling for cases where particle masses may be very different, which was common in prior galaxy formation models due to stellar enrichment sub-grid implementations.
Swift calculates (for most implemented flavours of SPH) the pressure of particles based upon their smoothed density and their internal energy per unit mass , or adiabat , with where  is the ratio of specific heats.

Creating general smoothed quantities
Beyond calculating the density, any quantity can be convolved with the kernel to calculate a smoothed quantity.For a general particlecarried quantity , with spatial derivatives provide basic estimates of smoothed quantities.Better estimators exist, and are used in specialised cases (see e.g.Price 2012), but in all other cases when we refer to a smoothed quantity these are the interpolants we rely on.

SPH equations of motion
Following Hopkins (2013), we write equations of motion for SPH in terms of two variables describing a volume element for conserving neighbour number ( x in their formalism, here we use ) and a volume element for the thermodynamical system ( in their formalism, here we use ).We then can write the conservative equations of motion for SPH as derived from a Lagrangian as follows: where here the factors    are given by The second equation of motion, i.e. the one evolving the thermodynamic variable ( or ) depends on the exact flavour of SPH, as described below.

Basic SPH flavours
Swift includes two so-called traditional SPH solvers, named Minimal (based on Price (2012)) and Gadget2 (based on Springel ( 2005)), which are Density-Energy and Density-Entropy-based solvers respectively.This means that they use the particle mass as the variable  in eq. 10 and evolve the internal energy  or, respectively the adiabat  (eq.6), as thermodynamic variable.These two solvers use a basic prescription for artificial viscosity that is not explicitly time-varying.They are included in the code mainly for comparison to existing literature and to serve as basis for new developments.
These two solvers share the same equation of motion for velocity and internal energy, d  d but as they each track different thermodynamic variables (, internal energy per unit mass for Minimal, and entropy/adiabat  for Gad-get2).In this latter flavour, the equation for the adiabat is absent as d/d = 0 in the absence of additional source terms.In the equations above we also defined, v   ≡ v  − v  , and which is known as the 'f-factor' or 'h-factor' to account for non-uniform smoothing lengths.
In addition to these conservative equations, the two basic SPH solvers include a simple viscosity prescription, implemented as an additional equation of motion for velocity and internal energy (entropy).The artificial viscosity implementation corresponds to the equations 101, 103, and 104 of Price ( 2012), with   = 0 and  = 3.We solve the following equations of motion where the interaction-dependent factor These rely on the signal velocity between all particles, which is also used in the time-step calculation, and is defined in these models as where the constant  = 3.Finally, the viscosity is modulated using the Balsara (1989) switch, which removes viscosity in shear flows.The switch is applied to the viscosity constants  V,  is as follows: where here  V = 0.8 is a fixed constant,  s, is the gas sound speed, and  = 0.0001 is a small dimensionless constant preventing divisions by zero.

The Sphenix flavour of SPH
The Sphenix flavour of SPH is the default flavour in Swift, and was described in detail by Borrow et al. (2022).Sphenix inherits from the Density-Energy formulation of SPH, uses similar discontinuity treatments and limiters as the Anarchy scheme use in the Eagle cosmological simulations (see Schaller et al. 2015;Schaye et al. 2015, and Appendix A2), and uses a novel limiter for feedback events.Sphenix was designed with galaxy formation applications in mind.As the scheme uses the Density-Energy equation of motion and not a pressure-smoothed implementation ( §A1), it must use a comparatively higher amount of conduction at contact discontinuities to avoid spurious pressure forces (e.g.Agertz et al. 2007;Price 2008Price , 2012)).As such, removing the additional conduction in scenarios where it is not warranted (in particular strong shocks) becomes crucial for accurate modelling and to not dissipate energy where not desired.
As such, the major equations of motion are the same as described above in the tradition SPH case, with the dissipationless component being identical to eq. 13.The artificial viscosity term, however, is more complex.We no longer use a constant  V in eq. 17.We follow the framework of Morris & Monaghan (1997) and turn it into a timeevolving particle-carried quantity.This scalar parameter is integrated forward in time using with   =  K ℎ  the kernel cut-off radius, and where which ensures that  V, decays away from shocks.In these expressions, ℓ = 0.05 is the viscosity decay length, and  V,max = 2.0 is the maximal value of the artificial viscosity parameter.The   term is a shock indicator (see Cullen & Dehnen 2010) which we use here to rapidly increase the viscosity in their vicinity.For this detector, we calculate the time differential of the velocity divergence using the value from the previous time-step, Additionally, If  V,loc, >  V, (), then  V, ( + Δ) is set to  V,loc, to ensure a rapid increase in viscosity when a shock front approaches.The value of the parameter entering the usual viscosity term (eq.17) is then which exploits the Balsara (1989) switch so that we can rapidly shut down viscosity in shear flows.Note that, by construction, these terms ensure that the interaction remains fully symmetric.
In Sphenix, we also implement a thermal conduction (also known as artificial diffusion) model following Price (2008), by adding an additional equation of motion for internal energy where here the new dimensionless parameter for the artificial conduction strength is constructed using a pressure weighting of the contribution of both interacting particles: with the  c, evolved on a particle-by-particle basis with a similar time dependency to the artificial viscosity parameter.The artificial conduction uses the Laplacian of internal energy as a source term, in an effort to remove nonlinear gradients of internal energy over the kernel width, with where here  c = 1 is a dimensionless parameter, and  c,,min = 0 is the minimal value of the artificial conduction coefficient.The artificial conduction parameter is bounded by a maximal value of  c,,min = 2 in all cases.The value of  c is high compared to other schemes to ensure the conduction parameter can vary on short timescales.Note that the velocity entering the last term of eq.29 is not the signal velocity but we instead follow Price et al. (2018) and write This is a combination of the signal velocities used by Price et al. (2018) for the cases with and without gravity.As the thermal conduction term (eq.27) is manifestly symmetric, no equation of motion for velocity is required to ensure energy conservation.Finally, we ensure that the conduction is limited in regions undergoing strong shocks, limiting  c by applying with  c,max = 1 a constant, and Note the explicit appearance of the viscosity parameters  V, in these expressions.More information on the motivation behind the limiter, and its implementation, are presented by Borrow et al. (2022).

Time-step limiter
For all these schemes, a necessary condition to ensure energy conservation, especially when additional source terms such as stellar feedback are in use, is to impose some form of limit between the time-step size of neighbouring particles.This allows for information to be correctly propagated between particles (see Durier & Dalla Vecchia 2012).In Swift, we use three different mechanisms to achieve the desired outcome; these are all called "time-step limiters" in different parts of the literature.We describe them here briefly.
The first limit we impose is to limit the time-step of active particles.When a particle computes the size of its next time-step, typically using the CFL condition, it also additionally considers the timestep size of all the particles it interacted within the loop computing accelerations.We then demand that the particle of interest's time-step size is not larger than a factor Δ of the minimum of all the neighbours' values.We typically use Δ = 4 which fits naturally within the binary structure of the time-steps in the code.This first mechanism is always activated in Swift and does not require any additional loops or tasks; it is, however, not sufficient to ensure energy conservation in all cases.
The time-step limiter proposed by Saitoh & Makino (2009) is also implemented in Swift and is a recommended option for all simulations not using a fixed time-step size for all particles.This extends the simple mechanism described above by also considering inactive particles and waking them up if one of their active neighbours uses a much smaller time-step size.This is implemented by means of an additional loop over the neighbours at the end of the regular sequence (Fig. 4).Once an active particle has computed its timestep length for the next step, we perform an additional loop over its neighbours and activate any particles whose time-step length differs by more than a factor Δ (usually also set to 4).As shown by Saitoh & Makino (2009), this is necessary to conserve energy and hence yield the correct solution even in purely hydrodynamics problems such as a Sedov-Taylor blast wave.The additional loop over the neighbours is implemented by duplicating the already existing tasks and changing the content of the particle interactions to activate the requested neighbours.
The third mechanism we implement is a synchronisation step to change the time-step of particles that have been directly affected by external source terms, typically feedback events.Durier & Dalla Vecchia (2012) showed that the Saitoh & Makino (2009) mechanism was not sufficient in scenarios where particles receive energy in the middle of their regular time-step.When particles are affected by feedback (see § 8.1, 8.2, and 8.3), we flag them for synchronisation.A final pass over the particles, implemented as a task acting on any cell which was drifted to the current time, takes these flagged particles, interrupts their current step to terminate it at the current time and forces them back onto the timeline ( § 2.4) at the current step.They then recompute their time-step and get integrated forward in time as if they were on a short time-step all along.This guarantees a correct propagation of energy and hence an efficient implementation of feedback.The use of this mechanism is always recommended in simulations with external source terms.

Particle splitting
In some scenarios, particles can see their mass increase by large amounts.This is particularly the case in galaxy formation simulations, where some processes such as enrichment from stellar evolution (see § 8.1.3)can increase some particle masses by large, sometimes unwanted, factors.To mitigate this problem, the Swift code can optionally be run with a mechanism to split particles that reach a specific mass.We note that this is a mere mitigation tool and should not be confused for a more comprehensive multi-resolution algorithm where particle would adapt their masses dynamically in different regions of the simulation volume and/or based on refinement criteria.
When a particle reaches a user-defined mass  thresh , we split the particle into two equal mass particles.The two particles are exact copies of each other but they are displaced in a random direction by a distance 0.2ℎ.All the relevant particle-carried properties are also halved in this process.One of the two particles then receives a new unique identifier 11 .To keep track of the particles' history, we record the number of splits a particle has undergone over its lifetime and the ID of the original progenitor of the particle present in the initial conditions.Combined with a binary tree of all the splits, also stored in the particle, this leads to fully traceable, unique, identifier for every particle in the simulation volume.

The nIFTy cluster
In Fig. 10, we demonstrate the performance of a selection of the hydrodynamics solvers within Swift on the (non-radiative) nIFTy cluster (Sembolini et al. 2016)   The gas density profile of the nIFTy cluster when simulated with five models within Swift (thick solid lines of various colours), and three external codes (dashed thin lines), shown at redshift  = 0. Middle panel: Gas entropy profile of the cluster (as extracted from the temperature and electron density profiles).Bottom panel: Gas temperature profile of the cluster with the same models.used to perform this test are available for download as part of the Swift package in hdf5 format.All necessary data, like the parameter file required to run the test, is also provided in the repository as a ready-to-go example.
In the figure, we demonstrate the performance of five models from Swift (Density-Energy ( §3.2) in green, Anarchy-PU ( §A2) in blue, Sphenix ( §3.3) in orange, Phantom ( §A3) in purple, and Gasoline-2 ( §A4) in red) 12 .All simulations use the same Wendland-C2 kernel and  = 1.2.For comparison purposes, we display the results on this problem from the Gadget-2 flavour of Anarchy (based upon Pressure-Entropy; G2-anarchy in dashed blue), the Arepo code and moving mesh-based solver (dashed purple), and a more standard SPH flavour implemented in Gadget-3 (G3-music).These additional curves were extracted from the original Sembolini et al. (2016) work.
Outside of radius  > 0.5 Mpc, all models show very similar behaviour.Internally to this radius, however, two classes of hydrodynamics model are revealed: those that form a flat entropy profile (i.e. the entropy tends towards very low values within the centre, driven by high densities and low temperatures), or a declining entropy profile (entropy flattens to a level of  B   −2/3 e ≈ 10 2.5 cm 2 keV, driven by a low central density and high temperature).There has been much debate over the specific reasons for this difference between solvers.
Here, we see that we form a flat profile with the Gasoline-2-like (GDF) and Density-Energy models within Swift, and the G3-music code.These models have relatively low levels of diffusion or conduction (or none at all, in the case of Density-Energy and G3-music).For instance, within our Gasoline-2-like implementation, we choose the standard value of the conduction parameter  = 0.03, consistent with the original implementation.Using a similar model Wadsley et al. (2008) demonstrated that the formation of flat or declining entropy profiles was sensitive to the exact choice of this parameter (only forming flat profiles for 0.1 <  < 1.0), and it is likely that this is the case within our Swift implementation too, though any such tuning and parameter exploration is out of the scope of this technical paper.

GRAVITY SOLVER
We now turn our attention towards the equations solved in Swift to account for self-gravity (see Dehnen & Read 2011;Angulo & Hahn 2022, for reviews).We start by introducing the gravity softening kernels ( §4.1), then move on to summarise the Fast-Multipole-Method at the core of the algorithm ( §4.2), and describe how it is implemented in our task-based framework ( §4.3).We then present our choice of opening angle ( §4.4) and the coupling of the method to a traditional Particle-Mesh algorithm ( §4.5).We finish by showing a selection of test results ( §4.6) before discussing how massive neutrinos are treated ( §4.7).

Gravitational softening
To avoid artificial two-body relaxation and avoid singularities when particles get too close, the Dirac -distribution of the density field corresponding to each particle is convolved with a softening kernel of a given fixed, but possibly time-varying, scale-length .Beyond , a purely Newtonian regime is recovered.
Instead of the commonly used spline kernel of Monaghan & Lattanzio (1985) we use a C2 kernel (Wendland 1995), which leads to an expression for the force that is cheaper to compute whilst yielding a very similar overall shape.We modify the density field generated by a point-like particle δ(r) = (|r|) =  (|r|, 3 Plummer ), where within Swift rather than using their original codes, and all use the same neighbour-finding and time-step limiting procedures.
with  = /, and  Plummer is a free parameter linked to the resolution of the simulation (e.g.Power et al. 2003;Ludlow et al. 2019).The potential (, ) corresponding to this density distribution reads with  () ≡ −3 7 + 15 6 − 28 5 + 21 4 − 7 2 + 3.These choices lead to a potential at |x| = 0 that is equal to the central potential of a Plummer (1911) sphere (i.e.( = 0) = 1/ Plummer ) 13 .From this expression the softened gravitational force can be easily obtained: with () ≡  ′ ()/ = −21 5 + 90 4 − 140 3 + 84 2 − 14.This last expression has the advantage of not containing any divisions or branching (besides the always necessary check for  < ), making it faster to evaluate than the softened force derived from the Monaghan & Lattanzio (1985) spline kernel14 .It is hence well suited to target modern hardware, for instance to exploit SIMD instructions.In particular, the use of a C2 kernel here allows most of the commonly used compilers to automatically generate vectorised code, which is not the case when using a spline-based kernel with branches.On the realistic scenario used as a convergence test of §4.6, we get a speed-up of 2.5x when using AVX2 vectorisation over the regularly optimised code15 .The same code using a spline kernel forfeits that speed-up and is even slightly slower due to the extra operations even in the non-vectorised case.
The softened density profile, with its corresponding potential and resulting forces 16 are shown in Fig. 11.For comparison purposes, we also implemented the more traditional spline-kernel softening in Swift.For a recent discussion of the impact of different softening kernel shapes see section 8 of Hopkins et al. (2023).

Evaluating the forces using the Fast Multipole Method
The algorithmically challenging aspect of the -body problem is to generate the potential and associated forces received by each particle in the system from every other particle in the system.Mathematically, this means evaluating efficiently for large numbers of particles  (with  N the gravitational constant).In the case of collisionless dynamics, the particles are a mere Monte-Carlo sampling of the underlying coarse-grained phasespace distribution (e.g.We use distances in units of the kernel cut-off  to normalise the figures.A Plummer-equivalent sphere is shown for comparison.The spline kernel of Monaghan & Lattanzio (1985) is depicted for comparison but note that it has not been normalised to match the Plummer-sphere potential at  = 0 (as is done in simulations) but rather normalised to the Newtonian potential at  =  to better highlight the differences in shapes.
adapted specifically for gravity solvers by Dehnen (2000Dehnen ( , 2002) ) (see also Warren & Salmon (1995) for related ideas).The FMM works by expanding the potential in a Taylor series around both x  and x  and grouping similar terms arising from nearby particles to compute long-distance interactions between well-separated groups only once.
In other words, we consider groups of particles with a large enough separation that the forces between them can be approximated well enough by just the forces between their centres of mass.Higher-order expressions, as used in Swift and other FMM codes, then not only approximate these groups as interacting point masses, but also take into account their shape, i.e. use the next order terms such as inertia tensors and beyond.A more rigorous derivation is given below.The convergence of FMM and its applicability to a large range of gravity problems have been explored extensively (see e.g.Dehnen 2002Dehnen , 2014;;Potter et al. 2017;Garrison et al. 2021;Springel et al. 2021).For comparison, a Barnes & Hut (1986) tree-code, used in other modern codes such as 2Hot (Warren 2013) and Gadget-4 (Springel et al. 2021, in its default operating mode), only expands the potential around the sources x  .The formal complexity of such a method is O ( log ).

Double expansion of the potential
In this section, we use the compact multi-index notation of Dehnen (2014) (repeated in appendix B for completeness) to simplify expressions and ease comparisons with other published work.In what follows k, m, and n denote the multi-indices and r, R, x, y, and z are vectors, whilst  and  denote particle indices.Note that no assumptions are made on the specific functional form of the potential .For a single pair of particles  and  located in respective cells  and  with centres of mass z  and z  , as shown in Fig. 12, the potential generated by  at the location of  can be written as where the Taylor expansion of  around R ≡ z  − z  was used on the third line, r  ≡ x  − z  , r  ≡ x  − z  is defined throughout, and m ≡ k − n is defined for the last line.Expanding the series only up to order , we get with the approximation converging towards the correct value provided |R| > |r  + r  | as  → ∞.If we now consider all the particles within  and combine their contributions to the potential at location x  in cell , we get This last equation forms the basis of the FMM.The algorithm decomposes eq.36 into three separated sums, evaluated at different stages.

The FMM algorithm
As a first step, multipoles are constructed from the innermost sum.
For each cell, we compute up to order  all the necessary multi-poles (i.e.all terms M whose norm of the multi-index m ⩽ ) where we re-used the tensors X m (r  ) ≡ 1 m! r m  to simplify the notation.This is the first kernel of the method, commonly labelled as P2M (particle to multipole).In a second step, we compute the second kernel, M2L (multipole to local expansion), which corresponds to the interaction of a cell with another one: where D n+m (R) ≡ ∇ n+m (R) is an order  +  derivative of the potential.This is the computationally expensive step of the FMM algorithm, as the number of operations in a naive implementation using Cartesian coordinates scales as O (  6 ).More advanced techniques (e.g.Dehnen 2014) can bring the cost down to O (  3 ), albeit at a considerable algebraic cost.In the case of collisionless dynamics, accuracy down to machine precision for the forces is not required, and low values of  are thus sufficient, which maintains a reasonable computational cost for the M2L kernel (even in the Cartesian form).
Finally, the potential is propagated from the local expansion centre back to the particles (L2P kernel) using This expression is purely local, and can be efficiently implemented in a loop that updates all the particles in cell .
In summary, the potential generated by a cell  on the particles in cell  is obtained by the successive application of the P2M, M2L and L2P kernels.The P2M and L2P kernels need only be applied once per particle, whilst one M2L calculation must be performed for each pair of cells.
The forces applied to the particles are obtained by the same procedure, now using an extra order in the Taylor expansion.For instance, for the acceleration along the  axis, we have: Higher-order terms, such as tidal tensors, can be constructed using the same logic.Note that only the last step in the process, the L2P kernel, needs to be modified for the accelerations or tidal tensors.
The first two steps of the FMM, and in particular the expensive M2L phase, remain identical.
In practice, the multipoles can be constructed recursively from the leaves of the tree to the root, and the local expansions from the root to the leaves by shifting the M and F tensors and adding their contributions to their parent or child cell's tensors respectively.This can be done during the tree construction phase, for instance.Similarly, the local expansion tensors (F) can be propagated downwards using the opposite expressions.
While constructing the multipoles M, we also collect the centre of mass velocity of the particles in the cells.This allows us to drift the multipoles forward in time.This is only first-order accurate, but is sufficient in most circumstances, especially since once the particles have moved too much a full reconstruction of the tree (and hence of the multipoles) is triggered.Here, we follow the same logic as employed in many codes (e.g.Gadget Springel 2005) and force a tree reconstruction once a fixed cumulative fraction (typically 1%) of the particles have received an update to their forces.
One final useful expression that enters some of the interactions between tree-leaves is the P2M kernel.This directly applies the potential due to a multipole expansion in cell B to a particle in cell A without using the expansion of the potential F at the centre of mass of cell A. This kernel is obtained by setting r  to zero in eq.37, re-defining R ≡ x a − z B , and constructing the same M and D tensors as for the other kernels: The P2M kernel acts identically to traditional Barnes & Hut (1986) tree-codes, which use solely that kernel to obtain the forces from the multipoles (or often just monopoles, i.e. setting  = 0 throughout) to the particles.With all the kernels defined, we can construct a tree walk by recursively applying the M2L operation in a similar fashion to the double tree-walk introduced by Dehnen (2000).

Implementation choices
All the kernels (eqs.40-45) are rather straightforward to evaluate as they are only made of additions and multiplications (provided D can be evaluated quickly), which are extremely efficient instructions on modern architectures.However, the fully expanded sums can lead to rather large, and prone to typos, expressions.To avoid any mishaps, we use a python script to generate the C code in which all the sums are unrolled, ensuring they are correct by construction.This script is distributed as part of the code repository.In Swift, FMM kernels are implemented up to order  = 5, more than accurate enough for our purposes (see § 4.6), but this could be extended to higher order easily.At order  = 5, this implies storing 56 numbers per cell for each M and F plus three numbers for the location of the centre of mass.Our default choice is to use multipoles up to order  = 4; higher or lower implementations can be chosen at compile time.For leaf-cells with large numbers of particles, as in Swift, this is a small memory overhead.One further small improvement consists in choosing z  to be the centre of mass of cell  rather than its geometrical centre.The first order multipoles (M 100 , M 010 , M 001 ) then vanish by construction.This allows us to simplify some of the expressions and helps reduce, albeit by a small fraction, the memory footprint of the tree structure.

The tree walk and task-parallel implementation
The three main kernels of the FMM methods (eq.40, 41, and 42) are evaluated in different sections of the code.The construction of the multipoles is done during the tree building phase.This is performed outside of the task-based section of the code.As there is no need to handle dependencies or conflicts during the construction, we use a simple parallelisation over the threads for this phase.As is done in other codes, this is achieved by recursively accumulating information from the tree leaves to the root level.
Once the tree and associated multipoles have been constructed, the remaining work to be performed is laid out.In a similar fashion to the hydrodynamics case ( § 2.2), all the calculations (M2L kernels and direct leaf-leaf interactions) can, in principle, be listed.The only difference lies in the definition of which cells need to interact using which kernel.This is based on the distance between the cells and information gathered from the multipoles (see § 4.4 for the exact expression).In the case of a calculation using multiple nodes, the multipole information of neighbouring cells located on another node is exchanged after the tree construction (see §9.2).Whilst in the SPH (1) one self task computing the gravity kernels within the cell itself, (2) eight pair tasks computing the kernels for each pair of the red-green pairs of cells (the arrows), and (3) a single long-range task computing the M2L kernel contribution of all the blue cells to the red cell.In a realistic example, there will be many more blue cells beyond what is depicted here, but all their contributions to the cell of interest's potential will be handled by a single task looping over all of them.The green cells are too close, based on the criterion of § 4.4 to use a multipole-multipole (M2L) interaction; their interactions with the red cell are hence treated as individual tasks as they contain a substantial amount of calculation to perform.In some cases, the distance criterion may be such that cells slightly further away also need to be treated by the pair tasks rather than just the directly neighbouring layer.This depends on the exact particle configuration and on the user's opening angle choices.
case, the cells were constructed such that only direct neighbours need to be considered, one may, here, need to consider longer-range pairs of cells.
In practice, we start from the top-level grid of cells and identify all the pairs of cells that cannot interact via the M2L kernel.We then construct a pair task for each of them.Each cell also gets a self task which will take care of all the operations inside itself.Finally, for each cell, we create a long-range task, which will take care of all the interactions involving this cell and any cell far enough that the M2L kernel can be directly used.This third task is generally very cheap to evaluate as it involves only the evaluation of eq.41.This is illustrated on Fig. 13 for a simple case.
In most cases, the number of operations to perform within a single self or pair task is large.These cells are also very likely to be split into smaller cells in the tree.The tasks will hence attempt to recurse down the tree and perform the operations at the level that is most suitable.To this end, they use a double tree-walk logic akin to the one introduced by Dehnen (2002).At each level, we verify whether the children cells are far enough from each other based on the opening angle criterion ( § 4.4).If that is the case, then the M2L kernel is used.
If not, then we move further down the tree and follow the same logic at the next level.The algorithm terminates when reaching a leaf cell.At this point, we either apply the M2P kernel, if allowed by the criterion, or default to the basic direct summation (P2P kernel) calculation.
Finally, the L2P kernel is applied on a cell-by-cell basis from the root to the leaves of the tree using a per-cell task.These tasks are only allowed to run once all of the self, pair, and long-range gravity tasks described above have run on the cell of interest.This is achieved using the dependency mechanism of the task scheduling library.
As the gravity calculation updates different particle fields (or even different particles) from the SPH tasks, we do not impose any dependency between the gravity and hydrodynamics operations.Both sets of tasks can run at the same time on the same cells and particles.This differs from other codes where an ordering is imposed.Our choice allows for better load-balancing since we do not need to wait for all the gravity operations (say) to complete before the hydrodynamics ones.

The multipole acceptance criterion
The main remaining question is to decide when two cells are far enough from each others that the truncated Taylor expansion used as approximation for the potential (eq.37) is accurate enough.The criterion used to make that decision is called the multipole acceptance criterion (MAC).
We know that eq.37 converges towards the correct answer as  increases provided that 1 > |r  + r  |/|R|.This is hence the most basic (and always necessary) MAC that can be designed.If this ratio is lower, the accuracy (at a fixed expansion order) is improved and it is hence common practice to define a critical opening angle  cr and allow the use of the multipole approximation between two cells of size  A and  B if This lets users have a second handle on the accuracy on the gravity calculation besides the much more involved change in the expansion order  of the FMM method.Typical values for the opening angle are in the range [0.3, 0.7], with the cost of the simulation growing as  cr decreases.Note that this MAC reduces to the original Barnes & Hut (1986) criterion when individual particles are considered (i.e. A = 0).This method has the drawback of using a uniform criterion across the entire simulation volume and time evolution, which means that the chosen value of  cr could be too small in some regions (leading to too many operations for the expected accuracy) and too large in some other ones (leading to a lower level of accuracy than expected).Swift instead uses a more adaptive criterion to decide when the multipole approximation can be used.This is based on the error analysis of FMM by Dehnen (2014) and is summarised below for completeness 17 .The key idea is to exploit the additional information about the distribution of particles that is encoded in the higher-order multipole terms.
We start by defining the scalar quantity  A,n , the power of the 17 See also Springel et al. (2001) for similar ideas in the regular tree case, based on the detailed error analysis of the tree code by Salmon & Warren (1994).
multipole of order  of the particles in cell , via where the sum runs over all multipole terms of order  in the cell 18 .This quantity is a simple upper bound for the amplitude of the multipole (M ,m <  A, |m| /|m|!) and can hence be used to estimate the importance of the terms of a given order in the Taylor series of the potential.Following Dehnen (2014) we then consider a sink cell  and a source cell  (Fig. 12) for which we evaluate at order  the scalar with  B ≡ M B, (0,0,0) , the sum of the mass of the particles in cell .
Note that since  B,n ⩽  B   B , we have  BA,p ⩽ (( A +  B )/|R|)  , where the right-hand side is the expression used in the basic opening angle condition (eq.46).We finally scale the  BA,p 's by the relative size of the two cells to define the error estimator ẼBA,p : As shown by Dehnen (2014), these quantities are excellent estimators of the error made in computing the accelerations between two cells using the M2L and M2P kernels at a given order.We can hence use this property to design a new MAC by demanding that the estimated acceleration error is no larger than a certain fraction of the smallest acceleration in the sink cell .This means we can use the FMM approximation to obtain the accelerations in cell  due to the particles in cell  if where a  is the acceleration of the particles in cell  and  FMM is a tolerance parameter.Since this is self-referencing (i.e.we need the accelerations to decide how to compute the accelerations), we need to use a an estimator of |a  |.In Swift, we follow the strategy commonly used in other software packages and use the acceleration of the previous time-step 19 .The minimal norm of the acceleration in a given cell can be computed at the same time as the P2M kernels which are obtained in the tree construction phase.The second condition in eq.50 is necessary to ensure the convergence of the Taylor expansion.
One important difference between this criterion and the purely geometric one (eq.46) is that it is not symmetric in  ↔  (i.e. AB,p ≠  BA,p ).This implies that there are cases where a multipole in cell  can be used to compute the field tensors in cell  but the multipole in  cannot be used to compute the F values of cell  and vice versa.This affects the tree walk by breaking the symmetry and potentially leading to cells of different sizes interacting.That is handled smoothly by the tasking mechanism which naturally adapts to the amount of work required.Note that an alternative approach would be to force the symmetry by allowing the multipoles to interact at a given level only if the criterion is satisfied in both directions.We additionally remark that this breaking of the symmetry formally leads to a breaking of the momentum-conserving property of the FMM method.We, however, do not regard this as an important issue as the momentum conservation is already broken by the use of perparticle time-step sizes.

Coupling the FMM to a mesh for periodic long-range forces
To account for periodic boundary conditions in the gravity solver, the two main techniques present in the literature are: (1) apply an Ewald (1921)-type correction to every interaction (e.g.Hernquist & Katz 1989;Klessen 1997;Springel et al. 2001;Springel 2005;Hubber et al. 2011;Potter et al. 2017;Garrison et al. 2021;Springel et al. 2021); and (2) split the potential in two (or more) components with one of them solved for in Fourier space and thus accounting for the periodicity (e.g.Xu 1995;Bagla 2002;Springel 2005;Habib et al. 2016;Springel et al. 2021).We implement the latter of these two options in Swift and follow the same formalism as presented by Bagla & Ray (2003), adapted for FMM.
We start by truncating the potential and forces computed via the FMM using a smooth function that drops quickly to zero at some scale  s set by the size of the gravity mesh.The Newtonian potential in eq.36 is effectively replaced by where the subscript  indicates that this is the short-range part of the potential.As (,  s ) rapidly drops to negligible values, the potential and forces need only be computed via the tree walk for distances up to  cut =  s ; interactions at larger distances are considered to contribute exactly zero to the potential.Following Springel (2005), we use  = 4.5 as our default20 .This maximal distance for tree interaction means that the long-range task (the one taking care of all the blue cells in Fig. 13) only needs to iterate over the cells up to a distance   .This reduces further the amount of work to be performed for the long-range operations by the tree.
The long-range part of the potential ( l () = 1  × erf 1 2   s ) is solved using a traditional particle-mesh (PM, see Hockney & Eastwood 1988) method.We assign all the particles onto a regular grid of  3 mesh cells using a cloud-in-cell (CIC) algorithm.The mesh also sets the cut-off size  s ≡ / mesh , where  is a dimensionless orderunity factor and  is the size-length of the simulation volume.We use  = 1.25 as our default parameter value.In a second phase, we apply a Fourier transform to this density field using the Fast-Fourier-Transform (FFT) algorithm implemented in the fftw library (Frigo & Johnson 2005).
With the potential in Fourier space, Poisson's equation is solved by multiplying each cell's value by the transform of the long-range potential φl We then deconvolve the CIC kernel twice (once for the assignment, once for the potential interpolation) and apply an inverse (fast) Fourier transform to recover the potential in real space on the mesh.Finally, the particles' individual potential and forces are obtained by interpolating from the mesh using the CIC method.
The functional form of eq.51 might, at first, appear sub-optimal.The error function is notoriously expensive to evaluate numerically.In our formulation, we must evaluate it for every pair of interactions (P2P or M2L) at every step.On the other hand, eq.52 needs to be evaluated only  3 mesh times at every global step (see below).Typically,  mesh ∼  1/3 but each of the  particles will perform many P2P kernel calls every single step.Using a simpler form for  in real space with a more expensive one to evaluate correction in -space may hence seem like an improvement.We experimented with sigmoid-like options such as but found little benefit overall.The solution we adopted instead is to stick with eq.51 and use an approximation to erfc sufficient for our needs.Specifically, we used eq.7.1.26 of Abramowitz & Stegun (1965).Over the range of interest, ( ⩽ 4.5 s ), this approximation has a relative error of less than 10 −4 and the error tends to 0 as  → 0. An alternative would be to store exact values in a table and interpolate between entries, but that approach has the disadvantage of requiring non-local memory accesses to this table shared between threads.Comparing simulations run with an exact erfc to simulations using the approximation above, we find no differences in the results.Time integration of the forces arising from the long-range gravitational potential is performed using a long time step and the symplectic algorithm for sub-cycling of Duncan et al. (1998).We split the Hamiltonian in long and short timescales, corresponding to the long-and short-range gravity forces.The short-range Hamiltonian also contains the hydrodynamics forces.The time-steps then follow a sequence of kick & drift operators for the short-range forces embedded in-between two long-range kick operators (See also Quinn et al. 1997;Springel 2005;Springel et al. 2021).
As the mesh forces involve all particles and require all compute cores to perform the FFT together, we decided to implement the PM calculation (i.e. the CIC density interpolation, the calculation of the potential via Fourier space, and the interpolation of the accelerations back onto the particles) outside of the tasking system.In large calculations, the PM steps are rare (i.e. the long-range, global, time-step size is long compared to the smallest individual particle short-range time-step sizes).These steps are also where all particles will have to update their short-range forces, which will trigger a full tree rebuild.Having the PM calculation then perform a global operation outside of the tasking framework whilst locking all the threads is hence not an issue.To speed up operations, the PM calculation also uses parallel operations.The assignment of the particles onto the density grid is performed using a simple threading mechanism on each compute node.The Fourier transforms themselves are then performed using the MPI + threads version of the fftw library.All nodes and cores participate in the calculation.Once the potential grid has been obtained, the assignment of accelerations to the particles is done using the same basic per-node threading mechanism used for the construction of the density.

Convergence tests
The fast multipole method has been thoroughly tested both in the context of collisional dynamics and for collisionless applications (see e.g.Dehnen 2014;Springel et al. 2021).Many tests of simple scenarios, including cells with uniform particle distributions or isolated halos with different profiles can be found in the literature.As the behaviour of the method is well established and since our implementation does not differ from other reference codes besides the parallelisation aspects, we do not repeat such a detailed study here.We report having successfully tested the FMM implementation in Swift on a wide range of cases, most of which are distributed as part of the examples in the code.We thus verified that the code converges towards the correct solution and presents the correct behaviour when the free parameters (e.g. the MAC or the gravity mesh parameters) are varied.We report here on one such experiment with potential relevance to end users.
Our test setup is a snapshot from a cosmological simulation of the Eagle (Schaye et al. 2015) suite.We take the  = 0.1 snapshot from their (25 Mpc) 3 volume.This setup comprises 2 × 376 3 ≈ 10 7 particles with a very high degree of clustering and is hence directly relevant to all galaxy formation applications of the code.The combination of haloes and voids present in the test allows us to test Swift's accuracy in a variety of regimes.We randomly select 1 percent of the particles for which the exact forces are computed using a direct summation algorithm.An Ewald (1921) correction is applied to take into account the periodicity of the volume.We then run Swift and compute the forces via the FMM-PM code described above.We finally compute the relative force error for our sample of particles and evaluate the 99 th percentile (  99 ) of the error distribution.We chose to show the 99 th percentile error over lower ones as it provides better guidance for users for their accuracy requirements by taking into account outliers.We show this error percentile as a function of the opening angle parameters in Fig. 14 for the case where periodic boundary conditions have been switched off.In this test, only the FMM part of the code is thus exercised.The left panel corresponds to the case of a purely geometric MAC (eq.46) and the right panel to the case of the adaptive MAC (eq.50).On both panels, we show different orders of the method using different line colours.The dotted line is used to indicate the 1%-error level.We find that, as expected, the forces converge towards the correct, direct-summation-based, solution when the accuracy parameters are tightened.Similarly, when using the geometric MAC the relationship between  99 and  cr is found to be a power law whose slope steepens for higher values of  as predicted by theoretical arguments (e.g.Dehnen 2014;Springel et al. 2021).These expectations are displayed on the figure using thin dash-dotted lines.In the geometric case, the expected behaviour is recovered.The deviation from a power law at  cr < 0.3 for  = 5 is taking place in the regime where the results start to be affected by single precision floating-point truncation.We have verified that when switching to double precision the power-law behaviour continues for smaller values of  cr , demonstrating that our implementation of the FMM algorithm matches theoretical expectations.In practice, this truncation error takes place much below the regime used in production runs.In the adaptive MAC case, the theoretical expectation is for the scheme to converge as  99 ∝  FMM for all orders .This is shown as a thin black dash-dotted line on the figure.The current Swift implementation converges at a rate below these theoretical predictions.Our recommended default value for the adaptive MAC parameter is shown as a green arrow on the right panel.Using our default setup where we construct multipoles to fourth order, 99 percent of the particles have a relative error of less than 5 × 10 −3 for their force calculation.For comparison with the often used in the literature 90 th percentile of the error (e.g.Springel et al. 2021), we additionally show it using a dashed line on the right panel for our default 4 th -order FMM setup.
We repeat the same exercise but with periodic boundaries switched on and display the results in Fig. 15.The FMM part of the algorithm is unchanged, we only additionally add the PM part using a grid of 512 3 cells and a smoothing factor of  smooth = 1.25 (our default value).In this case, the force error reaches a plateau for low values of the opening angle  cr or adaptive MAC parameter  FMM .This is where the algorithm reaches the accuracy limit of the PM part of the Figure 14.Accuracy of the gravity calculation (solid lines) for the two multipole acceptance criteria (MAC) on a low-redshift ( = 0.1) 2 × 376 3 particles, 25 Mpc cosmological hydrodynamical simulation extracted from the Eagle suite.For 1 in every 100 particles, we calculated the exact forces using direct summation for comparison with the FMM-obtained prediction.We switch off periodic boundary conditions, and hence the gravity mesh, for this test.The 99 th percentile of the relative force error distribution is plotted against the geometric MAC, the classic tree opening angle, on the left, and against the adaptive MAC parameter on the right.Various multipole calculation orders  are shown using different colours.Theoretical predictions for the convergence rates (  99 ∝   for the geometric and  99 ∝  FMM for the adaptive case at all orders) are shown using thin dot-dashed lines in the background (only one line for the adaptive case as the predictions is independant of ).The horizontal dotted line indicates where 99 percent of the particles achieve a relative accuracy of better than 1 percent, a commonly adopted accuracy target.Our default MAC choice, indicated by an arrow on the right panel, corresponds to a 99 th percentile of the relative error of 5 × 10 −3 for our standard setup using the 4 th order FMM implementation.We additionally show the 90 th percentile of the error (  90 ) for the order four adaptive MAC case using a dashed line.The Swift implementation converges at a lower rate than theoretical expectations in the adaptive case.In the geometric case, the deviation from the theoretically expected power-law behaviour for  cr < 0.3 and  = 5 is due to truncation errors in single precision.
Figure 15.The same as Fig. 14, but now considering periodic boundary conditions.A gravity mesh of size  mesh = 512 with  smooth = 1.25 was used.The 99 th percentile of the relative error rapidly reaches a plateau set by the accuracy of the force calculations computed by the PM part of the algorithm.The dashed line on the right panel corresponds to the order four scheme but using  smooth = 3, illustrating the effect of the mesh parameters on the calculation's accuracy.For our default setup (green arrow), the scheme reaches a relative force accuracy of better than 6 × 10 −3 for 99 percent of the particles, a level only reached with very small opening angle values in the geometric case.
method.This is illustrated on the right panel by the dashed line which corresponds to the same run but with  smooth = 3.In our default setup (4 th order FMM,  FMM = 10 −3 ,  smooth = 1.25) indicated by the green arrow, 99 percent of the particles have a relative force accuracy of better than 6 × 10 −3 .

Treatment of massive neutrinos
Accurately modelling neutrinos is of great interest for large-scale structure simulations, due to their outsized effect on matter clustering (see Lesgourgues & Pastor 2006 for a review).We implemented two schemes for the treatment of neutrino effects in Swift: one based on the linear response method (Ali-Haïmoud & Bird 2013) and another based on the   method (Elbers et al. 2021).In terms of the total matter power spectrum they produce, the two schemes are in good agreement.
The linear response method is a grid-based approach that accounts for the presence of neutrino perturbations by applying a linear correction factor in Fourier space to the long-range gravitational potential: where φ,cb is the long-range gravitational potential computed from the cold dark matter and baryon particles ( § 4.5).The correction factor depends on the ratio of linear theory transfer functions () for neutrinos and cold dark matter plus baryons, as well as their relative mass fractions (  ).
The second scheme, based on the   method, actively solves for the neutrino perturbations.It is a hybrid approach that combines a particle-based Monte Carlo sampling of the neutrino phase-space distribution with an analytical background solution.The aim is to solve for the nonlinear gravitational evolution of the neutrinos, while suppressing the shot noise that plagues traditional particle implementations.In this method, the nonlinear phase-space density  of neutrinos is decomposed as where f ( , ) = (1 + exp( / B   )) −1 is the background Fermi-Dirac distribution (expressed in terms of the neutrino temperature   ) and   is a possibly non-linear perturbation.In contrast to traditional, pure particle, implementations, only   is estimated from the particles hence reducing the shot noise.To achieve this decomposition, the contribution of neutrino particles to the mass density is statistically weighted.The weight of particle  is given by where   is the phase-space density at its location.Weights express the deviation from the background, they can be positive or negative, and are ideally small.The reduction in shot noise is proportional to  2 for the neutrino power spectrum.The weights must be updated on the fly, which involves a single loop over neutrino particles.We make use of the fact that f depends only on the current particle momentum, while the value of   is conserved.To avoid storing   , Swift uses the particle ID as a deterministic pseudo-random seed to sample the initial Fermi-Dirac momentum.The value of   is then recomputed when needed.As a result, the memory footprint of neutrinos is identical to that of cold dark matter particles.The neutrino particles then enter the gravity calculation identically to all the other species but see their mass multiplied by their weight.
The possibility of negatively weighted particles requires some attention.In exceptional circumstances, which nevertheless occur for simulations involving billions of particles and thousands of steps, the centre of mass of a group of neutrinos can lie far beyond the geometric perimeter of the particles.Since Swift uses a multipole expansion around the centre of mass, this possibility causes a breakdown of the multipole expansion in eq.38, when truncated at finite .Although the multipole expansion could, in principle, be performed around another point (Elbers et al. 2021), we instead additionally implemented a version of the   method that only applies the weights in the long-range PM gravity calculation.This choice ensures that the spurious back-reaction of neutrino shot noise, which is most prominent on large scales and therefore feeds through the long-range force, is eliminated, while the possibility of neutrinos affecting smaller scales through short-range forces is not excluded.An added benefit is that PM steps are rare for large calculations, such that the computational overhead of the   step is minimal.
In addition, the   weights are always used to reduce the noise in on-the-fly power spectra and are provided in snapshots for use in post processing.
A final point concerns the relativistic nature of neutrino particles at high redshift.To ensure that neutrino velocities do not exceed the speed of light and to recover the correct free streaming lengths, we apply the relativistic correction factor / √︁  2 + (/) 2 to neutrino drifts, where  is the internal velocity variable described in Section 5.3 and  is the scale factor.Relativistic corrections to the acceleration can be neglected in the time frame typical for cosmological simulations (Elbers 2022).

Background evolution
In Swift we assume a standard FLRW metric for the evolution of the background density of the Universe and use the Friedmann equations to describe the evolution of the scale-factor ().We scale  such that its present-day value is  0 ≡ ( =  now ) = 1.We also define redshift  ≡ 1/ − 1 and the Hubble parameter with its present-day value denoted as  0 ≡  ( =  now ).Following usual conventions we write  0 = 100ℎ km • s −1 • Mpc −1 and use ℎ as the input parameter for the Hubble constant.
To allow for general expansion histories we use the full Friedmann equations and write where we followed Linder & Jenkins (2003) to parameterise the evolution of the dark-energy equation of state 21 as: The cosmological model is hence fully defined by specifying the (62) For a general set of cosmological parameters, this integral can only be evaluated numerically, which is too slow to be evaluated accurately during a run.At the start of the simulation we tabulate this integral for 10 4 values of  age equally spaced between log( start ) and log( end ).
The values are obtained via adaptive quadrature using the 61-points Gauss-Konrod rule implemented in the gsl library (Gough 2009) with a relative error limit of  = 10 −10 .The value for a specific  (over the course of a simulation run) is then obtained by linear interpolation of the table.

Addition of neutrinos
Massive neutrinos behave like radiation at early times, but become non-relativistic around  −1 ≈ 1890(  /1 eV).This changes the Hubble rate  () and therefore most integrated quantities described in the previous section.We optionally include this effect by specifying the number of massive neutrino species   and their non-zero neutrino masses  , in eV ( , ≠ 0,  = 1, . . .,   ).Multiple species with the same mass can be included efficiently by specifying mass degeneracies   .In addition, the present-day neutrino temperature  ,0 must also be set 22 as well as an effective number of ultra-relativistic (massless) species  ur .Together with the presentday CMB temperature  CMB,0 , these parameters are used to compute the photon density Ω  , the ultra-relativistic species density Ω ur , and the massive neutrino density Ω  (), replacing the total radiation density parameter Ω r .In our conventions, the massive neutrino contribution at  = 1 is not included in the present-day matter density Ω m = Ω cdm + Ω b .The radiation term appearing in eq.59 is simply replaced by In this expression, the constant Ω  describes the CMB density and is given by while the ultra-relativistic neutrino density is given by Note that we assume instantaneous decoupling for the ultrarelativistic species.The time-dependent massive neutrino density parameter is (Zennaro et al. 2017): where the function F is given by the momentum integral (67) 22 To match the neutrino density from an accurate calculation of decoupling (Mangano et al. 2005), one can use the value  ,0 / CMB,0 = 0.71599 (Lesgourgues & Tram 2011).
As Ω  () is needed to compute other cosmological integrals, this function should be calculated with sufficient accuracy.At the start of the simulation, values of eq.66 are tabulated on a piece-wise linear grid of 2 × 3 × 10 4 values of  spaced between log( ,begin ), log( ,mid ), and log( ,end ) = log(1) = 0.The value of  ,begin is automatically chosen such that the neutrinos are still relativistic at the start of the table.The value of log( ,mid ) is chosen just before the start of the simulation.The integrals F () are evaluated using the 61-points Gauss-Konrod rule implemented in the gsl library with a relative error limit of  = 10 −13 .Tabulated values are then linearly interpolated whenever  () is computed.
Besides affecting the background evolution, neutrinos also play a role at the perturbation level.These effects can be included in Swift using the linear response method of Ali-Haïmoud & Bird (2013) or the particle-based   method of Elbers et al. (2021), as described in § 4.7.

Choice of co-moving coordinates
Note that, unlike many other solvers, we do not express quantities with "little h" (ℎ) included23 ; for instance units of length are expressed in units of Mpc and not Mpc/ℎ.As a consequence, the time integration operators (see below) also include an ℎ-factor via the explicit appearance of the Hubble constant.
In physical coordinates, the Lagrangian for a particle  in an energybased flavour of SPH with gravity reads Introducing the comoving positions r ′ such that r = ()r ′ , we get where the comoving internal energy  ′ =  3(−1) is chosen such that the equation of state for the gas and thermodynamic relations between quantities have the same form (i.e. are scale-factor free) in the primed frame as well.Together with the definition of comoving densities  ′ ≡  3 () , this implies for the pressure, entropy, and sound-speed respectively.Following Peebles (1980) (chapter 7), we introduce the gauge transformation L → L +   Ψ with Ψ ≡ 1 2  r 2  and obtain and call  ′ the peculiar potential.Finally, we introduce the velocities used internally by the code: allowing us to simplify the first term in the Lagrangian.Note that these velocities do not have a direct physical interpretation.We caution that they are not the peculiar velocities (v p ≡  r ′ = 1  v ′ ), nor the Hubble flow (v H ≡ r ′ ), nor the total velocities (v tot ≡ v p + v H = r ′ + 1  v ′ ) and also differ from the convention used in outputs produced by Gadget (Springel 2005;Springel et al. 2021) and other related simulation codes (v out,Gadget = √  r ′ )24 .

SPH equations
Using the SPH definition of density, ρ′ Price (2012) and apply the Euler-Lagrange equations to write with These correspond to the equations of motion for density-entropy SPH (e.g.eq.14 of Hopkins 2013) with cosmological and gravitational terms.Similarly, the equation of motion describing the evolution of  ′ is expressed as: In all these cases, the scale-factors appearing in the equations are later absorbed in the time-integration operators such that the RHS of the equations of motions is identical for the primed quantities to the ones obtained in the non-cosmological case for the physical quantities.Additional terms in the SPH equations of motion (e.g.viscosity switches) often rely on the velocity divergence and curl.
We do not give a full derivation here but the co-moving version of all these terms can easily be constructed following the same procedure we employed here.

Time-integration operators
For the choice of cosmological coordinates made in Swift, the normal kick and drift operators get modified to account for the expansion of the Universe.The rest of the leapfrog algorithm is identical to the non-comoving case.The derivation of these operators from the system's Lagrangian is given in appendix A of Quinn et al. (1997) for the collisionless case.We do not repeat that derivation here but, for completeness, give the expressions we use as well as the ones used for the hydrodynamics.The drift operator gets modified such that Δ for a time-step running from a scale-factor   to  +1 becomes with  () given by eq.60 and the  −2 chosen to absorb the one appearing in eq.73.Similarly, the time-step-entering kick operator for collisionless acceleration reads . (77) However, for the case of gas dynamics, given our choice of coordinates, the kick operator has a second variant that reads Accelerations arising from hydrodynamic forces (1 st and 2 nd term in eq.74) are integrated forward in time using Δ kick,h , whilst the accelerations given by the gravity forces (3 rd term in eq.74) use Δ kick,g .The internal energy (eq.75) is integrated forward in time using Δ kick,u = Δ drift .Following the same method as for the age of the Universe ( §5.1), these three non-trivial integrals are evaluated numerically at the start of the simulation for a series 10 4 values of  placed at regular intervals between log  begin and log ( end ).The values for a specific pair of scale-factors   and  +1 are then obtained by interpolating that table linearly.

Validation
To assess the level of accuracy of Swift, it is important to compare results with other codes.This lets us assess the level of systematic differences and uncertainties left in the code.This is especially important for the studies of non-linear structure formation, as there is no possibility to use an exact solution to compare against.One such benchmark was proposed by Schneider et al. (2016) in the context of the preparation for the Euclid survey.Their goal was to assess whether cosmological codes can converge towards the same solution, within the targeted 1 percent accuracy of the survey.They focused on the matter density power spectrum as their observable and used three different -body codes for their study.Importantly, their work utilised three codes using three different algorithms to solve for the gravity forces: Ramses (Teyssier 2002, multi-grid technique), Pkdgrav3 (Potter et al. 2017, FMM tree algorithm), and Gadget-3 (Springel 2005, tree-PM technique).The setup evolves a cosmological simulation in a (500 Mpc/ℎ) 3 volume from  = 49 to  = 0, assuming a ΛCDM cosmology, sampled using 2048 3 particles.The setup only considers gravitational interactions and comoving time integration.The same setup was later adopted by Garrison et al. (2019) to compare their Abacus code and by Springel et al. (2021) for the Gadget-4 code 25 .It is a testimony to the advances of the field in general and to the increase in available computing power that a run akin to the then-record-breaking Millennium simulation (Springel et al. 2005b) is nowadays used as a mere benchmarking exercise.2016).The simulation evolves 2048 3 dark matter particles in a (500 Mpc/ℎ) 3 volume run from  = 49 to  = 0 assuming a ΛCDM cosmology.All power spectra were measured using the same tool (see text).The dark-and light-shaded regions correspond to ±0.25% and ±1% level agreement between codes.The fundamental mode (left) and the Nyquist frequency (right) are indicated using vertical dashed lines.Over the range of interest for modern cosmological applications, all codes agree to within 1%.
We ran Swift on the same initial conditions and analysed the results as described below.The exact configuration used for the Swift run is released as part of the code package, namely: a 2048 3 gravity mesh for the PM code, the adaptive MAC with  FMM = 10 −3 , and a Plummer-equivalent softening length  = 10/ℎ kpc.The top-left panel of Fig. 1 shows the projection of the matter density field in a 10 Mpc/ℎ slice rendered using the SWIFTsimIO tool (Borrow & Borrisov 2020).To ease the comparison to published results, and eliminate any possible discrepancy coming from binning choices or exact definitions, we used the power-spectrum measurement tool embedded in the Gadget-4 code on our output to allow for a direct comparison with the data presented by Springel et al. (2021) (who had also reanalysed the other runs with their tool).We show our results alongside the published measurements from other codes in Fig. 16, each presented as ratios to the Swift prediction.The shaded regions correspond to ±0.25 percent and ±1 percent differences with respect to our results.Over the range of wavelengths of interest for this problem, the Swift results are in excellent agreement with the other codes.This agreement extends from the linear regime to the non-linear regime ( ≳ 0.1Mpc/h).This confirms Swift's ability to make solid predictions for modern cosmological applications.
Note also that a similar exercise was independently presented by Grove et al. (2022) in the context of the DESI survey code comparison effort, for which Swift, Abacus, and Gadget-2 were compared.Comparing outputs at  = 1 and  = 2, they obtained results in excellent agreement with the ones presented here.

INPUT & OUTPUT STRATEGY
We now turn our attention towards the input and output strategy used by the Swift code.

Initial Conditions
To ease the use of the code and given the large number of legacy initial conditions (ICs) existing in the community using this format, we adopt the same file format for input as the "mode 3" option of the Gadget-2 code (Springel 2005), i.e. the mode based on the hdf5 library (The HDF Group 2022).Swift is fully compatible with any valid Gadget-2 set of initial conditions, but we also provide additional optional features.Firstly, we allow for different units to be used internally and in the ICs.Swift would then perform a conversion upon start-up to the internal units.This can be convenient when a certain set of ICs uses a range of values problematic when represented in single-precision.Secondly, for cosmological runs, Swift can also apply the necessary ℎ-factor and -factor corrections (see § 5.3) to convert to the system of co-moving coordinates adopted internally.A departure from the strict Gadget-2 format is that Swift only allows for the data to be distributed over a single file; we do, however, provide scripts to transform such distributed input files to our format.Some tools also exist to directly generate Swift ICs with all the optional features added.The SWIFTsimIO26 python package (Borrow & Borrisov 2020) can be used to generate simple setups.The SEAGen27 (Kegerreis et al. 2019) and WoMa28 (Ruiz-Bonilla et al. 2021) packages are designed to generate spherical or spinning planetary bodies in equilibrium for collision problems (See sec.8.5).For cosmological simulations, the public version of the state-of-the-art ICs code MonofonIC (Michaux et al. 2021;Hahn et al. 2021) has been extended to be able to produce files that are directly compatible with the format expected by Swift.In particular, information about the adopted cosmological parameters, phases, and all the information required to re-generate the ICs are added to the files, read by Swift, and propagated to the snapshots.This allows for runs to be reproduced based solely on the information given in the Swift outputs.

Snapshots
For the same convenience reasons as for the ICs, we also adopt an output file format designed as a fully-compatible extension to the Gadget-2 (Springel 2005) "mode 3" format based on the hdf5 library (The HDF Group 2022).We extend the format by creating new particle groups for the species not existing in the original Gadget-2 code.We also add to the snapshots a full copy of the parameters used to perform the simulation, information about the version of the code, details of the cosmological models, and information about the ICs.Another noteworthy extension is the extensive use of units metadata in the snapshots.We attach full units information to every field in the snapshots.That information includes human-friendly and machine-readable conversion factors to the cgs system, as well as the conversion factor needed to move between the co-moving and physical frame (See sec.5.3).These metadata can be read by python packages such as SWIFTsimIO (Borrow & Borrisov 2020) to then propagate this information through the simulation analysis.This mechanism is based on the unyt (Goldbaum et al. 2018) library.The particles are stored in the snapshots in order of the domain cells they belong to (See § 9.1).Efficiently retrieving the particles located in a small sub-region of the computational domain is hence possible; for instance extracting the particles in the region around a single halo only.In large simulations, this is much more efficient than reading all the randomly ordered particles and then masking out the ones that do not fall in the region of interest.Metadata to ease such reading patterns are added to the snapshots.That information is picked up by tools such as SWIFTsimIO to aid analysis of these massive simulations.The commonly used visualisation package yt29 (Turk et al. 2011) has also been extended to directly read in Swift snapshots, including the relevant meta-data.
The snapshots can either be written into one single file, with all nodes writing collectively to the same dataset in parallel, or by splitting the data such that each node writes a file with its local subset of particles.That second option is preferable when using file systems that are not able to handle parallel writes to a single file efficiently.When writing such a distributed snapshot, an additional meta-snapshot is written; it contains all the information of a regular single-file snapshot, but uses hdf5's virtual dataset infrastructure to present the data distributed over many files as a single contiguous array.The links between files are handled in the background by the library.These meta-snapshots can then be read as if they were standard snapshots, for instance via tools like Gadgetviewer30 .Swift can also optionally apply lossless compression to the snapshots (via hdf5's own gzip filter) as well as a per-field lossy compression where the number of bits in the mantissa of the numbers can be reduced to save disk space.This option is particularly interesting when considering particle fields where the 23 bits of relative precision (i.e.≈ 7 decimal digits) of a standard float type are more than sufficient for standard analysis31 .Similar filters can be applied to double-precision variables.Finally, Swift implements an option to down-sample the particles of a given type in the snapshots by writing only a fraction of the particles chosen at random.
As an example of i/o performance in a realistic scenario, the snapshots for the recent flagship Flamingo run (Schaye et al. 2023) were written in 200 seconds.They contain 2.65 × 10 11 particles of different types spread over 960 files totalling 39 terabytes of data.This corresponds to a writing speed of 200 GB/s.As this test only used 65% of the systems' nodes, this compares favourably to the raw capability (350 GB/s) of the full cluster.Compressing the data using both lossy and lossless filters reduces the snapshot size to 11 terabytes but the writing time increases to 1260 seconds.This corresponds to a sustained writing speed of 9 GB/s; the difference is due to the compression algorithm embedded within the hdf5 library.Additionally, by making use of the library's parallel writing capability, we can repeat the uncompressed test but with all nodes writing to a single file.In this configuration, we require 463 seconds, effectively achieving a sustained parallel writing speed of 86 GB/s.
Snapshots can be written at regular intervals in time or change in scale-factor.Alternatively, the user can provide a list of outputs in order to specify output times more precisely.This list can be accompanied by a list of fields (or of entire particle types) the user does not want to be written to a snapshot.This allows for the production of reduced snapshots at high-frequency; for instance to finely track black holes.Any of the structure finders ( § 7) can be run prior to the data being written to disk to include halo membership information of the particles in the outputs.

Check-pointing mechanism
When running simulations at large computing centres, limits on the length of a given compute job are often imposed.Many simulations will need to run for longer than these limits and a mechanism to cleanly stop and resume a simulation is thus needed.This checkpointing mechanism can also be used to store backups of the simulation's progress in case one needs to recover from a software or hardware failure.Such a mechanism is different from the writing of science-ready snapshots as all the information currently in the memory needs to be saved; not just the interesting fields carried by the particles.These outputs are thus typically much larger than the snapshots and are of the same size as the memory used for the run.
In Swift, we choose to write one file per MPI rank.No preprocessing of any kind is done during writing.Each of the code's modules writes its current memory state one after the other.This includes the raw particle arrays, the cells, the tasks, and the content of the extensions (see § 8) among many other objects.At the start each module's writing job we include a small header with some information about the size of the data written.This allows us to verify that the data was read in properly when resuming a simulation.As these are simple, unformatted, large, and distributed writing operations, the code typically achieves close to the maximal writing speed of the system.For the same Flamingo run mentioned above, the whole procedure took 260 s for 64 TB of data in 960 files.This corresponds to a raw writing speed of 250 GB/s.As the check-pointing is fast, it is convenient to write files at regular intervals (e.g.every few hours) to serve as a backup.
When restarting a simulation from a check-point file, the opposite operation is performed.Each rank reads one file and restores the content of the memory.At this point, the simulation is in exactly the same state as it was when the files were written.The regular operations can thus resume as if no stoppage and restarting operation had ever occurred.
As is the case in many software packages, our implementation is augmented with a few practical options such as the ability to stop an on-going run or to ask the simulation to run for a set wall-clock time before writing a check-point file and stopping itself.

Line-Of-Sight outputs
In addition to full-box snapshots, Swift can also produce so-called line-of-sight outputs.Randomly-positioned rays (typically perpendicular to a face) are cast through the simulation volume and all gas particles whose volumes are crossed by the infinitely thin rays are stored in a list.We then write all the properties of these particles for each ray to a snapshot with a format similar to the one described above but much reduced in volume.These outputs can then be used to produce spectra via tools such as SpecWizard (Schaye et al. 2003;Tepper-García et al. 2011).Thanks to their small data footprints, these line-of-sight snapshots are typically produced at high time frequencies over the course of a run.This type of output is particularly interesting for simulations of the IGM and Lyman- forest (See § 8.4).

Lightcone outputs
To bring the cosmological simulation outputs closer to observation mock catalogs, Swift implements two separate mechanisms to record information as particles cross the past light cone of a selection of observers placed in the simulation box.The first mechanism writes the particles to disk as they reach a distance from the observer corresponding to the light-travel distance of the look-back time to the outputs.The second mechanism accumulates particle information in redshift shells onto pixels to directly construct maps as the simulation runs.See the Appendix of Schaye et al. (2023) for a detailed use case of both these mechanisms.

Particle data
The position of each observer, the redshift range over which lightcone particle output will be generated, and the opening angle of the cone are specified at run time.At each time-step we compute the earliest and latest times that any particles could be integrated forward to and the corresponding co-moving distances.This defines a shell around each observer in which particles might cross the past light cone as a result of drift operations carried out during this time-step.An additional boundary layer is added to the inside of the shell to account for particles that move during the time-step and assuming that they have sub-luminal speeds.
For simulations employing periodic boundary conditions, we must additionally output any periodic copy of a particle which crosses the observer's light cone.We therefore generate a list of all periodic copies of the simulation volume that overlap the shell around the observer.Then, whenever a particle is moved, we check every periodic copy for a possible overlap with any of the shells.If so, the particle's position is interpolated to the exact redshift at which it crossed the lightcone and the particle is added to a writing buffer.When the buffer reaches a pre-defined size, we write out the particles including all their properties to disk.
To optimise the whole process, we take advantage of the way that Swift internally arranges the particles in a cubic grid of cells ( § 9.1).We can use this structure to identify which tree cells overlap with the current lightcone shells.This allows us to reduce the number of periodic replications to check for every particle.Only the particles in the cells previously identified need to undergo this process.
In most cases, the raw data generated by the particle lightcone requires some post-processing; for instance to reorganise the particles inside the files in terms of angular coordinates on the sky and redshift.

HEALPix maps
Light-cone particle outputs as well as the internal memory requirement rapidly grow in size as the upper redshift limit is increased, especially if many box replications occur, and can become impractical to store.Swift therefore also contains a scheme to store spherical maps of arbitrary quantities on the light cone with user specified opening angle, angular resolution, and redshift bins.
To this end, the observer's past light cone is split into a set of concentric spherical shells in co-moving distance.For each shell we create one full-sky HEALPix (Górski et al. 2005) map for each quantity to be recorded.Whenever a particle is found to have entered one of these shells, we accumulate the particles' contributions to the HEALPix maps for that shell.Typical examples are the construction of mass or luminosity maps.Particles can also, optionally, be smoothed onto the maps using an SPH kernel.
As the maps do not overlap in redshift, it is not necessary to store all of the shells simultaneously in memory.Each map is only allocated and initialised when the simulation first reaches the time corresponding to the outer edge of the shell.It is then written to disk and its memory freed once all the particles have been integrated to times past that corresponding to the light travel time to the inner edge of the shell.In practice, the code will hence only need to have a maximum of two maps in memory at any point in time.

On-the-fly Power Spectra
Finally, Swift can compute a variety of auto-and cross-power spectra at user-specified intervals.These include the mass density in different particle species (and combinations thereof) as well as the electron pressure.For the neutrino density, we also implement the option to randomly select one half of the particles only or the other.This helps reduce the shot-noise by computing a cross-spectrum between the two halves.
The calculation is performed on a regular grid (usually of size 256 3 and hence allowing for the Fourier transform to be performed on a single node).Foldings (Jenkins et al. 1998) are used to extend the range probed to smaller scales with a typical folding factor of 4 between iterations.Different window functions from nearestgrid-point, to CIC, to triangular-shaped-clouds can be used and are compensated for self-consistently (see e.g.Colombi et al. 2009).This could easily be extended to higher-order schemes and to more particle properties.

Continuous non-blocking adaptive output strategy
In Swift we also include a novel output strategy called the Continuous Simulation Data Stream (CSDS), described by Hausammann et al. (2022).The key principles are summarised here (for related ideas, see Faber et al. 2010;Rein & Tamayo 2017).
In classic output strategies ( § 6.2), the simulation is stopped at fixed time intervals and the current state of the system is written to disk, similar to the frames of a movie.This is an expensive operation where all the compute nodes suddenly stop processing the physics and instead put an enormous stress on the communication network and file-system.During these operations, the state of the system is not advanced, leading to an overall loss in performance as the whole simulation has to wait until the i/o operations have completed.Furthermore, in simulations with deep time-step hierarchies, only few particles are active on most steps, with most particles just drifting forward.In a cosmological context, a large fraction of the particles have fairly simple trajectories, barely departing from 1 stor 2 nd -order perturbation theory tracks.Only the small fraction of particles deep inside haloes follow complex trajectories.For the first group of particles, simulations typically have more snapshots than necessary to trace them, whilst for the second group, even one thousand snapshots (say) over a Hubble time may not be sufficient to accurately re-create their trajectory.It is hence natural to consider a more adaptive approach.
The CSDS departs from the snapshot idea by instead creating a database of updates.At the start of a simulation an entry is written for each particle.We then start the simulation and progress the particles along.In its simplest form, the CSDS then adds an entry for a particle to the database every few (∼ 10) particle updates.As the writing is done on a particle-by-particle basis, it can easily be embedded in the tasking system.Writing is no longer a global operation where the whole simulation stops; rather updates are made continuously.By writing an update every few particle steps, the trajectory of each particle is, by construction, well-sampled, irrespective of whether it is in a very active region (e.g.haloes) or not (e.g. in voids).With this mechanism, particles outside of structures can have as little as two entries (start time and end time of the simulation) whilst some particles will have thousands of entries.Since the time-step size of a particle is designed to correctly evolve a particle, relying on this information to decide when to write a database entry guarantees that the particles' evolution can later be faithfully recreated.Each entry for a particle contains a pointer to the previous entry such that particles can easily be followed in time.
An improved version of this approach would be to write a database entry every time a particle field has changed by some pre-defined fraction .This is an important philosophical change; instead of creating frames at fixed intervals, we can demand that the evolution of any quantity be reconstructed to some accuracy from the output and get the CSDS to create the individual particle entries at the required times.The somewhat arbitrary choice of time interval between snapshot is hence replaced by an objective accuracy threshold.
This database of particle updates allows for many new simulation analysis options.The trajectory and evolution of any particle can be reconstructed to the desired accuracy; that is we have all the information for a high time-resolution tracking of all the objects in a run.The first use is to produce classic snapshots at any position in time.We simply interpolate all the particle entries to that fixed time.But, one can also demand to construct slices in space-time, i.e. a lightcone from the output.New possibilities arising from this new output format will undoubtedly appear in the future.Tools to perform the basic operations described here are part of the CSDS package linked to Swift.The tools, and most of the analysis performed thus far, are currently focused on dark-matter simulations, but we expect to extend this to more complex scenarios in the future.

Friends-Of-Friends group finder
The classic algorithm to identify structures in simulations is Friends-Of-Friends (FOF, see e.g.Davis et al. 1985).Particles are linked together if they are within a fixed distance (linking length) of each other.Chains of links form groups, which in a cosmological context are identified as haloes.For a linking length of 0.2 of the mean inter-particle separation, the haloes found are close (by mass) to the virialised structures identified by more sophisticated methods.The FOF method falls into the wider class of Union-Find algorithms (Galler & Fisher 1964) and very efficient implementations have been proposed over the last decade for a variety of computing architectures (e.g.Creasey 2018).
The implementation in Swift is fully described by Willis et al. (2020).In brief, the algorithm operates on a list of disjoint sets.The Union operation merges two sets and the Find operation identifies the set a given element resides in.Initially, each set contains a single element (one particle), which plays the role of the set identifier.The algorithm then searches for any two pairs of particles within range of each other.When such a pair is identified, the Find operation is used to identify which set they belong to.The Union operation is then performed to merge the sets if the particles do not already belong to the same one.To speed-up the pair-finding process, we use the same base principles as the ones discussed in § 2.More precisely, by using the linking length as the search radius, we can construct a series of nested grids down to that scale.The search for links between particles can then be split between interactions within cells and between pairs of neighbouring cells.The tasking infrastructure can then be used to distribute the work over the various threads and nodes.When running a simulation over multiple compute nodes, the group search is first performed locally, then fragments of groups are merged together across domains in a second phase.This is however very different from other particle-particle interactions like the ones used for e.g.hydrodynamics, where the interactions are performed simultaneously, i.e. strictly within a single phase.Additional optimisations are described by Willis et al. (2020), alongside scaling results demonstrating excellent strong and weak scaling of the implementation.
Structures identified via Swift's internal FOF can either be used to seed black holes (see § 8.1.4)or be written as a halo or group catalogue output.Additionally, the FOF code can be run as standalone software to post-process an existing snapshot and produce the corresponding group catalogue.

Coupling to VELOCIraptor
Many algorithms have been proposed to identify bound structures and sub-structures inside FOF objects (for a review, see Knebe et al. 2013).Many of them can be run on simulation snapshots in a postprocessing phase.However, that is often inefficient as it involves substantial i/o work.In some cases, it can also be beneficial to have access to some of the (sub-)halo membership information of a particle inside the simulation itself.For these reasons, the Swift code contains an interface to couple with the VELOCIraptor code (Elahi et al. 2011(Elahi et al. , 2019)).VELOCIraptor uses phase-space information to identify structures using a 6D FOF algorithm.An initial 3D FOF is performed to identify haloes, however, this process may artificially join haloes together via a single particle, which is known as a particle bridge.These haloes are split apart by running a 6D FOF to identify particle bridges based upon their velocity dispersion.Large mergers are then identified in an iterative search for dense phase-space cores.Gravitationally unbound particles can optionally be removed from the identified structures.Such a substructure algorithm has the advantage over pure configuration-space algorithms of being able to identify sub-haloes deep within a host halo, where the density (or potential) contrasts relative to the background are small.
Over the course of a Swift run, the VELOCIraptor code can be invoked to identify haloes and sub-haloes.To this end, the public version of the structure finder was modified to be used as a library.At user-specified intervals (typically at the same time as snapshots), Swift will create a copy of the particle information and format it to be passed to VELOCIraptor.This process leads to some duplication of data but the overheads are small as only a small subset of the full particle-carried information is required to perform the phase-space finding.This is particularly the case for simulations which employ a full galaxy-formation model, where particles carry many additional tracers irrelevant to this process.
When the structure identification is completed, the list of structures and the particle membership information is passed back from the library to Swift.This information can then either be added to snapshots or be acted upon if any of the sub-grid models so require.
As an example, we ran Swift with VELOCIraptor halo finding on the benchmark simulation of Schneider et al. (2016) introduced in § 5.5.The resulting halo mass function is shown on Fig. 17 alongside the reference fitting function of Tinker et al. (2010) for the same cosmology.Our results are in excellent agreement with the predictions from the literature.

EXTENSIONS
Besides the coupled hydrodynamics and gravity solver, the Swift code also contains a series of extensions.These include complete galaxy formation models, AGN models, multi-material planetary models, and a series of external potentials.These features are briefly summarised over the next pages.

The Swift-Eagle galaxy formation model
An implementation of an evolution of the sub-grid models used for the Eagle project (Schaye et al. 2015;Crain et al. 2015) is part of the Swift code.The model is broadly similar to the original Gadget-based implementation but was improved in several areas.Some of these changes also arose from the change of SPH flavour from a pressure-based formulation (see Schaller et al. 2015, for the version used in Eagle) to the Sphenix energy-based flavour tailored specifically for galaxy formation simulations ( § 3.3).We summarise here the main components of the model.All the parameters presented below have values that can be adjusted for specific simulation campaigns and are stored in parameter files that Swift reads in upon startup.The example parameter files provided in the Swift repository contain the parameter values for this model that were obtained via the calibration procedure of Borrow et al. (2024).

Radiative cooling and heating
The radiative cooling and heating rates are pre-calculated on an element-by-element basis given the element abundance of each particle.The gas mass fractions of H, He, C, N, O, Ne, Mg, Si, and Fe are explicitly tracked in the code and directly affected by metal enrichment, while the abundance of S and Ca is assumed to scale with the abundance of Si using solar abundance ratios.Swift can use the tabulated cooling rates from Wiersma et al. (2009a) (W09) for optically thin gas from the original Eagle runs, as well as the various public tables from Ploeckinger & Schaye (2020) (PS20).Compared to W09, the PS20 tables are computed with a more recent version of Cloudy: c07 (Ferland et al. 1998) in W09 and c17 (Ferland et al. 2017) in PS20, use an updated version of the UV and X-ray background (Haardt & Madau (2001) in W09 and a background based on Faucher-Giguère (2020) in PS20) and include physical processes relevant for optically thick gas, such as cosmic rays, dust, molecules, self shielding, and an interstellar radiation field.

Entropy floor and star formation
In typical Eagle-like simulations, the resolution of the model is not sufficient to resolve the cold dense phase of the ISM, its fragmentation, and the star formation that ensues.We hence implement an entropy floor following Schaye & Dalla Vecchia (2008), which is typically set with a normalisation of 8000 K at a density of  H = 0.1 cm −3 with a slope expressed by the equation of state for pressure as  ∝  4/3 .The star formation model uses the pressure-law model of Schaye & Dalla Vecchia (2008) which relates the star formation rates to the surface density of gas.Particles are made eligible for star formation based on two different models.The first one follows Eagle and uses a metallicity-dependent density threshold based on the results of Schaye (2004).The second model exploits the Ploeckinger & Schaye (2020) tables.By assuming pressure equilibrium, we find the density and temperatures on the thermal equilibrium curve for the particles limited by the entropy floor.A combination of density and temperature threshold is then used with these sub-grid quantities (typically  H > 10 cm −3 and  < 1000 K).In practice, both models lead to broadly similar results.
Once a gas particle has passed the threshold for star formation, we compute its star formation rate based on two different models.We either assume a Schmidt (1959) law with a fixed efficiency per freefall time, or use the pressure-law of Schaye & Dalla Vecchia (2008), which is designed to reproduce the Kennicutt (1998) relation.Based on the particle masses and computed star formation rate, random numbers are then drawn to decide whether the particles will indeed be converted into a star particle or not.The star particles formed in this manner inherit the metal content and unique ID of their parent gas particle.

Stellar enrichment & feedback
Stellar enrichment is implemented for the SNIa, core-collapse, and AGB channels using the age-and metal-dependant yields compilation of Wiersma et al. (2009b).The light emitted by the stars in various filters, based on the model of Trayford et al. (2015), is written to the snapshots.Stellar feedback is implemented using a stochastic thermal form (Dalla Vecchia & Schaye 2012) with various options to choose which neighbour in a star particle's kernel to heat (Chaikin et al. 2022).The energy per supernova injection can either be kept fixed or be modulated by the local metallicity or density (Crain et al. 2015).Additionally, Swift includes the modified version of the stochastic kinetic feedback model of Chaikin et al. (2023) that was used in the Flamingo simulations (Schaye et al. 2023;Kugel et al. 2023).The SNe can either inject their energy after a fixed delay or can stochastically sample the stars' lifetimes.The energy injection from SNIa is done by heating all the particles in the stars' SPH kernel during each enrichment step.

Black holes & AGN feedback
Black hole (BH) particles are created by converting the densest gas particle in FOF-identified haloes (see § 7.1) that do not yet contain a BH and are above a user-defined mass threshold.BHs grow by accreting mass from their neighbourhood, using a Bondi (1952) model, possibly augmented by density-dependent boosting terms (Booth & Schaye 2009) or angular-momentum terms (Rosas-Guevara et al. 2015).BH particles can swallow neighbouring gas particles when they have accreted enough mass or can "nibble" small amounts of mass from them (see Bahé et al. 2022).Feedback from AGN is implemented using a stochastic thermal heating mechanism where energy is first stored into a reservoir until a pre-defined number of particles can be heated to a set temperature (Booth & Schaye 2009).Finally, the various modes of repositioning BHs presented in Bahé et al. (2022) are available as part of the Eagle model in Swift.

Results
The model and the calibration of its free parameters are fully described by Borrow et al. (2024), alongside a comprehensive set of results.For completeness, we show here the  = 0 galaxy stellar mass function measured in 50 kpc spherical apertures (see appendix of de Graaff et al. 2022) from a (25 Mpc) 3 simulation with 2 × 376 3 particles in Fig. 18.The baryon particle mass in this simulation is  gas = 1.81 × 10 6 M ⊙ , the resolution of the Eagle simulations and the resolution at which the model was calibrated.For comparison, we show the Driver et al. (2022) estimates of the mass function obtained from the GAMA survey.Over the range where the masses are resolved and the galaxies are not too rare to feature in such a small volume, the Swift-Eagle model produces is in good agreement with the data.That same model was used by Altamura et al. (2023) for their studies of groups and clusters; a map of the gas temperature weighted by its velocity dispersion extracted from one of their simulated clusters is displayed on panel (b) of Fig. 1.
We note that the exact parameters and initial conditions for this simulation are provided as part of the code release.

Gear-like galaxy formation model
The Gear physical model implemented in Swift is based on the model initially implemented in the Gear code (Revaz & Jablonka 2012;Revaz et al. 2016;Revaz & Jablonka 2018), a fully parallel chemo-dynamical Tree/SPH code based on Gadget-2 (Springel 2005).While Gear can be used to simulate Milky Way-like galaxies (Kim et al. 2016;Roca-Fàbrega et al. 2021) its physical model has been mainly calibrated to reproduce Local Group dwarf galaxies (Revaz & Jablonka 2018;Harvey et al. 2018;Hausammann et al. 2019;Sanati et al. 2020) and ultra-faint dwarfs (Sanati et al. 2023).We review hereafter the main features of the model; more details about the Swift implementation can be found in Hausammann (2021).An example galaxy from the Agora-suite (Kim et al. 2016) run using Swift-Gear is displayed in panel (c) of Fig. 1.

Gas radiative cooling and heating
Radiative gas cooling and heating is computed using the Grackle library (Smith et al. 2017).In addition to primordial gas cooling, it includes metal-lines cooling, obtained by interpolating tables, and scaled according to the gas metallicity.Grackle also includes UVbackground radiation heating based on the prediction from Haardt & Madau (2012).Hydrogen self-shielding against the ionising radiation is incorporated.Two shielding options can be used: (1) the UVbackground heating for gas densities above  H = 0.007 cm −3 (Aubert & Teyssier 2010), and (2) the semi-analytic prescriptions of Rahmati et al. (2013) directly included in the Grackle cooling tables.

Pressure floor
To prevent gas from artificially fragmenting at high density and low temperature, i.e. when the Jeans length is not resolved (Truelove et al. 1997;Bate & Burkert 1997;Owen & Villumsen 1997), the gas' normal adiabatic equation of state is supplemented by a non-thermal pressure term.This additional term, interpreted as the non-thermal pressure of the unresolved ISM turbulence, artificially increases the Jeans length to make it comparable to the gas resolution (Robertson & Kravtsov 2008;Schaye & Dalla Vecchia 2008).The Gear model uses the following pressure floor, a modified version of the formulation proposed by Hopkins et al. (2011): where  is the universal gravitational constant and,  the adiabatic index of the gas fixed to 5/3.ℎ, , and  are respectively the SPH smoothing length, density, and velocity dispersion of the gas particle.
The parameter  Jeans (usually set to 10) is the ratio between the SPH mass resolution and the Jeans mass.

Star formation and pressure floor
Star formation is modelled using a modified version of the stochastic prescription proposed by Katz (1992) and Katz et al. (1996) that reproduces the Schmidt (1959) law.In the Gear model star formation proceeds only in dense and cold gas phases where the physics is unresolved, i.e.where the artificial Jeans pressure dominates.Inverting eq.79, the temperature and resolution-dependent density threshold that delimits the resolved and unresolved gas phases is defined: Above this limit, the gas particles are eligible to form stars.It is possible to supplement this threshold with a constant density threshold, which prevents the stars from forming in cold and low-density gas regions, or by a temperature threshold, which prevents stars from forming in hot phases.Finally, only particles with a negative divergence of the velocity are eligible to form stars. Once a particle of mass  g is eligible, it will have a probability  ★ to form a stellar particle of mass  ★ during a time interval Δ (Springel & Hernquist 2003): where  ★ is a free parameter and  g the local free fall time.Each gas particle can form a maximal number  ★ of stellar particles over the whole simulation. ★ is a free parameter set by default to 4. The Gear model can use a critical metallicity [Fe/H] c parameter to differentiate stellar populations.Below [Fe/H] c , a stellar particle will represent a Pop III (metal-free) population and above the critical metallicity, it will be considered a Pop II star.Both populations are characterised by different initial mass functions (IMF), stellar yields, stellar lifetimes, and energies of supernova explosions.All this information is provided to Swift by tables that can be generated by the PyChem32 utility.

Stellar feedback, chemical evolution and metal mixing
At each time step following the creation of a stellar particle, the IMF and stellar lifetimes-dependent number of exploding supernova (core collapse and Type Ia) is computed.This number that can be less than one and is turned into an integer number using a stochastic procedure called the random discrete IMF sampling (RIMFS) scheme in which the IMF is considered as a probability distribution (Revaz et al. 2016).Once a supernova explodes, its energy and synthesised elements are injected into the surrounding gas particles using weights provided by the SPH kernel.A parameter  SN may be used to decide the effective energy that will impact the ISM, implicitly assuming that the remainder will be radiated away.
To avoid instantaneous radiation of the injected energy, the delayed cooling method, which consists in disabling gas cooling for a short period of time of about 5 Myr (Stinson et al. 2006), is used.
The released chemical elements are further mixed in the ISM using either the smooth metallicity scheme (Okamoto et al. 2005;Tornatore et al. 2007;Wiersma et al. 2009b) or explicitly solving a diffusion equation using the method proposed by Greif et al. (2009).

Spin-driven AGN jet feedback
This model for AGN feedback is fully described by Huško et al. (2022) and Huško et al. (2024).We summarise here its main features.This sub-grid model only contains a prescription for AGN and can be used in combination with the Eagle-like model described above for the rest of the galaxy formation processes.
In this model for AGN feedback, additional sub-grid physics related to accretion disks is included, allowing the evolution of spin (angular momentum) for each black hole in the simulation.This in turn means that one can use the spin-dependent radiative efficiency, instead of using a constant value (e.g. 10 percent) for the thermal feedback channel employed in the fiducial model.More significantly, tracking black hole spins also allows for the inclusion of an additional mode of AGN feedback in the form of kinetic jets.The hydrodynamic aspects of the jets and their interaction with the CGM were tested by Huško & Lacey (2023).These jets are included in a self-consistent way by using realistic jet efficiencies (that depend strongly on spin), and by accounting for the jet-induced spindown of black holes.In the standard version of the model, at high accretion rates it is assumed that thermal feedback corresponds to radiation from sub-grid thin, radiatively-efficient accretion discs (Shakura & Sunyaev 1973).At low accretion rates, jets are launched from unresolved, thick, advection-dominated accretion disks (Narayan & Yi 1994).In more complicated flavours of the model, jets are also launched at high accretion rates and radiation (thermal feedback) at low accretion rates, as well as strong jets and thermal feedback from slim discs at super-Eddington accretion rates -all of which is motivated by either observational findings or simulations.
These modifications to the AGN feedback may lead to more realistic populations of galaxies, although they probably have a stronger impact on the properties of the CGM/ICM.Although the model comes with the price of a more complicated feedback prescription (which involves some number of free parameters), it also opens an avenue for further observational comparisons between simulations and observations.The model yields predictions such as the spin-mass relation for black holes or the AGN radio luminosity function.These relations can be used to constrain or discriminate between versions of the model.

Quick-Lyman-alpha implementation
Besides galaxy formation models, another popular application of cosmological hydrodynamical simulations is the study of the intergalactic medium (IGM) via the Lyman- forest.So-called "Quick-Lyman-alpha" codes have been developed (e.g.Viel et al. 2004;Regan et al. 2007) to simulate the relevant physics.As the focus of such simulations is largely on the low-density regions of the cosmic web, a very simplified network of sub-grid model can be employed.In particular, for basic applications at least, the chemistry and cooling can be limited to only take into account the primordial elements.Similarly, any high-density gas can be turned into dark matter particles as soon as the gas reaches a certain over-density (typically Δ = 1000).In such a case, no computing time is wasted on evolving the interior of haloes, which allows for a much shallower time-step hierarchy than in a full galaxy formation model and thus much shorter run times.
We implement such a model in Swift.The "star formation" is designed as described above: any gas particle reaching an over-density larger than a certain threshold is turned into a dark matter particle.The cooling makes use of the table interpolation originally designed for the Swift-Eagle model ( § 8.1).Either the W09 or the P20 tables can be used.Of particular interest for Quick-Lyman-alpha applications, these are based on two different models of the evolution of the UV background: Haardt & Madau (2001) and Faucher-Giguère (2020) respectively.A simulation using the W09 tables would be similar to the ones performed by Garzilli et al. (2019).

Material extensions and planetary applications
Swift also includes features that can be used to model systems with more complicated and/or multiple equations of state (EoS), and to better deal with density discontinuities.They are organised under a nominal 'planetary' label, given their initial application to giant impacts (Kegerreis et al. 2019).These extensions can be applied either onto a 'Minimal'-like solver, with the inclusion of the Balsara (1995) viscosity switch, or in combination with the other, more sophisticated SPH modifications described below.

Equations of state
Many applications of SPH involve materials for which an ideal gas is not appropriate, and may also require multiple different materials.Included in Swift are a wide variety of EoS, which use either direct formulae (e.g.Tillotson 1962) or interpolation of tabulated data (e.g.Stewart et al. 2020;Chabrier & Debras 2021) to compute the required thermodynamic variables.Each individual SPH particle is assigned a material ID that determines the EoS it will use.By default, no special treatment is applied when particles of different EoS are neighbours: the smoothed densities are estimated as before, and the pressure, sound speed, and other thermodynamic variables are then computed by each particle using its own EoS.
Currently implemented are EoS for several types of rocks, metals, ices, and gases.Custom user-provided EoS can also be used.Some materials can, for example, yield much more dramatic changes in the pressure for moderate changes in density than an ideal gas, and can also account for multiple phase states.In practice, in spite of the comparative complexity of some of these EoS, invoking them does not have a significant effect on the simulation run speed, because they are called only by individual particles instead of scaling over multiple neighbours.
Some input EoS may include a tension regime, where the pressure is negative for a cold, low-density material.This is usually undesired behaviour in a typical SPH simulation and/or implies an unphysical representation of the material in this state as a fluid, and can lead to particles accelerating towards each other and overlapping in space.As such, by default, a minimum pressure of zero for these EoS is applied.

Special treatment for initial conditions
Prior to running a simulation, it is a common practice to first perform a 'settling' run to relax the initial configuration of particles.This is particularly pertinent to planetary and similar applications, where the attempted placement of particles to model a spherical or spinning body will inevitably lead to imperfect initial SPH densities (Kegerreis et al. 2019;Ruiz-Bonilla et al. 2021).If the applied EoS includes specific entropies, then Swift can explicitly enforce the settling to be adiabatic, which may be a convenient way to maintain an entropy profile while the particles relax towards equilibrium.

Improvements for mixing and discontinuities
Standard SPH formulations assume a continuous density field, so can struggle to model contact discontinuities and to resolve mixing across them (e.g.Price 2008).However, density discontinuities appear frequently in nature.For example, in a planetary context, sharp density jumps might appear both between a core and mantle of different materials, and at the outer vacuum boundary.Smoothing particles' densities over these desired discontinuities can lead to large, spurious pressure jumps, especially with complex EoS.
We have developed two approaches to alleviate these issues in Swift, briefly summarised here, in addition to the significant benefits of using more SPH particles for higher resolutions than were previously feasible.First, a simple statistic can be used to identify particles near to material and/or density discontinuities and to modify their estimated densities to mitigate the artificial forces and suppressed mixing (Ruiz-Bonilla et al. 2022).This method is most effective when combined with the geometric density-average force (GDF) equations of motion (Wadsley et al. 2017).
Second, a more advanced scheme in which density discontinuities are addressed by directly reducing the effects of established sources of SPH error (Sandnes et al. 2024).This combines a range of novel methods with recent SPH developments, such as gradient estimates based on linear-order reproducing kernels (Frontiere et al. 2017).The treatment of mixing in simulations with either one or multiple equations of state is significantly improved both in standard hydrodynamics tests such as Kelvin-Helmholtz instabilities and in planetary applications (Sandnes et al. 2024).
Each of these modifications may be switched on and off in Swift in isolation.Further improvements are also in active developmentincluding the implementation of additional features such as material strength models.

External potentials
Several external potentials intended for use in idealised simulations are implemented in Swift.The simplest external potentials include an unsoftened point mass, a softened point mass (i.e. a Plummer (1911) sphere), an isothermal sphere, a Navarro et al. (1997) (NFW) halo, and a constant gravitational field.
Besides these traditional options, Swift includes two Hernquist (1990) profiles that are matched to a NFW potential.The matching can be performed in one of two ways: (1) we demand that the mass within  200,cr is  200,cr33 for the Hernquist (1990) profile, i.e.
In order to reduce errors in the integration of orbits, each of the spherically-symmetric potentials optionally imposes a minimum time-step to each particle (see e.g.Nobels et al. 2022).We compute the distance from the centre  of each particle and the corresponding circular velocity  circ ().We then impose a minimum time-step of Δ pot =  pot   circ ( ) , where  pot is a free parameter typically defaulting to  pot = 0.01 (i.e. 100 time-steps per orbit).

IMPLEMENTATION DETAILS & PARALLELISATION
In this Section, we present some of the important implementation details, especially surrounding the multi-node parallelism, and discuss the results of a scaling test on a realistic problem testing the entirety of the code modules.

Details of the cells & tasking system
The basic decomposition of the computational domain in meaningfully-sized cells was introduced in § 2.1.We present some more technical details here.
In all the calculations we perform, we start by laying a Cartesian grid on top of the domain.This defines the most basic level in the cell hierarchy and is referred to as the top-level grid34 .The size of this grid varies from about 8 cells on a side for small simple test runs to 64 elements for large calculations.In most cases, there will be many thousands or millions of particles per cell.We then use a standard oct-tree construction method to recursively split the cells into 8 children cells until we reach a number of particles per cell smaller than a set limit, typically 400.This leads to a relatively shallow tree when compared to other codes which create tree nodes (cells) down to a single particle, and implies a much smaller memory footprint for the tree itself than for other codes.As discussed in § 2.1, Swift can perform interactions between cells of different size.
Once the tree has been fully constructed, we sort the particles into their cells.By using a depth-first ordering, we can guarantee that the particles occupy a contiguous section of memory for all the cells in the tree and at any level.This greatly helps streamline operations on single or pairs of cells as all the particles will simply be located between two known addresses in memory; no speculative walk will be necessary to find all the particles we need for a set of interactions.This sorting of particles can be relatively expensive on the very first step as we inherit whatever order the particles were listed in the initial conditions.However, in the subsequent constructions, this will be much cheaper because the particles only move by small amounts with respect to their cells in between constructions.This is also thanks to the relatively shallow tree we build, which permits for comparatively large cell sizes.For this reason, we use a parallel merge sort here to sort the particles in their cells as it is an efficient way to sort almost-sorted lists, which is the case in all but the first step.Recall also that we do not need to sort the particles very finely, thanks to the high number of them we accept in tree leaves.Whilst this operation is technically a sort, we refer to it as binning of the particles in what follows to avoid confusion with the sorting of particles on the interaction axis used by the pseudo-Verlet algorithm.
With the tree constructed and the particles all in their cell hierarchies, we have all the information required to decide which cells will need to interact for SPH (based on the cells' maximum smoothing lengths) and for gravity (based on the multipoles).All the quantities required for this decision making were gathered while binning the particles.We start by constructing the tasks on the top-level grid only, as described in § 2.2 and § 4.3 for SPH and gravity respectively.In most non-trivial cases, however, this will lead to tasks with very large numbers of particles and hence a large amount of work to perform.If there are only a few expensive tasks, then the scheduler will not be able to load-balance the work optimally as its options are limited.We ideally want significantly more tasks to be enqueued and waiting for execution than there are compute cores.It is hence key to fine-grain the problem further.To achieve this, we attempt to split the tasks into smaller units.For instance, a task acting on a single cell might be split into eight tasks, each acting on its eight children cells independently.For some tasks, in particular when there are no particle-particle interactions involved, this is trivially done (e.g.time integration or for a cooling sub-grid model) but other tasks may lead to more complex scenarios.An SPH task for instance cannot be split into smaller tasks if the smoothing length of the particles is larger than the size of the children cells.In most non-pathological cases, however, the tasks can be moved down the tree by several levels, thus multiplying their overall number many times over and ultimately satisfying our request to have many more tasks than computing units.In cases where more than one loop over the neighbours are needed, only the tasks corresponding to the first loop are moved down the tree levels by assessing whether refinement criteria are met.The tasks corresponding to the subsequent interaction loops however are created by duplicating the already existing tasks of the first loop.As an example, the SPH force loop is created by copying all the tasks needed for the density loop and relabelling them.Similarly, all the sub-grid feedback or black hole-related loops are created in this fashion.This approach has the advantage of keeping the task-creation code as simple as possible.While duplicating the loops, we also set dependencies between tasks to impose the logical order of operations between them (see Fig. 4).
With the tasks created, the code is ready to perform many timesteps.That is, we can re-use the infrastructure created above until the geometrical conditions are violated by particle movement.For SPH, these conditions would be too large a change in smoothing length or a particle moving too far out of its cell meaning that the assumption that all the neighbours are in the same cell or any directly adjacent one is broken.For gravity, this would be too large a particle movement, leading to it being impossible to recompute multipoles without changing the cell geometry.Our shallow tree with large leaves has the advantage of remaining valid for many steps.We also note that other criteria (such as a global mesh gravity step or a certain number of particle updates leading to a tree rebuild) do, in practice, trigger a tree and tasks construction more often than these.
At the start of each step, we perform a quick tree walk starting, in parallel, in each of the many top-level cells.In this walk, we simply identify which cells contain active particles (i.e.particles which need to be integrated forward in time on this step) and activate the corresponding tasks.This operation is very rapid (much less than 1 percent of the total runtime in production runs) and can easily be parallelised given the large number of cells present in a run.Once all the tasks have been activated, they are handed over to the QuickSched engine which will launch them when ready.
As described by Gonnet et al. (2016), the tasks whose dependencies are all satisfied (i.e. for which all the tasks taking place earlier in the graph have already run) are placed in queues.We typically use one of these queues per thread and assign the tasks to the queues (and hence threads) either randomly or based on their physical location in the compute domain.The threads then run through their queues and attempt to fetch a task.When doing so they have to verify that the tasks they get are not conflicting with another, already-running operation.To this end, a mechanism of per-cell locks and semaphores is used.If a thread cannot acquire the lock on a given cell, it abandons this task and attempts to fetch the next one in the queue.If it can acquire a task, it will run the physics operations and upon completion will unlock all the dependencies associated with this task, hence enabling the next tasks to be placed in the queues.We highlight once more that the physics operations themselves are taking place inside a single thread and that no other thread can access the same data at the same time.This places the physics and maths operations taking place in a very safe space, allowing users with only limited programming experience to easily modify or extend the physics contained inside the tasks.No intimate knowledge of parallel programming or even of task-based parallelism is needed to alter the content of a task.If a thread reaches the end of its queue, it starts again from the beginning until there are no more tasks it can process.When that happens, the thread will attempt to steal work from the other threads' queues, a unique feature, at the time this project started, of the QuickSched library.Once all tasks in all queues have been processed, the timestep has been completed and the threads are paused until the start of the next step.

Multi-node strategy
The top-level grid described in the previous section serves as the base decomposition unit of the simulated domain.When decomposing the problem into multiple domains, which would be necessary to run a simulation over multiple compute nodes, we assign a certain number of these cells to each of them.The tree construction algorithm is then run in parallel in each domain for each cell.The only addition is the possible exchange of particles which have left their domain entirely.They are sent to their new region and placed in the appropriate cells.
With the tree fully constructed, we send the sections of the trees (the cell geometry information and multipoles, not the particles) that border a domain to the nodes on the other side of the divide.Each compute node has henceforth full knowledge of its own trees and of any of the directly adjacent ones.With that information in hand, each node will be able to construct all of its tasks, as described above.It will do so for all the purely local cells as well as for the pair tasks operating on one local cell and one foreign cell.The compute node on the other side of the divide will create the exact same task as it bases its decision-making on exactly the same information.The only remaining operation is the creation of send and receive tasks for each task pair overlapping with a domain edge.By adding the appropriate dependencies, we create a task graph similar to the one depicted in Fig. 8.
With this logic, any task spanning a pair of cells that belong to the same partition needs only to be evaluated on that rank/partition, whilst tasks spanning more than one partition need to be evaluated on both ranks/partitions.This is done in the shallow tree walk that performs the task activation at the start of a step.A minor optimisation can be used in the cases where only one of the two cells in a pair task contains active particles.In that situation, we can skip the sending and receiving of data to the node hosting the inactive cell since it will not be using it for any local updates.
All the tasks are put in queues in exactly the same way as in the single-node case.The only difference applies to the communication tasks.These are treated slightly differently.As soon as their dependencies are satisfied, the data is sent asynchronously.Similarly, as soon as the receiving node is ready, it will post a call to an asynchronous receive operation.Note that these communication tasks are treated like any other task; in particular, any of the threads can act on them and thus perform the inter-node communications.We then use the conflict mechanism of the queues to ask the MPI communication library whether the data has effectively been sent or received, respectively.Once that has happened, we simply unlock the corresponding tasks' dependencies and the received data can safely be used from that point onward.This allows us to effectively hide all the communications in the background and perform local work while the data move.We also note that once the data have arrived, nothing distinguishes them from data that were always on that node.This means that the physics operations in tasks can be agnostic of which data they work on.There is no need for special treatment when dealing with remote data; once more helping developers of physics modules to focus on the equations they implement rather than on the technicalities of distributed parallelism.

Domain decomposition
When running a large simulation over MPI using many ranks, an important question is how to share the workload across all the ranks and their host compute nodes.This is important, beyond the obvious reasons like limited memory and CPU cores per node, as the progression of a simulation with synchronisation points is determined by the slowest part.
The simulation workload consists of not just particles and their memory, but also the associated computation, which can vary depending on the types of particles, the current state and environment of the particles, as well as the costs of inter-node communication.All these elements play their part.
A representation of the workload and communication can be constructed by considering the hyper-graph of all top-level cells, where graph vertices represent cells and the edges represent the connections to the nearest neighbours (so each vertex has up to 26 edges).In this graph the vertices represent the computation done by the cell's tasks and the edges represent only the computation done in pair-interaction tasks.This follows since pair interactions are the only ones that could involve non-local data, so the computation in tasks spanning an edge should be related to the communication needed.Now, any partition of this graph represents a partition of the computation and communication, i.e. the graph nodes belonging to each partition will belong to an MPI rank, and the data belonging to each cell resides on the rank to which it was assigned.Such a decomposition is shown in Fig. 19 for a simple toy example.
The weighting of the vertices and edges now needs to reflect the actual work and time expected to be used for communication.Initially, the only knowledge we have of the necessary weights is the association of particles and cells, so we only have vertex weights.However, when a simulation is running, every task is timed to CPU tick accuracy and thus has a direct wall-clock measurement to reflect the computation.This will never be perfect, as other effects like interruptions from other processes will add time, but should be good enough.Note that it also naturally accounts for unknowns, like CPU speed and compiler optimisations, that a non-timed system would need to know about for all the different task types.So, once all the tasks of a simulation have run, we then know how long they take and can then use these real-world weights in the graph.
Decomposing such graphs is a standard problem in computer science and multiple packages exist in the literature.We chose to use Metis and ParMetis (Karypis & Kumar 1998).
Using this simple weights scheme is sufficient, as shown in the next section.Note also that we are not demanding a perfect partition of the graph.In typical simulations, the workload evolves with time (which task times naturally take into account), and it is hence counterproductive to spend a large amount of time identifying the perfect partition.We prefer to use a partition that is good enough but quick to obtain.For realistic simulations, we find that we can maintain the imbalance between compute domains to less than 10 percent (see also Schaller et al. 2016, and Fig. 20 below).We caution that this approach does not explicitly consider any geometric constraints, nor does it attempt to distribute the data uniformly.The only criterion is the relative computational cost of each domain, for which the task The representation of the top-level cells as a graph to be split over domains.The cells of the grid (on the left) correspond to the vertices of the graph (on the right), while the tasks spanning two cells constitute its edges (dashed and dotted lines).For simplicity, we consider here a 4 × 4 non-periodic grid in 2D and only show the pair tasks for cells that share an edge.Each vertex and graph edge has a weight associated with it, shown here as the numbers on each vertex and edge.The weights correspond to the cost of the task execution.If a pair operation is taking place over the network (shown here using dashed lines), its cost will be increased since communications will have to take place and the task will be executed on both of the involved ranks.The domain decomposition algorithm splits the graph so that the work (vertices and edges) is as evenly distributed as possible among all computing ranks (the four colours), minimising the total cost by creating as few communications as possible.In the case shown here, this corresponds to the domain decomposition presented on the left.Note in particular that the number of cells assigned to each domain may not necessarily be the same.
decomposition provides a convenient model.We are therefore partitioning the computation, as opposed to just the data.There could, in principle, be cases where the work-based decomposition leads to problematic data distributions leading to the code running out of memory on a given compute node.We have so far never encountered such a situation in practice.
In addition to this default mechanism, Swift also offers other domain decomposition algorithms.The first one just attempts to split the data evenly between the compute nodes, so maintains the initial state.This is similar to what other simulation packages do, though here it is based on the top-level cells.This is also used as a backup mechanism in case the work-based decomposition leads to too much data imbalance.Finally, a mode where the top-level grid is simply split into regular chunks is also implemented.This is never recommended but the code will default to this if the Metis library is not available.

Scaling results & code performance
The scaling performance of the Swift code on various test problems has been reported in different publications thus far.We give a quick overview here and complement it with a test exploiting the full cosmological simulation engine in a realistic scenario.
In their original Swift feasibility study, Schaller et al. (2016) analysed the original SPH-only code's performance on cosmological test boxes.They reported a strong-scaling efficiency of 60 percent when scaling a problem from 512 cores to 131 072 cores of a BlueGene system.This demonstrated the viability of the task-based approach combined with a graph-based domain decomposition mechanism and set the foundation for the current version of the code.
In their analysis, Borrow et al. (2018) took low-redshift cosmological simulations from the Eagle suite and ran strong-and weakscaling tests of the code.They focused on the scaling of the SPH operations by running only the hydrodynamics tasks.However, by using late-time cosmological boxes, they analysed the performance The "mesh gravity" category corresponds to all the operations performed by the PM-part of the algorithm.The loss of performance is dominated by the lack of scalability of some operations within the tree construction (yellow) as well as by the accumulation of residual imbalance between nodes (purple).The domain decomposition itself (green) only requires a negligible amount of time.
of the code with a realistic density (and hence time-step) distribution.They demonstrated the importance of running the drift operation only on the region of the volumes that directly contribute to the calculation.
Finally, Rogers et al. (2022) analysed the performance of Swift in the context of future exa-scale developments with engineering-type SPH applications in mind.To this end, they ran a fixed time-step, fairly uniform, test volume with more than 5.5 × 10 11 gas particles and demonstrated excellent weak-scaling performance up to the size of their test cluster (≈ 50 000 cores).
To complement these earlier tests, we present here a scaling test exploiting all the main physics modules, including a galaxy formation model.To be as representative as possible, we use a  = 1 setup such that the density structure and hence time-step hierarchy is well developed.We use a 400 3 Mpc 3 volume with 720 3 baryon, 720 3 dark matter, and 144 3 neutrino particles extracted from the Flamingo (Schaye et al. 2023) suite and run it for 1024 time-steps.The sub-grid model is broadly similar to the one described in § 8.1 but with parameters calibrated to match observational datasets at a lower resolution than Eagle did (for details, see Kugel et al. 2023).We use this volume as a base unit and run it on a single node (128 cores) of the cosma-8 system 35 .We use 4 MPI ranks per node even when running on a single node to include the MPI overheads also in the smallest run.The 4 MPI ranks are distributed over the various NUMA regions of the node.We then scale up the problem by replicating the box periodically along the three axes and increasing the number of nodes proportionally.We also use 8 top-level cells per unit volume and an FFT gravity mesh of size 512 3 .Both are scaled up when increasing the problem size.We increase the problem size by a factor 7 3 = 343, which corresponds to the largest setup we can fit on the system.The results of this test are shown in Fig. 20, where we plot the time to solution in units of the time taken on one node.Perfect weak-scaling hence corresponds to horizontal lines.When the problem size is increased by a factor 343, the performance loss is only 15 percent.We also decompose the time spent in the main code sections.The tasks (i.e.physics operations, blue line) dominate the run time and display an excellent scaling performance.Decomposing the task work into the gravity and SPH parts, we see that gravity is the dominant component, validating the hydrodynamics-first approach of the overall code design.All other operations, including all of the sub-grid model tasks, are a negligible contribution to the total.The loss of performance when scaling up comes from the tree construction (orange) and from the overall imbalance between the different nodes (purple) due to an imperfect domain decomposition leading to slightly non-uniform work-load between the nodes despite the problem being theoretically identical.As discussed in § 9.3, we can maintain the node-to-node imbalance below 10 percent.We also report that the time spent deciding how to distribute the domains and performing the corresponding exchange of particles (green line) is a negligible fraction of the total runtime.
Finally, we note that the right-most points in Fig. 20 correspond to a test as large as the largest cosmological hydrodynamical simulation (by particle number) ever run to  = 0 (the flagship 2 × 5040 3 Flamingo volume of Schaye et al. 2023), demonstrating Swift's capability to tackle the largest problems of interest to the community.
We started the presentation of the design decisions that lead to the architecture of Swift in § 2 by a brief discussion of the performance of the previous generation of cosmological hydrodynamical simulations and in particular of the Eagle suite.To demonstrate improvements we could have repeated the flagship simulation of Schaye et al. (2015) with Swift using our updated SPH implementation and the Eaglelike model of § 8.1.Even with Swift's enhanced performance, this would still be a large commitment of resources for a benchmarking exercise, so we decided to instead compare the time taken by the codes on a smaller simulation volume using the same model.The (25 Mpc) 3 volume run with 2 × 376 3 particles presented in § 8.1.5took 159 hours using 28 compute cores of the cosma-7 system 36 ; 35 The cosma-8 system is run by DiRAC (www.dirac.ac.uk) and hosted by the University of Durham, UK.The system is made of 360 compute nodes with 1 TB RAM and dual 64-core AMD EPYC 7H12 at 2.6 GHz (4 NUMA regions / CPU) with AVX2 vector capability.The interconnect is Mellanox HDR, 200GBit/s, with a non-blocking fat-tree topology.The machine has a theoretical 1.9 PF peak performance and achieved 1.3 PF on the standard HPL benchmark. 36The cosma-7 system is run by DiRAC (www.dirac.ac.uk) and hosted by the University of Durham, UK.The system is made of 448 compute this corresponds to a total of 4452 CPU core hours.The Gadgetbased run, using the same initial conditions, from the original Eagle suite took 32900 CPU core hours, meaning that our software is > 7× faster on that problem.Recall however, that the flavours of SPH and the implementation of the sub-grid models are different from the original Eagle code making a more detailed comparison difficult.
We also note that this Swift-based Eagle-like run only required 92 GB of memory meaning that it would easily fit in the memory of a single compute node of most modern facilities.By contrast, the Gadget-based Eagle run required 345 GB of memory; a factor of nearly 4x more.

Random number generator
Many extensions of the base solvers, in particular sub-grid models for galaxy formation, make use of (pseudo-)random numbers in their algorithms.Examples of this are stochastic star formation models or feedback processes (see § 8.1.2and § 8.1.3for such models in Swift).Simulation packages can generate random numbers in various ways, often based on direct calls to a generator such as the base one part of UNIX or the more advanced ones in GSL (Gough 2009).To speed things up or to make the sequence independent of the number of MPI nodes, these calls can then be bundled into tables and regenerated every so often.The particles and physics modules then access these tables to retrieve a random number.This approach can lead to different issues of reproducibility between runs if the particles or modules are not calling the generator in the same order.These issues can arise due to task ordering choices 37 .Additionally, when bundling random numbers in small tables, great care has to be taken to make sure the indexing mechanism is sufficiently uniform so as to not bias the results 38 .
In Swift, despite the intrinsic lack of ordering of the operations due to the tasking, we decided to avoid these pitfalls by viewing the generation of random numbers as a hashing of four unique quantities which are then used to construct the mantissa of a number in the interval [0, 1).We combine the ID of the particle (64-bit), the current location on the integer timeline (64-bit), a unique identifier for this random process (64-bit), and a general seed (16-bit).By doing so, we always get the same random number for a given particle at the same point in simulation time.Since each process also gets a unique identifier, we can draw uncorrelated numbers between modules for the same particle in the same step.Finally, the global seed can be altered if one wanted to actually change the whole sequence to study the effect of a particular set of randoms (see Borrow et al. 2023b, for an example using Swift and the Eagle-like model).The combined 144 bits thus generated are passed through a succession of XOR and random generator seed evolution functions to create a final source of entropy.We use this source as a seed for our last UNIX random nodes with 512 GB RAM and dual 14-core Intel Xeon Gold 5120 CPU at 2.2 GHz (1 NUMA region / CPU) with AVX512 vector capability.The interconnect is Mellanox EDR, 100GBit/s, using a fat tree topology with a 2:1 blocking configuration. 37Note that in MPI codes, the same order-of-operations-issue can also occur if rounding choices change the time-step size of a particle, thus altering the sequence of numbers.The ordering of operations is not guaranteed for reduction operations, or in the directly Swift-relevant case, for asynchronous communications in a multi-threaded environment, unless the developers implemented explicit mechanisms to force this (often slower) behaviour. 38A common mistake is to index the tables based on particle IDs when these IDs themselves encode some information (e.g.only even numbers for gas, or a position in the ICs).number call, erand48(), whose output bits are interpreted as the mantissa of our result.
We have thoroughly verified that this entire mechanism generates perfectly uniform numbers.We also verified that there is no correlation between calls using the same particle and time-step but varying the identifier of the random process.

Summary
In this paper, we have presented the algorithms and numerical methods exploited in the open-source astrophysical code, Swift.We have presented various test problems performed with the code, as well as demonstrated its scaling capability to reach the largest problems targeted by the community.In addition, we described the sub-grid models and other features made available alongside the code, and the various output strategies allowing the users to make the most efficient use of their simulations.
The core design strategy of the Swift code was to focus on a hydrodynamics-first approach, with a gravity solver added on top.In tandem with this, the parallelisation strategy departs from traditional methods by exploiting a task-based parallelism method with dependencies and conflicts.This allows for the efficient load-balancing of problems by letting the runtime scheduler dynamically shift work between the different compute units.This approach, coupled to a domain decomposition method focusing on distributing work and not data, is specifically designed to adhere to the best practices for efficient use of modern hardware.
Various modern flavours of Smoothed Particle Hydrodynamics (SPH) are implemented, alongside two sets of flexible sub-grid models for galaxy formation, a modern way of evolving cosmological neutrinos, and extensions to handle planetary simulations.These additional components are presented and released publicly along with the base code.
Besides testing and benchmarking (in simulations using more than 2 × 10 12 particles), the Swift software package has already been exploited to perform extremely challenging scientific calculations.These include the very large dark-matter-only "zoom-in" (> 10 11 particles in the high resolution region) of the Sibelius project (McAlpine et al. 2022), the large cosmological hydrodynamics runs (up to 2 × 5040 3 particles) of the Flamingo project (Schaye et al. 2023), and the highest ever resolution Moon-formation simulations (Kegerreis et al. 2022).We envision that the public release of the code and its future developments will lead to more projects adopting it as their backbone solver for the most difficult and largest numerical astrophysics and cosmology problems.

Future developments
The Swift code is in constant development and we expect it to evolve considerably in the future.This paper describes the first full public release of the software and we expect improvements to the numerical aspects to be made, new models to be added, as well as new computer architectures to be targeted in the future.
One of the current grand challenges in high-performance computing is the jump towards so-called exa-scale systems.It is widely believed that such computing power can only be reached via the use of accelerators such as GPUs.This is a challenge for methods such as SPH and generally for algorithms including deep time-step hierarchies due to the low arithmetic intensity of these methods and the use of largely irregular memory access patterns.In the context of Swift, exploiting efficiently both CPUs and GPUs via a unified tasking approach is an additional challenge.Some avenues and possible solutions are discussed by Bower et al. (2022), where some early work porting specific computationally-intensive tasks to GPUs is also described.
In terms of physics models, we expect the public code to be soon expanded to include the self-interacting dark matter model of Correa et al. (2022).This will expand the range of cosmological models that can be probed with the Swift package.Work on other extensions beyond vanilla ΛCDM will likely follow.Similarly, additional subgrid models for galaxy formation and cosmological applications are in the process of being included in the main code base and will be released in the future.
The code is also being expanded to include material strength models, as well as further new equations of state, for planetary and other applications.
The various hydrodynamics solvers in the code are currently all variations of SPH.This family of methods is known to have some limitations in the rate of convergence towards analytic solutions in certain scenarios.In future releases of the Swift package, we thus intend to supplement this with additional SPH variations (e.g.Rosswog 2020), renormalised mesh-free methods (e.g.Vila 1999;Hopkins 2015;Alonso Asensio et al. 2023), and a moving mesh implementation akin to Vandenbroucke & De Rĳcke (2016).These methods all use unstructured particles with neighbourhoods as their base algorithmic tool, which makes them very suitable to fit within the framework currently existing in the Swift code.Developments on top of the SPH flavours to include magneto-hydrodynamics terms are also under way both using a direct induction formulation (e.g.Price et al. 2018) and a vector-potential formulation (e.g.Stasyszyn & Elstner 2015).
The code is also being expanded to include radiative transfer modules, starting with the SPH-based formalism of Chan et al. (2021) based on the M1-closure method and a coupling to the CHIMES non-equilibrium thermo-chemical solver (Richings et al. 2014a,b).Developments to include sub-cycling steps, in an even deeper hierarchy than in the gravity+hydro case (Duncan et al. 1998), for the exchange of photons are also on-going, which coupled to the taskbased approach embraced by Swift should lead to significant gains over more classic methods (Ivkovic 2023).
Finally, an improved domain decomposition strategy for the special case of zoom-in simulations with high-resolution regions small compared to the parent box but too large to find in a single node's memory will be introduced by Roper et al. (2024) (See also Chapter 2 of Roper (2023) for a preliminary discussion).
By publicly releasing the code and its extensions to the community, we also hope to encourage external contributors to share their models built on top of the version described here to other researchers by themselves making their work public.

Figure 1 .
Figure 1.A selection of simulation results obtained with the Swift code, illustrating the huge range of problems that have already been targeted and the flexibility of the solver.The panels show: (a) a projection of the large-scale distribution of dark matter from a 10 Mpc/ℎ slice of the (500 Mpc/ℎ) 3 benchmark simulation of Schneider et al. (2016, § 5.5); (b) the temperature of the gas weighted by its velocity dispersion in a zoom-in simulation of a galaxy cluster using the Swift-Eagle galaxy formation model ( § 8.1) extracted from the runs of Altamura et al. (2023); (c) an idealised isolated galaxy from the Agora-suite (Kim et al. 2016) simulated using the Gear model ( § 8.2) rendered using pNbody (Revaz 2013); and (d) a snapshot extracted from a Moon-forming giant impact simulation of Kegerreis et al. (2022) using the planetary physics extension of the code ( § 8.5) and rendered using the Houdini software.

Figure 2 .
Figure2.The Verlet-list method.By constructing a mesh structure with cell sizes matching the search radius  of particles, the neighbour-finding strategy is entirely set by the geometry of the cells and the list of potential candidates is thus exactly known.The particle in black only has potential neighbours in the cell where it resides or any of the 8 (26 in 3D) directly neighbouring cells (in grey).The smoothly varying nature of SPH leads to particles having similar  in nearby regions, with this scale only varying slowly over the whole simulated domain.

Figure 4 .
Figure4.A simplified graph of the tasks acting on a given cell for SPH and gravity during one time step in Swift.Dependencies are depicted as arrows and conflicts by dotted lines.Once the particles have been drifted to the current point in time, the first loop over neighbours can be run.The so-called "ghost" task serves mainly to reduce the number of dependencies between successive loops over the neighbours.Once the second loop has run, the time integration ( § 2.4) can be performed.In parallel to the SPH operations, the gravity tasks (condensed into a single one here for clarity) can be run as they act on different subsets of the data.To prevent different threads from over-writing each others' data, the various SPH loop tasks (1 self and 26 pairs) are prevented from running concurrently via our conflict mechanism.Additional loops over neighbours, used for instance in more advanced SPH implementations, in sub-grid models or for radiative transfer, can be added by repeating the same pattern.They can also be placed after the time integration tasks if they correspond to terms entering the equations in an operator splitting way.

Figure 6 .
Figure6.Pseudo-Verlet list optimisation for the interactions between all particles within a pair of neighbouring cells.Here the particles in the left cell receive contributions from the particles in the right cell.In the first phase, all particles are projected onto the axis linking the two cells (grey line) and sorted based on their projected coordinates.In the interaction phase, the particles iterate along this axis to identify candidates.For instance, the particle  (in black) will identify plausible neighbours (in light and dark grey) on this axis up to a distance   (indicated by the black ruler).These candidates are then tested for 3D distance to verify whether they are genuine neighbours (i.e.within the dotted circle and highlighted in dark grey here) or not.With this technique, the number of false-positives (light grey) is greatly reduced compared to the total number of possible candidates in the right-hand cell (here, 3 vs.11).The advantage is even greater when considering the next particle (from right to left) on the axis.Particle  knows that it will at most have to iterate on the axis up to the end of the ruler set by particle , i.e. its list of candidates is at most as large as 's for the same value of .Moving from particle to particle in the left-hand cell, we can also stop the whole operation as soon as the distance on the axis does not reach at least the first particle in the right-hand cell.Because particles move only by small amounts between steps, the sorted list can be re-used multiple times provided a sufficient buffer is added to the length of the black ruler.Finally, the process is reversed to update the particles on the RHS with contributions from particles in the left cell.In 3D, even larger gains are achieved when the two cells share only an edge or just a corner.

Figure 7 .
Figure7.A pair interaction taking place over a domain boundary.The cell pair interaction in grey involves cells residing on either side of the domain boundary (thick black line), on two separate nodes.To allow for the interaction to happen, we create a set of proxy cells on the first node and create communication tasks (arrows) that import the relevant particles (in grey) from the second node.We also create a dependency between the communication and the pair task to ensure the data have arrived before the pair interaction can start.The pair task can then update the particles entirely locally, i.e. by exploiting exactly the same piece of code as for pairs that do not cross domain boundaries.A similar proxy exists on the other node to import particles in the opposite direction in order to process the pair also on that node and update its local particles.

Figure 8 .
Figure 8. Extra communication tasks.The pair - task (SPH or gravity) corresponds to the grey pair in Fig.7.Each compute node has a task to drift its own local cell.The foreign node (here below the thick black line) then executes a send operation.On the local node, a receive task is run to get the data before unlocking the dependency (solid arrow) and letting the scheduler eventually run the pair - interaction task.The communication itself (dotted arrow) implicitly acts as a dependency between the nodes.The converse set of tasks exists on the other compute node to allow the pair - to also be run on that node.

Figure 9 .
Figure9.The same physics problem (2 × 128 3 particles cosmological simulation) as displayed on Fig.5but now split across 4 nodes, each using 8 threads, i.e. a combination of distributed and shared parallelism.This is the hybrid mode in which Swift is run for large calculations that do not fit on a single node.Each panel corresponds to a different compute node.Within each panel the different rows correspond to the different threads on the compute node.The work each thread performs is coloured to correspond to the task type it executes using the same scheme as on Fig.5.The vertical dashed line on the right of each panel indicates the end of the time-step, which is determined by the point where the last compute node finishes.As can be seen, the node-to-node balance is not perfect; some nodes complete their work slightly earlier.This is due to the MPI library requiring some time to process messages in an unpredictable way, which the domain decomposition algorithm ( §9.3) can thus not compensate for.This leads to small gaps in the execution (white gaps in the coloured bands).All required communication for the tasks occurs within this same figure, and overlaps (asynchronously) with work that only has local or already satisfied dependencies.All the exchanges happen whilst other tasks are running.The communications are overlapping with actual work.Note also that with less work per node overall compared to the shared-memory case, shown in Fig.5, it is easier to see here that a given point in time different threads often process different task types, and hence solve a different set of equations.
benchmark.The initial conditions 11 Depending on how the IDs are distributed in the initial conditions, we either generate a new random ID or append one to the maximal ID already present in the simulation.

Figure 10 .
Figure10.Top panel: The gas density profile of the nIFTy cluster when simulated with five models within Swift (thick solid lines of various colours), and three external codes (dashed thin lines), shown at redshift  = 0. Middle panel: Gas entropy profile of the cluster (as extracted from the temperature and electron density profiles).Bottom panel: Gas temperature profile of the cluster with the same models.

Figure 11 .
Figure11.The density, potential, force, and force ratio to the Newtonian case generated by a point unit mass in our softened gravitational scheme.We use distances in units of the kernel cut-off  to normalise the figures.A Plummer-equivalent sphere is shown for comparison.The spline kernel ofMonaghan & Lattanzio (1985) is depicted for comparison but note that it has not been normalised to match the Plummer-sphere potential at  = 0 (as is done in simulations) but rather normalised to the Newtonian potential at  =  to better highlight the differences in shapes.

Figure 12 .
Figure 12.The basics of the Fast Multipole Method: The potential generated by a particle at position x  on a particle at location x  is replaced by a double Taylor expansion of the potential around the distance vector R linking the two centres of mass (z  and z  ) of cell  and .The expansion converges towards the exact expression provided |R| > |r  + r  |.In contrast, in a traditional Barnes & Hut (1986) tree-code, all the particles in the cell  receive direct contributions from z  without involving the centre of expansion z  in .

Figure 13 .
Figure 13.The basic decomposition of the FMM tree-walk into tasks for a set of particles in their cells, shown in 2D for clarity.The operations involving the red cell are as follows:(1) one self task computing the gravity kernels within the cell itself, (2) eight pair tasks computing the kernels for each pair of the red-green pairs of cells (the arrows), and (3) a single long-range task computing the M2L kernel contribution of all the blue cells to the red cell.In a realistic example, there will be many more blue cells beyond what is depicted here, but all their contributions to the cell of interest's potential will be handled by a single task looping over all of them.The green cells are too close, based on the criterion of § 4.4 to use a multipole-multipole (M2L) interaction; their interactions with the red cell are hence treated as individual tasks as they contain a substantial amount of calculation to perform.In some cases, the distance criterion may be such that cells slightly further away also need to be treated by the pair tasks rather than just the directly neighbouring layer.This depends on the exact particle configuration and on the user's opening angle choices.

Figure 16 .
Figure 16.Comparison of the matter power-spectra as a function of scale for four different -body codes (see text) relative to the Swift prediction on the test problem introduced by Schneider et al. (2016).The simulation evolves 2048 3 dark matter particles in a (500 Mpc/ℎ) 3 volume run from  = 49 to  = 0 assuming a ΛCDM cosmology.All power spectra were measured using the same tool (see text).The dark-and light-shaded regions correspond to ±0.25% and ±1% level agreement between codes.The fundamental mode

Figure 17 .
Figure 17.The halo mass function, computed using VELOCIraptor as the structure finder, extracted from the benchmark cosmological simulation of Schneider et al. (2016) run with Swift (See § 5.5) and compared with the fitting function of Tinker et al. (2010).The shaded region depicts the 1 −  Poisson errors on the counts, while the arrow indicates the mass corresponding to 100 particles.

Figure 18 .
Figure 18.The galaxy stellar mass function, computed using VELOCIraptor as the structure finder and measured in 50 kpc spherical apertures, extracted from a (25 Mpc) 3 volume run with Swift-Eagle model and compared to the Driver et al. (2022) data inferred from the GAMA survey.The shaded region on the simulation corresponds to Poisson error counts in each 0.2 dex mass bin.

Figure
Figure19.The representation of the top-level cells as a graph to be split over domains.The cells of the grid (on the left) correspond to the vertices of the graph (on the right), while the tasks spanning two cells constitute its edges (dashed and dotted lines).For simplicity, we consider here a 4 × 4 non-periodic grid in 2D and only show the pair tasks for cells that share an edge.Each vertex and graph edge has a weight associated with it, shown here as the numbers on each vertex and edge.The weights correspond to the cost of the task execution.If a pair operation is taking place over the network (shown here using dashed lines), its cost will be increased since communications will have to take place and the task will be executed on both of the involved ranks.The domain decomposition algorithm splits the graph so that the work (vertices and edges) is as evenly distributed as possible among all computing ranks (the four colours), minimising the total cost by creating as few communications as possible.In the case shown here, this corresponds to the domain decomposition presented on the left.Note in particular that the number of cells assigned to each domain may not necessarily be the same.

Figure 20 .
Figure20.Weak-scaling performance of the Swift code on a representative cosmological simulation test problem.We use a 400 3 Mpc 3 volume extracted from the Flamingo series with 720 3 baryon, 720 3 dark matter, and 144 3 neutrino particles at  = 1.That base unit is then replicated periodically in all three directions; the top-level grid, as well as the gravity mesh, are also scaled alongside the replications.The number of compute nodes is grown proportionally, starting from a single node (128 cores) for the base volume.The top axis indicates the total number of particles used in each of the tests.When scaling the problem by a factor 7 3 = 343, the total runtime (black line) increases by only 15 percent, as shown on the top panel (note the linear y-axis).The bottom panel shows the breakdown of the total time in different categories (note the log y-axis).The time spent in the tasks (aka.actually solving physics equations, blue line) is remarkably constant as the problem size increases.The task time can be further subdivided in gravity (the FMM part) and SPH operations (dotted and dashed lines); all other tasks, including the sub-grid operations, correspond to a negligible fraction of the runtime.The "mesh gravity" category corresponds to all the operations performed by the PM-part of the algorithm.The loss of performance is dominated by the lack of scalability of some operations within the tree construction (yellow) as well as by the accumulation of residual imbalance between nodes (purple).The domain decomposition itself (green) only requires a negligible amount of time.
Wright 2006 m , Ω r , Ω k , Ω Λ , ℎ,  0 , and   as well as the starting redshift (or scale-factor of the simulation)  start and final time  end .At any scale-factor  age , the time  age since the Big Bang (age of the Universe) is computed as (e.g.Wright 2006): 21Note that w () ≡∫  0 1+ ( ′ )1+ ′ d ′ , which leads to the analytic expression we use.dimensionless