Fermionic quantum turbulence: Pushing the limits of high-performance computing

Abstract Ultracold atoms provide a platform for analog quantum computer capable of simulating the quantum turbulence that underlies puzzling phenomena like pulsar glitches in rapidly spinning neutron stars. Unlike other platforms like liquid helium, ultracold atoms have a viable theoretical framework for dynamics, but simulations push the edge of current classical computers. We present the largest simulations of fermionic quantum turbulence to date and explain the computing technology needed, especially improvements in the Eigenvalue soLvers for Petaflop Applications library that enable us to diagonalize matrices of record size (millions by millions). We quantify how dissipation and thermalization proceed in fermionic quantum turbulence by using the internal structure of vortices as a new probe of the local effective temperature. All simulation data and source codes are made available to facilitate rapid scientific progress in the field of ultracold Fermi gases.


INTRODUCTION
Computation is regarded as the third pillar of physical science, complementing theoretical and experimental physics.Each pillar has its unique methodology: theoretical physics relies on mathematical analysis, measurements are the central interest of experimental physics, and numerical modeling is the heart of computational physics.Many recent breakthroughs, like observing the Higgs boson (1, 2) or detecting gravitational waves (3), would not have been possible without advanced numerical analysis capabilities that adapt algorithmic breakthroughs to evolving hardware.Here we demonstrate the synergy between theory and computation: advances in linear algebra libraries enable Europe's fastest supercomputer (lumi) to diagonalize matrices of record size, allowing us to simulate turbulent dynamics in quantum systems (superfluids).We use these simulations to investigate how vortices dissipate energy, driving quantum turbulence in neutron stars and ultracold-atom experiments.As Moore's law bottoms out, using high-performance computing (hpc) effectively becomes a significant challenge.Current hpc systems consist of thousands of interconnected nodes, each comprising dozens of computing cores or multiple hardware accelerators.Specifically, accelerators like graphics processing units (gpus) account for most of the computing power on modern platforms.Leadership supercomputers can compute from 10 17 floating point operations per second (flops) for preexascale systems, to 10 18 flops (=1 Eflops or exa-flops) for exascale systems.According to Top 500 list (June 2023) the top three supercomputers are: Frontier (Oak Ridge National Laboratory, usa) with 1.19 Eflops, Supercomputer Fugaku (riken Center for Computational Science, Japan) with 0.44 Eflops, and lumi (Euro hpc/csc, Finland) with 0.31 Eflops.Here we use lumi (Fig. 1), the fastest European system, to demonstrate some of its capabilities to advance computational physics.
While this computational potential is enormous, using these hpc capabilities requires a highly-tuned software stack capable of dealing with massive parallel and heterogeneous architectures, and core scientific libraries are constantly being adjusted to maximize performance on new hardware.These include Fast Fourier Transforms, linear algebra routines, libraries for matrix decomposition, random number generators, and solvers for algebraic and differential equations.These core libraries form the building blocks for the efficient domain-specific scientific packages that enable us to make physics breakthroughs in the domain of quantum mechanics.
Simulating quantum dynamics is one of the hardest challenges for classical computers due to the exponentially large size of a manybody wavefunction.Even storing the wavefunction for a modest nucleus like tin with ∼ 100 nucleons would require more bytes than there are atoms in the visible universe.The techniques of density functional theory (dft) (4)(5)(6) and its time-dependent extension time-dependent density functional theory (tddft) (7,8) have rev-olutionized our ability to study quantum dynamics by replacing the need to store the many-body wavefunction with an energy functional of a handful of densities.Despite needing to approximate the form of the functional, tddft has become one of the most successful methods for simulating dynamics in quantum systems.Its popularity is due to both its accuracy and versatility, providing access to static properties at zero and finite temperatures (6,9), and real-time dynamics (8) for highly complex problems consisting of thousands of particles.For these reasons, dft is presently the tool of choice for solid-state physics and quantum chemistry (10)(11)(12), and is used extensively in nuclear physics and astrophysics (13)(14)(15).
To properly account for the Pauli exclusion principle, the slda variant of dft is an orbital-based theory in the spirit of Kohn and Sham (5).When applied to a normal state, this requires one orbital per particle.Thus, to describe N p particles on a spatial grid r requires storing N p functions.Extending this to superfluids requires at least twice as many functions, which we call quasiparticle wavefunctions (qpwfs) φ n (r, t) = [u n (r, t), v n (r, t)] T , since we must represent both particles and holes.(Including both Hartree and Fock terms in nuclei requires four times as many functions (see e.g. ( 35)).)The main cost increase, however, is that superfluids allow fractional occupation of the orbitals, and to obtain convergence, one needs significantly more qpwfs than there are particles in the system.As we shall estimate below, if we represent our system on a three-dimensional spatial grid containing N 3 points, then we need on the order of N qpwf ≈ 0.5N 3 qpwfs leading to a memory cost of ≈ 0.5N 6 complex numbers per state.For a grid with N = 100 points in each direction, a single state thus requires ≈ 15 TB.We consider the simplest type of fermionic superfluid comprising equal populations of two species.(Think spin-up and spin-down, but in most cold-atom experiments these are different hyperfine states.)In these systems, the attractive interaction between these two states can be tuned at will using magnetic fields and broad Feshbach resonances to realize fermionic superfluids throughout what is called the Bardeen Cooper Schrieffer (bcs)-Bose-Einstein condensate (bec) crossover (36).For weak inter-particle attractions, the superfluid has a bcs-like structure with large Cooper pairs providing long-range order and the associated superfluid flow while slightly modifying the dominant structure of the Fermi sea.As one increases the interaction strength, the size of these Cooper pairs gets smaller and smaller until they are best described as tightly bound dimers consisting of fermions of each type.In this bec limit, the dimers behave as bosons, and the system is well described as a bec with the Gross-Pitaevskii equation (gpe) Eq. ( 3) where m B = 2m is now the mass of the dimer, and the dimer density n B = n/2 is half the total fermion density n.In the middle is the so-called unitary Fermi gas (ufg) where the dimers are on the threshold of being bound in the vacuum.Interestingly, there is no phase-transition separating the strongly and weakly interacting systems, hence the term crossover.
When these systems are sufficiently dilute, all the parameters of the short-range interaction can be become irrelevant except for the s-wave scattering length a which describes the size of the bound dimers on the bec side when a > 0. At unitarity a → ∞, and it becomes negative on the bcs side where the dimers are unbound.It is thus convenient to parameterize the equation of state in terms of the dimensionless parameter k F a, where k F = (3π 2 n) 1/3 is the so-called Fermi momentum.For reference, the total density n, Fermi energy ε F (n), and energy density E FG (n) for this two-component free-Fermi gas are At zero temperature (T = 0), for example, the equation of state throughout the crossover can be expressed as The dimensionless function ξ(k F a) characterizes the strength of the interactions, and a major challenge for both theory and experiment has been to determine the universal value at unitarity known as the Bertsch parameter ξ(∞) ≈ 0.37 (28), which combines both experiment (37,38) and qmc (27,39) results.(Note: computing the value of ξ(∞) was a race between classical and analog quantum computing using cold atoms.Establishing the value was one of the first demonstrated successes of the quantum approach.)Dynamics in these systems has direct application to ultra-cold atom experiments, but also indirectly to nuclear physics.In particular, the neutron-neutron scattering length is accidentally large, making the ufg an excellent proxy for the dilute neutron matter expected to occur in the crust of neutron stars (40)(41)(42).
Simulating quantum turbulence is one of the most complex problem in quantum mechanics.The phenomenon has been studied intensively using both isotopes of superfluid helium -bosonic 4 He, and fermionic 3 He (see (43,44) for reviews) -and more recently in ultracold atomic gases (see (45)(46)(47)(48)(49)(50) and (51,52) for reviews).The cold-atom platform has several advantages over liquid helium (23), including access to compressible and hypersonic regimes, superfluid mixtures for studying entrainment (53), and accurate microscopic theories in the form of the time-dependent superfluid local density approximation (tdslda) for fermions, and the gpe for bosons (see e.g. ( 54)).We demonstrate here that hpc techniques and timedependent dft frameworks have reached a level of maturity that allows for microscopic simulations of complex phenomena in systems consisting of tens of thousands of superfluid fermions.

RESULTS
Here we consider quantum turbulence in an ultra-cold atomic gas of fermions for two cases within bec-bcs crossover: 1.The strongly coupled ufg with k F a → ∞.We compare this to a bosonic (gpe) theory for dimers tuned to the ufg equation of state; 2. A weakly coupled superfluid in the bcs regime with k F a = −1.8.Through this comparison, we demonstrate the importance of energy dissipation from heating, and its effect on the structure of quantum vortex cores in fermionic systems.
The quantum simulations we perform require two stages of computation: matrix diagonalization to obtain the initial state for evolution, and solving system of millions of coupled partial differential equations (pdes) to perform the real-time evolution.Here, we will describe them in the context of an hpc implementation on the lumi supercomputer, and benchmark their efficiency on this platform.Before describing the technical aspects of the computation, we introduce a theoretical framework to show the source of the challenges we face.

Theoretical Framework: TDDFT for Bosons
We start with the simpler problem of describing a bosonic superfluid gas.If the gas is sufficiently dilute, then the superfluid state can be well described as a bec where all the bosons (dimers in our case) occupy the same condensate wavefunction ψ B (r, t) normalized to the total boson number density n B (r, t) = |ψ B (r, t)| 2 .This evolves under a non-linear Schrödinger equation called the gpe (see (55,56)), where m B is the boson mass, η is a small phase factor to model dissipation that we tune to better match the natural dissipation in the fermionic simulations (57), and interactions enter through the derivative of the equation of state E (n B ) that characterizes the energy density as a function of the boson density n B .This derivative is an effective mean-field chemical potential µ = E ′ (n B ) which repels or attracts bosons depending on the sign and strength of the interaction.
One can include an external potential V ext (r, t) in the single-particle Hamiltonian ĥB (r, t), but we do not include one in our simulations here.Although expressed as a wavefunction, this is equivalent to an orbital-free dft (58) of the Hohenberg-Kohn type (4), and Eq. ( 3) follows from a principle of stationary action for a generalized Schrödinger field ψ B with the right-hand-side of Eq. ( 3) minimizing the energy functional For modest system sizes, Eq. ( 3) can be efficiently solved on small computers, with hpc resources being required only for large simulation volumes (see e.g. ( 59)).These superfluids demonstrate a wide array of interesting properties, including dissipationless flow past obstacles (with η = 0), and quantized vortices that mediate the energy cascades associated with quantum turbulence in spite of the lack of dissipation (44,54).
Although the theory for fermions is much more complicated as we shall describe below, there is a limit which can be well described by a modified version of the gpe.This is the so-called bec limit where two fermionic species have sufficiently strong attraction that they form a gas of tightly bound dimers.These dimers are bosonic in nature, and can be described by a modified gpe like Eq. ( 3) with m B = 2m, total density n = 2n B = 2 ⃓ ⃓ ψ ⃓ ⃓ 2 (i.e.ψ describes the dimers), total currents j = 2 Im(ψ∇ψ * ), and a properly tuned equation of state E .(See (58) for details: our modified gpe with E (n) = E k F a=∞ (n) is what they call the effective Thomas-Fermi etf model.)To date, the majority of results for quantum turbulence in ultra-cold atomic gases have been simulated using the gpe (54).

Theoretical Framework: TDDFT for Fermions
Unlike bosons, fermions cannot occupy the same state due to the Pauli exclusion principle, and a density functional of the Hohenberg-Kohn type would be highly non-local.Instead, one uses an orbital dft of the Kohn-Sham type (5) where the functional is expressed in terms of a Slater determinant of N qpwf single-particle orbitals φ n (r, t) = [u n (r, t), v n (r, t)] T , which, as described above, must include at least two components to describe particle-hole excitations in a bcs type superfluid.Each of these states evolves under a singleparticle Hamiltonian of the form: )︄ = (︄ ĥ(r, t) ∆(r, t) ∆ * (r, t) − ĥ * (r, t) where ĥ(r, t) has a form similar to ĥB (r, t) above with secondderivatives in space, and ∆(r, t) is a complex-valued function describing the superfluid correlations.Together they form the quasiparticle Hamiltonian H(r, t).The function ∆ plays the role of the order parameter, in analogy to the ψ B function in the gpe.Unlike in the gpe, however, it no longer carries information about the density of the system.The key to a local dft like the slda is that ĥ(r, t) and ∆(r, t) depend only on a handful of local densities: the particle density n(r, t), the kinetic density τ(r, t), the current density j(r, t), and the anomalous density ν(r, t), each of which is computed from the orbitals {φ n (r, t)} via a reduction.The precise form of equations of motion Eq. ( 5) will be discussed in the Methods and Materials section below, but follows from minimizing an energy functional of the form A key property of this system of pdes Eq. ( 5) is that the quasiparticle Hamiltonian H is the same for all quasiparticle wavefunctions.This means that at each step of evolution, one needs to communicate only the handful of local densities ∼50 MB rather than the complete state.Furthermore, the single-particle Hamiltonian is unitary, ensuring that the states remain orthonormal throughout the evolution.Thus, each node can independently perform the evolution of its quasiparticle wavefunctions using local hardware acceleration to compute the derivatives, with minimal communication that requires only an efficient message-passing interface (mpi) reduction.This is how we solve the second challenge, but to initialize this evolution, we must first obtain a good initial state.This requires an orthonormal set of quasiparticle wavefunctions {φ n (r)} which solve the self-consistent set of equations minimizing E slda [{φ n }]: While the matrix on the left-hand-side can be described efficiently in terms of the densities, the components are formally functions of the orbitals ĥ(φ 1 , φ 2 , . . . ) and ∆(φ 1 , φ 2 , . . .).This Hermitian eigenvalue problem must be solved self-consistently, which we do iteratively through a series of diagonalizations.In the numerical implementation, the functions u n (r) and v n (r) are represented as vectors whose length depends strongly on the geometry and size of the problem.While some initial states can be computed efficiently -e.g.systems with high degrees of symmetry such as homogeneous matter -the maximum problem size is generally limited by the technical capabilities of the eigensolver libraries on the chosen hpc systems.

Numerical setup and implementation
To study turbulence, we simulate a periodic volume in space, and use a spectral representation for the quasiparticle wavefunctions on an equally-spaced N x × N y × N z Cartesian grid.In this basis, each quasiparticle wavefunction φ n (r) is represented as a complex vector with 2N x N y N z components.
To put the size of our problem in perspective, we focus on a cubic box with N = N x,y,z = 100 grid points in each direction.We define our length scale in terms of the grid-spacing ∆x = ∆y = ∆z = 1 so that V = L 3 = N 3 and set h = m = 1 so that momenta p = hk is equivalent to the wave-vector.To compute the kinetic energy in ĥ, we use the fast Fourier transform (fft) φ )︂ where This replaces a matrix multiplication with two ffts and a single intermediate vector multiplication by k 2 /2 which is diagonal in momentum space.The remaining calculations are local in position space, simply multiplying φ n by various functions of the densities.These operations can be computed independently and locally on the computation nodes, each of which stores a small fraction of the total set of quasiparticle wavefunctions φ n .The need for hpc comes from the large number of these required to adequately represent the problem.To estimate this, note that, in our units, the maximum momentum represented is k max = π, hence the maximum kinetic energy represented is . This provides a natural cutoff scale E c ≲ E max and we must keep those quasiparticles with energy E n < E c .For large E n , the energy is dominated by the kinetic energy, and we can estimate the number of such states by Cost (time-to-solution × node-count) of the parallel evolution of all quasiparticle wave-functions for a unit time interval (expressed in dimensionless units), which corresponds to 286 integration time steps, as measured on the lumi system.The number of nodes was adjusted to fit the problem in the memory available on the nodes, and ranges from 32 in the smallest case to 800 for the largest.The dashed line shows the expected ideal cost scaling.
considering the volume 4  3 πk 3 max of the sphere E < E c in momentum space in terms of the volume occupied by each quantum state (∆k) 3 : For each state we have two pdes (c.f.Eq. ( 5)).This demonstrates the cost of explicitly including the Pauli principle: instead of solving one pde as for bosons, we need to solve many pdes.For the case we consider, N = 100, we will be solving in parallel the corresponding million of pdes, all of them coupled to each other!

Performance of the Time Evolution Algorithm
The time integration is done with 5 th -order multistep Adams-Bashforth-Moulton (abm) predictor-corrector method (60).The method requires the evaluation of ĥ and ∆ twice per time step (predictor and corrector).The cost per time step thus scales as N 3 × N 3 log N 3 where the first factor accounts for the number of evolved states and the second one for the complexity of the fft that we use to compute the kinetic term.(We use the hipfft library hipfft.readthedocs.io.)The stability of the method has been studied in (61,62).
The open-source w-slda Toolkit (63) provides a parallel implementation of this time integrator, and is designed to simulate fermionic superfluids with the tdslda on modern gpu-accelerated systems.In Fig. 2, we demonstrate the measured cost C (defined as time-to-solution × node-count) obtained on the lumi system.Our parallel implementation of Eq. ( 5) exhibits the expected scaling up to the maximum problem size of N = 100 corresponding to one million pdes.The main limitation is imposed by the memory requirements: the abm method, while very accurate, require about 10 copies of the state to operate in case of our implementation.In our largest case (N = 100) the total memory requirement is about 164 TB.

Performance of the Matrix Diagonalization Algorithm
Generating the initial state is even more costly, as it requires finding a self-consistent solution to Eq. ( 7) for the complete set of N qpwf wavefunctions {φ n }.This is done iteratively via a sequence of diagonalizations of an M × M Hermitian matrix where M = 2N 3 .We reduce the cost of iteratively solving Eq. ( 7) by using a multi-grid approach.We fix the domain size and solve the problem on consecutively larger lattices with N = 60, 80, and 100 points in each direction, corresponding to decreasing lattice spacings ∆x = 1.67, 1.25, and 1.0, respectively.At each step, we interpolate the converged solution from coarser to finer latices, providing a good initial state to accelerate the iterative algorithm.In this way, we only need a few iterations to converge on the target N = 100 lattice.Even with this tremendous simplification, in the final stage of iterations requires a few dense diagonalization of two million by two million matrices.
To do this, we use the publicly available elpa library (64), which was designed to efficiently solve dense symmetric or Hermitian standard or generalized eigenvalue problems (evps), especially with scalability to large core and/or gpu counts in mind.The elpa library was first released in 2010 and has been ported and optimized for all major hpc architectures.The elpa library is used in most software packages for electronic-structure theory (See e.g.elpa.mpcdf.mpg.de/ELPA_USED.) and clearly outperforms implementations such as scalapack (65).Recently, in addition to the accelerated version for nvidia gpus (66,67), a port to the amd mi250 gpu architecture has been publicly released.Here we show the first results obtained with this port for amd gpus.
For the standard evp, the elpa library provides two solvers.The first is a 1-stage solver with three steps: i) transforming the dense matrix into a tridiagonal form, ii) diagonalizing this tridiagonal form, and iii) transforming the eigenvectors back to their original representation.Alternatively, a 2-stage solver introduces two additional steps: first transforming the dense matrix in a banded matrix, and then transforming the banded matrix into a tridiagonal form; the subsequent back transformation of the eigenvectors also requires Fig. 4. Scaling of the elpa 1-stage solver with matrix size.Shown is the cost for solution of different eigenvalue problems for different matrix sizes with the elpa 1-stage solver on the lumi gpu system.The different red symbols indicate the costs (see also Fig. 3) for a fixed matrix size but different node counts.The solid blue line represents the fastest time-to-solution (albeit with the highest costs) achieved.Note, that the blue line was obtained with measurements on different number of gpu nodes.Further note that the blue line has been obtained by taking the arithmetic mean of the costs of at least two experiments of a specific matrix size and node count.The dashed black line shows a power-law fit to the costs for the best run time.The dashed blue line represents the lowest cost (albeit with higher time-to-solution) achieved.
To initialize the target N = 100 quantum turbulence problem, we must be able to efficiently diagonalize matrices where the size is of orders of millions by millions.With access to the lumi supercomputer, we first ensured that the new elpa amd gpu version works as expected on large node counts with matrices of this size.In Fig. 3 we compare the run time of the amd gpu version of the elpa library with the run time on nvidia a100 gpus using the raven system www.mpcdf.mpg.de/services/supercomputing/raven of the mpcdf.We see that in a direct node-per-node comparison, the solutions of the eigenvalue problems on 4 mi250x gpus are in general twice as fast as on 4 nvidia a100 gpus, which is in line with the expectations.Due to limited resources, especially limitations in the maximum job run-time, we could not perform a strong-scaling analysis of the eigenvalue problem for each matrix size -especially above a linear dimension of one million.Instead, we had to rather focus on a limited number of experiments and run each eigenvalue solution for a specific matrix size on specific node counts.
This set of runs is shown in Fig. 4 which shows the costs C to solve a real double-precision dense eigenvalue problem for different matrix sizes.We include the power-law fit of the C = M b to the data, representing the scaling behaviour for the best time-to-solution b ≈ 2.5.Since this power-law exponents b is still below the theoretical value of 3 (scaling of eigenvalue algorithms is O(x 3 )) limit, we have not yet reached the complexity scaling limit for matrix size M, at least on this hpc system (hardware, compilers, etc.).
Nevertheless, we have successfully solved dense eigenvalue problems for real, double-precision evps with linear matrix sizes up to 3.2 million, which is already substantially larger than the problem size required for the target quantum turbulence simulations discussed below.To our knowledge, this is the largest dense eigenvalue problem ever solved with a direct solver.While the benchmark of the elpa library was executed for symmetric matrices (double precision), in the production computation, we were working with Hermitian matrices (double complex precision).The scaling properties in the Hermitian mode are similar, with the complex case taking about twice as long.

Results for Quantum Turbulence
We have presented the capabilities of leadership supercomputers like lumi to deal with dense matrices and solve nonlinear pdes.
We now combine all these elements together to generate large-scale simulations of turbulent dynamics in ultra-cold Fermi gases.
We start our calculations by preparing the initial state at zero temperature (T = 0) consisting of a regular lattice of imprinted vortices in all three directions, see Fig. 5a.The lattice consists of alternately arranged vortices and anti-vortices and the resulting state has zero total angular momentum.The generation of the initial state amounts to solving the static problem (7) with the additional constraint imposed on the phase θ(r) of the order parameter ⃓ ⃓ e iθ(r) .The phase provides a superfluid velocity field v s (r) ∝ ∇θ(r) consistent with the vortex/anti-vortex lattice with a slight long-wavelength perturbation that destabilizes the vortex lattice leading to a turbulent tangle of vortices as seen in subsequent frames of Fig. 5.
We study the strongly-interacting ufg (k F a = ∞) with both the full fermionic tdslda dft, and a simple modified gpe-like theory Eq. ( 3) for dimers (m B = 2m) tuned to the ufg equation of state (58).To mimic the natural dissipation of the tdslda, we add some artificial damping to the gpe Eq. ( 3), considering two values: η = 0.01 and η = 0.08 .The lower value was found to give reasonable qualitative agreement with ufg simulations of rotating quantum turbulence (57), while the larger value better matches the flow energy decay seen in the corresponding fermionic simulation.We also study a less strongly interacting system in the bcs regime at an experimentally accessible value of k F a = −1.8.The number of quasiparticle states extracted from the initial state preparation was N qpwf = 582 898 for the ufg and N qpwf = 675 460 for the bcs regime.
Table I shows some of the characteristic properties of these initial states.There are four length-scales of interest.From smallest to largest, these are: the Fermi scale l F = k −1 F is set by the density and is the smallest resolvable scale in the problem; the bcs coherence length ξ = k F /π∆ describes the size of the cores of the quantum vortices; the mean inter-vortex distance l describes the vortex density n] −1/3 : inverse of Fermi momentum.L(0): total initial length of vortices.l(0): initial mean inter-vortex spacing.All lengths are in units of l F .The scales are set such ratio l/ξ ≈ 5.5 is fixed across the runs.and is the scale at which we initially inject energy for the turbulence; and the size of the simulation volume L box , which is the largest scale in the problem.Fig. 5 shows the evolution of the ufg in our largest tdslda simulation on lumi.The initial perturbed vortex lattice (Fig. 5a) is unstable, and rapidly forms a vortex tangle (Fig. 5b) by tε F ≈ 50.The subsequently decay of this tangle (Figs.5c to 5e) transfers hydrodynamic energy from the initial scale of the vortex lattice to other length scales.The bending, crossing, and reconnection of vortices seen in Figs.5b to 5d are the primary mechanisms for quantum turbulence.Through these mechanisms, hydrodynamic energy can flow from large to small scales, resulting in the emergence of an effective viscosity (69,70) even though this is a superfluid.In compressible fluids, part of this energy is converted into sound (71,72) and further into internal excitations, making this cascade irreversible.There is also some weak wave turbulence (73,74) in the phonons (sound waves), but this is a small effect, and not visible in these plots.
To quantify these, we introduce flow and condensation energies.The former is just the kinetic energy associated with the flow, while the latter estimates the energy in the condensate (Cooper pairs) which uses a simple formula derived in the bcs limit ( 33): We compare the evolution of E flow with the total length L(t) of the vortices in Figs.6a and 6b.When the vortex core is small (we shall call these tight vortices) as in e.g.liquid helium, the flow energy associated with turbulence is dominated by vortices and expected to be proportional to the vortex length.Our case qualitatively differs from that of liquid helium because we have compression modes (phonons or sound), and the vortex core size is comparable to the inter-particle separation.Nevertheless, we still see quite a strong correlation between these.The evolution of the total length L(t) during the free decay is one of the main probes to distinguish the type of turbulence (75).In an incompressible fluid, a random tangle of tight vortices with no largescale structures in the velocity field is expected to develop what is called Vinen or ultraquantum turbulence where the total vortex length decays as L ∝ t −1 .On the other hand, if the vortices create largescale structures -i.e.bundles of coherent vortices -then one expects the fluid to develop eddies and dynamics that produce Kolomogorov or quasiclassical turbulence (76, 77) with a characteristic decay of L ∝ t −3/2 .In our simulations (Fig. 6b), we see Kolomogorov-like decay in the gpe when the vortex density is high 50 ≲ tε F ≲ 300, but the tdslda never develops this behaviour, suggesting a fundamental difference in the decay mechanism between bosonic and fermionic simulations.In this regard, the tdslda appears to more closely match the decay predicted by Vinen turbulence, but we suspect this is coincidental rather than causal since, as we show below, there are additional dissipation mechanisms present in this case.Note that in this work, we do a comparative study between two methods (gpe vs tdslda) since we have access to data for identical setups.However, the lack of statistics does not allow us to make statements about the precise value of the decay exponent.
For compressible turbulence, one can use a Helmholtz decomposition to split E flow into divergence-free (incompressible/rotational/vortices) and curl-free (compress- ible/irrotational/phonons) parts E flow = E vortices + E phonons .
(See e.g.(78).)The total vortex length L should be most strongly correlated with the E vortices contribution, but now vortex reconnections can produce phonons (71,72), further reducing E vortices .Thus, we expect the decrease in L(t) to be more pronounced than in E flow (t).
To check if this expectation is seen in Fig. 6, we note that, by coincidence, both tdslda simulations have lost about 98 % of their total flow energy at tε F ≈ 1000, so we use this as a fiducial.(The phenomenological dissipation η = 0.08 was chosen for one gpe to match these results; the other gpe simulation with η = 0.01 has a 94 % loss in E flow at this time.)Here, we find that the gpe demonstrates the expected trend for both cases, with a slightly greater drop in L. The tdslda simulations, however, have a quantitatively different behaviour with a smaller decrease in L of 94 % in the ufg, and 80 % in the bcs regime.
This result might seem to be counterintuitive: in the bcs regime we still have many vortices at the end, but they do not generate much flow.We interpret this as a strong indication that another mechanism (absent in the gpe) is responsible for reducing E flow in the tdslda.Noting that the simple correlation L ∼ E vortices holds only if we do not consider corrections from the internal vortex core structure, we hypothesize that thermalization plays a significant role.To demonstrate this, in Fig. 7 we consider a vortex solution for the bcs case, as a function of temperature T, which allows us to manipulate the size and structure of the vortex core.The results are obtained by solving static Eq. ( 7) with the constraint that we have a single and straight vortex line.Far from the core, the density n(r) has the same behavior, but clearly, the density inside the vortex core is sensitive to the temperature (Fig. 7a).The order parameter distribution ∆(r) also indicates that vortices get bigger with the increase of T (Fig. 7b).Accordingly, the velocity v(r) = j(r)/n(r), which quantifies the flow energy, is suppressed by the thermal effects (Fig. 7c and inset Fig. 7d).This shows that the structure of the vortex core, which is sensitive to T, can affect E flow .Specifically, in the inset Fig. 7d, we see that finite temperature significantly reduces both the flow and condensation energy.
This suggests an explanation for the breakdown of the correlation between E flow (t) and L(t): the tdslda admits an additional dissipation mechanism whereby flow energy is "thermalized", altering the flow structure of the vortices.To explicitly demonstrate that the vortices in the time-dependent runs get hotter, we added to Fig. 7 (thin gray lines) cross-sections through five randomly selected vortices from the bcs runs at tε F = 1000.As expected for a non-equilibrium state, the profiles of individual vortices have some variability, but all characteristics -the density profile, the order parameter profile, and the velocity field -are consistent with the static solutions obtained for a temperature T/T c ≈ 0.6.
The temperature dependence of the vortex-core density n core allows us to use fermionic vortices as a local thermometer.Using static slda simulations of a single vortex, we calibrate n core (T), (the curve is provided in the Supplementary Material) and then use the density along the vortex lines to demonstrate the thermal evolution of the turbulence in the tdslda simulations.These results are presented in Fig. 8.We observe that the effective temperature of vortex lines is higher in regions of higher curvature, especially in regions where reconnections occur.This is reminiscent to the heating of wire which is sharply bent back and forth.Heating of the vortices represents an additional dissipative mechanism missing in gpe-like models.
Heating depletes the condensate, which eventually vanishes at the critical temperature T c at the superfluid to normal phase transition.This heating is indicated by the loss of condensation energy as seen in Fig. 6c.In the ufg, the condensate is depleted only at the early stages tε F ≲ 200, after which it remains relatively constant.In contrast, in the bcs regime, the condensate continues to gradually deplete.In the case of the bcs vortex solutions, the depletion of the condensate by about 20 % is for T ≈ 0.6T c (Fig. 7d).This value approximately corresponds to the estimated effective temperature of vortices from the cross-sections.It is another signature pointing to the conversion of the flow energy into internal excitations, which effectively heat the system.

DISCUSSION
In the ufg, the turbulent dynamics resolved by the tdslda demonstrate qualitative differences when compared with the simplified approach based on a modified gpe.For instance, the gpe does not properly account for physics in the cores of vortices: Where fermionic vortices have a finite density in the core, vortices in the gpe are empty, causing them to move at different speeds.This can be somewhat compensated for by averaging procedures ( 58), but feeding this back into the evolution has proved tricky (see (79) for one approach).It is also insufficient to model the dissipative effects with a single phenomenological parameter η.Here we have adjusted η ≈ 0.08 to reasonably match the decay pattern for E flow seen in the tdslda, but this differs from the value η ≈ 0.01 that best fits rotating  turbulence (57).Nevertheless, the gpe provides a fast way of gaining some insight into the qualitative effects of superfluid dynamics.
The tdslda is a parameter-free, self-consistent microscopic theory that naturally captures these effects.Thus, although more costly computationally, it provides deeper insight into superfluids dynamics than any other currently available techniques.Here we have demonstrated that the tdslda breaks the natural correlation between the total flow energy E flow and the total vortex length L. By studying the structure of vortex cores at various temperatures T, we provide evidence that this is due to additional energy dissipation and thermalization mechanisms.Directly comparing the tdslda with a gpe-like theory, we can distinguish this mechanism from dissipation due to vortex bending, crossing, and Kelvin modes, for example, which exist in both theories.The importance of the fermionic nature of the superfluid is further supported by the result that deviations are stronger in the bcs regime than in the bec regime where gpe-like theories should be accurate.Some caveats are in order, and results at finite T in the slda   and (c) velocity v(r) = j(r)/n(r) for a single straight vortex line at various temperatures in the bcs regime (k F a = −1.8).The thin gray lines show the corresponding profiles of five randomly selected vortices from the full tdslda taken at time tε F = 1000.The temperature is normalized by the critical temperature of superfluid-normal phase transition T c within the static model: i.e. for T = T c one would have ∆(r) = 0.In the lower panel (c) we include the asymptotic form v(r) = 1/2r as a thin dashdotted line.Inset (d): temperature dependence of E flow and E cond energies for a single vortex.Together, these suggest an effective temperature of T ≈ 0.6T c at time tε F = 1000.must be treated with some caution.In principle, the functional (precisely functions A, B, and C, in Eq. ( 15)) should now depend on the dimensionless parameter k B T/ε F (n), but the form of these has not yet been fit due to lack of reliable benchmarks.Furthermore, it is an open question about how thermalization in strictly T = 0 dynamical simulations (Fig. 6) should be related to the explicit thermal distribution f T (E) used in static calculations (Fig. 7).The agreement between these strongly suggests that thermalization is the correct explanation for the break in correlation between E flow (t) and L(t), but additional analysis is needed to conclusively rule out geometric effects, to identify the importance of vortices to the thermalization process (80), etc.This type of progress requires the development of all exascale hpc technology components: computing hardware, optimized scientific libraries for fft (hipfft) and matrix diagonalization (elpa), and advanced high-level scientific application software (the tdslda via the w-slda toolkit).We have demonstrated that, with current technologies, we can diagonalize matrices of order a million by a million, and use tddft to model fermionic quantum dynamics with 10 4 -10 5 particles.This is rapidly approaching the scale of typical ultracold atom experiments (10 5 -10 6 particles), allowing us to directly benchmark the tdslda against experiments.The new generation of dft packages optimized for hpc are extremely flexible, enabling researchers to access not only superfluidity as studied here, but many other fields, including superconductivity, nuclear physics (15,81), nuclear astrophysics, and new opportunities for emerging fields like atomtronics and other quantum technologies (82).Simulations like these provide access to complex but important microscopic physical processes, and thus provide benchmarks to tune more economical phenomenological models like (58,79), which may ultimately be scaled up to address problems of importance to fundamental physics in nuclear astrophysics.(E.g., the ufg we study here is a good model for the neutron superfluid in the crust of neutron stars, which is likely responsible for pulsar glitches.) To maximize the scientific impact of these publicly-funded resources, we have open-sourced our codes (63), and release the raw data generated by these simulations (83) so that others can reproduce our analysis, and use these results for further benchmarking other models, thereby advancing the pace of science.We hope that other groups will follow our example, making their codes and data available for the community.This step is needed in order to maximize the amount of knowledge extracted from data obtained by expensive hpcs systems, and share research opportunities with groups that do not have direct access to them.

MATERIALS AND METHODS LUMI specification
lumi is one of the three Euro hpc pre-exascale supercomputers, along with Leonardo and Marenostrum 5.It is the only major European hpc system equipped with amd gpu technology.The numerical results described in this work were collected from the lumi pilot access phase.The lumi system has a peak performance of nearly 500 Pflops, most of which is delivered by 2560 gpu nodes.Each of these nodes includes a single amd epyc cpu with 64 cores and four amd mi250x gpus.The amd mi250x has a peak performance of 53 Tflops in double-precision arithmetic.The amd mi250x gpu package consists of two independent devices, called graphics compute dies (gcds).Each of these gcds has 110 compute units, and 64 GB of high-bandwidth memory (hbm2) which can be accessed at a peak rate of 1.6 TB/s.The two gcds in the mi250x package are connected by an in-package communication interface with a peak bidirectional bandwidth of up to 400 GB/s.Devices on different packages are linked with either a single or double communication link with a peak bidirectional bandwidth of 100 GB/s and 200 GB/s, respectively.Each gpu package is directly connected to the Slingshot network providing up to 2×50 GB/s peak bandwidth.The amd mi250 gpu family is based on the 2 n d generation amd cdna ("Compute DNA") architecture which uses amd rocm (Radeon Open Compute) development stack.rocm is an open-source collection of drivers, development tools and application programming interfaces (apis) for gpu programming from the low-level kernel to end-user libraries.Device kernels are programmable with the hip gpu programming language extension.The hip extension also provides a runtime platform, numerical libraries (including the hipfft), and porting tools.

Density functional theory for superfluid fermions
We consider here a system with equal densities of two types of fermions interacting with short-range interactions.In secondquantized notation, where na = â † â and nb = b † b are the number operators for the two species, expressed in terms of annihilation â/ b and creation â † / b † operators that satisfy anti-commutation relations.We consider the limit where V(r) is short-ranged, so that the interaction can be completely described by a single parameter, the s-wave scattering length a.
In general, the state of such a system with N particles must be described by a many-body wavefunction ψ(r 1 , r 2 , . . ., r N ) which requires an exponential amount of information, but dft allows us to reduce this description to an effective theory for the total density n(r) = ⟨ na (r)⟩ + ⟨ nb (r)⟩ or states which can be expressed as a Slater determinant -an anti-symmetrized product of single-particle states.Treated as a variational problem of finding the best single-Slaterdeterminant state (with some care required to express the zero-range limit), one derives a bcs-like ansatz in what is commonly referred to as Hartee-Fock-Bogoliubov (hfb) theory or Bogoliubov-de Gennes (bdg) theory for superconductivity.(See e.g.(55,56,84).)These theories qualitatively capture properties of the system Eq.( 11), but quantitatively fail in regions of interest like the ufg.The slda has a similar mathematical form, but is constructed from a different philosophy -that of density functional theory (dft).
The tddft equations follow from a condition of stationary action where |Ω(t)⟩ is quasiparticle vacuum at time t and E(t) is the total energy.The key to all dft approaches is the existence theorem due to Hohenberg and Kohn (4), with extension to the time-dependent cases by Runge and Gross (7), that for any given system, E(t) can be expressed as where E int [n] is a universal functional and V ext (r, t) is the external potential (which we set to zero here).Unfortunately, no prescription for finding E int [n] is known, and it is likely extremely complicated and non-local, even for non-interacting fermions.Instead, Kohn and Sham (5) derived an equivalent but alternate formulation in terms of an energy functional of a Slater determinant |Ω(t)⟩ of single-particle orbitals that allows for an exact local formulation for non-interacting fermions.By including both particles and holes, and with an appropriate regularization procedure, this was generalized for superfluids (19)(20)(21)85) in a form called the slda that we now describe.Unlike the bdg equations, we can now tune the parameters of the theory to match experiment and ab-initio qmc calculations.We lose any notion of a variational bound, but obtain instead a theory accurate to the few-percent level for a wide range of systems (21)(22)(23).
The energy function for the slda can be expressed as an integral of four local "densities" -each of which is a function of the orbitals in the Slater-determinant state |Ω(t)⟩: At T = 0, this system has only two length-scales: the inverse Fermi momentum k −1 F and the s-wave scattering length a.The slda functional is thus constrained by dimensional arguments (see Eq. ( 1)) to have the following form (in units where h = m = k B = 1): where A, B, and C are dimensionless universal functions.The first term defines the kinetic energy, the second term (missing in bdg theory) describes the Hartree energy, the third term accounts for the energy gain due to pairing correlations, and the last term is required to restore Galilean covariance.By appropriately choosing the universal functions, one can describe the entire bec-bcs crossover, including the weakly-interaction bcs limit k F a → 0 − , and the ufg |k F a| → ∞.The latter is especially simple because A, B, and C, are just numbers.Precisely, the functions are constructed in such way as to ensure correct reproductions of selected properties of uniform Fermi gas at a given value of the interactions parameter k F a.These properties are: equations of state ξ(k F a) = E /E FG (n), strength of the pairing correlations ∆(k F a)/ε F (n), and the quasiparticle effective mass m * (k F a) = m/A(k F a).All of these quantities are accessible from qmc calculations.For more details related to the construction of the functional, see (19-23, 27, 35, 85) and the source code for the reference implementation.In the calculations presented here, we have assumed that the effective mass m * = m, so A(k F a) = 1.This is a physically reasonable approximation that simplifies the functional slightly since the last term of Eq. ( 15) vanishes.
The local densities entering the functional are: After varying the functional, we will obtain a matrix equation Eq. ( 5) (derived below) with eigenvalues E n and two-component eigenstates In terms of these, the densities are: where we have suppressed the space-time arguments on the components u n and v n to save space, and we have introduced the thermal distribution function (Fermi distribution) which allows us to approximate finite temperatures T. We used the finite-temperature variant only for understanding the structure of the vortex core as presented in Fig. 7; the time-dependent runs were executed at T = 0. Note that all states up to a specified energy cutoff E c = h2 k 2 c /2m contribute to the densities: This is the main difference between the dft for superfluid systems and the original formulation of Kohn and Sham which only keeps states up to the Fermi surface.For superfluids, the single-particle state |Ω(t)⟩ has a coherent phase and hence does not have a well-defined particle number.Instead, the bcs particle-hole correlations (Cooper pairs) allow fractional occupation of states with larger energy.The cutoff E c is essential for a local formulation with short-range interactions because the kinetic τ(E c ) and anomalous ν(E c ) densities are linearly divergent in such a state While both τ and ν diverge linearly ∝ k c , the combination h2 τ/2m − ∆ † ν is finite where ∆ = g c ν is the finite pairing gap.To improve convergence (19), we choose a scale k 0 ≈ k F and use so that the functional is specified by the density dependence of the finite combination A/g c + Λ c .This finite quantity is proportional to the inverse scattering length 1/a in the standard variational formulation of bdg theory, but has additional dependence on k F a in the slda.
The regularization scheme is constructed to assure independence of the observables with respect to the energy cutoff E c (assuming it is large enough to encapsulate meaningful states) when considering static problems.In the context of time-dependent problems, another factor needs to be considered: the total energy must be conserved.Formally, the tdslda equations ( 5) are conservative only in E c → ∞ limit (86).For this reason, in the computation, we use the cutoff scale E c = k 2 max /2 defined through the maximum value of momentum k max = π resolved by our spatial grid.In our simulations, the relative change in of the total energy ⃓ ⃓ E(t) − E(0) ⃓ ⃓ /E(0) does not exceed a fraction of a percent.See (19-23, 27, 35, 85) and the source code for further details.
The equations of motion that emerge from the stationarity condition Eq. ( 12) have the forms Eqs. ( 5) and ( 7) discussed in the results above.The single particle Hamiltonian ĥ and pairing potential ∆ are defined through functional derivatives of the energy density ĥ = −∇ δE δτ (18b) They depend on densities, which in turn depend on the quasiparticle orbitals [u n (r), v n (r)] T .The single particle Hamiltonian ĥ includes a shift by the value of the chemical potential µ.This controls the particle number in the static solution for the initial state Eq.(7).
It is formally irrelevant for the time-evolution Eq. ( 5), but helps improve numerical convergence by minimizing the evolution of the global phase.The regularized coupling function g c defines the order parameter ∆, ensuring it remains finite.

Initial state preparation and vortex lines detection
To generate the quantum turbulence, we start with a regular alternating array of interleaved vortices and anti-vortices in all three directions.(See Fig. 5a.)Using the Biot-Savart law, including image vortices from the periodic box, we obtain the phase profile θ 0 (r) for this periodic array with a superfluid velocity field v s (r) ∝ ∇θ 0 (r) consistent with the vortex/anti-vortex lattice.On this periodic phase profile, we add a few small low-frequency Fourier components θ(r) = θ 0 (r) + a 0 ∑ Nv n=0 c n cos(k n • r), where N v is the number of vortices in each direction.The coefficients and frequencies are c n = (−1) n /(2n + 1) 2 , and k i n = (2n + 1)(2π)/L i box .The magnitude of the perturbation a 0 is adjusted so that the additional large-scale flow increases E flow by 5% compared with the vortex lattice (as computed within the gpe).This phase profile is then held fixed in r) , and the iterative solution for the magnitude of the gap |∆|, density, etc. is found using Eq. (7)as described above, which requires the improved elpa diagonalization routines.
To compute the total length of L, we need first to identify the position of vortices from numerical data.For this, we search in ∆(r) fields for points around which its phase winds.This also implies that, at this point, the order parameter vanishes.We interpolate grid data using discrete variable representation (dvr) to identify such points with subgrid resolution (87).To connect points into lines we use pseudovorticity ω(r) = ∇ × j(r), which should point along the line.A detailed description of the algorithm is given in (57).

Fig. 1 .
Fig. 1. lumi, Euro hpc's pre-exascale system with amd mi250x gpu-accelerated nodes.Each gpu node consists of the four amd mi250x gpus, each of which has two gcds being individual hip-programmable devices.Photo copyright Fade Creative.

Fig. 2 .
Fig. 2. Cost scaling.Cost (time-to-solution × node-count) of the parallel evolution of all quasiparticle wave-functions for a unit time interval (expressed in dimensionless units), which corresponds to 286 integration time steps, as measured on the lumi system.The number of nodes was adjusted to fit the problem in the memory available on the nodes, and ranges from 32 in the smallest case to 800 for the largest.The dashed line shows the expected ideal cost scaling.

Fig. 3 .
Fig. 3.Strong scaling of the elpa 1-stage solver.Shown is the time-to-solution of the elpa 1-stage solver for real, double-precision computations in a strong-scaling setup for different matrix sizes.The solid lines show the results on lumi with 4 amd mi250x gpus per node.The dashed lines show the results obtained on the raven hpc system of the mpcdf, with 4 nvidia a100 gpus per node.

Fig. 5 .
Fig. 5. Time evolution of the vortex tangle.Selected frames of the time evolution of vortex tangle in the strongly interacting ufg (k F a → ∞) superfluid gas.The lines indicate position of the vortex cores, and isosurfaces are used to visualize their sizes.For the simulation we used a periodic lattice with N 3 = 100 3 points.The full movie is provided in Supplementary Material.

Fig. 6 .
Fig. 6.Energy flow.Time evolution of: (a) the flow energy E flow , (b) total vortex length L, and (c) the condensation energy E cond .All quantities are normalized with respect to their initial values.The solid lines are from the tdslda calculations on lumi for the ufg (dark blue) and bcs (light orange) regimes.The dashed-dotted lines correspond to the simplified gpe model with two different phenomenological dissipation coefficients η ∈ {0.01, 0.08}.In panel (b), the thin gray lines show the expected slope for L(t) expected for Vinen (dotted) or Kolmogorov (dasheddotted) turbulence.The arrows indicate where the analyzed quantity drops by 94 % and 98 %.

Fig. 7 .
Fig. 7. Temperature dependence of the vortex core in the bcs regime.Radial dependence of the: (a) density n(r), (b) order parameter ∆(r), and (c) velocity v(r) = j(r)/n(r) for a single straight vortex line at various temperatures in the bcs regime (k F a = −1.8).The thin gray lines show the corresponding profiles of five randomly selected vortices from the full tdslda taken at time tε F = 1000.The temperature is normalized by the critical temperature of superfluid-normal phase transition T c within the static model: i.e. for T = T c one would have ∆(r) = 0.In the lower panel (c) we include the asymptotic form v(r) = 1/2r as a thin dashdotted line.Inset (d): temperature dependence of E flow and E cond energies for a single vortex.Together, these suggest an effective temperature of T ≈ 0.6T c at time tε F = 1000.

Fig. 8 .
Fig. 8. Effective temperature of vortices Vortex lines with the color indicating their local effective temperature.The temperature is extracted from the relation between density inside the vortex and the system's temperature for static solutions, see Fig 7(a).This snapshot is for ufg simulation at tε F = 300, where the reconnection event occurs and represents the same snapshot as Fig. 5d.The inset shows the histogram of temperatures along the vortex lines.The full movie is provided in the Supplementary Material.The highest temperatures are generally correlated with events like vortex crossings and reconnections at sites of high curvature, but this energy dissipates, resulting in a uniform effective temperature at later stages of evolution.