Implementation of a parallel high-order WENO-type Euler equation solver using a CUDA PTX paradigm

This study proposes the optimization of a low-level assembly code to reconstruct the flux for a splitting flux Harten–Lax–van Leer (SHLL) schemeonhigh-endgraphicprocessingunits.Theproposedsolverisimplementedusingtheweightedessentiallynon-oscillatoryreconstruction methodtosimulatecompressiblegasflowsthatarederivedusinganunsteadyEulerequation.Instructionsinthelow-levelassemblycode,i.e. parallelthreadexecutionandinstructionsetarchitectureincomputeunifieddevicearchitecture(CUDA),areusedtooptimizetheCUDAkernel forthefluxreconstructionmethod.Thefluxreconstructionmethodisafifth-orderonethatisusedtoprocessthehigh-resolutionintercell fluxforachievingahighlylocalizedscheme,suchasthehigh-orderimplementationofSHLLscheme.Manybenchmarkingtestcasesincluding shock-tubeandfour-shockproblemsaredemonstratedandcompared.Theresultsshowthatthereconstructionmethodiscomputationallyvery intensiveandcanachieveexcellentperformanceupto5183GFLOP/s, ∼ 66%ofpeakperformanceofNVIDIAV100,usingthelow-levelCUDA assembly code. The computational efficiency is twice the value as compared with the previous studies. The CUDA assembly code reduces 26.7% calculation and increases 37.5% bandwidth. The results show that the optimal kernel reaches up to 990 GB/s for the bandwidth. The overall efficiencyofbandwidthandcomputationperformanceachieves127%ofthepredictedperformancebasedontheHBM2-memoryrooflinemodel estimatedbyEmpiricalRooflineTool.

algorithm designs that are indeed important for applications on GPUs.Therefore, the scheme's performance is not well optimized with respect to the GPU hardware.
The WENO method is a high-resolution method that eliminates spurious oscillations close to the location where steep gradients of flow properties exist.The method of Liu et al. [6] significantly increases computing efficiency because the method uses a freeadaption stencil to construct the high-order numerical approximation.The smoothness indicator that was proposed by Jiang and Shu is the most common weighting method [7].This method determines the nonlinear weights for low-degree reconstructed polynomials for constructing a fifth-order WENO-JS scheme [7] even using coarse uniform grids.Other methods of the WENO variants include the WENO-M [8], the WENO-Z [9] and the hybrid WENO method [10].The WENO-M scheme proposes a new method for nonlinear weighting that achieves optimal convergence.Instead, the WENO-Z scheme uses a smoothness indicator to reduce the number of calculation operations and maintain accuracy.
A well-known evaluation model, named roofline model, that predicts the maximum attainable performance of a kernel was proposed by Williams et al. [11].The operational intensity (= FLOPs/byte) is the main factor in a typical roofline model.The model shows that performance bottlenecks are caused by computational intensity and the bandwidth for the on-chip/off-chip memory.
Since the WENO method uses a single computing path [6], it is highly computationally intensive intrinsically.Wermelinger et al. [5] measured an operational intensity of 6.0 FLOPs/byte and used the roofline model to predict the maximum performance of the WENO-JS method.A divergent computing path in the numerical algorithm is a disadvantage for GPU computation.A typical example having a divergent computing path is the second-order SHLL method that uses the total variation diminishing (TVD hereafter) method [12][13][14] for better predicting shocks using coarse grids.The TVD method features a divergent computing path because the method includes logical operations, which makes the WENO method achieve better performance than the TVD methods.That is because it features high operational intensity and a single computing path.Based on the two features, the WENO-based CUDA kernel could be well optimized and enhanced by using the low-level assembly code.
The goal of optimizing the WENO method on a GPU is to achieve the maximum performance as predicted by the roofline model.There are only few studies seeking to optimize the corresponding kernel.Esfahanian et al. [15] described a computing flowchart within the WENO kernel and its memory arrangement on a GPU.Wermelinger et al. [5] measured the WENO kernel's floatingpoint performance to be 10-25% of the peak single-precision performance for a NVIDIA K20X.Diaz et al. [16] developed a 3D compressible flow solver using the WENO-Z method, using double-precision and roofline model analysis for a WENO-Z kernel that uses a NVIDIA K80.The operational intensities for the WENO-JS method and the WENO-Z method are measured to be 6.0 [5] and 5.2 [16] FLOPs/byte, respectively.
Diaz et al. [16] used the empirical roofline benchmark suite to produce a maximum performance of 1230 GFLOP/s and a memory bandwidth of 133 GB/s for a single GPU for a NVIDIA K80.They measured the WENO-Z kernel operated at 365 GFLOP/s (∼30% of peak performance for a NVIDIA K80 and ∼53% of the roofline-predicted performance).The roofline-predicted performance for the kernel is ∼691 GFLOP/s.Therefore, the computational efficiency of the WENO kernel on a GPU could be definitely improved and achieves ∼56% of the peak performance for a NVIDIA K80.
This study uses the WENO-JS method to reconstruct SHLL fluxes for the WENO-SHLL method, which has fifth-order accuracy at the intercell point on the Cartesian grid.The new flux terms for the SHLL method give greater accuracy and computing efficiency for a GPU.In addition, the code-optimized kernel inserts a low-level assembly code to increase the use of the L1/L2 cache, which gives greater performance on a GPU.The WENO method can also be used for the high-order flux reconstruction for the SHLL method to eliminate spurious oscillations near steep gradients.
In this study, we propose to apply the WENO-JS method to reconstruct the SHLL fluxes for the WENO-SHLL method, which has fifth order in spatial accuracy at the intercell point on the Cartesian grid.The new reconstructed flux terms for the SHLL method provide better accuracy and computing efficiency on a GPU.A WENO-JS kernel code with a highly optimized function in CUDA is developed.It is expected that the flux reconstruction can eliminate spurious oscillations near steep gradients and improves numerical accuracy in the spatial domain.The roofline model is used to predict the maximum performance in terms of the kernel's computing intensity.
The remainder of this paper is organized as follows.The numerical methods, including the SHLL scheme and the WENO method, are briefly presented in Section 2. In Section 3, a parallel computation algorithm with optimized kernels that is enhanced by the assembly code is described.The benchmarking test results of the developed GPU CFD code and the performance results for the roofline model are reported in detail in Section 4. Conclusions and future work are summarized in Section 5.

GOVERNING EQUATIONS AND NUMERICAL METHODS
The 2D Euler equation in Cartesian coordinates is described for convenience and is expressed in a conservative form neglecting the source term as follows: where the vector U includes the conserved quantities and the vectors F and G are the respective fluxes for mass, momentum and energy, in the x and y directions, respectively.These vectors are defined as where ρ is the gas density, and u and v are the velocities in the x and y directions, respectively.The pressure p is defined for an ideal gas law (p = ρRT).R is the gas constant and T is the gas temperature.The total energy E is given by where C V is the specific heat constant at a constant volume.

The Godunov-type finite-volume method
The finite-volume method with a TVD flux limiter constructs an approximate solution with the second-order accuracy in the spatial domain using a uniform grid.Godunov's theorem states that a linear combination of the time-step solution preserves monotonicity and satisfies the conditions for a TVD scheme, as described in [12].The Godunov method for time integration is a first-order scheme that approximates the cell-averaged solution and is described by the following equation in a conservation form for the 1D case: where U n i and F * are the conserved quantities and the intercell fluxes at the time level n, respectively, and the variables t and x are the time step and the grid spacing, respectively.These variables are chosen to satisfy the Courant-Friedrichs-Lewy (CFL) condition.Figure 1 shows a typical control volume in the space-time domain covering the cell region between cells i and i + 1.The variable U n+1 i is the solution at the n + 1 time level.The intercell flux F i+1/2 is defined by Eq. ( 5) and is the sum of the splitting fluxes from two neighboring cells: where the U L and U R are located at U i and U i+1 , respectively, and the flux F i−1/2 is defined similar to F i+1/2 .Equation (5) formalizes the relationship between the intercell flux and the splitting fluxes in a uniform grid.The solution U n+1 i has a less detailed structure and the high-order numerical solution requires an additional spatial gradient to reconstruct the solution.
The second-order Godunov-type method [12] involves more accurate fluxes at the intercellular level.The fluxes are then written as where the gradient terms are the flux limiters that are used to construct the second-order TVD scheme.Substituting Eq. ( 6) into Eq.
(5) gives This formula is the weighting-averaged flux for the second-order intercell fluxes F i+1/2 in terms of F + L and F − R .

The SHLL method
The SHLL method [2] is used to split mathematical vectors for the HLL flux expressions [17].Figure 1 shows that the cell interface at i + 1/2 spatially separates the cell into a right-hand side (RHS) cell and a left-hand side (LHS) cell in the spatial axis.The volume region is limited in the time-space domain of [0, T] × [X L , X R ].The LHS and RHS components for the flux across a cell interface are recast as follows [18]: where the superscripts + and −, respectively, represent the splitting fluxes that move forward and backward across the interface.The variables U * are the vectors for the conserved quantities, variables F * are the fluxes for the conserved properties and variables S * are the wave speeds.The subscripts L and R, respectively, indicate the left-side and right-side cell regions.Furthermore, the variables S A R ≈ V L + a L and S A L ≈ V R + a R are the approximations for the correct wave speeds, where V * is the local velocity and a * is the local speed of sound.The SHLL method in Eq. ( 8) is a highly local scheme that is highly computationally efficient on a GPU [2].The splitting fluxes reduce the data requirement and the cell-interface flux computation in Eq. ( 8) can be easily performed on two threads simultaneously.After substitution and adding the dissipation coefficient D for the resulting interface fluxes, the complete splitting fluxes for the SHLL method have the following form: where χ L and χ R are the critical speeds.In a Eulerian gas flow, the critical speed is the Mach number.In a shallow water flow, the critical speed is the Froude number.The critical speed region is limited to [−1, 1].If both critical speeds are equal to 1, the intercellular flux is F L and vice versa.If both speeds are equal to −1, the intercellular flux is F R .To allow easy adjustment of the flux approximation, the new splitting fluxes with two coefficients are changed into the following form: where D 1 and D 2 are the artificial dissipation coefficients that are used to adjust the numerical dissipation.Coefficient D 2 is an artificial diffusion term.If the critical speed is 0 for Eq. ( 10) and Eq. ( 10) is substituted into Eq.( 5), the approach of the intercellular flux is the Rusanov flux [19] with a diffusion coefficient D 2 .
The above is summarized in the following.The first-order SHLL fluxes are a highly local scheme that include artificial dissipation coefficients and the intercellular flux computation and are separated and calculated by two threads.The second-order SHLL scheme includes a simplified TVD flux limiter [12] and the CUDA kernel includes some logical operations for the flux limiter, which inhibits single-path computation on a GPU.A highly local splitting flux scheme can produce spurious oscillation near steep gradients, so we used the WENO flux reconstruction method for a high-order approximation of the splitting fluxes.The WENO method eliminates the high-order numerical oscillation for the SHLL scheme.

The WENO-JS reconstruction of the splitting fluxes
The SHLL scheme that uses a WENO-JS flux reconstruction method (WENO-SHLL hereafter) constitutes a high-order numerical stencil with free adaptability and a high computational intensity.The WENO-SHLL method uses the fifth-order WENO-JS reconstruction of Shu [7] to reconstruct a high-order spatial solution on a Cartesian grid.Unlike the SHLL method, the WENO-JS method reconstructs the terms for the SHLL fluxes using Eq. ( 5).Spatial discretization of the conservation law equation follows nonlinear hyperbolic conservation laws.The semi-discretization scheme is written as where the averaged quantity over a cell is U i .The WENO-JS method, respectively, reconstructs the high-order splitting fluxes F + and F − using the series {F + } and {F − } at points {x i−2 , x i−1 , x i , x i+1 , x i+2 }, as shown in Fig. 2.
The WENO-JS method constructs the fifth-order automatic adaptive stencil using low-order interpolations at point x i + 1/2 .The high-order splitting flux for F + i+1/2 is expressed as where the variables w j are the nonlinear weights for the smoothness indicators [6] and S j (x i+1/2 ) are the third-order polynomial interpolations of the splitting fluxes at point x i+1/2 , as shown in Fig. 2 and will be described later.The WENO-JS weights w j are expressed as , where where the parameter ε = 10 −6 is a small constant that ensures a positive value for w j .Parameters γ j are the linear weights for S j (x i+1/2 ) that are used to construct the fifth-order interpolation of the splitting fluxes at point x i+1/2 .Shu [7] denotes the linear weights as where γ j are always positive and their sum is equal to 1.The corresponding smoothness indicators β j are given by Jiang and Shu [7].β j is written as a quadratic form of the cell-averaged fluxes f j .The smoothness indicators β j are defined as The following three terms S * (x i+1/2 ) are the Newtonian formulations of the polynomial interpolations at x i+1/2 .On a uniform grid, the low-order interpolations S * (x i+1/2 ) for the splitting fluxes at x i+1/2 are expressed as The reconstruction to F − i−1/2 is mirror symmetric with respect to x i for this procedure.There is a numerical oscillation of the splitting flux method because there is a steep gradient due to the discontinuous solution of the fluxes.The WENO method prevents numerical oscillations in the cell-averaged solution.The explicit third-order low-storage Runge-Kutta method [20] integrates the spatial information in the RHS of Eq. ( 11) and is a TVD scheme with respect to the positive coefficients in the integration, as demonstrated by Gottlieb and Shu [20].The method has three stages to approach the solution for time step n + 1: where Ū n is the cell-averaged solution, L( Ū n ) denotes the spatial operator at the spatial discretization stage and t is the time-step size, as determined by the CFL condition with Courant number C CFL .For this study, Eq. ( 17) has the form of the selected time-step size: where variables |S n i jk | are the wave speeds in the x, y and z directions, respectively, and the Courant number C CFL is <1.0 to ensure numerical stability.To ensure stability for all test cases, a Courant number of 0.4 is used throughout the study, unless otherwise specified.Note that the details of the numerical stability are described in the study [18].
Some notes about the WENO-SHLL method are described next.The WENO method gives a high-order smooth solution for the SHLL splitting fluxes in spatial derivation and prevents numerical oscillation near steep gradients.The free-adaptation stencil allows a single-path computation.The high computational intensity of the WENO method gives the maximum performance of a GPU that makes it well suited for the GPU computation.It can be termed as an implementation of the high-order SHLL scheme.

APPLICATION TO GPU COMPU TATION AND OPTIMIZ ATION
3.1 Implementation of the WENO-SHLL scheme on single GPU This section details the CUDA parallel thread execution (PTX) and instruction set architecture, which accelerates the CUDA kernel execution for the WENO-JS method.The GPU architecture consists of thousands of synchronously computing cores that perform computation that has a low dependence on data, such as a vector operation.A highly efficient CUDA kernel features the design of a single instruction, multiple threads format programming model.The algorithm for the WENO-JS method uses threadlevel parallelism and includes low-level floating-point operation instructions.A fused multiply-add (FMA hereafter) instruction finishes the two floating-point operations for the multiply-add operation in one cycle, where the FMA instruction is defined in the CUDA PTX.For example, the calculation for the formula "A = BX + C" has two floating-point operations including multiplication and addition, while the FMA instruction completes the calculation for these two operations of multiplication and addition in only one cycle.Most of the time, the compiler does not insert FMA instructions perfectly to accelerate such kind of calculation.The optimized code for this study manually uses FMA instructions to accelerate the computation on a GPU.The optimized kernel code can be readily used for a hybrid parallel computing model featuring MPI and CUDA on a multi-GPU system if needed.The details of the memory arrangement for the flow properties on the host and GPU devices were described in our previous work [21].
Figure 3 shows the flowchart of the WENO-SHLL method, implemented on a GPU, which includes the main CUDA kernels for massive intensive computing.The host part and the device part are handled by the CPU and GPU, respectively.The host part controls the computing loop and initializes/ends the simulation on the CPU and GPU.The device part uses a sequential flow within a series of CUDA kernels.These kernels include the flux calculation, the WENO-JS flux reconstruction and the calculation for the flow properties states.The cubic-spline immersed boundary method [21] that is a kernel is activated if the geometry of the solid boundary is not aligned with the coordinate directions on a Cartesian grid.The third-order lowstorage Runge-Kutta method is an additional loop used for time integration in the device part, which is generally three times slower than the Euler method but has much higher accuracy.The CFL condition is the last kernel in the computing loop that is used to adjust the size of the time step during the runtime.More details about the above-mentioned kernels are described next.The following algorithm is the pseudo code for the WENO flux reconstruction used for approaching the fifth-order splitting flux.

Algorithm: Processing of the WENO-JS kernel in one dimension on GPU
The CUDA threads ID i is mapped to the ith cell for a 1D problem.In the for loop, the pseudo code moves the cell's splitting fluxes from the global memory to the fastest cache (register memory/L1 cache).The device function called WENO_kernel is the optimized WENO kernel and high-order fluxes are calculated simultaneously on parallel threads.The WENO kernel can be used for WENO-Z or WENO-M methods.The data points for the series {F + n } are shown in Fig. 2. The WENO-JS reconstruction approaches the high-order splitting flux at the point n + 1/2.  Figure 4 shows the source code of the WENO-JS kernel as a device function for the previous algorithm.The function power2 is a squaring function that returns the square of the input value.The arithmetic precision of the WENO-JS kernel is kept as double precision, because precision is very sensitive for the WENO method.

Optimization of the CUDA kernels with PTX assembly codes
In this study, we optimize the CUDA kernel by adding PTX assembly instructions to the kernel code during code compilation.The PTX assembly instructions execute the data-parallel computing model on the NVIDIA GPU device.Due to the computational complexity, the compiler retains the conservatively correct behavior to generate the PTX/cubin files for the device kernel, so it is difficult to achieve good performance for the GPU.
Figure 5 shows two types of compiling a CUDA code to the binary file.The general process of compiling CUDA code is shown in Fig. 5a.The general process shows that the NVCC compiler separates the CUDA code into two parts, including the host code and the device code, and then compiles those codes to generate the binary file.Figure 5b shows the different compilation process that adds the PTX code where we manual add the low-level assembly instructions in the massively computing kernel.Therefore, PTX assembly instructions are manually added to the WENO-JS CUDA kernel to optimize the execution of the kernel on the GPU.The manual PTX assembly conversion optimizes the use of floating-point calculations and register memory in the L1 cache, but it is relatively difficult to read the source code and debug.Therefore, there is no need to convert all kernels manually to the PTX assembly.Only those kernels that are computationally intensive are converted into PTX assembly.
An example to manually translate the CUDA code into the PTX assembly code is presented next.For example, the device function foo calculates the equation A 1 = −U 2  1 + U 2 on a CUDA thread.The CUDA code can be written as  with different combinations of artificial dissipation coefficients (D 1 , D 2 ).The traveling contact wave is next to the shock wave and the sonic rarefaction wave is on the left side.The results show that the WENO-SHLL method with (D 1 , D 2 ) = (0.5, 1.0) performs the best at the most numerically challenging contact discontinuity among all schemes considering the accuracy and suppression of the spurious oscillation.

2D Riemann problem: Euler four-shock problem
In this section, the 2D Riemann problem [23,24] with the propagation and complex interaction of four shocks is used to demonstrate and compare the numerical resolution among these different schemes.The computational domain is square on [0, 1] × [0, 1].The initial conditions are defined using a quadrant, which is divided by lines x = 0.8 and y = 0.8 on the square.The initial conditions are with the Neumann boundary conditions imposed at the outer boundaries.A Cartesian-grid uniform mesh with 400 × 400 cells with a fixed time step of 0.0002 is used.The final simulation time for the data presentation is set as 0.8.The four shocks interact and a narrow jet with a complex fluid structure is created near the center of the domain.
Figure 7 shows the snapshots of the simulated density contours at the time of 0.8 using the SHLL-TVD scheme and the WENO-SHLL method with four sets of combination of the dissipation coefficients (D 1 , D 2 ).The time integration uses the low-storage explicit Runge-Kutta method for all cases, except for the case in the lower-left corner with the Euler method as a comparison purpose.The general features of the shocks and complex fluid structure are generally captured by all the simulations.However, there are some details of the flow structure near the narrow jet that are distinctly different between the SHLL-TVD scheme and the WENO-SHLL schemes.One typical example of the high-resolution simulation results using the WENO-SHLL scheme in Fig. 7 is demonstrated with (D 1 , D 2 ) = (1.0,1.0).It is found that a change in the coefficient D 1 from 1.0 to 0.5 and in the artificial dissipation coefficient D 2 from 1.0 to 0.8 provides the better numerical resolution with many finer flow structures inside the recirculation vortices along the end of the narrow jet.Therefore, the use of WENO-JS flux reconstruction with lower artificial dissipation coefficients for splitting flux generally provides more accurate results and increases the resolution of the fluid structure.

Computational performance
As mentioned earlier, the baseline kernel for the WENO-JS method is constructed using the CUDA code and the optimized kernel is enhanced using the CUDA PTX assembly.The benchmarking problem for the performance analysis is the Euler four-shock problem as mentioned previously.The mesh resolution and the time integration, respectively, are 9 million cells (3000 × 3000 cells) and 100 time steps with a fixed time step.The baseline/optimized kernels are executed for 100 time steps for 400 times in total because each grid cell has four splitting fluxes.The high-order intercell flux transferring kernel that is described by Eq. ( 5) sends the flux information to the neighboring cells.Therefore, the kernel requires ∼40% of the computing time.The minmod limiter is the longest computing time kernel in a call because the minmod limiter function in the presented code calculates four flux gradients for the four directions of a cell in a single call.Table 1 summarizes the times that are required by the several CUDA kernels in the computing flowchart, which includes the SHLL-TVD method and the WENO-SHLL methods.The fifth-order SHLL method with WENO-JS flux reconstruction is found to have the highest computing efficiency in the current study.The averaged computing times for the minmod limiter and the optimized WENO-JS method are ∼2.92 and ∼0.85 ms, respectively.Therefore, the total computing time for the WENO-JS kernel for four splitting fluxes in a single cell is 3.4 ms.The computing time of the WENO-JS kernel is obviously larger than that of the minmod limiter one, but it is more accurate and provides higher numerical resolution in the complex fluid structure region.
Table 2 summarizes the results for the floating-point computation and the memory bandwidth.NVIDIA visual profiler is used to measure performance data for the NVIDIA V100 and K20c.NVIDIA K20c uses a GPU core with Kepler architecture and is a benchmark GPU device [4,16].The results show that the optimized kernel has ∼50% of the peak performance of NVIDIA K20c.The computational efficiency is twice the value as compared with the previous studies [4,16].For NVIDIA V100, the averaged computing time for each call in the x and y directions is 0.848 and 0.855 ms, respectively.The total number of floating-point operations is 5.9994 × 10 9 and 4.3965 × 10 9 for the baseline kernel and the optimized kernel, respectively.The optimized kernel only consumes 73.3% of the floating-point operations of the baseline kernel.For NVIDIA V100, the optimized kernel features 5183.9 and 5137.4GFLOPs, respectively.The baseline kernel's results in the x and y directions are 4475.5 and 4413.2GFLOPs, respectively, in the x and y directions.The optimized kernel performs much better because it has a greater memory bandwidth and requires fewer floating-point operations.
The peak performance for NVIDIA V100 is 7800 GFLOPs based on the manufacturer's specifications.The optimized kernel achieves ∼66% of this figure, while the baseline kernel has a peak performance of ∼57% only.The optimized kernel increases the computational performance usage by ∼9%.For NVIDIA K20c, the optimized kernel's computational performance usage increases by ∼3%.Compared with the baseline kernel, the optimized kernel uses 37% less computing time for the NVIDIA V100 and K20c.The averaged bandwidth results for the baseline and the optimized kernels are ∼720 and ∼990 GB/s, respectively.The optimized kernel's memory bandwidth increases by ∼37.5%.The bandwidth results for the case of optimized kernels (∼990 GB/s) include a load/store in the x and y directions of 836.5/159.3 and 833.5/158.8GB/s, respectively.The result is higher than that of the global memory's peak bandwidth of 900 GB/s because the measured bandwidth includes the bandwidths from the global memory and the  L2 cache.The higher bandwidth will increase the number of instructions that are executed per cycle and the PTX assembly code can reuse the cached data in the registers.

Roofline model analysis
Figure 8 shows the NVIDIA V100's roofline graph, which is a performance model that is used to determine the attainable maximum performance for arithmetic operations.Arithmetic intensity is the ratio of floating-point operations to memory flow (in bytes).Empirical Roofline Tool [25] (ERT hereafter) is an open-source code for evaluating and drawing the roofline model for multiple kinds of processing units.The maximum performance for NVIDIA V100 is 7749.1 GFLOP/s, as determined by ERT.The short-dashed line in Fig. 8 is used to determine the upper limit of the attainable performance based on the different arithmetic intensities in the global memory.The long-dashed line shows the attainable performance based on the L2 cache memory bandwidth.The bandwidth of the global memory and L2 cache is 783.2 and 3 352.2GB/s, respectively.The results show that the roofline model predicts that, if the kernel has an arithmetic intensity of 1.0, the kernel's predicted performance is 783.2GFLOP/s for NVIDIA V100, using the global memory for the computation.If the kernel reuses the cached data in the L2 cache throughout its runtime, the predicted performance can be as high as 3352 GFLOP/s. Figure 8 shows the four measured performance data for the results for the SHLL splitting flux kernel, the flux gradient kernel of the minmod limiter and the baseline/optimized WENO-JS kernels.The flux gradient kernel has a lower arithmetic intensity and thus lower computational performance as expected.By comparing with the flux gradient kernel, the two WENO-JS kernels, including the optimized kernel and the baseline one, show a much higher arithmetic intensity and much better measured performance that is summarized in Table 3.
Table 3 shows that the arithmetic intensities for the WENO-JS kernels are 6.2 and 5.2, as measured for the baseline kernel and the optimized kernel, respectively.The results are close to those that were measured by Wermelinger et al. [5] and Diaz et al. [16].These studies calculated the arithmetic intensities of 6.0 and 5.2 for the baseline WENO-JS method on NVIDIA K20X and NVIDIA K80, respectively.The predicted attainable performance using NVIDIA V100 global memory is between 4072 and 4855 GFLOP/s for the arithmetic intensities of 5.2 and 6.2.The results of this study show that the optimized WENO-JS kernel has a better performance of 5183 GFLOP/s, its operational intensity is 5.2 and the computing efficiency is 127% of the predicted performance.The performance of the baseline kernel is 4478 GFLOP/s and the arithmetic intensity is 6.2, which is 92.5% of the predicted performance of 4875 GFLOP/s.In addition, Table 3 summarizes the results for the baseline/optimized kernels executed on NVIDIA K20c.The computing efficiency for the optimized kernel is also better than that for the baseline kernel on NVIDIA K20c.However, the increase  in performance is less than the results for NVIDIA V100 because a cache missing problem on NVIDIA K20c reduces the computing efficiency and affects the total bandwidth because of the small L2 cache.

CONCLUSIONS
This study uses the WENO-SHLL method with lower artificial dissipation factors to reconstruct a high-order WENO-type fluxsplitting scheme and demonstrates the low-level instruction implementation of an optimized CUDA kernel that can be enhanced by a PTX assembly.The results show that the GPU-based Euler equation solver is benchmarked for the 1D shock-tube Sod's problem by comparing the results with the analytical solution.The artificial coefficients for the SHLL scheme are adjusted and compared for a 2D four-shock problem with a very complex fluid structure.The results show that a set of lower artificial dissipation coefficients for the WENO-SHLL scheme can provide a better numerical resolution as compared with other sets of artificial dissipation coefficients.However, it requires 17% more computing time than the SHLL-TVD kernel.
The benchmarking results of the CUDA kernels show that the WENO-JS kernel that is enhanced by the PTX assembly utilizes 26.7% fewer floating-point operations and increases 37.5% bandwidth.The roofline model analyses show that the optimized kernel has a high arithmetic intensity of 5.2 and its measured performance achieves 66% of the peak performance for NVIDIA V100, which is twice the computing efficiency of the studies reported previously.The overall efficiency of bandwidth and computation performance achieves 127% of the predicted performance based on the HBM2-memory roofline model.

Figure 1
Figure 1 Illustration of the HLL fluxes in the control volume inside two cells.

Figure 3
Figure 3 Flowchart for the simulation program for parallel computation using CUDA.

4 :
n+2 }: 5 fluxes to right neighboring cells.OUTPUT W ENO F + n+1/2 : WENO-JS reconstructed flux to right cell.1: Set the mapping of the CUDA threads to the N cells in one dimension 2: Set the index i = {3, 4, 5, …, N − 2} for the cells and the thread IDs 3: Set the parameter ε = 10 −6 For loop for the 2D Euler equation do 5: Make copy of the local fluxes {F

Figure 4
Figure 4 CUDA kernel (device function) for the WENO-JS method.The input variables are five fluxes from the neighboring cells and epsilon is a small value.

Code 1 :
The CUDA code for the computation of the formula:A 1 = −U 2 1 + U 2 __device__ __forceinline__ double foo(double A1, double U1, double U2) { A1 = (U1)*(-U1)+(U2) return A1; }where the CUDA device function foo has three input variables and consists of a CUDA special function, which is named asm and allows a PTX assembly code to be inserted in the CUDA device function.The enhanced code with the manual translation of PTX assembly could become Downloaded from https://academic.oup.com/jom/article/doi/10.1093/jom/ufab016/6343545 by guest on 16 August 2021

Figure 5
Figure 5 CUDA compilation trajectory describes the compiling procedure of a CUDA C/C++ file.(a) The general CUDA compilation trajectory.(b) The CUDA compilation trajectory with the manual PTX assembly code.

Figure 6
Figure 6 The shock-tube simulation: the Sod's problem.(a) The density profiles for the SHLL-TVD (minmod) method and the present method (WENO-SHLL scheme) with the different dissipation parameters.(b) The velocity profiles.

Figure 8
Figure 8The roofline graph for NVIDIA Tesla V100 was generated by ERT.The HBM2 memory is a high-speed memory interface and achieves high bandwidth on NVIDIA V100.The features of L2 cache memory have low latency and very higher bandwidth up to 3.3 TB/s than the global memory such as HBM2 memory.

Table 1
Computing times of the 2D Euler equation solver including the methods SHLL-TVD and WENO-SHLL (CUDA and CUDA-PTX).

Table 3
Results for roofline model analysis of the WENO kernel with CUDA PTX code.Roofline efficiency is based on the predicted performance from the roofline model with HBM2 memory.