GENE Code Performance Overview¶

Caveats¶

The following is based on the big-8 test case, with some modifcations. I believe what it shows is essentially typical behavior, but YMMV.
I’m myself rather new to GENE, so I may have missed some things or gotten something wrong.
I had some issue with the profiling of the CPU-only version, so the profile is not as detailed as it could be, but the important things are there.

Big picture¶

Most of the profiling I’m showing in the following is from a benchmark run on Summit that just used 4 CPU cores. The code is MPI parallelized. The profile on all procs should essentially looks the same, with maybe some minor imbalance at times due to physical boundary conditions.

At the top level, you see two phases, initialization and timeloop. Since the initialization happens only once, its performance can be mostly ignored. The timeloop in this example is very short (only two time steps), but in real life wll be composed of a very large number of steps.

There can also be a phase 0, the autotuning. It is turned off in this run. That phase will determine parameters to get optimal performance, by

evaluating different options for domain decomposition across the given number of MPI processes
find an optimal parameter for cache blocking
evaluate variants of various kernels which do the same operation but are implemented in some variations

As a general overview, the GENE code integrates 6-dimensional distribution functions in time, together with supporting lower-dimensional quantities like the electromagnetic field. Distribution functions are discretized in space (x, y, z), velocity space (v, w) and by species, The code often considers the distribution function arrays in a reshaped fashion, where it becomes f(i, j, klmn), ie. many i-j (x-y) slices, where klmn collapses the remaining 4 dims (k, l, m, n). This is what the cache blocking is based on. If nblocks == 1, all nk * nl * nm * nn slices are processed together. However, if nblocks > 1, nk * nl * nm * nn is subdivided into as many equal-sized pieces (“blocks”). Then, all operations are performed for just the first block, next all operations are performed on the second block, etc. In order to have a large amount of parallelism available for offloading onto the GPU, it is helpful to have large blocks, so in my work I have been using nblocks = 1, which for the test I’ve been using gives me a per-MPI process index space of 120 x 24 x 4 x 32 x 8 x 2 complex quanities, which is giving me pretty good results in terms of GPU performance already, but production runs are going to have yet larger problem sizes.

Timeloop¶

Zooming into the timeloop, once can see the general structure of time integration.

As stated above, in this benchmark case, the timeloop is only run for two steps, which show up as eRK_standard. Each timestep is performed by a 4th order Runge-Kutta (RK) method. which is comprised of 4 essentially identical stages. At each stage, the r.h.s. is calculated and then used to update some temporary fields, which eventually are combined into the new solution at time n+1.

In a real run, additionally, diagnostics will happen every so many time steps – this is ignored here.

RK stage¶

Each RK stage consists of some prep work (calcaux) and then the computation of the r.h.s. (which is also a 6-d array, like the distribution functions) (‘’CalFRhs0’‘). At the end of the stage there are some updates of the temporary arrays, which is missing a marker in this profile, it’s in the white space after CalFRhs0.

Computation of the r.h.s.¶

In the profile above, one can also see the steps involved in calculating the r.h.s. – essentially, terms from the physical equations are added up one after the other. These involve z and v derivatives (dzv_ak), hyperdiffusion (hyp_zv), various terms involving x and y derivatives (dfdxy (doesn’t show up separately), dchidxy, dgdxy), and then finally the nonlinear terms (add_nl).

One can see that while CalcFRhs0 if taken inclusively is where the majority of time is spent, it calls a number of other functions each of which take somewhat comparable amounts of time, so there is no one function that dominates the overall behavior.

The call chain doesn’t end here, functions like dchidxy are often further subdivided into a few actual computational kernels (typically I’d say about two or so).

The actual kernels are for the most part stencil computations – to show an example of a pretty large one:

do concurrent(n=bl(6):bu(6),&
  m=bl(5):bu(5), l=bl(4):bu(4), &
  k=bl(3):bu(3), j=bl(2):bu(2) )
b(i1:i2,j,k,l,m,n) = &
     sten(i1:i2, 1,k,l,m,n)*a(i1:i2,j,k  ,l-2,m,n) + &
     sten(i1:i2, 2,k,l,m,n)*a(i1:i2,j,k-1,l-1,m,n) + &
     sten(i1:i2, 3,k,l,m,n)*a(i1:i2,j,k  ,l-1,m,n) + &
     sten(i1:i2, 4,k,l,m,n)*a(i1:i2,j,k+1,l-1,m,n) + &
     sten(i1:i2, 5,k,l,m,n)*a(i1:i2,j,k-2,l  ,m,n) + &
     sten(i1:i2, 6,k,l,m,n)*a(i1:i2,j,k-1,l  ,m,n) + &
     sten(i1:i2, 7,k,l,m,n)*a(i1:i2,j,k  ,l  ,m,n) + &
     sten(i1:i2, 8,k,l,m,n)*a(i1:i2,j,k+1,l  ,m,n) + &
     sten(i1:i2, 9,k,l,m,n)*a(i1:i2,j,k+2,l  ,m,n) + &
     sten(i1:i2,10,k,l,m,n)*a(i1:i2,j,k-1,l+1,m,n) + &
     sten(i1:i2,11,k,l,m,n)*a(i1:i2,j,k  ,l+1,m,n) + &
     sten(i1:i2,12,k,l,m,n)*a(i1:i2,j,k+1,l+1,m,n) + &
     sten(i1:i2,13,k,l,m,n)*a(i1:i2,j,k  ,l+2,m,n)
 end do

The nonlinearity calculation is a bit different in that it involves FFTs into real space, then a number of stencil computations, then FFTs back. It consists of more kernels (maybe ~8) than the remaining terms, and the FFTs, so that’s why it shows up as slower.

The nonlinearity calculation as-is in Fortran is not well suited for offloading because it currently is written as slice-by-slice processing, rather than whole blocks, so the available parallelism is too small. (I have rewritten it by block for the CUDA work, but that’s in C++ – it wouldn’t be difficult to do this in Fortran, too, though.)

All the other stencil computations I think are decent candidates for offloading, they’re probably memory bound but can still can get pretty nice speed-ups.

There are some additional operations not covered in the above, for example the field solve (sparse linear algebra), and gyro-averaging (also linear algebra), but the vast majority of time is spent in various forms of stencil computations.

Current GPU work¶

So in order to make my point that off-loading works, let me show the current state of the GPU port using CUDA:

This profile also shows a single RK stage, ie., calcaux, bar_emf, and CalFRhs0. (bar_emf didn’t show up above since it’s pretty fast compared to its surrounding calcaux and bar_emf).

CalFRhs0 has been the focus of my work, and it’s vastly faster. It’s still comprised of the same underlying terms (too small to be labeled) and then the nonlinarity part. The nonlinearity part now takes a larger fraction of the r.h.s. calculation, but essentially, the picture is similar.

What’s different is that the intial part of the RK stage now takes a relatively larger fraction of time – the reason for this, I believe, is that the gyroaveraging, while on the GPU, calls way too many small kernels, so some refactoring into bigger kernels is needed there.

In terms of just looking at CalcFRhs0 itself, the timing went from 1.19 s (1 CPU core) to 27 ms (1 V100 GPU). This is of course an unfair comparison, but extrapolating given that Summit has 7 CPU cores / GPU, it’s a speed-up of 6.3x, which is at least a good start.