Nonlinearity

Performance

Starting out from the FFTs, I worked outward until I got all of the nonlinearity calculation ported, ie. calc_arakawa_nonlinearity_df.

I ran the big-8 case (IIRC) on a single node on Summit, and I used only 4 MPI procs, so either 4 GPUs or 4 CPU cores (baseline). Obviously, that’s an unfair comparison, so I’ll divide the CPU timing by 7 in the following to account for that one nodes has 42 cores / 6 GPUs.

The CPU code, measured with score-p/scalasca took 222s/7 = 32s. The GPU version took 25s. At least that’s faster, but obviously not by much. In my implementation, I explicitly do host -> device copies and v.v. At the point where most of the code is ported, these will move out (go away?), so to look at the potential of the GPU port, I tried to figure out how much time is spent in the actual kernel computations. Unfortunately, for whatever reason, scalasca didn’t want to give me that information. So I used nvprof and I got the following:

0.15%  104.39ms       168  621.38us  407.68us  847.55us  void composite_2way_fft_c2r<unsigned int=72, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=9, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>)
0.12%  85.837ms       168  510.93us  332.09us  819.77us  _Z13kernel_assignIN6thrust7complexIdEEZN71_INTERNAL_49_tmpxft_0000fc59_00000000_6_cuda_nonlinear_cpp1_ii_35ab1dc933launch_transpose_and_extend_cmplxIS2_EEv9ArrayViewIT_Li3EN5Space6DeviceEES9_EUliiiE_EvS9_T0_
0.11%  74.173ms        84  883.01us  879.58us  885.43us  void kernel_assign<double, Minus<Multiply<ArrayView<double, int=3, Space::Device>, ArrayView<double, int=3, Space::Device>>, Multiply<ArrayView<double, int=3, Space::Device>, ArrayView<double, int=3, Space::Device>>>>(ArrayView<double, int=3, Space::Device>, double)
0.07%  52.319ms         1  52.319ms  52.319ms  52.319ms  ij_deriv_kernel(int, int, int, double2 const *, int, double const *, double2 const *, double2*)
0.06%  44.891ms        42  1.0688ms  1.0666ms  1.0711ms  void kernel_assign<double, Plus<Plus<Plus<Plus<Plus<MultiplyScalar<double, ArrayView<double, int=3, Space::Device>>, MultiplyScalar<double, ArrayView<double, int=3, Space::Device>>>, MultiplyScalar<double, ArrayView<double, int=3, Space::Device>>>, MultiplyScalar<double, ArrayView<double, int=3, Space::Device>>>, MultiplyScalar<double, ArrayView<double, int=3, Space::Device>>>, Plus<Multiply<UnaryMinus<ArrayView<double, int=3, Space::Device>>, ArrayView<double, int=3, Space::Device>>, Multiply<ArrayView<double, int=3, Space::Device>, ArrayView<double, int=3, Space::Device>>>>>(ArrayView<double, int=3, Space::Device>, double)
0.05%  32.995ms        84  392.80us  391.52us  396.32us  void composite_2way_fft_r2c<unsigned int=72, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=9, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>)
0.03%  20.699ms        84  246.42us  245.92us  247.10us  void kernel_assign<thrust::complex<double>, MultiplyScalar<double, ArrayView<thrust::complex<double>, int=3, Space::Device>>>(ArrayView<double, int=3, Space::Device>, thrust::complex<double>)
0.02%  14.250ms        42  339.28us  338.78us  339.87us  _Z13kernel_assignIN6thrust7complexIdEEZN12NonLinearity9transposeI14MultiplyScalarId4PlusI9ArrayViewIS2_Li3EN5Space6DeviceEESA_EEEEvSA_RK10expressionIT_EEUliiiE_EvS7_ISE_Li3ES9_ET0_
0.02%  14.153ms        42  336.99us  336.54us  337.60us  _Z13kernel_assignIN6thrust7complexIdEEZN9Prefactor13multiply_withE9ArrayViewIS2_Li3EN5Space6DeviceEES7_EUliiiE_EvS4_IT_Li3ES6_ET0_
0.01%  9.8096ms        42  233.56us  232.99us  234.30us  _Z13kernel_assignIN6thrust7complexIdEEZN12NonLinearity7y_derivE9ArrayViewIS2_Li3EN5Space6DeviceEEEUliiiE_EvS4_IT_Li3ES6_ET0_

This adds up to about 450 ms total. This number needs to be multiplied by 4 GPUs, so that’s 1.8s for the nonlinearity calculation, which still compares to the estimated 32s on 7 CPU cores, so a speed-up of about 18x, which I think sounds very promising.

Changes to existing code

Getting good performance on the GPU requires lots of parallelism. The current r.h.s. computation is using tiling that’s optimized for CPUs and their caches. There is an outermost loop over blocks, where the granularity can be easily changed by the nblocks parameter. However, at the next level down, each blocks is processed xy-slice by slice, e.g., FFTs are executed for one xy slice at a time. That is too little work to effectively occupy a GPU, beside the fact that overheads for kernel launches etc also likely become significant at that point. So while I started at the single FFT level, I ended up moving the implementation up until I had enough work, which is at the calc_arakawa_nonlinearity_df level. The way it’s currently hooked in is not particularly pretty and still needs work, but to give some idea, here’s the relevant diff:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
diff --git a/src/df_arakawa_nonlinear_term.F90 b/src/df_arakawa_nonlinear_term.F90
index 0c9fd61b2..f2b5ab587 100644
--- a/src/df_arakawa_nonlinear_term.F90
+++ b/src/df_arakawa_nonlinear_term.F90
@@ -25,6 +25,25 @@ module df_arakawa_nonlinear_term_mod
      procedure :: getType => getThisType
   end type df_arakawa_nonlinear_term_t
 
+#ifdef USE_GPU
+  interface
+     
+     subroutine c_nonlinearity_1(c_nonlin, gy_chi, gy_chi_shape, g2d, g2d_shape, &
+          vexy, vexy_shape, dgdxy, dgdxy_shape, &
+          localrhs, localrhs_shape, first, ve_max) bind(C)
+       import
+       type(C_PTR), value :: c_nonlin
+       complex(C_DOUBLE_COMPLEX), dimension(*) :: gy_chi, g2d, vexy, dgdxy
+       complex(C_DOUBLE_COMPLEX), dimension(*) :: localrhs
+       integer(C_INT), dimension(*) :: gy_chi_shape, g2d_shape, vexy_shape, dgdxy_shape
+       integer(C_INT), dimension(*) :: localrhs_shape
+       logical(C_BOOL), value :: first
+       real(C_DOUBLE), dimension(2) :: ve_max
+     end subroutine c_nonlinearity_1
+
+  end interface
+#endif
+  
 contains
 
   function getThisType(this)
@@ -34,10 +53,29 @@ contains
     getThisType = "df_arakawa_nonlinear_term_t"
   end function getThisType
 
+#ifdef USE_GPU
+  subroutine calc_arakawa_nonlinearity_df_c(this, gy_chi, g_block, vexy, dgdxy, localrhs, first)
+    class(df_arakawa_nonlinear_term_t) :: this
+    complex, dimension(gdisc%li1:gdisc%li2, gdisc%lj1:gdisc%lj2, this%lbg0), intent(in) :: gy_chi
+    complex, dimension(gdisc%lbi:gdisc%ubi, gdisc%lj1:gdisc%lj2, this%lbg0), intent(in) :: g_block
+    complex, dimension(gdisc%li1:gdisc%li2, gdisc%lj1:gdisc%lj2, 2, this%lbg0), intent(in) :: vexy, dgdxy
+    complex, dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:this%lbg0), intent(inout) :: localrhs
+    logical, intent(in) :: first
+
+    complex, dimension(gdisc%li1:gdisc%li2, gdisc%lj1:gdisc%lj2,this%lbg0):: g2d
+
+    g2d = g_block(gdisc%li1:gdisc%li2,:,1:this%lbg0)
+    call c_nonlinearity_1(this%nonlin, gy_chi, shape(gy_chi), g2d, shape(g2d), &
+         vexy, shape(vexy), dgdxy, shape(dgdxy), localrhs, shape(localrhs), &
+         logical(first, C_BOOL), ve_max)
+    
+  end subroutine calc_arakawa_nonlinearity_df_c
+#endif
+  
   !>Computes the nonlinearity for the (x) nonlocal version
   !!\todo Get rid of transposition using striding? make some arrays for arakawa allocatable
   !!\todo Check whether the nonlinearity prefactor is important for the CFL criterion
-  Subroutine calc_arakawa_nonlinearity_df(this,gy_chi,g_block,vexy,dgdxy,localrhs,first,lb1,lb2)
+  Subroutine calc_arakawa_nonlinearity_df_orig(this,gy_chi,g_block,vexy,dgdxy,localrhs,first,lb1,lb2)
     class(df_arakawa_nonlinear_term_t) :: this
     Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:*),Intent(inout) :: gy_chi
     complex, dimension(gdisc%lbi:gdisc%ubi,gdisc%lj1:gdisc%lj2,1:*),intent(in) :: g_block
@@ -113,8 +151,36 @@ contains
     ! multiply with the prefactor and write to localrhs
     call this%prefactor%multiply_with(nonlin,localrhs,this%lbg0,lb1,lb2)
 
-  End Subroutine calc_arakawa_nonlinearity_df
+  End Subroutine calc_arakawa_nonlinearity_df_orig
 
+  subroutine calc_arakawa_nonlinearity_df(this,gy_chi,g_block,vexy,dgdxy,localrhs,first,lb1,lb2)
+#ifdef USE_GPU
+    use x_derivatives, only: rad_bc_type
+#endif
+
+    class(df_arakawa_nonlinear_term_t) :: this
+    Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:this%lbg0),Intent(inout) :: gy_chi
+    complex, dimension(gdisc%lbi:gdisc%ubi,gdisc%lj1:gdisc%lj2,1:this%lbg0),intent(in) :: g_block
+    Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,2,1:this%lbg0),Intent(inout) :: vexy, dgdxy
+    Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:this%lbg0),Intent(inout) :: localrhs
+    Logical, Intent(in):: first
+    Integer, Intent(in):: lb1,lb2
+
+#ifdef USE_GPU
+    if (n_procs_y == 1 .and. n_procs_x == 1 .and. rad_bc_type == 1 .and. &
+         gdisc%yx_order .eqv. .false.) then
+       call calc_arakawa_nonlinearity_df_c(this, gy_chi, g_block, vexy, dgdxy, localrhs, first)
+    else
+       call calc_arakawa_nonlinearity_df_orig(this, gy_chi, g_block, vexy, dgdxy, localrhs, &
+            first, lb1, lb2);
+    endif
+#else
+    call calc_arakawa_nonlinearity_df_orig(this, gy_chi, g_block, vexy, dgdxy, localrhs, &
+         first, lb1, lb2);
+#endif
+       
+  end subroutine calc_arakawa_nonlinearity_df
+  
   !> Go to real space
   !! This routine is equal to the one of df_nonlinear_term, with the
   !! small exception, that here it is clear, that we do not have
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
diff --git a/src/df_nonlinear_term.F90 b/src/df_nonlinear_term.F90
index 18510cd3e..6cb236d99 100644
--- a/src/df_nonlinear_term.F90
+++ b/src/df_nonlinear_term.F90
@@ -25,6 +25,9 @@ module df_nonlinear_term_mod
 
   type, public, extends(nonlinear_term_t) :: df_nonlinear_term_t
      class(df_nonlinear_prefactor_t),allocatable,public :: prefactor
+#ifdef USE_GPU
+     type(C_PTR) :: nonlin
+#endif
    contains
      procedure :: initialize => initialize_nonlinearity
      procedure :: finalize => finalize_nonlinearity
@@ -42,6 +45,30 @@ module df_nonlinear_term_mod
 
   private
 
+  interface
+
+     type(C_PTR) function nonlinearity_create(nx0, nky0, ny0, howmany, nib, &
+          radscheme, kjmin, pnl_1d, pnl_1d_shape, coeff, coeff_shape) bind(C)
+       import
+       integer(C_INT), value :: nx0, nky0, ny0, howmany, nib, radscheme
+       real(C_DOUBLE), value :: kjmin
+       real(C_DOUBLE), dimension(*) :: pnl_1d, coeff
+       integer(C_INT), dimension(1) :: pnl_1d_shape, coeff_shape
+     end function nonlinearity_create
+     
+     subroutine nonlinearity_destroy(nonlin) bind(C)
+       import
+       type(C_PTR), value :: nonlin
+     end subroutine nonlinearity_destroy
+
+     subroutine cu_transpose_and_extend_cmplx(in, in_shape, out, out_shape) bind(C)
+       import
+       complex(C_DOUBLE_COMPLEX) :: in, out
+       integer(C_INT), dimension(2) :: in_shape, out_shape
+     end subroutine cu_transpose_and_extend_cmplx
+
+  end interface
+  
 contains
 
   function getThisType(this)
@@ -106,14 +133,31 @@ contains
   end function mem_est_nonlinearity
 
   subroutine initialize_nonlinearity(this)
+    use x_derivatives, only: radscheme, coeff
+    use coordinates, only: gcoor
     class(df_nonlinear_term_t) :: this
 
+#ifdef USE_GPU
+    integer :: ub, li
+#endif
+
     if (this%init_status==1) return
     !write(*,"(3A,I5)") "Initializing ",this%getType()," with blocksize = ",this%lbg0
     call initialize_fourier(gdisc%li0da,gdisc%ly0da)
 
     call this%prefactor%initialize()
 
+#ifdef USE_GPU
+    if (n_procs_y == 1) then
+       ub = gdisc%li0/n_procs_y
+       li = gpdisc%pi1 + my_pey*ub !lower index for relevant pnl_1d part
+       this%nonlin = nonlinearity_create(gdisc%li2 - gdisc%li1 + 1, &
+            gdisc%lj2 - gdisc%lj1 + 1, gdisc%ly0da, this%lbg0, gdisc%nib, radscheme, &
+            gcoor%kjmin, this%prefactor%pnl_1d(li:li+ub-1), shape(this%prefactor%pnl_1d(li:li+ub-1)), &
+            coeff, shape(coeff))
+    endif
+#endif
+
     this%init_status = 1
   end subroutine initialize_nonlinearity
 
@@ -127,6 +171,12 @@ contains
 
     call finalize_fourier
 
+#ifdef USE_GPU
+    if (n_procs_y == 1) then
+       call nonlinearity_destroy(this%nonlin);
+    endif
+#endif
+
     this%init_status = 0
   end subroutine finalize_nonlinearity
 
@@ -191,10 +241,10 @@ contains
   !!\todo Check whether the nonlinearity prefactor is important for the CFL criterion
   Subroutine calc_nonlinearity_df(this,gy_chi,g_block,vexy,dgdxy,localrhs,first,lb1,lb2)
     class(df_nonlinear_term_t) :: this
-    Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:*),Intent(inout) :: gy_chi
-    complex, dimension(gdisc%lbi:gdisc%ubi,gdisc%lj1:gdisc%lj2,1:*),intent(in) :: g_block
-    Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,2,1:*),Intent(inout) :: vexy, dgdxy
-    Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:*),Intent(inout) :: localrhs
+    Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:this%lbg0),Intent(inout) :: gy_chi
+    complex, dimension(gdisc%lbi:gdisc%ubi,gdisc%lj1:gdisc%lj2,1:this%lbg0),intent(in) :: g_block
+    Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,2,this%lbg0),Intent(inout) :: vexy, dgdxy
+    Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:this%lbg0),Intent(inout) :: localrhs
     Logical, Intent(in) :: first
     Integer, Intent(in) :: lb1,lb2
 

In the end, I think it might be possible to just have another subclass of nonlinear_term_t that forwards to the C/C++ side.