Nonlinearity¶
Performance¶
Starting out from the FFTs, I worked outward until I got all of the
nonlinearity calculation ported, ie. calc_arakawa_nonlinearity_df.
I ran the big-8 case (IIRC) on a single node on Summit, and I used
only 4 MPI procs, so either 4 GPUs or 4 CPU cores
(baseline). Obviously, that’s an unfair comparison, so I’ll divide the
CPU timing by 7 in the following to account for that one nodes has 42
cores / 6 GPUs.
The CPU code, measured with score-p/scalasca took 222s/7 = 32s. The GPU version took 25s. At least that’s faster, but obviously not by much. In my implementation, I explicitly do host -> device copies and v.v. At the point where most of the code is ported, these will move out (go away?), so to look at the potential of the GPU port, I tried to figure out how much time is spent in the actual kernel computations. Unfortunately, for whatever reason, scalasca didn’t want to give me that information. So I used nvprof and I got the following:
0.15% 104.39ms 168 621.38us 407.68us 847.55us void composite_2way_fft_c2r<unsigned int=72, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=9, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>)
0.12% 85.837ms 168 510.93us 332.09us 819.77us _Z13kernel_assignIN6thrust7complexIdEEZN71_INTERNAL_49_tmpxft_0000fc59_00000000_6_cuda_nonlinear_cpp1_ii_35ab1dc933launch_transpose_and_extend_cmplxIS2_EEv9ArrayViewIT_Li3EN5Space6DeviceEES9_EUliiiE_EvS9_T0_
0.11% 74.173ms 84 883.01us 879.58us 885.43us void kernel_assign<double, Minus<Multiply<ArrayView<double, int=3, Space::Device>, ArrayView<double, int=3, Space::Device>>, Multiply<ArrayView<double, int=3, Space::Device>, ArrayView<double, int=3, Space::Device>>>>(ArrayView<double, int=3, Space::Device>, double)
0.07% 52.319ms 1 52.319ms 52.319ms 52.319ms ij_deriv_kernel(int, int, int, double2 const *, int, double const *, double2 const *, double2*)
0.06% 44.891ms 42 1.0688ms 1.0666ms 1.0711ms void kernel_assign<double, Plus<Plus<Plus<Plus<Plus<MultiplyScalar<double, ArrayView<double, int=3, Space::Device>>, MultiplyScalar<double, ArrayView<double, int=3, Space::Device>>>, MultiplyScalar<double, ArrayView<double, int=3, Space::Device>>>, MultiplyScalar<double, ArrayView<double, int=3, Space::Device>>>, MultiplyScalar<double, ArrayView<double, int=3, Space::Device>>>, Plus<Multiply<UnaryMinus<ArrayView<double, int=3, Space::Device>>, ArrayView<double, int=3, Space::Device>>, Multiply<ArrayView<double, int=3, Space::Device>, ArrayView<double, int=3, Space::Device>>>>>(ArrayView<double, int=3, Space::Device>, double)
0.05% 32.995ms 84 392.80us 391.52us 396.32us void composite_2way_fft_r2c<unsigned int=72, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=9, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>)
0.03% 20.699ms 84 246.42us 245.92us 247.10us void kernel_assign<thrust::complex<double>, MultiplyScalar<double, ArrayView<thrust::complex<double>, int=3, Space::Device>>>(ArrayView<double, int=3, Space::Device>, thrust::complex<double>)
0.02% 14.250ms 42 339.28us 338.78us 339.87us _Z13kernel_assignIN6thrust7complexIdEEZN12NonLinearity9transposeI14MultiplyScalarId4PlusI9ArrayViewIS2_Li3EN5Space6DeviceEESA_EEEEvSA_RK10expressionIT_EEUliiiE_EvS7_ISE_Li3ES9_ET0_
0.02% 14.153ms 42 336.99us 336.54us 337.60us _Z13kernel_assignIN6thrust7complexIdEEZN9Prefactor13multiply_withE9ArrayViewIS2_Li3EN5Space6DeviceEES7_EUliiiE_EvS4_IT_Li3ES6_ET0_
0.01% 9.8096ms 42 233.56us 232.99us 234.30us _Z13kernel_assignIN6thrust7complexIdEEZN12NonLinearity7y_derivE9ArrayViewIS2_Li3EN5Space6DeviceEEEUliiiE_EvS4_IT_Li3ES6_ET0_
This adds up to about 450 ms total. This number needs to be multiplied by 4 GPUs, so that’s 1.8s for the nonlinearity calculation, which still compares to the estimated 32s on 7 CPU cores, so a speed-up of about 18x, which I think sounds very promising.
Changes to existing code¶
Getting good performance on the GPU requires lots of parallelism. The
current r.h.s. computation is using tiling that’s optimized for CPUs
and their caches. There is an outermost loop over blocks, where the
granularity can be easily changed by the nblocks parameter. However,
at the next level down, each blocks is processed xy-slice by slice,
e.g., FFTs are executed for one xy slice at a time. That is too little
work to effectively occupy a GPU, beside the fact that overheads for
kernel launches etc also likely become significant at that point. So
while I started at the single FFT level, I ended up moving the
implementation up until I had enough work, which is at the
calc_arakawa_nonlinearity_df level. The way it’s currently hooked
in is not particularly pretty and still needs work, but to give some
idea, here’s the relevant diff:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | diff --git a/src/df_arakawa_nonlinear_term.F90 b/src/df_arakawa_nonlinear_term.F90
index 0c9fd61b2..f2b5ab587 100644
--- a/src/df_arakawa_nonlinear_term.F90
+++ b/src/df_arakawa_nonlinear_term.F90
@@ -25,6 +25,25 @@ module df_arakawa_nonlinear_term_mod
procedure :: getType => getThisType
end type df_arakawa_nonlinear_term_t
+#ifdef USE_GPU
+ interface
+
+ subroutine c_nonlinearity_1(c_nonlin, gy_chi, gy_chi_shape, g2d, g2d_shape, &
+ vexy, vexy_shape, dgdxy, dgdxy_shape, &
+ localrhs, localrhs_shape, first, ve_max) bind(C)
+ import
+ type(C_PTR), value :: c_nonlin
+ complex(C_DOUBLE_COMPLEX), dimension(*) :: gy_chi, g2d, vexy, dgdxy
+ complex(C_DOUBLE_COMPLEX), dimension(*) :: localrhs
+ integer(C_INT), dimension(*) :: gy_chi_shape, g2d_shape, vexy_shape, dgdxy_shape
+ integer(C_INT), dimension(*) :: localrhs_shape
+ logical(C_BOOL), value :: first
+ real(C_DOUBLE), dimension(2) :: ve_max
+ end subroutine c_nonlinearity_1
+
+ end interface
+#endif
+
contains
function getThisType(this)
@@ -34,10 +53,29 @@ contains
getThisType = "df_arakawa_nonlinear_term_t"
end function getThisType
+#ifdef USE_GPU
+ subroutine calc_arakawa_nonlinearity_df_c(this, gy_chi, g_block, vexy, dgdxy, localrhs, first)
+ class(df_arakawa_nonlinear_term_t) :: this
+ complex, dimension(gdisc%li1:gdisc%li2, gdisc%lj1:gdisc%lj2, this%lbg0), intent(in) :: gy_chi
+ complex, dimension(gdisc%lbi:gdisc%ubi, gdisc%lj1:gdisc%lj2, this%lbg0), intent(in) :: g_block
+ complex, dimension(gdisc%li1:gdisc%li2, gdisc%lj1:gdisc%lj2, 2, this%lbg0), intent(in) :: vexy, dgdxy
+ complex, dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:this%lbg0), intent(inout) :: localrhs
+ logical, intent(in) :: first
+
+ complex, dimension(gdisc%li1:gdisc%li2, gdisc%lj1:gdisc%lj2,this%lbg0):: g2d
+
+ g2d = g_block(gdisc%li1:gdisc%li2,:,1:this%lbg0)
+ call c_nonlinearity_1(this%nonlin, gy_chi, shape(gy_chi), g2d, shape(g2d), &
+ vexy, shape(vexy), dgdxy, shape(dgdxy), localrhs, shape(localrhs), &
+ logical(first, C_BOOL), ve_max)
+
+ end subroutine calc_arakawa_nonlinearity_df_c
+#endif
+
!>Computes the nonlinearity for the (x) nonlocal version
!!\todo Get rid of transposition using striding? make some arrays for arakawa allocatable
!!\todo Check whether the nonlinearity prefactor is important for the CFL criterion
- Subroutine calc_arakawa_nonlinearity_df(this,gy_chi,g_block,vexy,dgdxy,localrhs,first,lb1,lb2)
+ Subroutine calc_arakawa_nonlinearity_df_orig(this,gy_chi,g_block,vexy,dgdxy,localrhs,first,lb1,lb2)
class(df_arakawa_nonlinear_term_t) :: this
Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:*),Intent(inout) :: gy_chi
complex, dimension(gdisc%lbi:gdisc%ubi,gdisc%lj1:gdisc%lj2,1:*),intent(in) :: g_block
@@ -113,8 +151,36 @@ contains
! multiply with the prefactor and write to localrhs
call this%prefactor%multiply_with(nonlin,localrhs,this%lbg0,lb1,lb2)
- End Subroutine calc_arakawa_nonlinearity_df
+ End Subroutine calc_arakawa_nonlinearity_df_orig
+ subroutine calc_arakawa_nonlinearity_df(this,gy_chi,g_block,vexy,dgdxy,localrhs,first,lb1,lb2)
+#ifdef USE_GPU
+ use x_derivatives, only: rad_bc_type
+#endif
+
+ class(df_arakawa_nonlinear_term_t) :: this
+ Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:this%lbg0),Intent(inout) :: gy_chi
+ complex, dimension(gdisc%lbi:gdisc%ubi,gdisc%lj1:gdisc%lj2,1:this%lbg0),intent(in) :: g_block
+ Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,2,1:this%lbg0),Intent(inout) :: vexy, dgdxy
+ Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:this%lbg0),Intent(inout) :: localrhs
+ Logical, Intent(in):: first
+ Integer, Intent(in):: lb1,lb2
+
+#ifdef USE_GPU
+ if (n_procs_y == 1 .and. n_procs_x == 1 .and. rad_bc_type == 1 .and. &
+ gdisc%yx_order .eqv. .false.) then
+ call calc_arakawa_nonlinearity_df_c(this, gy_chi, g_block, vexy, dgdxy, localrhs, first)
+ else
+ call calc_arakawa_nonlinearity_df_orig(this, gy_chi, g_block, vexy, dgdxy, localrhs, &
+ first, lb1, lb2);
+ endif
+#else
+ call calc_arakawa_nonlinearity_df_orig(this, gy_chi, g_block, vexy, dgdxy, localrhs, &
+ first, lb1, lb2);
+#endif
+
+ end subroutine calc_arakawa_nonlinearity_df
+
!> Go to real space
!! This routine is equal to the one of df_nonlinear_term, with the
!! small exception, that here it is clear, that we do not have
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | diff --git a/src/df_nonlinear_term.F90 b/src/df_nonlinear_term.F90
index 18510cd3e..6cb236d99 100644
--- a/src/df_nonlinear_term.F90
+++ b/src/df_nonlinear_term.F90
@@ -25,6 +25,9 @@ module df_nonlinear_term_mod
type, public, extends(nonlinear_term_t) :: df_nonlinear_term_t
class(df_nonlinear_prefactor_t),allocatable,public :: prefactor
+#ifdef USE_GPU
+ type(C_PTR) :: nonlin
+#endif
contains
procedure :: initialize => initialize_nonlinearity
procedure :: finalize => finalize_nonlinearity
@@ -42,6 +45,30 @@ module df_nonlinear_term_mod
private
+ interface
+
+ type(C_PTR) function nonlinearity_create(nx0, nky0, ny0, howmany, nib, &
+ radscheme, kjmin, pnl_1d, pnl_1d_shape, coeff, coeff_shape) bind(C)
+ import
+ integer(C_INT), value :: nx0, nky0, ny0, howmany, nib, radscheme
+ real(C_DOUBLE), value :: kjmin
+ real(C_DOUBLE), dimension(*) :: pnl_1d, coeff
+ integer(C_INT), dimension(1) :: pnl_1d_shape, coeff_shape
+ end function nonlinearity_create
+
+ subroutine nonlinearity_destroy(nonlin) bind(C)
+ import
+ type(C_PTR), value :: nonlin
+ end subroutine nonlinearity_destroy
+
+ subroutine cu_transpose_and_extend_cmplx(in, in_shape, out, out_shape) bind(C)
+ import
+ complex(C_DOUBLE_COMPLEX) :: in, out
+ integer(C_INT), dimension(2) :: in_shape, out_shape
+ end subroutine cu_transpose_and_extend_cmplx
+
+ end interface
+
contains
function getThisType(this)
@@ -106,14 +133,31 @@ contains
end function mem_est_nonlinearity
subroutine initialize_nonlinearity(this)
+ use x_derivatives, only: radscheme, coeff
+ use coordinates, only: gcoor
class(df_nonlinear_term_t) :: this
+#ifdef USE_GPU
+ integer :: ub, li
+#endif
+
if (this%init_status==1) return
!write(*,"(3A,I5)") "Initializing ",this%getType()," with blocksize = ",this%lbg0
call initialize_fourier(gdisc%li0da,gdisc%ly0da)
call this%prefactor%initialize()
+#ifdef USE_GPU
+ if (n_procs_y == 1) then
+ ub = gdisc%li0/n_procs_y
+ li = gpdisc%pi1 + my_pey*ub !lower index for relevant pnl_1d part
+ this%nonlin = nonlinearity_create(gdisc%li2 - gdisc%li1 + 1, &
+ gdisc%lj2 - gdisc%lj1 + 1, gdisc%ly0da, this%lbg0, gdisc%nib, radscheme, &
+ gcoor%kjmin, this%prefactor%pnl_1d(li:li+ub-1), shape(this%prefactor%pnl_1d(li:li+ub-1)), &
+ coeff, shape(coeff))
+ endif
+#endif
+
this%init_status = 1
end subroutine initialize_nonlinearity
@@ -127,6 +171,12 @@ contains
call finalize_fourier
+#ifdef USE_GPU
+ if (n_procs_y == 1) then
+ call nonlinearity_destroy(this%nonlin);
+ endif
+#endif
+
this%init_status = 0
end subroutine finalize_nonlinearity
@@ -191,10 +241,10 @@ contains
!!\todo Check whether the nonlinearity prefactor is important for the CFL criterion
Subroutine calc_nonlinearity_df(this,gy_chi,g_block,vexy,dgdxy,localrhs,first,lb1,lb2)
class(df_nonlinear_term_t) :: this
- Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:*),Intent(inout) :: gy_chi
- complex, dimension(gdisc%lbi:gdisc%ubi,gdisc%lj1:gdisc%lj2,1:*),intent(in) :: g_block
- Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,2,1:*),Intent(inout) :: vexy, dgdxy
- Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:*),Intent(inout) :: localrhs
+ Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:this%lbg0),Intent(inout) :: gy_chi
+ complex, dimension(gdisc%lbi:gdisc%ubi,gdisc%lj1:gdisc%lj2,1:this%lbg0),intent(in) :: g_block
+ Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,2,this%lbg0),Intent(inout) :: vexy, dgdxy
+ Complex, Dimension(gdisc%li1:gdisc%li2,gdisc%lj1:gdisc%lj2,1:this%lbg0),Intent(inout) :: localrhs
Logical, Intent(in) :: first
Integer, Intent(in) :: lb1,lb2
|
In the end, I think it might be possible to just have another subclass of nonlinear_term_t that forwards to the C/C++ side.