GENE CUDA Notes¶

Let start with some general comments and caveats:

While I started out hand-implementing CUDA kernels at a time, I got pretty carried away later, so realize that some of the stuff below is pretty out there. So this is really meant just as a basis for discussions.
This work currently only supports the case where the whole x-y plane is available on a single proc. That’s not a fundamental limitation, but other cases will require communication of things that are in device memory, so CUDA-aware MPI probably would be helpful there.
The test case I’ve been running is arbitrarily chosen, it’s basically big-8. YMMV.
While my work-in-progress implementation is working, it’s ugly, missing lots of pieces, etc., and generally very far from being in mergeable shape.