Matrix Product Toolkit | Installing / CUDA

The toolkit can make use of CUDA. This can give a speedup of up to 100x compared with a single CPU core. CUDA requires a Nvidia tesla GPU, with compute capability >= 3.5 (ie. Kepler or newer) that has good double-precision (FP64) performance. Note that most recent Nvidia consumer grade GPU's have very slow double-precision, and will be slower than the CPU. Exceptions are the old GTX TITAN / TITAN Black, TITAN Z range (same chipset as a Kepler K20), and the newer GP100 (same chipset as the Pascal P100, but a very expensive card!). There are no GPU's in the Maxwell range that have good FP64 performance; even the high-end cards are very poor. Beware the Maxwell-based Titan X/Xp, which is a different chipset to the older models of Titan cards and has very poor FP64 performance.

Known good GPU chipsets for FP64 CUDA performance:

Kepler: K20, K20X, K40, K80 (server range), GeForce Titan, Titan Black, Titan Z (consumer range)
Pascal: P100 (server), Quadro GP100 (consumer)
Volta: V100 (server), Titan V (consumer)

As far as is known, all other GPU's have poor performance with the toolkit.

To compile the toolkit with the CUDA backend, you need

CUDA toolkit (version?)
cuBLAS library
MAGMA (maybe; or cuSOLVER)
CUB

CUB needs to be downloaded from github, but it is a source library only, you don't need to compile anything, just make sure that the compiler include path includes the root directory of the cub installation.

You will almost certainly need to set the NVFLAGS environment variable to set some sensible compiler options for the nvcc compiler. nvcc uses a back-end compiler that can be either g++ or clang. When using g++ as the backend, only gcc versions up to 5 are supported as of cuda toolkit version 8. Cuda toolkit version 9 suppots gcc-6.

On Ubuntu 17.04 with CUDA toolkit 8, I use:

export CXX=g++-5
export CC=gcc-5
export NVFLAGS="-ccbin g++-5 -arch=sm_70 -std=c++11 -I/home/ian/git/cub"
../mptoolkit/configure

Note that the CUB library also uses boost, so you might need to add an include directory for boost to the NVFLAGS, if it is installed in a non-standard location.

CUDA 8 doesn't support C++14, so the cuda components that are compiled with nvcc have some workarounds. CUDA 9 does support C++14 so these workarounds can be removed once CUDA 9 becomes widespread.

On the getafix cluster, cuda-10 is installed. Boost is available as a module, but is also installed in /usr/include, which makes it a bit awkward as we need to override the standard locations. I use

module add gnu gnutools mkl boost compilers/gnu/7.2.0 cuda
export "LDFLAGS=-L$BOOSTROOT/lib"
export NVFLAGS="-ccbin g++ -arch=sm_35 -std=c++14 -I/home/uqimccul/git/cub"
configure --with-cuda --with-boost=$BOOSTROOT

Some benchmarks for CUDA and CPU performance Installing.Benchmarks

For debugging I use export CXXFLAGS="-Wall -Wextra -Wno-unused-value -Wno-unused-parameter -Wno-implicit-fallthrough"

Note on OpenCL

It would be possible, with some effort, to write a backend for OpenCL, which would then support GPUs made by AMD. However in benchmarks OpenCL matrix-matrix multiply is 3x-5x slower than the CUDA equivalent. AMD appear to have abandoned high-performance computing, and their recent Vega architecture has very poor FP64 performance; the only GPUs worth considering are the older FirePro range (W8100 or W9100), and their server equivalents. In the medium/long term, the best hope for AMD is probably some kind of CUDA compatibility layer.