CUBLAS on Bender

An example of CUBLAS is available in Bender at the following path:

/usr/local/cuda/samples/7_CUDALibraries/simpleCUBLAS

First copy all samples to your home directory. There are dependencies that these samples depend on.

cp -r /usr/local/cuda/samples ~/CUDAsamples

Next, go to the simpleCUBLAS example

cd ~/CUDAsamples/7_CUDALibraries/simpleCUBLAS

You can compile and run the example simply by:

make
./simpleCUBLAS

Using CUBLAS

The syntax for using CUBLAS is shown in simpleCUBLAS.cpp.

To use CUBLAS, you need to first include the library: #include <cublas_v2.h>

CUBLAS requires using a status variable and a handler variable in order to create a handler. The handler is the CUBLAS context. Essentially, CUBLAS class are kernel calls.

  cublasStatus_t status;
  cublasHandle_t handle;
  /* Initialize CUBLAS */
  status = cublasCreate(&handle);
  /* Destroy CUBLAS */
  status = cublasDestroy(handle);

Move matrix to device

Next you need to move your matrix to the device side, using CUBLAS APIs. From the CUBLAS documentation (https://docs.nvidia.com/cuda/pdf/CUBLAS_Library.pdf):

cublasStatus_t
cublasSetVector(int n, int elemSize,
 const void *x, int incx, void *y, int incy)

This function copies n elements from a vector x in host memory space to a vector y in GPU memory space. Elements in both vectors are assumed to have a size of elemSize bytes. The storage spacing between consecutive elements is given by incx for the source vector x and by incy for the destination vector y.

Assume n2 is the size of the matrix, h_A, h_B, h_C is the matrix on the host side, and d_A, d_B, d_C is the matrix on the device side. We move the data as following:

  // copy matrices from the host to the device
  status = cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1);
  status = cublasSetVector(n2, sizeof(h_B[0]), h_B, 1, d_B, 1);
  status = cublasSetVector(n2, sizeof(h_C[0]), h_C, 1, d_C, 1);

For generic matrix cases, we also have cublasSetMatrix, which is defined in the API as:

cublasStatus_t
cublasSetMatrix(int rows, int cols, int elemSize,
 const void *A, int lda, void *B, int ldb)

This function copies a tile of rows x cols elements from a matrix A in host memory space to a matrix B in GPU memory space. It is assumed that each element requires storage of elemSize bytes and that both matrices are stored in column-major format, with the leading dimension of the source matrix A and destination matrix B given in lda and ldb, respectively. The leading dimension indicates the number of rows of the allocated matrix, even if only a submatrix of it is being used.

  // copy matrices from the host to the device
  stat = cublasSetMatrix (m,k, sizeof (*a) ,h_a,m,d_a ,m); //a -> d_a
  stat = cublasSetMatrix (k,n, sizeof (*b) ,h_b,k,d_b ,k); //b -> d_b
  stat = cublasSetMatrix (m,n, sizeof (*c) ,h_c,m,d_c ,m); //c -> d_c

Perform Matrix Multiply

The API call to perform matrix multiply is cublasSgemm defined as follows:

cublasStatus_t cublasSgemm(cublasHandle_t handle,
 cublasOperation_t transa, cublasOperation_t transb,
 int m, int n, int k,
 const float *alpha,
 const float *A, int lda,
 const float *B, int ldb,
 const float *beta,
 float *C, int ldc)

This function performs the matrix-matrix multiplication where alplha and beta are scalars, and A, B and C are matrices stored in column-major format with dimensions m x k, k x n and m x n, respectively.

    /* Performs operation using cublas */
    status = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);

Results can be read by getting the vector from CUBLAS:

  /* Read the result back */
  status = cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1);