CUBLAS on Bender
An example of CUBLAS is available in Bender at the following path:
/usr/local/cuda/samples/7_CUDALibraries/simpleCUBLAS
First copy all samples to your home directory. There are dependencies that these samples depend on.
cp -r /usr/local/cuda/samples ~/CUDAsamples
Next, go to the simpleCUBLAS example
cd ~/CUDAsamples/7_CUDALibraries/simpleCUBLAS
You can compile and run the example simply by:
make ./simpleCUBLAS
Using CUBLAS
The syntax for using CUBLAS is shown in simpleCUBLAS.cpp
.
Setup CUBLAS
To use CUBLAS, you need to first include the library: #include <cublas_v2.h>
CUBLAS requires using a status variable and a handler variable in order to create a handler. The handler is the CUBLAS context. Essentially, CUBLAS class are kernel calls.
cublasStatus_t status; cublasHandle_t handle; /* Initialize CUBLAS */ status = cublasCreate(&handle); /* Destroy CUBLAS */ status = cublasDestroy(handle);
Move matrix to device
Next you need to move your matrix to the device side, using CUBLAS APIs. From the CUBLAS documentation (https://docs.nvidia.com/cuda/pdf/CUBLAS_Library.pdf):
cublasStatus_t cublasSetVector(int n, int elemSize, const void *x, int incx, void *y, int incy)
This function copies n elements from a vector x in host memory space to a vector y in GPU memory space. Elements in both vectors are assumed to have a size of elemSize bytes. The storage spacing between consecutive elements is given by incx for the source vector x and by incy for the destination vector y.
Assume n2
is the size of the matrix, h_A
, h_B
, h_C
is the matrix on the host side, and d_A
, d_B
, d_C
is the matrix on the device side. We move the data as following:
// copy matrices from the host to the device status = cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1); status = cublasSetVector(n2, sizeof(h_B[0]), h_B, 1, d_B, 1); status = cublasSetVector(n2, sizeof(h_C[0]), h_C, 1, d_C, 1);
For generic matrix cases, we also have cublasSetMatrix
, which is defined in the API as:
cublasStatus_t cublasSetMatrix(int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb)
This function copies a tile of rows x cols elements from a matrix A in host memory space to a matrix B in GPU memory space. It is assumed that each element requires storage of elemSize bytes and that both matrices are stored in column-major format, with the leading dimension of the source matrix A and destination matrix B given in lda and ldb, respectively. The leading dimension indicates the number of rows of the allocated matrix, even if only a submatrix of it is being used.
// copy matrices from the host to the device stat = cublasSetMatrix (m,k, sizeof (*a) ,h_a,m,d_a ,m); //a -> d_a stat = cublasSetMatrix (k,n, sizeof (*b) ,h_b,k,d_b ,k); //b -> d_b stat = cublasSetMatrix (m,n, sizeof (*c) ,h_c,m,d_c ,m); //c -> d_c
Perform Matrix Multiply
The API call to perform matrix multiply is cublasSgemm
defined as follows:
cublasStatus_t cublasSgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const float *alpha, const float *A, int lda, const float *B, int ldb, const float *beta, float *C, int ldc)
This function performs the matrix-matrix multiplication where alplha and beta are scalars, and A, B and C are matrices stored in column-major format with dimensions m x k, k x n and m x n, respectively.
/* Performs operation using cublas */ status = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);
Results can be read by getting the vector from CUBLAS:
/* Read the result back */ status = cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1);