cuBLASLt notes
Starting with CUDA 10.1, nVidia introduced a new library called cuBLASLt, as an alternative to the cuBLAS function cublasGemmEx (and related gemm routines). The reasons for using cuBLASLt include, but are not limited to:
- To be able to specify the CUDA stream, and cuBLAS workspace memory, on a per-function-call basis.
- To make use of certain nVidia-provided epilogues fused into the matrix multiplication kernel.
- To make use of integer tensor cores on sufficiently modern GPUs.
On the other hand, there's one big reason for not using cuBLASLt: it is significantly more complicated to use than cublasGemmEx.
To start, a cublasLtHandle_t is required - this can come from cublasLtCreate, or an existing cublasHandle_t can be cast to cublasLtHandle_t. However, lots of setup needs to be performed before cublasLtMatmul can be used to actually compute a matrix multiplication. The first piece of setup is to initialise a cublasLtMatmulDesc_t object describing some of the attributes of the matrix multiplication. Such an object is created by cublasLtMatmulDescCreate followed by zero or more calls to cublasLtMatmulDescSetAttribute. Some of the common attributes on this object are:
| Attribute | Default Value |
|---|---|
CUBLASLT_MATMUL_DESC_COMPUTE_TYPE | cublasLtMatmulDescCreate parameter |
CUBLASLT_MATMUL_DESC_SCALE_TYPE | cublasLtMatmulDescCreate parameter |
CUBLASLT_MATMUL_DESC_POINTER_MODE | CUBLASLT_POINTER_MODE_HOST |
CUBLASLT_MATMUL_DESC_TRANSA | CUBLAS_OP_N |
CUBLASLT_MATMUL_DESC_TRANSB | CUBLAS_OP_N |
CUBLASLT_MATMUL_DESC_TRANSC | CUBLAS_OP_N |
CUBLASLT_MATMUL_DESC_EPILOGUE | CUBLASLT_EPILOGUE_DEFAULT (none) |
Next up, a cublasLtMatrixLayout_t object needs to be initialised for each of the three matrix shapes involved in the matrix multiplication. Such an object is created by cublasLtMatrixLayoutCreate followed by zero or more calls to cublasLtMatrixLayoutSetAttribute. Some of the common attributes on this object are:
| Attribute | Default Value |
|---|---|
CUBLASLT_MATRIX_LAYOUT_TYPE | cublasLtMatrixLayoutCreate parameter |
CUBLASLT_MATRIX_LAYOUT_ROWS | cublasLtMatrixLayoutCreate parameter |
CUBLASLT_MATRIX_LAYOUT_COLS | cublasLtMatrixLayoutCreate parameter |
CUBLASLT_MATRIX_LAYOUT_LD | cublasLtMatrixLayoutCreate parameter |
CUBLASLT_MATRIX_LAYOUT_ORDER | CUBLASLT_ORDER_COL |
CUBLASLT_MATRIX_LAYOUT_BATCH_COUNT | 1 (not batched) |
The third thing which cublasLtMatmul requires is a cublasLtMatmulAlgo_t, but such a thing isn't created directly. Instead, a cublasLtMatmulPreference_t object is initialised, and cublasLtMatmulAlgoGetHeuristic is used to take all of the previously created objects and spit out a list of potential cublasLtMatmulAlgo_t objects. A cublasLtMatmulPreference_t object is created by cublasLtMatmulPreferenceCreate followed by zero or more calls to cublasLtMatmulPreferenceSetAttribute. Some of the common attributes on this object are:
| Attribute | Default Value |
|---|---|
CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES | 0 |
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_A_BYTES | 256 |
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_B_BYTES | 256 |
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_C_BYTES | 256 |
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_D_BYTES | 256 |
With all these objects in hand, cublasLtMatmulAlgoGetHeuristic can be called to populate an array of cublasLtMatmulHeuristicResult_t objects. Once populated, the algo field is a ready-to-use cublasLtMatmulAlgo_t. This field is relatively opaque, but some information about it can be obtained if desired. This information comes in three places: firstly there are other fields on cublasLtMatmulHeuristicResult_t (namely wavesCount and workspaceSize), secondly read-only attributes can be queried using cublasLtMatmulAlgoCapGetAttribute (e.g. CUBLASLT_ALGO_CAP_NUMERICAL_IMPL_FLAGS and CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_A_BYTES through CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_D_BYTES), and thirdly read-write attributes can be queried using cublasLtMatmulAlgoConfigGetAttribute.
With a cublasLtMatmulAlgo_t object chosen (typically the front of the array passed to cublasLtMatmulAlgoGetHeuristic), cublasLtMatmul can finally be used. This function computes D = alpha * (A @ B) + beta * C, which when C == D is equivalent to cublasGemmEx.
That concludes basic usage notes, but if CUBLAS_COMPUTE_32I (or CUBLAS_COMPUTE_32I_PEDANTIC) is being used, then there's another whole chapter of usage notes. This chapter starts by noting that the list of supported configurations for integer matrix multiplication is (at least currently) very limited:
CUDA_R_32Idestination, computation using legacy CUDA cores:- scaleType must be
CUDA_R_32I, but only0or1supported - A matrix must be
CUDA_R_8Iwith eitherCUBLASLT_ORDER_COLorCUBLASLT_ORDER_ROW - B matrix must be
CUDA_R_8Iwith eitherCUBLASLT_ORDER_COLorCUBLASLT_ORDER_ROW - C matrix must be
CUDA_R_32Iwith eitherCUBLASLT_ORDER_COLorCUBLASLT_ORDER_ROW CUBLASLT_MATMUL_DESC_EPILOGUEmust beCUBLASLT_EPILOGUE_DEFAULT
- scaleType must be
CUDA_R_8Idestination, computation using legacy CUDA cores:- scaleType must be
CUDA_R_32F - A matrix must be
CUDA_R_8IwithCUBLAS_OP_T - B matrix must be
CUDA_R_8IwithCUBLAS_OP_N - C matrix must be
CUDA_R_8I CUBLASLT_MATMUL_DESC_EPILOGUEmust beCUBLASLT_EPILOGUE_DEFAULT
- scaleType must be
CUDA_R_32Idestination, computation using integer tensor cores:- scaleType must be
CUDA_R_32I, but only0or1supported - A matrix must be
CUDA_R_8IwithCUBLAS_OP_NandCUBLASLT_ORDER_COL32 - B matrix must be
CUDA_R_8IwithCUBLAS_OP_TandCUBLASLT_ORDER_COL4_4R2_8C(Turing or Ampere) orCUBLASLT_ORDER_COL32_2R_4R4(Ampere) - C matrix must be
CUDA_R_32IwithCUBLAS_OP_NandCUBLASLT_ORDER_COL32 CUBLASLT_MATMUL_DESC_EPILOGUEmust beCUBLASLT_EPILOGUE_DEFAULT
- scaleType must be
CUDA_R_8Idestination, computation using integer tensor cores:- scaleType must be
CUDA_R_32F - A matrix must be
CUDA_R_8IwithCUBLAS_OP_NandCUBLASLT_ORDER_COL32 - B matrix must be
CUDA_R_8IwithCUBLAS_OP_TandCUBLASLT_ORDER_COL4_4R2_8C(Turing or Ampere) orCUBLASLT_ORDER_COL32_2R_4R4(Ampere) - C matrix must be
CUDA_R_8IwithCUBLAS_OP_NandCUBLASLT_ORDER_COL32
- scaleType must be
Of particular note are the strange CUBLASLT_MATRIX_LAYOUT_ORDER values required for using integer tensor cores. In C/C++ terminology, CUBLASLT_ORDER_COL can be thought of as a two-dimensional array indexed as [I][J], with the [J] part packed densely, and the leading dimension specifying how to advance by one in the [I] part. In the same terminology, CUBLASLT_ORDER_COL32 can be thought of as a three-dimensional array indexed as [I/32][J][I%32], with the [J][I%32] part packed densely, and the leading dimension specifying how to advance by one in the [I/32] part.
The CUBLASLT_ORDER_COL4_4R2_8C and CUBLASLT_ORDER_COL32_2R_4R4 layouts are even more exotic. Rather than trying to explain their layout, it is best to consider them to be completely opaque layouts. Thankfully, the cublasLtMatrixTransform function is provided to convert between layouts, so a matrix can be constructed using a known simple layout (such as CUBLASLT_ORDER_COL or CUBLASLT_ORDER_ROW) and then converted to CUBLASLT_ORDER_COL4_4R2_8C or CUBLASLT_ORDER_COL32_2R_4R4 using cublasLtMatrixTransform. To use cublasLtMatrixTransform, a cublasLtMatrixTransformDesc_t object is required. Such an object is created by cublasLtMatrixTransformDescCreate followed by zero or more calls to cublasLtMatrixTransformDescSetAttribute. Some of the common attributes on this object are:
| Attribute | Default Value |
|---|---|
CUBLASLT_MATRIX_TRANSFORM_DESC_SCALE_TYPE | cublasLtMatrixTransformDescCreate parameter |
CUBLASLT_MATRIX_TRANSFORM_DESC_POINTER_MODE | CUBLASLT_POINTER_MODE_HOST |
CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSA | CUBLAS_OP_N |
CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSB | CUBLAS_OP_N |
cublasLtMatrixTransform can also be used to convert in or out of CUBLASLT_ORDER_COL32, but said conversions obviously come with a time cost, so it is better to keep matrices in the CUBLASLT_ORDER_COL32 format, and perform all operations on them in that layout. This might mean rewriting a bunch of CUDA kernels to understand the layout, if said kernels care about the two dimensional structure of the matrix.