cuBLASLt notes

Starting with CUDA 10.1, nVidia introduced a new library called cuBLASLt, as an alternative to the cuBLAS function cublasGemmEx (and related gemm routines). The reasons for using cuBLASLt include, but are not limited to:

To be able to specify the CUDA stream, and cuBLAS workspace memory, on a per-function-call basis.
To make use of certain nVidia-provided epilogues fused into the matrix multiplication kernel.
To make use of integer tensor cores on sufficiently modern GPUs.

On the other hand, there's one big reason for not using cuBLASLt: it is significantly more complicated to use than cublasGemmEx.

To start, a cublasLtHandle_t is required - this can come from cublasLtCreate, or an existing cublasHandle_t can be cast to cublasLtHandle_t. However, lots of setup needs to be performed before cublasLtMatmul can be used to actually compute a matrix multiplication. The first piece of setup is to initialise a cublasLtMatmulDesc_t object describing some of the attributes of the matrix multiplication. Such an object is created by cublasLtMatmulDescCreate followed by zero or more calls to cublasLtMatmulDescSetAttribute. Some of the common attributes on this object are:

Attribute	Default Value
`CUBLASLT_MATMUL_DESC_COMPUTE_TYPE`	`cublasLtMatmulDescCreate` parameter
`CUBLASLT_MATMUL_DESC_SCALE_TYPE`	`cublasLtMatmulDescCreate` parameter
`CUBLASLT_MATMUL_DESC_POINTER_MODE`	`CUBLASLT_POINTER_MODE_HOST`
`CUBLASLT_MATMUL_DESC_TRANSA`	`CUBLAS_OP_N`
`CUBLASLT_MATMUL_DESC_TRANSB`	`CUBLAS_OP_N`
`CUBLASLT_MATMUL_DESC_TRANSC`	`CUBLAS_OP_N`
`CUBLASLT_MATMUL_DESC_EPILOGUE`	`CUBLASLT_EPILOGUE_DEFAULT` (none)

Next up, a cublasLtMatrixLayout_t object needs to be initialised for each of the three matrix shapes involved in the matrix multiplication. Such an object is created by cublasLtMatrixLayoutCreate followed by zero or more calls to cublasLtMatrixLayoutSetAttribute. Some of the common attributes on this object are:

Attribute	Default Value
`CUBLASLT_MATRIX_LAYOUT_TYPE`	`cublasLtMatrixLayoutCreate` parameter
`CUBLASLT_MATRIX_LAYOUT_ROWS`	`cublasLtMatrixLayoutCreate` parameter
`CUBLASLT_MATRIX_LAYOUT_COLS`	`cublasLtMatrixLayoutCreate` parameter
`CUBLASLT_MATRIX_LAYOUT_LD`	`cublasLtMatrixLayoutCreate` parameter
`CUBLASLT_MATRIX_LAYOUT_ORDER`	`CUBLASLT_ORDER_COL`
`CUBLASLT_MATRIX_LAYOUT_BATCH_COUNT`	`1` (not batched)

The third thing which cublasLtMatmul requires is a cublasLtMatmulAlgo_t, but such a thing isn't created directly. Instead, a cublasLtMatmulPreference_t object is initialised, and cublasLtMatmulAlgoGetHeuristic is used to take all of the previously created objects and spit out a list of potential cublasLtMatmulAlgo_t objects. A cublasLtMatmulPreference_t object is created by cublasLtMatmulPreferenceCreate followed by zero or more calls to cublasLtMatmulPreferenceSetAttribute. Some of the common attributes on this object are:

Attribute	Default Value
`CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES`	`0`
`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_A_BYTES`	`256`
`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_B_BYTES`	`256`
`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_C_BYTES`	`256`
`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_D_BYTES`	`256`

With all these objects in hand, cublasLtMatmulAlgoGetHeuristic can be called to populate an array of cublasLtMatmulHeuristicResult_t objects. Once populated, the algo field is a ready-to-use cublasLtMatmulAlgo_t. This field is relatively opaque, but some information about it can be obtained if desired. This information comes in three places: firstly there are other fields on cublasLtMatmulHeuristicResult_t (namely wavesCount and workspaceSize), secondly read-only attributes can be queried using cublasLtMatmulAlgoCapGetAttribute (e.g. CUBLASLT_ALGO_CAP_NUMERICAL_IMPL_FLAGS and CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_A_BYTES through CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_D_BYTES), and thirdly read-write attributes can be queried using cublasLtMatmulAlgoConfigGetAttribute.

With a cublasLtMatmulAlgo_t object chosen (typically the front of the array passed to cublasLtMatmulAlgoGetHeuristic), cublasLtMatmul can finally be used. This function computes D = alpha * (A @ B) + beta * C, which when C == D is equivalent to cublasGemmEx.

That concludes basic usage notes, but if CUBLAS_COMPUTE_32I (or CUBLAS_COMPUTE_32I_PEDANTIC) is being used, then there's another whole chapter of usage notes. This chapter starts by noting that the list of supported configurations for integer matrix multiplication is (at least currently) very limited:

CUDA_R_32I destination, computation using legacy CUDA cores:
- scaleType must be CUDA_R_32I, but only 0 or 1 supported
- A matrix must be CUDA_R_8I with either CUBLASLT_ORDER_COL or CUBLASLT_ORDER_ROW
- B matrix must be CUDA_R_8I with either CUBLASLT_ORDER_COL or CUBLASLT_ORDER_ROW
- C matrix must be CUDA_R_32I with either CUBLASLT_ORDER_COL or CUBLASLT_ORDER_ROW
- CUBLASLT_MATMUL_DESC_EPILOGUE must be CUBLASLT_EPILOGUE_DEFAULT
CUDA_R_8I destination, computation using legacy CUDA cores:
- scaleType must be CUDA_R_32F
- A matrix must be CUDA_R_8I with CUBLAS_OP_T
- B matrix must be CUDA_R_8I with CUBLAS_OP_N
- C matrix must be CUDA_R_8I
- CUBLASLT_MATMUL_DESC_EPILOGUE must be CUBLASLT_EPILOGUE_DEFAULT
CUDA_R_32I destination, computation using integer tensor cores:
- scaleType must be CUDA_R_32I, but only 0 or 1 supported
- A matrix must be CUDA_R_8I with CUBLAS_OP_N and CUBLASLT_ORDER_COL32
- B matrix must be CUDA_R_8I with CUBLAS_OP_T and CUBLASLT_ORDER_COL4_4R2_8C (Turing or Ampere) or CUBLASLT_ORDER_COL32_2R_4R4 (Ampere)
- C matrix must be CUDA_R_32I with CUBLAS_OP_N and CUBLASLT_ORDER_COL32
- CUBLASLT_MATMUL_DESC_EPILOGUE must be CUBLASLT_EPILOGUE_DEFAULT
CUDA_R_8I destination, computation using integer tensor cores:
- scaleType must be CUDA_R_32F
- A matrix must be CUDA_R_8I with CUBLAS_OP_N and CUBLASLT_ORDER_COL32
- B matrix must be CUDA_R_8I with CUBLAS_OP_T and CUBLASLT_ORDER_COL4_4R2_8C (Turing or Ampere) or CUBLASLT_ORDER_COL32_2R_4R4 (Ampere)
- C matrix must be CUDA_R_8I with CUBLAS_OP_N and CUBLASLT_ORDER_COL32

Of particular note are the strange CUBLASLT_MATRIX_LAYOUT_ORDER values required for using integer tensor cores. In C/C++ terminology, CUBLASLT_ORDER_COL can be thought of as a two-dimensional array indexed as [I][J], with the [J] part packed densely, and the leading dimension specifying how to advance by one in the [I] part. In the same terminology, CUBLASLT_ORDER_COL32 can be thought of as a three-dimensional array indexed as [I/32][J][I%32], with the [J][I%32] part packed densely, and the leading dimension specifying how to advance by one in the [I/32] part.

The CUBLASLT_ORDER_COL4_4R2_8C and CUBLASLT_ORDER_COL32_2R_4R4 layouts are even more exotic. Rather than trying to explain their layout, it is best to consider them to be completely opaque layouts. Thankfully, the cublasLtMatrixTransform function is provided to convert between layouts, so a matrix can be constructed using a known simple layout (such as CUBLASLT_ORDER_COL or CUBLASLT_ORDER_ROW) and then converted to CUBLASLT_ORDER_COL4_4R2_8C or CUBLASLT_ORDER_COL32_2R_4R4 using cublasLtMatrixTransform. To use cublasLtMatrixTransform, a cublasLtMatrixTransformDesc_t object is required. Such an object is created by cublasLtMatrixTransformDescCreate followed by zero or more calls to cublasLtMatrixTransformDescSetAttribute. Some of the common attributes on this object are:

Attribute	Default Value
`CUBLASLT_MATRIX_TRANSFORM_DESC_SCALE_TYPE`	`cublasLtMatrixTransformDescCreate` parameter
`CUBLASLT_MATRIX_TRANSFORM_DESC_POINTER_MODE`	`CUBLASLT_POINTER_MODE_HOST`
`CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSA`	`CUBLAS_OP_N`
`CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSB`	`CUBLAS_OP_N`

cublasLtMatrixTransform can also be used to convert in or out of CUBLASLT_ORDER_COL32, but said conversions obviously come with a time cost, so it is better to keep matrices in the CUBLASLT_ORDER_COL32 format, and perform all operations on them in that layout. This might mean rewriting a bunch of CUDA kernels to understand the layout, if said kernels care about the two dimensional structure of the matrix.