cuBLASLt notes
Starting with CUDA 10.1, nVidia introduced a new library called cuBLASLt, as an alternative to the cuBLAS function cublasGemmEx
(and related gemm
routines). The reasons for using cuBLASLt include, but are not limited to:
- To be able to specify the CUDA stream, and cuBLAS workspace memory, on a per-function-call basis.
- To make use of certain nVidia-provided epilogues fused into the matrix multiplication kernel.
- To make use of integer tensor cores on sufficiently modern GPUs.
On the other hand, there's one big reason for not using cuBLASLt: it is significantly more complicated to use than cublasGemmEx
.
To start, a cublasLtHandle_t
is required - this can come from cublasLtCreate
, or an existing cublasHandle_t
can be cast to cublasLtHandle_t
. However, lots of setup needs to be performed before cublasLtMatmul
can be used to actually compute a matrix multiplication. The first piece of setup is to initialise a cublasLtMatmulDesc_t
object describing some of the attributes of the matrix multiplication. Such an object is created by cublasLtMatmulDescCreate
followed by zero or more calls to cublasLtMatmulDescSetAttribute
. Some of the common attributes on this object are:
Attribute | Default Value |
---|---|
CUBLASLT_MATMUL_DESC_COMPUTE_TYPE | cublasLtMatmulDescCreate parameter |
CUBLASLT_MATMUL_DESC_SCALE_TYPE | cublasLtMatmulDescCreate parameter |
CUBLASLT_MATMUL_DESC_POINTER_MODE | CUBLASLT_POINTER_MODE_HOST |
CUBLASLT_MATMUL_DESC_TRANSA | CUBLAS_OP_N |
CUBLASLT_MATMUL_DESC_TRANSB | CUBLAS_OP_N |
CUBLASLT_MATMUL_DESC_TRANSC | CUBLAS_OP_N |
CUBLASLT_MATMUL_DESC_EPILOGUE | CUBLASLT_EPILOGUE_DEFAULT (none) |
Next up, a cublasLtMatrixLayout_t
object needs to be initialised for each of the three matrix shapes involved in the matrix multiplication. Such an object is created by cublasLtMatrixLayoutCreate
followed by zero or more calls to cublasLtMatrixLayoutSetAttribute
. Some of the common attributes on this object are:
Attribute | Default Value |
---|---|
CUBLASLT_MATRIX_LAYOUT_TYPE | cublasLtMatrixLayoutCreate parameter |
CUBLASLT_MATRIX_LAYOUT_ROWS | cublasLtMatrixLayoutCreate parameter |
CUBLASLT_MATRIX_LAYOUT_COLS | cublasLtMatrixLayoutCreate parameter |
CUBLASLT_MATRIX_LAYOUT_LD | cublasLtMatrixLayoutCreate parameter |
CUBLASLT_MATRIX_LAYOUT_ORDER | CUBLASLT_ORDER_COL |
CUBLASLT_MATRIX_LAYOUT_BATCH_COUNT | 1 (not batched) |
The third thing which cublasLtMatmul
requires is a cublasLtMatmulAlgo_t
, but such a thing isn't created directly. Instead, a cublasLtMatmulPreference_t
object is initialised, and cublasLtMatmulAlgoGetHeuristic
is used to take all of the previously created objects and spit out a list of potential cublasLtMatmulAlgo_t
objects. A cublasLtMatmulPreference_t
object is created by cublasLtMatmulPreferenceCreate
followed by zero or more calls to cublasLtMatmulPreferenceSetAttribute
. Some of the common attributes on this object are:
Attribute | Default Value |
---|---|
CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES | 0 |
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_A_BYTES | 256 |
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_B_BYTES | 256 |
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_C_BYTES | 256 |
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_D_BYTES | 256 |
With all these objects in hand, cublasLtMatmulAlgoGetHeuristic
can be called to populate an array of cublasLtMatmulHeuristicResult_t
objects. Once populated, the algo
field is a ready-to-use cublasLtMatmulAlgo_t
. This field is relatively opaque, but some information about it can be obtained if desired. This information comes in three places: firstly there are other fields on cublasLtMatmulHeuristicResult_t
(namely wavesCount
and workspaceSize
), secondly read-only attributes can be queried using cublasLtMatmulAlgoCapGetAttribute
(e.g. CUBLASLT_ALGO_CAP_NUMERICAL_IMPL_FLAGS
and CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_A_BYTES
through CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_D_BYTES
), and thirdly read-write attributes can be queried using cublasLtMatmulAlgoConfigGetAttribute
.
With a cublasLtMatmulAlgo_t
object chosen (typically the front of the array passed to cublasLtMatmulAlgoGetHeuristic
), cublasLtMatmul
can finally be used. This function computes D = alpha * (A @ B) + beta * C
, which when C == D
is equivalent to cublasGemmEx
.
That concludes basic usage notes, but if CUBLAS_COMPUTE_32I
(or CUBLAS_COMPUTE_32I_PEDANTIC
) is being used, then there's another whole chapter of usage notes. This chapter starts by noting that the list of supported configurations for integer matrix multiplication is (at least currently) very limited:
CUDA_R_32I
destination, computation using legacy CUDA cores:- scaleType must be
CUDA_R_32I
, but only0
or1
supported - A matrix must be
CUDA_R_8I
with eitherCUBLASLT_ORDER_COL
orCUBLASLT_ORDER_ROW
- B matrix must be
CUDA_R_8I
with eitherCUBLASLT_ORDER_COL
orCUBLASLT_ORDER_ROW
- C matrix must be
CUDA_R_32I
with eitherCUBLASLT_ORDER_COL
orCUBLASLT_ORDER_ROW
CUBLASLT_MATMUL_DESC_EPILOGUE
must beCUBLASLT_EPILOGUE_DEFAULT
- scaleType must be
CUDA_R_8I
destination, computation using legacy CUDA cores:- scaleType must be
CUDA_R_32F
- A matrix must be
CUDA_R_8I
withCUBLAS_OP_T
- B matrix must be
CUDA_R_8I
withCUBLAS_OP_N
- C matrix must be
CUDA_R_8I
CUBLASLT_MATMUL_DESC_EPILOGUE
must beCUBLASLT_EPILOGUE_DEFAULT
- scaleType must be
CUDA_R_32I
destination, computation using integer tensor cores:- scaleType must be
CUDA_R_32I
, but only0
or1
supported - A matrix must be
CUDA_R_8I
withCUBLAS_OP_N
andCUBLASLT_ORDER_COL32
- B matrix must be
CUDA_R_8I
withCUBLAS_OP_T
andCUBLASLT_ORDER_COL4_4R2_8C
(Turing or Ampere) orCUBLASLT_ORDER_COL32_2R_4R4
(Ampere) - C matrix must be
CUDA_R_32I
withCUBLAS_OP_N
andCUBLASLT_ORDER_COL32
CUBLASLT_MATMUL_DESC_EPILOGUE
must beCUBLASLT_EPILOGUE_DEFAULT
- scaleType must be
CUDA_R_8I
destination, computation using integer tensor cores:- scaleType must be
CUDA_R_32F
- A matrix must be
CUDA_R_8I
withCUBLAS_OP_N
andCUBLASLT_ORDER_COL32
- B matrix must be
CUDA_R_8I
withCUBLAS_OP_T
andCUBLASLT_ORDER_COL4_4R2_8C
(Turing or Ampere) orCUBLASLT_ORDER_COL32_2R_4R4
(Ampere) - C matrix must be
CUDA_R_8I
withCUBLAS_OP_N
andCUBLASLT_ORDER_COL32
- scaleType must be
Of particular note are the strange CUBLASLT_MATRIX_LAYOUT_ORDER
values required for using integer tensor cores. In C/C++ terminology, CUBLASLT_ORDER_COL
can be thought of as a two-dimensional array indexed as [I][J]
, with the [J]
part packed densely, and the leading dimension specifying how to advance by one in the [I]
part. In the same terminology, CUBLASLT_ORDER_COL32
can be thought of as a three-dimensional array indexed as [I/32][J][I%32]
, with the [J][I%32]
part packed densely, and the leading dimension specifying how to advance by one in the [I/32]
part.
The CUBLASLT_ORDER_COL4_4R2_8C
and CUBLASLT_ORDER_COL32_2R_4R4
layouts are even more exotic. Rather than trying to explain their layout, it is best to consider them to be completely opaque layouts. Thankfully, the cublasLtMatrixTransform
function is provided to convert between layouts, so a matrix can be constructed using a known simple layout (such as CUBLASLT_ORDER_COL
or CUBLASLT_ORDER_ROW
) and then converted to CUBLASLT_ORDER_COL4_4R2_8C
or CUBLASLT_ORDER_COL32_2R_4R4
using cublasLtMatrixTransform
. To use cublasLtMatrixTransform
, a cublasLtMatrixTransformDesc_t
object is required. Such an object is created by cublasLtMatrixTransformDescCreate
followed by zero or more calls to cublasLtMatrixTransformDescSetAttribute
. Some of the common attributes on this object are:
Attribute | Default Value |
---|---|
CUBLASLT_MATRIX_TRANSFORM_DESC_SCALE_TYPE | cublasLtMatrixTransformDescCreate parameter |
CUBLASLT_MATRIX_TRANSFORM_DESC_POINTER_MODE | CUBLASLT_POINTER_MODE_HOST |
CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSA | CUBLAS_OP_N |
CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSB | CUBLAS_OP_N |
cublasLtMatrixTransform
can also be used to convert in or out of CUBLASLT_ORDER_COL32
, but said conversions obviously come with a time cost, so it is better to keep matrices in the CUBLASLT_ORDER_COL32
format, and perform all operations on them in that layout. This might mean rewriting a bunch of CUDA kernels to understand the layout, if said kernels care about the two dimensional structure of the matrix.