AVX-512 notes
I have access to some Intel Skylake-X CPUs and some Intel Cascade Lake CPUs, both of which support AVX-512 instructions. AVX-512 has lots of subsets; both of these CPUs support the F, CD, VL, DQ, and BW subsets. Additionally, Cascade Lake supports the VNNI subset. A small part of this post is about VNNI, but other than that, everything is applicable to both Skylake-X and Cascade Lake.
There are new 512-bit-wide vector registers zmm0
through zmm31
, which extend ymm0
through ymm15
both in their width and their number. The low parts of these new registers are available as ymm16
through ymm31
, or xmm16
through xmm31
, should you have code which would benefit from more registers rather than wider registers. There are also new 64-bit-wide "mask" registers k0
through k7
.
Starting with "mask" registers:
- Movement between "mask" registers and GPRs (or memory) is done with
kmov[bwdq]
. - Movement between "mask" registers and vector registers is done with
vpmovm2[bwdq]
(expand each bit of a mask register to an element of a vector register),vpmov[bwdq]2m
(create a mask from the sign bits of vector elements), orvpbroadcastm(b2q|w2d)
(zero-extend a mask and broadcast it to all elements of a vector). - A limited range of arithmetic on "mask" registers is done with
kadd[bwdq]
/kand[bwdq]
/kandn[bwdq]
/knot[bwdq]
/kor[bwdq]
/kshiftl[bwdq]
/kshiftr[bwdq]
/kunpck(bw|wd|dq)
/kxnor[bwdq]
/kxor[bwdq]
, all of which are three-operand (exceptknot
), and none of which affect flags. ktest[bwdq]
andkortest[bwdq]
compare two "mask" registers and write to flags.- Most vector instructions can optionally take a "mask" register (but not
k0
), and use it to control which vector lanes are active for the instruction. Inactive lanes can either produce a zero output, or preserve the value of the destination register. In terms of syntax, this is done by putting{kN}
after the destination register (preserving), or{kN}{z}
(zero-ing), for examplevmulps zmm0 {k1}{z}, zmm1, zmm2
. - Vector comparison instructions output to a "mask" register rather than a vector register (though
vpmovm2[bwdq]
can be used to expand that to a vector). The newvp(test|testn)m[bwdq]
instructions similarly compare two vectors and output to a "mask" register.
Most vector instructions which allow a memory operand and have a lane width of 32-bits or 64-bits now support the memory operand being an embedded broadcast of a 32-bit or 64-bit value. In terms of syntax, this is done by putting {1toN}
after the memory operand, for example vpaddd zmm0, zmm0, dword ptr [rax] {1to16}
. Some instructions gain optional modifiers for controlling the rounding mode, or for suppressing exceptions. As a quirk of the instruction encoding, all three pieces of functionality (broadcasting, rounding mode control, and suppressing exceptions) are enabled/disabled by the same bit, which might cause surprises.
Assorted new floating-point instructions:
vrangeps
does amin
ormax
(optionally ignoring sign bits), and then optionally replaces the sign bit of the result with0
or1
or the sign bit of the first operand. One potential use-case is clamping values to be between-t
and+t
in a single operation.vreduceps
is a combinedvroundps
andvsubps
. Optionally it can also scale by 2M for 0 <= M <= 15 on input and by a matching 2-M on output.vrndscaleps
is likevroundps
, with the extra trick of optionally scaling by 2M for 0 <= M <= 15 on input and by a matching 2-M on output.vrcp14ps
andvrsqrt14ps
are variants ofvrcpps
(x-1) andvrsqrtps
(x-0.5) with more precision (14 bits rather than 11).vfixupimmps
andvfpclassps
help with handling edge-cases around zeros / NaNs / infinities / denormals.vgetexpps
andvgetmantps
andvscalefps
are also new.
Assorted new integer instructions:
vpcmp[bwdq]
generalisesvpcmp(eq|gt)[bwdq]
, and is also available for unsigned integers asvpcmpu[bwdq]
.vp(min|max)[su]q
andvpabsq
andvpmullq
areq
versions of instructions previously only present for[bwd]
.vp(rol|ror)[dq]
andvp(rol|ror)v[dq]
are rotate instructions.vplzcnt[dq]
are vectorisedlzcnt
instructions (similar tobsr
).vdbpsadbw
is an extension ofvpsadbw
.vpconflict[dq]
perform pair-wise equality comparisons of source elements, outputting bitmasks in every lane.vpternlog[dq]
subsumes all boolean functions of up to three inputs (thoughvp(and|andn|or|xor)[dq]
should still be used where appropriate, due to their shorter encoding and lack of dependency on the output register).vpmov(|s|us)q[dwb]
andvpmov(|s|us)d[wb]
andvpmov(|s|us)wb
provide down-conversion (in truncating, signed saturating, and unsigned saturating varieties) combined with packing. For example,vpmovsdb xmm0, zmm0
converts 16int32
values into 16int8
values (via saturation), and packs the results into the low 128 bits of the destination.- On Cascade Lake, VNNI adds
vpdpbusd
as a fusion ofvpmaddubsw
+vpmaddwd
+vpaddd
.vpdpbusds
is similar, but with saturation.vpdpwssd
andvpdpwssds
fuse justvpmaddwd
+vpaddd
. These instructions have a latency of 5 cycles, versusvpaddd
's 1 cycle, so more accumulation registers are required in tight loops.
There are new instructions for converting between unsigned integers and floating-point values, in the form of v(cvt|cvtt)[ps][sd]2u(dq|si|qq)
and vcvtu(dq|si|qq)2[ps][sd]
. Also new are packed conversions between int64
and floating-point values, in the form of v(cvt|cvtt)p[sd]2qq
and vcvtqq2p[sd]
.
Assorted new permutation and shuffling and blending instructions:
vblendm(ps|pd)
andvpblendm[bwdq]
are per-lane blends, controlled by a "mask" register.vcompress(ps|pd)
andvpcompress[dq]
provide cross-lane packing, controlled by a "mask" register (can have a memory destination). This is likepext
, but operating on lanes rather than bits.vexpand(ps|pd)
andvpexpand[dq]
provide cross-lane un-packing, controlled by a "mask" register (can have a memory source). This is likepdep
, but operating on lanes rather than bits.vextract[fi](32x8|64x4)
andvinsert[fi](32x8|64x4)
manipulate the 256-bit halves of a 512-bit register.vshuf[fi](32x4|64x2)
shuffle at 128-bit granularity from two sources. These instructions are to the 128-bit lanes of a 512-bit register asshufps
is to the 32-bit lanes of a 128-bit register.vperm(ps|pd)
andvperm[wdq]
permute from one source, using indices from another source.vperm[ti]2(ps|pd)
andvperm[ti]2[wdq]
permute from two sources, using indices from another source. Thei
variant has the indices register as the destination. Thet
variant has a source register as the destination.valign[dq]
concatenate two 512-bit registers and extract a contiguous 512-bit slice.vscatter[dq](ps|pd)
andvpscatter[dq][dq]
provide scattered stores.