AVX-512 notes

I have access to some Intel Skylake-X CPUs and some Intel Cascade Lake CPUs, both of which support AVX-512 instructions. AVX-512 has lots of subsets; both of these CPUs support the F, CD, VL, DQ, and BW subsets. Additionally, Cascade Lake supports the VNNI subset. A small part of this post is about VNNI, but other than that, everything is applicable to both Skylake-X and Cascade Lake.

There are new 512-bit-wide vector registers zmm0 through zmm31, which extend ymm0 through ymm15 both in their width and their number. The low parts of these new registers are available as ymm16 through ymm31, or xmm16 through xmm31, should you have code which would benefit from more registers rather than wider registers. There are also new 64-bit-wide "mask" registers k0 through k7.

Starting with "mask" registers:

Movement between "mask" registers and GPRs (or memory) is done with kmov[bwdq].
Movement between "mask" registers and vector registers is done with vpmovm2[bwdq] (expand each bit of a mask register to an element of a vector register), vpmov[bwdq]2m (create a mask from the sign bits of vector elements), or vpbroadcastm(b2q|w2d) (zero-extend a mask and broadcast it to all elements of a vector).
A limited range of arithmetic on "mask" registers is done with kadd[bwdq] / kand[bwdq] / kandn[bwdq] / knot[bwdq] / kor[bwdq] / kshiftl[bwdq] / kshiftr[bwdq] / kunpck(bw|wd|dq) / kxnor[bwdq] / kxor[bwdq], all of which are three-operand (except knot), and none of which affect flags.
ktest[bwdq] and kortest[bwdq] compare two "mask" registers and write to flags.
Most vector instructions can optionally take a "mask" register (but not k0), and use it to control which vector lanes are active for the instruction. Inactive lanes can either produce a zero output, or preserve the value of the destination register. In terms of syntax, this is done by putting {kN} after the destination register (preserving), or {kN}{z} (zero-ing), for example vmulps zmm0 {k1}{z}, zmm1, zmm2.
Vector comparison instructions output to a "mask" register rather than a vector register (though vpmovm2[bwdq] can be used to expand that to a vector). The new vp(test|testn)m[bwdq] instructions similarly compare two vectors and output to a "mask" register.

Most vector instructions which allow a memory operand and have a lane width of 32-bits or 64-bits now support the memory operand being an embedded broadcast of a 32-bit or 64-bit value. In terms of syntax, this is done by putting {1toN} after the memory operand, for example vpaddd zmm0, zmm0, dword ptr [rax] {1to16}. Some instructions gain optional modifiers for controlling the rounding mode, or for suppressing exceptions. As a quirk of the instruction encoding, all three pieces of functionality (broadcasting, rounding mode control, and suppressing exceptions) are enabled/disabled by the same bit, which might cause surprises.

Assorted new floating-point instructions:

vrangeps does a min or max (optionally ignoring sign bits), and then optionally replaces the sign bit of the result with 0 or 1 or the sign bit of the first operand. One potential use-case is clamping values to be between -t and +t in a single operation.
vreduceps is a combined vroundps and vsubps. Optionally it can also scale by 2^M for 0 <= M <= 15 on input and by a matching 2^-M on output.
vrndscaleps is like vroundps, with the extra trick of optionally scaling by 2^M for 0 <= M <= 15 on input and by a matching 2^-M on output.
vrcp14ps and vrsqrt14ps are variants of vrcpps (x^-1) and vrsqrtps (x^-0.5) with more precision (14 bits rather than 11).
vfixupimmps and vfpclassps help with handling edge-cases around zeros / NaNs / infinities / denormals.
vgetexpps and vgetmantps and vscalefps are also new.

Assorted new integer instructions:

vpcmp[bwdq] generalises vpcmp(eq|gt)[bwdq], and is also available for unsigned integers as vpcmpu[bwdq].
vp(min|max)[su]q and vpabsq and vpmullq are q versions of instructions previously only present for [bwd].
vp(rol|ror)[dq] and vp(rol|ror)v[dq] are rotate instructions.
vplzcnt[dq] are vectorised lzcnt instructions (similar to bsr).
vdbpsadbw is an extension of vpsadbw.
vpconflict[dq] perform pair-wise equality comparisons of source elements, outputting bitmasks in every lane.
vpternlog[dq] subsumes all boolean functions of up to three inputs (though vp(and|andn|or|xor)[dq] should still be used where appropriate, due to their shorter encoding and lack of dependency on the output register).
vpmov(|s|us)q[dwb] and vpmov(|s|us)d[wb] and vpmov(|s|us)wb provide down-conversion (in truncating, signed saturating, and unsigned saturating varieties) combined with packing. For example, vpmovsdb xmm0, zmm0 converts 16 int32 values into 16 int8 values (via saturation), and packs the results into the low 128 bits of the destination.
On Cascade Lake, VNNI adds vpdpbusd as a fusion of vpmaddubsw+vpmaddwd+vpaddd. vpdpbusds is similar, but with saturation. vpdpwssd and vpdpwssds fuse just vpmaddwd+vpaddd. These instructions have a latency of 5 cycles, versus vpaddd's 1 cycle, so more accumulation registers are required in tight loops.

There are new instructions for converting between unsigned integers and floating-point values, in the form of v(cvt|cvtt)[ps][sd]2u(dq|si|qq) and vcvtu(dq|si|qq)2[ps][sd]. Also new are packed conversions between int64 and floating-point values, in the form of v(cvt|cvtt)p[sd]2qq and vcvtqq2p[sd].

Assorted new permutation and shuffling and blending instructions:

vblendm(ps|pd) and vpblendm[bwdq] are per-lane blends, controlled by a "mask" register.
vcompress(ps|pd) and vpcompress[dq] provide cross-lane packing, controlled by a "mask" register (can have a memory destination). This is like pext, but operating on lanes rather than bits.
vexpand(ps|pd) and vpexpand[dq] provide cross-lane un-packing, controlled by a "mask" register (can have a memory source). This is like pdep, but operating on lanes rather than bits.
vextract[fi](32x8|64x4) and vinsert[fi](32x8|64x4) manipulate the 256-bit halves of a 512-bit register.
vshuf[fi](32x4|64x2) shuffle at 128-bit granularity from two sources. These instructions are to the 128-bit lanes of a 512-bit register as shufps is to the 32-bit lanes of a 128-bit register.
vperm(ps|pd) and vperm[wdq] permute from one source, using indices from another source.
vperm[ti]2(ps|pd) and vperm[ti]2[wdq] permute from two sources, using indices from another source. The i variant has the indices register as the destination. The t variant has a source register as the destination.
valign[dq] concatenate two 512-bit registers and extract a contiguous 512-bit slice.
vscatter[dq](ps|pd) and vpscatter[dq][dq] provide scattered stores.