Contrasting Intel AMX and Apple AMX
Intel has an x64 instruction set extension called AMX, meanwhile Apple has a totally different aarch64 instruction set extension also called AMX.
Register file
Intel AMX: 8 tmm
registers, each a 16 by 16 matrix of 32-bit elements (technically, each can be configured to be any size - square or rectangular - between 1 by 1 and 16 by 16, though element size is fixed at 32-bits regardless). Total architectural state 8 kilobytes.
Apple AMX: Total architectural state 5 kilobytes, broken down as:
- 8
x
registers, each a 64-byte (row) vector - 8
y
registers, each a 64-byte (row or column) vector - Then
z
, which can be variously viewed as any of:- 1 register, a 64 by 64 matrix of 8-bit elements
- 1 register, a 32 by 32 matrix of 32-bit elements
- 2 registers, each a 32 by 32 matrix of 16-bit elements
- 4 registers, each a 16 by 16 matrix of 32-bit elements
- 8 registers, each an 8 by 8 matrix of 64-bit elements
- 64 registers, each a 64-byte row vector
The vectors have 8/16/32/64-bit elements, like regular SIMD registers. Note that Intel AMX does not need vector registers, as Intel already has AVX512 with 64-byte vectors (32 of which are in the AVX512 architectural register file).
Data types
Intel AMX: Multiplicands are always 32-bit, either i8[4]
or u8[4]
or bf16[2]
, combined via dot product. Accumulators are always 32-bit, either i32
or u32
or f32
.
Apple AMX: Multiplicands are scalars, any of i8
/ u8
/ i16
/ u16
/ f16
/ f32
/ f64
. Accumulators are any of i16
/ u16
/ i32
/ u32
/ f16
/ f32
/ f64
. Note f16
(i.e. IEEE 754 half-precision with 5-bit exponent and 10-bit fraction) rather than bf16
(8-bit exponent, 7-bit fraction), though bf16
support is added on M2 and later.
Computational operations
Intel AMX: Matrix multiplication of any two tmm
registers, accumulating onto a third tmm
register. For the multiplication, matrix elements are themselves (very small) vectors, combined via dot product. This is the only operation. Viewed differently, this is doing 16×64 by 64×16 (int8) or 16×32 by 32×16 (bf16) matmul, then adding onto a 16×16 matrix.
Apple AMX: Outer product of any x
register with any y
register (viewed as a column vector), accumulating onto any (matrix view) z
register). For the multiplication, x
/ y
elements are scalars (depending on the data type, this might be viewed as doing 16×1 by 1×16 matmul then adding onto a 16×16 matrix). Alternatively, pointwise product of any x
register with any y
register (viewed as a row vector), accumulating onto any (vector view) z
register. Many more operations as well, though the general theme is {outer or pointwise} {multiplication or addition or subtraction}, possibly followed by right-shift, possibly followed by integer saturation. The most exotic exceptions to the theme are min
/ max
/ popcnt
.
Memory operations
Intel AMX: Matrix load or store (up to 1 kilobyte), configurable with a base address (register + immediate offset) and a row stride (register or zero, optionally shifted left by 1-3 bits).
Apple AMX: Vector load or store (64 bytes), configurable with a base address (register). Also load or store pair (128 bytes), though the two registers must be consecutive, and the row stride is fixed at 64 bytes, and the base address must be 128-byte aligned. Loads or stores with y
effectively give a free vector transpose, as y
registers can be viewed as column vectors.
Masking modes
Intel AMX: Each tmm
register can be configured to any size - square or rectangular - between 1 by 1 and 16 by 16. This is (mostly) equivalent to saying that a tmm
register is always 16 by 16, but has an associated mask on each dimension to only enable some number of leading rows and columns. These per-register masks are architectural state.
Apple AMX: Per-dimension masking is available on a per-instruction basis (though notably not for loads / stores). Available masking modes are: all rows (columns), even/odd rows (columns) only, first N rows (columns) only, last N rows (columns) only, row (column) N only.
Note that both of these approaches are different to Intel's AVX512 approach, which is a separate register file containing 8 mask registers (k0
through k7
) and every operation optionally taking a mask register as an input.
Other
Apple AMX contains a very interesting instruction called genlut
. In the forward direction ("lookup"), it is somewhere between AVX512's vpshufb
and vgatherps
. In the backward direction ("generate") it is something like an inverse vpshufb
, performing arbitrary 2/3/4/5-bit quantisation. When used in both directions, it can be useful for piecewise linear interpolation, or as an alternative to AVX512's vfpclassps
/ vfixupimmps
.