Galois field instructions on 2021 CPUs
CPUs like to operate on data types such as uint8_t
, uint16_t
, uint32_t
, and uint64_t
. These data types can be viewed as integers modulo 28, 216, 232, and 264 respectively. Addition and multiplication can be defined on these integers modulo 2N in a way which is familiar to most people; for example, in uint8_t
math, we have 199 + 201 == 144
and 16 * 20 == 64
. uint8_t
math can be viewed as infinite-precision integer math combined with the reduction rule of 256 == 0
; for example in infinite-precision integer math we have 199 + 201 == 400 == 256 + 144
and 16 * 20 == 320 == 256 + 64
.
Reduction rules of the form 2N == 0
(for N > 1
) are CPU-friendly, but not particularly math-friendly. On the other hand, reduction rules of the form P == 0
for some prime number P
are math-friendly, but not CPU-friendly. With infinite-precision integer math combined with the reduction rule of say 251 == 0
, we have 199 + 201 == 149
and 16 * 20 == 69
. As 251
is prime, this reduction rule is math-friendly. By that, I mean that for any x
other than zero, there is some y
such that x * y == 1
. For example, taking x
of 16, we have 16 * 204 == 1
. This property is not true for the 256 == 0
reduction rule; with that rule, there is no y
such that 16 * y == 1
. Where it exists, this y
can be called inv(x)
or 1/x
.
If we want to keep the CPU-friendly uint8_t
, uint16_t
, uint32_t
, and uint64_t
data types, and also keep the math-friendly property of having inv(x)
exist for all non-zero x
, then we are actually in luck, albeit we have to use some exotic definitions of addition and multiplication. The exotic definitions come from polynomial math rather than integer math. A string of bits b0
, b1
, ..., bn-1
can be viewed as the polynomial b0 + b1 * x1 + ... + bn-1 * xn-1
, at which point addition or multiplication of two strings of bits can be done by converting both strings to polynomials, doing addition or multiplication of those two polynomials, and then converting the resultant polynomial back to a bit string. That final back-conversion step requires that every bi
in the resultant polynomial is either 0
or 1
, which we can ensure by applying the reduction rule 2 == 0
to every bi
. In the case of multiplication, the resultant bit string (or, equivalently, polynomial) is often going to be twice as long as its inputs. This should not come as a surprise, as the same thing happens with infinite-precision integer math. Continuing the trend of non-surprising things, just as reduction rules can be used to bring infinite-precision integer math down to fixed bit lengths, reduction rules can also be used to bring infinite-precision polynomial math down to fixed bit lengths. As yet another non-surprise, some polynomial reduction rules are math-friendly, and others are not. Some math-friendly polynomial reduction rules are:
x8 == x4 + x3 + x + 1
x16 == x5 + x3 + x + 1
x32 == x7 + x3 + x2 + 1
x64 == x4 + x3 + x + 1
(These reduction rules come from Table of Low-Weight Binary Irreducible Polynomials, and in some sense are the simplest reduction rules for each xi
on the left hand side of the ==
.)
By choosing one of these reduction rules, we can define exotic addition and multiplication for any of uint8_t
, uint16_t
, uint32_t
, and uint64_t
. For example, exotic addition or multiplication of two uint8_t
values can be done by converting both values to polynomials, doing infinite-precision polynomial addition or multiplication, applying the reduction rule x8 == x4 + x3 + x + 1
to the polynomial (to get rid of x8
, x9
, and all other higher terms), and then converting the polynomial back to a string of 8 bits (remembering the 2 == 0
reduction applied to each bi
in the polynomial). It turns out that this exotic addition doesn't depend on the choice of polynomial reduction rule, and that the 2 == 0
reduction means that this exotic addition is actually just bitwise XOR of the original values. Unfortunately, exotic multiplication does depend on the choice of polynomial reduction rule, and for any math-friendly choice, exotic multiplication doesn't turn out to be the same as an everyday operation.
With one of these polynomial reduction rules in hand, we can define exotic addition and multiplication for each of the uint8_t
, uint16_t
, uint32_t
, and uint64_t
data types. Furthermore, assuming that the polynomial reduction rule is math-friendly, this exotic multiplication has an inv(x)
which exists for all non-zero x
. In other words, we have what mathematicians call a field. As the number of elements (28, 216, 232, and 264, respectively) is finite in every case, they are furthermore called Galois fields. This is shortened to GF, giving GF(28), GF(216), GF(232), and GF(264).
With that long introduction done, the question becomes: how can addition, multiplication, and inv
of GF(28) / GF(216) / GF(232) / GF(264) values be done efficiently on contemporary CPUs? Addition is just XOR, so the interesting questions are around multiplication and inv
.
For GF(28), inv
could be implemented as a uint8_t[256]
lookup table. Multiplication could be implemented as a uint8_t[256][256]
lookup table. This latter table is starting to get a little large, but there's a time/space tradeoff possible: analogues to log
and exp
exist in GF(2N), and both of log
and exp
can be implemented as uint8_t[256]
lookup tables. Multiplication of two values thus becomes two log
lookups, an integer addition, and an exp
lookup.
For GF(216), the same broad strokes apply, with each of inv
and log
and exp
implementable as uint16_t[65536]
lookup tables.
For GF(232) and GF(264), the lookup tables would become too big, so alternative approaches are required. Recall that multiplication consists of two major substeps: infinite-precision polynomial multiplication, and then applying a reduction rule. For the infinite-precision polynomial multiplication part, some uncommon CPU instructions come to our aid:
- On x86 / amd64,
pclmulqdq
takes two 64-bit polynomials and returns the 128-bit polynomial product. As a fun extra twist, the inputs are actually 128-bit registers, and an immediate operand can be used to select between the low/high 64-bit halves of each input. - On ARMv8-A / AArch64,
pmul
takes two 8-bit polynomials, computes the 16-bit polynomial product, and then returns the low 8 terms of that product. As a fun extra twist, this is done in SIMD fashion, either 8-wide or 16-wide. - On ARMv8-A / AArch64,
pmull
takes two 8-bit polynomials or two 64-bit polynomials and returns the 16-bit or 128-bit polynomial product. The variant with a 16-bit result is done as 8-side SIMD. The variant with a 128-bit result is equivalent topclmulqdq
(selecting the low halves of each input). It is notable thatpmull
followed byeor
is one of the few fused instruction patterns on the Apple M1. - On ARMv8-A / AArch64,
pmull2
is likepmull
, but selecting the high halves of each input. The variant with a 128-bit result is equivalent topclmulqdq
(selecting the high halves of each input).
The pclmulqdq
instruction is available as the _mm_clmulepi64_si128
intrinsic:
#include <stdint.h>
#include <wmmintrin.h>
__m128i poly_mul(uint64_t a, uint64_t b) {
return _mm_clmulepi64_si128(_mm_cvtsi64_si128(a),
_mm_cvtsi64_si128(b), 0);
}
To complete the GF(264) multiplication, this infinite-precision polynomial multiplication needs to be followed by reduction back down to 64 bits using a suitable reduction rule. With the x64 == x4 + x3 + x + 1
rule, the right hand side of the rule can be represented as the bit string (1 << 4) + (1 << 3) + (1 << 1) + (1 << 0)
, or 0x1b
for short. This rule can be applied in full by using pclmulqdq
twice:
uint64_t poly_reduce(__m128i x, uint64_t r = 0x1b) {
__m128i xr = _mm_cvtsi64_si128(r);
__m128i x2 = _mm_clmulepi64_si128(x, xr, 1);
x ^= x2;
x ^= _mm_clmulepi64_si128(x2, xr, 1);
return _mm_cvtsi128_si64(x);
}
The two pieces combine to give multiplication in GF(264):
uint64_t gf_mul(uint64_t a, uint64_t b, uint64_t r = 0x1b) {
return poly_reduce(poly_mul(a, b), r);
}
If performing a dot product rather than just one multiplication, then the reduction step can be performed just once rather than after every multiplication:
uint64_t gf_dot(const uint64_t* as, const uint64_t* bs,
uint32_t n, uint64_t r = 0x1b) {
__m128i x = _mm_setzero_si128();
for (uint32_t i = 0; i < n; ++i) {
x ^= poly_mul(as[i], bs[i]);
}
return poly_reduce(x, r);
}
Thanks to some neat mathematical properties, the inv
operation can be implemented in terms of multiplication:
uint64_t gf_inv(uint64_t x, uint64_t r = 0x1b) {
uint64_t y = x = gf_mul(x, x, r);
for (int i = 2; i < 64; ++i) {
x = gf_mul(x, x, r);
y = gf_mul(x, y, r);
}
return y;
}
For GF(232), things are actually slightly harder, as shifts by 32 bits are required in the reduction step rather than shifts by 64 bits, and pclmulqdq
/ pmull2
can only do a "free" shift by 64 bits, so explicit shift/shuffle instructions are required. One interesting option is to use the x86 crc32
instruction, as the reduction x32 == x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 + x18 + x14 + x13 + x11 + x10 + x9 + x8 + x6 + 1
is performed within this instruction, but this forces you into a particular reduction rule, and also subjects you to the other steps performed within crc32
, namely some shifts and bit reflections.
Going back down to GF(28), very recent Intel chips have added three very relevant instructions under the name of GFNI (Galois Field New Instructions). The first of these is gf2p8mulb
, which takes two uint8_t
values, performs infinite-precision polynomial multiplication, and then does reduction using the x8 == x4 + x3 + x + 1
rule to return a uint8_t
value. This is available at any SIMD width, i.e. 16 bytes at a time in xmm
registers, 32 bytes at a time in ymm
registers (AVX), or 64 bytes at a time in zmm
registers (AVX512). The other two instructions are gf2p8affineqb
and gf2p8affineinvqb
, which both follow the same sequence of steps:
- Take some
uint8_t
value as input. - Optionally perform the
inv
operation on theuint8_t
(with reference tox8 == x4 + x3 + x + 1
).gf2p8affineinvqb
performs this step, whereasgf2p8affineqb
skips it. - Form a new
uint8_t
where every bit of the outputuint8_t
is the horizontal XOR of any subset of bits from the inputuint8_t
. - XOR the
uint8_t
with an eight-bit immediate.
Step 3 requires a uint64_t
operand, as there are eight output bits, and each of those requires an eight-bit mask to specify the input subset. Both of gf2p8affineqb
and gf2p8affineinvqb
are available at any SIMD width, i.e. 16/32/64 bytes at a time. The second input to the instruction is another vector of the same width; it specifies the operands for step 3, but as this operand is uint64_t
rather than uint8_t
, the same operand is used for each group of eight bytes. If just the inv
from step 2 is desired, then step 3 can be disabled by specifying that output bit i is formed from just input bit i, and step 4 can be disabled by specifying a constant of zero.
(As an aside, step 3 on its own allows for all sorts of bit tricks. Amongst other things, the bits within an uint8_t
can be permuted or shifted in any fashion, including reversal or rotation.)
One final case to consider is multiplication by a constant in GF(28). That is, multiplying a
with c
for lots of different values of a
, where both a
and c
are uint8_t
. This extends to doing dot products where one of the vectors is constant, or doing matrix/vector multiplication where the matrix is constant. One option would be to replace c
with a uint8_t[256]
table, at which point multiplication against c
is just a table lookup. A related observation is that gf_mul(a, c) == gf_mul(a & 0xf0, c) ^ gf_mul(a & 0x0f, c)
, and both of the multiplications on the right only have 16 possible inputs for the a & 0xf0
and a & 0x0f
terms. This means that they can both be implemented with a uint8_t[16]
table, bringing the memory requirement down from uint8_t[256]
to uint8_t[16+16]
. At this smaller size, the uint8_t[16]
tables can exist in SIMD registers rather than in memory, and table lookup becomes a pshufb
(amd64) or tbl
(AArch64) instruction rather than a memory load. Even better, multiple such table lookups can be performed in parallel for any SIMD width. This is the approach used by Intel's ISA-L EC routines.