An alternative exposition of crc32_4k_pclmulqdq
A recent post presented a faster way of computing CRC32C on x86. Part of that was crc32_4k_pclmulqdq
, based on Intel's whitepaper and Chromium's implementation thereof. The Intel whitepaper is the traditional exposition for this crc32_4k_pclmulqdq
function, but I don't like the exposition therein, as I think it spends too much time on some obvious things, and not nearly enough time on some non-obvious things. I present an alternative exposition here.
We begin by thinking about CPU-native data types. The types u32
and i32
should be familiar to most programmers, being 32-bit unsigned integers and 32-bit two's complement signed integers respectively. Two-operand addition can be defined for both of them; u32 + u32 ↦ u32
and i32 + i32 ↦ i32
. The neat thing is that the hardware circuit for both types of addition is the same, and is what we call the CPU add
instruction. On 64-bit CPUs, these 32-bit types live in the least significant bits of 64-bit registers. I'll use u32_z
and i32_z
for such things, where the _z
suffix denotes zero-extension of the 32-bit value to the 64-bit storage. Note that value-preserving conversion from u32_z
to u64
is a no-op, whereas value-preserving conversion from i32_z
to i64
requires sign extension (movsx
instruction on x86, sxtw
instruction on ARM).
Some less common CPU-native data types are p32
and r32
: polynomials over Z2 of degree at most 31 (p32
), and bit-reversed versions thereof (r32
). As with u
and i
types, these p
and r
types exist at a variety of bit-widths, and (at times) live in the least significant bits of wider registers. The rbit
instruction on ARM is a value-preserving conversion p32 ⇆ r32
(x86 has no such thing).
The ARM crc32cb
instruction, and the matching overload of the x86 crc
instruction, has type signature r32, r8 ↦ r32_z
and implementation:
def crc32cb(acc, val):
val = val * x32
acc = acc * x8 + val
P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
x18 + x14 + x13 + x11 + x10 + x9 + x8 + x6 + 1
q, r = floor_divide(acc, P)
return r
To avoid some future repitition, we can pull out the tail of crc32cb
and call it mod_P
(with type signature rN ↦ r32
for any N
):
def crc32cb(acc, val):
val = val * x32
acc = acc * x8 + val
return mod_P(acc)
def mod_P(acc):
P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
x18 + x14 + x13 + x11 + x10 + x9 + x8 + x6 + 1
q, r = floor_divide(acc, P)
return r
The ARM crc32ch
instruction, and the matching overload of the x86 crc
instruction, has type signature r32, r16 ↦ r32_z
and implementation:
def crc32ch(acc, val):
q, r = floor_divide(val, x8)
acc = crc32cb(acc, q)
acc = crc32cb(acc, r)
return acc
A totally equivalent implementation is the following, which is identical to the crc32cb
implementation except for one number:
def crc32ch(acc, val):
val = val * x32
acc = acc * x16 + val # The x8 on this line has changed to x16
return mod_P(acc)
The crc32cw
(r32, r32 ↦ r32_z
) and crc32cx
(r32, r64 ↦ r32_z
) instructions can be similarly implemented either as two calls to a lower-width implementation, or as crc32cb
with a different scaling factor when merging acc
with val
. In all cases, the equivalence of the two implementations arises due to the nature of mod_P
: so long as mod_P
is done once at the end, it can be done any number of times at any point in the preceding computation. It is useful to do so regularly in order to bound the size of acc
, but it doesn't need to be done. In the extreme, if acc
could grow without bound, a valid multi-byte implementation r32, r8[] ↦ r32_z
would be:
def crc32c_bytes(acc, bytes):
for val in bytes:
acc = acc * x8 + val * x32
return mod_P(acc)
Something similar operating 64-bits at a time rather than 8-bits at a time (r32, r64[] ↦ r32_z
) would be:
def crc32c_qwords(acc, qwords):
for val in qwords:
acc = acc * x64 + val * x32
return mod_P(acc)
At this width, the math can be fiddled around with slightly to do * x32
once at the end of the loop rather than in every loop iteration:
def crc32c_qwords(acc, qwords):
acc = acc * x32 + qwords[0]
for i in range(1, len(qwords)):
acc = acc * x64 + qwords[i]
acc = acc * x32
return mod_P(acc)
In preparation for SIMD, we can go wider still, and operate on 128 bits at a time (r32, r128[] ↦ r32_z
):
def crc32c_xwords(acc, xwords):
acc = acc * x96 + xwords[0]
for i in range(1, len(xwords)):
acc = acc * x128 + xwords[i]
acc = acc * x32
return mod_P(acc)
Furthermore, multiple accumulators could be used (this being a standard latency-hiding trick):
def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x96
acc1 = xwords[1]
for i in range(2, len(xwords), 2):
acc0 = acc0 * x256 + xwords[i + 0]
acc1 = acc1 * x256 + xwords[i + 1]
acc = acc0 * x128 + acc1
acc = acc * x32
return mod_P(acc)
This is starting to look very similar to crc32_4k_pclmulqdq
, albeit we've got a way to go still: having even more accumulators, implementing polynomial +
and *
, preventing the accumulators from getting too big, and implementing the final mod_P
.
To prevent the accumulators from getting too big, we can insert some extra mod_P
calls (remembering that these can be freely inserted at any point in the computation):
def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x96
acc1 = xwords[1]
for i in range(2, len(xwords), 2):
acc0 = mod_P(acc0 * x256)
acc1 = mod_P(acc1 * x256)
acc0 += xwords[i + 0]
acc1 += xwords[i + 1]
acc = acc0 * x128 + acc1
acc = acc * x32
return mod_P(acc)
This solves the accumulator problem, albeit at the cost of introducing the problem of implementing the extra mod_P
calls. That problem is slightly fiddly, so we'll come back to it later, and instead address polynomial +
and *
next.
For small N
, every CPU implements Z2 polynomial +
with signature pN_z + pN_z ↦ pN_z
— it is the xor
instruction on x86, and the eor
instruction on ARM. Every CPU also implements Z2 polynomial +
with signature rN_z + rN_z ↦ rN_z
— it is also the xor
instruction (in the same way that the add
instruction is both u32
and i32
). Modern CPUs implement Z2 polynomial *
under the name of pclmulqdq
(x86) or pmull
/ pmull2
(ARM). The kernel of this instruction is normally seen as p64 * p64 ↦ p127_z
, though it can equally be seen as r64 * r64 ↦ r127_z
. Value-preserving conversion from p127_z
to p128
is a no-op (like going from u32_z
to u64
), whereas value-preserving conversion from r127_z
to r128
requires a left-shift by one (like going from i32_z
to i64
requires a sign extension). When operating on r128
, a right-shift by one is multiplication by x1. Doing both shifts in sequence results in no shift at all, and just a multiplication by x1. For N
≤ 64 and M
≤ 64, the kernel can alternatively be seen as pN_z * pM_z ↦ p(N+M-1)_z
or as rN_z * rM_z ↦ r(N+M-1)_z
. As the output of the kernel can have up to 127 bits, it goes into a 128-bit SIMD register. This causes the two inputs to the kernel to also come from 128-bit SIMD registers, and thus causes the type signature of the overall instruction to be p128, p128 ↦ p127_z
or r128, r128 ↦ r127_z
. To get the inputs down from p128
to p64
(or from r128
to r64
), floor_divide
by x64 is done, and then either the quotient or the remainder is used as the 64-bit input to the kernel. On x86, an immediate argument to pclmulqdq
controls whether to take the quotient or the remainder of each input. On ARM, there is slightly less flexibility: either both inputs have their remainder taken or both have their quotient taken. In practice, the decreased flexibility is rarely a problem.
With this knowledge of how to do polynomial +
and *
, we can return to the question of how to do acc0 = mod_P(acc0 * x256)
. The type of acc0
is going to be r128
in order to match the SIMD register width, and a floor_divide
by x64 can split it into two r64
pieces:
q, r = floor_divide(acc0, x64)
assert acc0 == q * x64 + r
We can then do some arbitrary-looking algebraic manipulations of the expression we want:
mod_P(acc0 * x256)
== mod_P((q * x64 + r) * x256)
== mod_P(q * x256+64 + r * x256)
== mod_P(q * x256+64-33 * x33 + r * x256-33 * x33)
The next trick is where the magic happens: because of the mod_P
call at the end of crc32c_xwords
, we're free to insert to remove mod_P
calls anywhere else. We previously used this freedom to insert some extra mod_P
calls, and we're going to use it again now to remove those mod_P
calls and insert new ones elsewhere:
mod_P(q * x256+64-33 * x33 + r * x256-33 * x33)
-> q * mod_P(x256+64-33) * x33 + r * mod_P(x256-33) * x33
Note that these two expressions give different values (hence ->
rather than ==
), but this difference evaporates away after the mod_P
call at the end of crc32c_xwords
. This new expression is very convenient to compute: q
has type r64_z
, and mod_P(x256+64-33)
is a constant with type r32_z
, so their product has type r95_z
. A value-preserving conversion from r95_z
to r128
involves a left-shift by 33 bits, and then * x33
is a right-shift by 33 bits. These two shifts cancel each other out, meaning that we effectively get the * x33
for free. In other words, a single CPU instruction can obtain q
from acc0
and compute q * mod_P(x256+64-33) * x33
. Similarly, a single CPU instruction can obtain r
from acc0
and compute r * mod_P(x256-33) * x33
. Adding the two terms is another CPU instruction - pxor
on x86, or eor
on ARM (see also eor3
on ARM, and also note that the M1 can fuse pmull
with eor
, albeit the fused operation still takes two retire slots). We can wrap this expression up in a function, call it mul_xN_mod_P
, and make use of it in crc32c_xwords
:
def mul_xN_mod_P(acc, N): # r128, int ↦ r128
q, r = floor_divide(acc, x64)
return q * mod_P(xN+64-33) * x33 + r * mod_P(xN-33) * x33
def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x96
acc1 = xwords[1]
for i in range(2, len(xwords), 2):
acc0 = mul_xN_mod_P(acc0, 256)
acc1 = mul_xN_mod_P(acc1, 256)
acc0 += xwords[i + 0]
acc1 += xwords[i + 1]
acc = acc0 * x128 + acc1
acc = acc * x32
return mod_P(acc)
This mul_xN_mod_P
is also useful for doing acc = acc0 * x128 + acc1
at the end:
def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x96
acc1 = xwords[1]
for i in range(2, len(xwords), 2):
acc0 = mul_xN_mod_P(acc0, 256)
acc1 = mul_xN_mod_P(acc1, 256)
acc0 += xwords[i + 0]
acc1 += xwords[i + 1]
acc = mul_xN_mod_P(acc0, 128) + acc1
return mod_P(acc * x32)
At this point, only three problems remain: having even more accumulators, precomputing the various mod_P(xM)
constants, and computing the final mod_P(acc * x32)
.
The optimal number of accumulators depends on the throughput and latency of the CPU's polynomial +
and *
operations. +
is typically latency 1 and throughput better than or equal to the throughput of *
. On recent Intel chips, *
(i.e. pclmulqdq
) has a latency of between 5 and 7 cycles, and a throughput of 1 per cycle. On the M1, *
(i.e. pmull
/ pmull2
) has a latency of 3 cycles, and a throughput of 4 per cycle (these numbers are incredible compared to Intel's). We need two *
and one +
for each mul_xN_mod_P
call. This makes 4 (or possibly 5) accumulators optimal on Intel, and 12 accumulators optimal on M1. A version with four accumulators looks like:
def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x96
acc1 = xwords[1]
acc2 = xwords[2]
acc3 = xwords[3]
for i in range(4, len(xwords), 4):
acc0 = mul_xN_mod_P(acc0, 128*4)
acc1 = mul_xN_mod_P(acc1, 128*4)
acc2 = mul_xN_mod_P(acc2, 128*4)
acc3 = mul_xN_mod_P(acc3, 128*4)
acc0 += xwords[i + 0]
acc1 += xwords[i + 1]
acc2 += xwords[i + 2]
acc3 += xwords[i + 3]
acc0 = mul_xN_mod_P(acc0, 128) + acc1
acc2 = mul_xN_mod_P(acc2, 128) + acc3
acc = mul_xN_mod_P(acc0, 128*2) + acc2
return mod_P(acc * x32)
If we have a hardware crc32c
instruction, then the remaining two problems are easy. mod_P(acc * x32)
is just crc32c(0, acc)
. In our case, acc
is r128
, and the widest hardware crc32c
instruction only takes r64
, but following the pattern from the start of this post, a wide CRC can be constructed from two narrow CRCs:
def crc32c_128(acc, val): # r32, r128 ↦ r32_z
q, r = floor_divide(val, x64)
acc = crc32c_64(acc, q)
acc = crc32c_64(acc, r)
return acc
Meanwhile, mod_P(xM)
is just acc = xM % 32
followed by M//32
iterations of acc = mod_P(acc * x32)
, and the latter is just a CRC instruction. In concrete C code, this is just:
uint32_t magic(uint64_t M) { // int ↦ r32_z
uint32_t acc = ((uint32_t)0x80000000) >> (M & 31);
for (M >>= 5; M; --M) acc = _mm_crc32_u32(acc, 0);
return acc;
}
For bonus points, note that magic
can be implemented to run in O(log(M))
time rather than O(M)
time:
uint32_t magic(uint64_t M) {
uint64_t stack = ~(uint64_t)1;
for (; M > 191; M = (M >> 1) - 16) {
stack = (stack << 1) + (M & 1);
}
stack = ~stack;
uint32_t acc = ((uint32_t)0x80000000) >> (M & 31);
for (M >>= 5; M; --M) acc = _mm_crc32_u32(acc, 0);
for (;;) {
uint32_t low = stack & 1;
if (!(stack >>= 1)) return acc;
__m128i x = _mm_cvtsi32_si128(acc);
uint64_t y = _mm_cvtsi128_si64(_mm_clmulepi64_si128(x, x, 0));
acc = _mm_crc32_u64(0, y << low);
}
}
For more bonus points, the ARM translation of this is:
uint32_t magic(uint64_t M) {
uint64_t stack = ~(uint64_t)1;
for (; M > 191; M = (M >> 1) - 16) {
stack = (stack << 1) + (M & 1);
}
stack = ~stack;
uint32_t acc = ((uint32_t)0x80000000) >> (M & 31);
for (M >>= 5; M; --M) acc = __crc32cw(acc, 0);
for (;;) {
uint32_t low = stack & 1;
if (!(stack >>= 1)) return acc;
poly8x8_t x = vreinterpret_p8_u64(vmov_n_u64(acc));
uint64_t y = vgetq_lane_u64(vreinterpretq_u64_p16(vmull_p8(x, x)), 0);
acc = __crc32cd(0, y << low);
}
}
If we do not have a hardware crc32c
instruction, or we want to change P
to something which lacks a hardware instruction, then things are harder, but not insurmountable. All we have to do is build the equivalent of the 32-bit CRC instruction:
def crc32cw(acc, val): # r32, r32 ↦ r32_z
acc = (acc + val) * x32
return mod_P(acc)
Expanding the mod_P
gets us to:
def crc32cw(acc, val): # r32, r32 ↦ r32_z
acc = (acc + val) * x32
P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
x18 + x14 + x13 + x11 + x10 + x9 + x8 + x6 + 1
q, r = floor_divide(acc, P)
return r
We can compute q
by Barrett reduction, and then use that to compute r
:
def crc32cw(acc, val): # r32, r32 ↦ r32_z
acc = (acc + val) * x32
P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
x18 + x14 + x13 + x11 + x10 + x9 + x8 + x6 + 1
q = (acc * (x63 // P)) // x63 # Anything ≥ 63 works
r = acc - q * P
return r
As we're operating in Z2, the -
in r = acc - q * P
can be replaced by +
. As deg(r)
< 32 and deg(acc)
≥ 32, the acc
in r = acc - q * P
can be replaced with term-slicing. After doing both these things, we're left with:
def crc32cw(acc, val): # r32, r32 ↦ r32_z
acc = (acc + val) * x32
P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
x18 + x14 + x13 + x11 + x10 + x9 + x8 + x6 + 1
q = (acc * (x63 // P)) // x63 # Anything ≥ 63 works
return (q * P)[0:32]
As concrete C code, this is:
uint32_t crc32cw(uint32_t acc, uint32_t val) {
__m128i k = _mm_setr_epi32(0, 0xdea713f1, 0x05ec76f1, 1);
__m128i a = _mm_cvtsi32_si128(acc ^ val);
__m128i b = _mm_clmulepi64_si128(a, k, 0x00);
__m128i c = _mm_clmulepi64_si128(b, k, 0x10);
return _mm_extract_epi32(c, 2);
}
Where the magic numbers in k
come from:
def floor_divide(S, P):
Q = 0
while S:
d = S.bit_length() - P.bit_length()
if d < 0:
break
Q ^= 1 << d
S ^= P << d
return Q, S
def rev32(F):
G = 0
for _ in range(32):
G = (G << 1) + (F & 1)
F >>= 1
assert F == 0
return G
P = 0x11EDC6F41
Q, _ = floor_divide(1 << 63, P)
print(hex(rev32(Q >> 32)))
print(hex(rev32(Q & 0xffffffff)))
print(hex(rev32(P >> 1)))
print(P & 1)
We can then build a 64-bit CRC function as two calls to crc32cw
, or directly:
uint32_t crc32cx(uint32_t acc, uint64_t val) {
__m128i k = _mm_setr_epi32(0xdea713f1, 0x4869ec38, 0x05ec76f1, 1);
__m128i a = _mm_cvtsi64_si128(acc ^ val);
__m128i b = _mm_clmulepi64_si128(a, k, 0x00);
__m128i c = _mm_clmulepi64_si128(b, k, 0x10);
return _mm_extract_epi32(c, 2);
}
Where the magic numbers in k
come from:
P = 0x11EDC6F41
Q, _ = floor_divide(1 << 95, P)
print(hex(rev32(Q >> 32)))
print(hex(rev32(Q & 0xffffffff)))
print(hex(rev32(P >> 1)))
print(P & 1)
For bonus points, the zlib CRC equivalent (P = 0x104C11DB7
) is the following:
uint32_t crc32x(uint32_t acc, uint64_t val) {
__m128i k = _mm_setr_epi32(0xf7011641, 0xb4e5b025, 0xdb710641, 1);
__m128i a = _mm_cvtsi64_si128(acc ^ val);
__m128i b = _mm_clmulepi64_si128(a, k, 0x00);
__m128i c = _mm_clmulepi64_si128(b, k, 0x10);
return _mm_extract_epi32(c, 2);
}
This can be plugged into Chromium's implementation to give a more succinct and slightly faster tail; everything under "Fold 128-bits to 64-bits" and "Barret reduce to 32-bits" and "Return the crc32" can be replaced by:
__m128i k = _mm_setr_epi32(0xf7011641, 0xb4e5b025, 0xdb710641, 1);
__m128i b = _mm_clmulepi64_si128(x1, k, 0x00);
__m128i c = _mm_clmulepi64_si128(b, k, 0x10);
c = _mm_blend_epi16(c, _mm_setzero_si128(), 0xcf); // or: c = _mm_and_si128(c, _mm_setr_epi32(0, 0, ~0, 0));
__m128i a = _mm_xor_si128(c, x1);
b = _mm_clmulepi64_si128(a, k, 0x01);
c = _mm_clmulepi64_si128(b, k, 0x10);
return _mm_extract_epi32(c, 2);