Last time, we had a need to compute mod_P(A * x32) where A was a polynomial with degree at most 31, and mod_P took the remainder after dividing by a polynomial P with degree 32. The traditional Barrett reduction approach involved finding the quotient and then calculating the remainder from that:

Q = ((A * x32) * (x63 // P)) // x63
R =  (A * x32) - Q * P
return R

Via a combination of the * x32, and the degree of P being 32, and the context being Z2 polynomials, this simplified to:

Q = ((A * x32) * (x63 // P)) // x63
return (Q * P)[0:32]

The resultant C code (for bit-reversed polynomials) was very short:

uint32_t mul_x32_mod_P(uint32_t A) {
    __m128i k = _mm_setr_epi32(0, 0xdea713f1, 0x05ec76f1, 1);
    __m128i a = _mm_cvtsi32_si128(A);
    __m128i b = _mm_clmulepi64_si128(a, k, 0x00);
    __m128i c = _mm_clmulepi64_si128(b, k, 0x10);
    return _mm_extract_epi32(c, 2);
}

Though our particular case permitted this simplification, it is not permitted in general. In Faster Remainder by Direct Computation (D. Lemire, O. Kaser, and N. Kurz, 2018), a different approach was presented: the remainder can be computed directly, rather than computing the quotient and getting the remainder from that. Said paper is about integers, and its main Theorem 1 is effectively:

Provided that d in [1, 2N) and n in [0, 2N) and F ≥ N + log2(d), then:
n % d equals (((n * ceil(2F / d)) % 2F) * d) // 2F.

Translated from the language of integers to the language of polynomials, this theorem would be:

Provided that S, P, T are polynomials satisfying deg((S * P) // T) ≤ 0, then:
S % P equals (((S * (T // P)) % T) * P) // T.

This lets us turn % P into multiplication by T // P followed by % T followed by * P followed by // T. The % T and // T steps are trivial if T is chosen to be xn (with n ≥ deg(S) + deg(P) to make the precondition hold), and T // P can be precomputed if P is fixed.

It turns out that the translated theorem is true, and that the proof isn't too hard. The proof builds upon the Barrett reduction property, which is:

Provided that deg(S // T) ≤ 0, S // P equals (S * (T // P)) // T.

We start with the complicated side of our (supposed) equality:

(((S * (T // P)) % T) * P) // T

We're going to be overwhelmed by parentheses if we're not careful, so I'm going to say that a sequence of mixed * and % and // operations always happens left-to-right, thus allowing some of the parentheses to be dropped:

S * (T // P) % T * P // T

From here, the % can be expanded: (F % G equals F - (F // G * G))

((S * (T // P)) - (S * (T // P) // T * T)) * P // T

Then push the - to the top level:

(S * (T // P) * P // T) - (S * (T // P) // T * T * P // T)

Making use of our deg((S * P) // T) ≤ 0 precondition, by the Barrett reduction property, (S * P) // P equals (S * P) * (T // P) // T:

(S * P // P) - (S * (T // P) // T * T * P // T)

We can then simplify away * G // G for any G:

S - (S * (T // P) // T * P)

Our deg((S * P) // T) ≤ 0 precondition implies deg(S // T) ≤ 0, so by the Barrett reduction property, S // P equals S * (T // P) // T:

S - (S // P * P)

Which is the definition of:

S % P

Thus our supposed equality is an actual equality.