x86 macro-op fusion notes
For very many years now, x86 and x64 processors have been able to take an arithmetic instruction immediately followed by a conditional jump instruction, and execute the two instructions as if they were a single instruction (which is called "macro-op fusion"). When writing very hot loops in assembly code by hand, this can effectively give you an extra instruction for free.
In Intel chips, fusion first appeared in the original Core microarchitecture, launched in 2006. An improved version of fusion made its debut in the Nehalem microarchitecture (Nhlm) a few years later in 2008. The Sandy Bridge microarchitecture (SnBr) released in 2011 contained yet more improvements to fusion. Things remained fairly stable through until Cannon Lake (CnLk), but then Ice Lake in 2019 removed support for a few cases of fusion (INC
and DEC
instructions, and instructions with memory operands).
In AMD chips, fusion first appeared in the Bulldozer microarchitecture (Bdzr), launched in 2011. It supported fusing TEST
and CMP
instructions with all flag-based conditional jumps. Things remained fairly stable through until Zen 3 in 2020, which added support for a number of other instructions.
The table below summarises which instructions can fuse with which jumps on which microarchitectures. Combinations which are supported, but which are pointless in practice, are parenthesised.
TEST | AND | OR XOR | CMP | ADD SUB | INC DEC | |
---|---|---|---|---|---|---|
J[N]O | (Core+) | (SnBr+) | - | - | - | - |
J[N]S | Core+ | SnBr+ | - | - | - | - |
J[N]P JP[EO] | Core+ | SnBr+ | - | - | - | - |
J[N]B J[N]AE J[N]C | (Core+) | (SnBr+) | - | Core+ | SnBr+ | - |
J[N]BE J[N]A | (Core+) | (SnBr+) | - | Core+ | SnBr+ | - |
J[N]Z J[N]E | Core+ | SnBr+ | - | Core+ | SnBr+ | SnBr-CnLk |
J[N]L J[N]GE | (Core+) | (SnBr+) | - | Nhlm+ | SnBr+ | SnBr-CnLk |
J[N]G J[N]LE | Core+ | SnBr+ | - | Nhlm+ | SnBr+ | SnBr-CnLk |
TEST | AND | OR XOR | CMP | ADD SUB | INC DEC | |
J[N]O | (Bdzr+) | (Zen3+) | (Zen3+) | Bdzr+ | Zen3+ | Zen3+ |
J[N]S | Bdzr+ | Zen3+ | Zen3+ | Bdzr+ | Zen3+ | Zen3+ |
J[N]P JP[EO] | Bdzr+ | Zen3+ | Zen3+ | Bdzr+ | Zen3+ | Zen3+ |
J[N]B J[N]AE J[N]C | (Bdzr+) | (Zen3+) | (Zen3+) | Bdzr+ | Zen3+ | (Zen3+) |
J[N]BE J[N]A | (Bdzr+) | (Zen3+) | (Zen3+) | Bdzr+ | Zen3+ | (Zen3+) |
J[N]Z J[N]E | Bdzr+ | Zen3+ | Zen3+ | Bdzr+ | Zen3+ | Zen3+ |
J[N]L J[N]GE | (Bdzr+) | (Zen3+) | (Zen3+) | Bdzr+ | Zen3+ | Zen3+ |
J[N]G J[N]LE | Bdzr+ | Zen3+ | Zen3+ | Bdzr+ | Zen3+ | Zen3+ |
ADD
.
Jumps which depend only upon the result bits (for
TEST
, the result is what AND
would give, and for CMP
, the result is what SUB
would give):
Description | Flags | Fusion (SnBr) | |
---|---|---|---|
JS | Jump if sign | SF = 1 | TEST AND |
JNS | Jump if not sign | SF = 0 | TEST AND |
JP JPE | Jump if parity Jump if parity even | PF = 1 | TEST AND |
JNP JPO | Jump if not parity Jump if parity odd | PF = 0 | TEST AND |
JZ JE | Jump if zero Jump if equal | ZF = 1 | TEST AND INC DEC CMP ADD SUB |
JNZ JNE | Jump if not zero Jump if not equal | ZF = 0 | TEST AND INC DEC CMP ADD SUB |
SF
is a copy of the most significant bit of the result (which for conceptually signed results, indicates a result less than zero), PF
is set based on the population count of the low 8 bits of the result (PF = 1
implies that said population count was even), and ZF
is set if all bits of the result are zero (which for CMP
and SUB
implies that the two operands were equal).
Jumps which conceptually operate on unsigned operands:
Description | Flags | Fusion (SnBr) | |
---|---|---|---|
JB JNAE JC | Jump if below Jump if not above or equal Jump if carry | CF = 1 | CMP ADD SUB |
JNB JAE JNC | Jump if not below Jump if above or equal Jump if not carry | CF = 0 | CMP ADD SUB |
JBE JNA | Jump if below or equal Jump if not above | CF = 1 or ZF = 1 | CMP ADD SUB |
JNBE JA | Jump if not below or equal Jump if above | CF = 0 and ZF = 0 | CMP ADD SUB |
CF
is the (N+1)th bit. Fusion with TEST
and AND
is technically possible, but CF
is always zero following these instructions, making it pointless. The "above" and "below" descriptions are from the perspective of CMP
or SUB
with unsigned operands. To extend the description to ADD
, consider the instruction to be taking two N-bit unsigned operands, zero-extending both to (N+1) bits, and producing an (N+1)-bit signed result. Then "above" and "below" are whether this (N+1)-bit signed result is above or below zero.
Jumps which conceptually operate on signed operands:
Description | Flags | Fusion (SnBr) | |
---|---|---|---|
JO | Jump if overflow | OF = 1 | - |
JNO | Jump if not overflow | OF = 0 | - |
JL JNGE | Jump if less Jump if not greater or equal | SF ≠ OF | INC DEC CMP ADD SUB |
JNL JGE | Jump if not less Jump if greater or equal | SF = OF | INC DEC CMP ADD SUB |
JNG JLE | Jump if not greater Jump if less or equal | SF ≠ OF or ZF = 1 | TEST AND INC DEC CMP ADD SUB |
JG JNLE | Jump if greater Jump if not less or equal | SF = OF and ZF = 0 | TEST AND INC DEC CMP ADD SUB |
1
for INC
and DEC
), sign extend both to (N+1)-bits, compute an (N+1)-bit signed result, then OF
is set to the exclusive-or of the two most signicant bits of those (N+1). This means that SF ≠ OF
corresponds to the most significant bit of this signed (N+1)-bit result (in the same way that CF
corresponds to the most significant bit of the (N+1)-bit result when the operands are zero-extended). In other words, OF
is set if sign-extending the N-bit result to (N+1)-bits gives a different result to sign-extending the operands to (N+1)-bits and then performing (N+1)-bit arithmetic. Fusion with TEST
and AND
is technically possible for all these jumps, but OF
is always zero following TEST
or AND
, making many combinations pointless. The "greater" and "less" descriptions are from the perspective of CMP
or SUB
with signed operands. To extend the description to ADD
, consider the instruction to be taking two N-bit signed operands, sign-extending both to (N+1) bits, and producing an (N+1)-bit signed result. Then "greater" and "less" are whether this (N+1)-bit signed result is greater or less than zero. The INC
and DEC
instructions are equivalent to ADD
and SUB
with the 2nd operand being 1
(except for INC
and DEC
not touching CF
).
In the context of optimising hot loops, it is common to be iterating over the half-open range
[0, N)
, by some step S
. Given all the references above to comparing to zero, it is often beneficial to transform this range to [-N, 0)
, as then the loop-control instructions can be ADD reg, S
followed by a conditional jump, rather than ADD reg, S
followed by CMP reg, N
followed by a conditional jump. In the case of fusing ADD reg, S
with a conditional jump, if N
is known to be non-zero, then the conditional jump can be JA
or JAE
(or synonyms thereof), relying on CF = 1
occuring when the counter transitions from negative to non-negative. If N
might zero, and a single iteration of the loop is desired (or harmless) even when N
is zero, then the conditional jump cannot be JA
or JAE
, but instead can be JL
(or a synonym thereof), relying on sign extension of both operands to (N+1) bits and then jumping if the (N+1)-bit result is less than zero.