What do compilers do with the CPython interpreter main loop?
Compilers are notoriously bad at compiling the main loop of a programming language interpreter, and the CPython interpreter main loop is no exception: it is hard to compile it perfectly. The difficulty of compilation scales with the number of opcodes which the interpreter has, which in the case of CPython is more than 100, but we can actually get a feel for how well a compiler is doing by looking at just one opcode.
For this exercise I'll look at CPython 3.6's LOAD_FAST
opcode, which in C is:
TARGET(LOAD_FAST) {
PyObject *value = GETLOCAL(oparg);
if (value == NULL) {
format_exc_check_arg(PyExc_UnboundLocalError,
UNBOUNDLOCAL_ERROR_MSG,
PyTuple_GetItem(co->co_varnames, oparg));
goto error;
}
Py_INCREF(value);
PUSH(value);
FAST_DISPATCH();
}
This opcode does a very small task: it loads a single local variable and pushes it onto the Python stack. After expanding various macros, the code becomes:
TARGET(LOAD_FAST) {
PyObject *value = fastlocals[oparg];
if (value == NULL) {
// ... error handling ...
}
value->ob_refcnt++;
*stack_pointer++ = value;
if (_Py_TracingPossible) {
// ... slow path ...
}
f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr));
uint16_t word = *next_instr;
opcode = word & 255;
oparg = word >> 8;
next_instr++;
goto *opcode_targets[opcode];
}
With this code in hand, we can start looking at what compilers do with it, starting with gcc on Linux x64:
; PyObject *value = fastlocals[oparg]
48 8B 44 24 38 mov rax, [rsp+38h]
49 63 F5 movsxd rsi, r13d
48 8B 04 F0 mov rax, [rax+rsi*8]
; if (value == NULL)
48 85 C0 test rax, rax
0F 84 86 3A 00 00 jz loc_545A2B
; value->ob_refcnt++
; *stack_pointer++ = value
4C 89 F2 mov rdx, r14
48 83 00 01 add qword ptr [rax], 1
49 83 C6 08 add r14, 8
48 89 02 mov [rdx], rax
; if (_Py_TracingPossible)
8B 05 97 AA 3B 00 mov eax, cs:_Py_TracingPossible
85 C0 test eax, eax
0F 85 A3 CD FF FF jnz loc_53ED64
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
; next_instr++
48 89 EA mov rdx, rbp
48 2B 14 24 sub rdx, [rsp]
48 83 C5 02 add rbp, 2
48 D1 FA sar rdx, 1
01 D2 add edx, edx
89 53 78 mov [rbx+78h], edx
; word = *next_instr
0F B7 55 FE movzx edx, word ptr [rbp-2]
; opcode = word & 255
44 0F B6 C2 movzx r8d, dl
; oparg = word >> 8
0F B6 F6 movzx esi, dh
; goto *opcode_targets[opcode]
49 63 D0 movsxd rdx, r8d
41 89 F5 mov r13d, esi
48 8B 14 D5 20 51 60 00 mov rdx, ds:opcode_targets_11490[rdx*8]
FF E2 jmp rdx
From this, we can infer that gcc made the following choices:
fastlocals
is stored on the stack at[rsp+38h]
.oparg
is stored in the registerr13d
.stack_pointer
is stored in the registerr14
.next_instr
is stored in the registerrbp
.first_instr
is stored on the stack at[rsp]
.f
is stored in the registerrbx
.
We can also spot a number of sad things:
- A memory load could be avoided if gcc knew that
[rsp+38h]
wasrbx+178h
(fastlocals
is an array member off
). movsxd rsi, r13d
could be avoided if gcc knew thatr13d
(oparg
) was never negative (oparg
is stupidly declared asint
rather thanunsigned int
, and proving that it can never be negative requires a bit of effort and language-lawyering [1]).mov rdx, r14
feels silly; gcc should transposeadd r14, 8
andmov [rdx], rax
, then this instruction wouldn't be required (as a register-register move, it doesn't consume any execution resources, but it still consumes icache space and decoding resources).sar rdx, 1
andadd edx, edx
could be avoided if gcc knew thatfirst_instr
andnext_instr
were always even. The precedingsub
could also be avoided ifnext_instr
was maintained as an offset fromfirst_instr
rather than an absolute pointer.movsxd rdx, r8d
could be avoided if gcc realized thatr8d
(opcode
) was never negative (which it should be able to spot fairly easily: it is the result of amovzx
instruction, and thus can only be in the range 0 through 255).mov r13d, esi
could be avoided by doingmovzx esi, dh
directly intor13d
instead of intoesi
(except thatmovzx r13d, dh
cannot be expressed in machine code [2], so the compiler would have to swap things around and putoparg
in sayebp
andnext_instr
in sayr13
).mov rdx, ds:opcode_targets_11490[rdx*8]
andjmp rdx
could be merged into a single instructionjmp ds:opcode_targets_11490[rdx*8]
(thus saving an instruction decode and a byte or two of icache).
It feels like there are quite a few things which gcc could improve upon. Despite that, it could be the case that gcc is doing better than other compilers. To find out, we'll have to consider a few other compilers. With that in mind, we can look at what Clang on OSX x64 does:
; PyObject *value = fastlocals[oparg]
49 63 F7 movsxd rsi, r15d
49 8B 84 F6 78 01 00 00 mov rax, [r14+rsi*8+178h]
; if (value == NULL)
48 85 C0 test rax, rax
0F 84 51 41 00 00 jz loc_F7B6D
; value->ob_refcnt++
48 FF 00 inc qword ptr [rax]
; *stack_pointer++ = value
49 89 45 00 mov [r13+0], rax
49 83 C5 08 add r13, 8
; if (_Py_TracingPossible)
44 8B 3D D2 D5 1A 00 mov r15d, cs:__Py_TracingPossible
45 85 FF test r15d, r15d
0F 85 B7 BB FF FF jnz loc_EF5EE
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
48 8B 85 48 FE FF FF mov rax, [rbp-1B8h]
48 2B 85 28 FE FF FF sub rax, [rbp-1D8h]
48 D1 F8 sar rax, 1
48 98 cdqe
48 01 C0 add rax, rax
41 89 46 78 mov [r14+78h], eax
; word = *next_instr
48 8B 95 48 FE FF FF mov rdx, [rbp-1B8h]
0F B7 02 movzx eax, word ptr [rdx]
; opcode = word & 255
0F B6 D8 movzx ebx, al
; oparg = word >> 8
0F B6 C4 movzx eax, ah
41 89 C7 mov r15d, eax
; next_instr++
48 83 C2 02 add rdx, 2
48 89 95 48 FE FF FF mov [rbp+var_1B8], rdx
; goto *opcode_targets[opcode]
48 63 C3 movsxd rax, ebx
48 8D 0D 87 01 13 00 lea rcx, _opcode_targets_11343
48 8B 04 C1 mov rax, [rcx+rax*8]
FF E0 jmp rax
From this, we can infer that clang made the following choices:
oparg
is stored in the registerr15d
.stack_pointer
is stored in the registerr13
.next_instr
is stored on the stack at[rbp-1B8h]
.first_instr
is stored on the stack at[rbp-1D8h]
.f
is stored in the registerr14
.
Again, we can critique this assembly:
movsxd rsi, r15d
could be avoided if clang knew thatoparg
was never negative.- Clang does realise that
fastlocals
can be expressed asr14+178h
, which is good. inc qword ptr [rax]
versusadd qword ptr [rax], 1
is debatable, though I tend to have a slight preference foradd
(inc
might suffer a partial flags stall, but on the flip side is a slightly shorter encoding).- The choice of
r13
forstack_pointer
is a shame, as it requires[r13+0]
rather than just[r13]
. It would have been better to put sayf
inr13
andstack_pointer
inr14
. - Clang gets points for putting
mov [r13+0], rax
andadd r13, 8
in the "right" order, thus avoiding the extramov
performed by gcc. next_instr
should really have been assigned a register rather than a stack slot.cdqe
,add rax, rax
,mov [r14+78h], eax
is a disappointing instruction sequence: given the 32-bit store at the end, Clang should have spotted thatadd rax, rax
could be turned intoadd eax, eax
, and thus thatcdqe
could be dropped entirely.- The memory load
mov rdx, [rbp-1B8h]
could be avoided by instead doingmov rdx, rax
immediately after the precedingmov rax, [rbp-1B8h]
. - The
mov
inmovzx eax, ah
,mov r15d, eax
could be elimited had Clang chosen a more suitable register foroparg
[2]. movsxd rax, ebx
could be avoided if Clang realized thatebx
(opcode
) was never negative (which it should be able to spot fairly easily: it is the result of amovzx
instruction, and thus can only be in the range 0 through 255).- Clang has generated position-independent code, thus requiring
lea rcx, _opcode_targets_11343
rather than fusing the address of the jump table into the next memory load. mov rax, [rcx+rax*8]
andjmp rax
could be fused into a single instruction.
Overall, Clang did some things better than gcc, made some of the same mistakes, and did some things worse than gcc.
Next up is MSVC on Windows x64. This compiler is at a slight disadvantage, as it doesn't supported computed goto
statements, and has to instead fall back to using a switch
statement. Bearing that in mind, the assembly is:
; PyObject *value = fastlocals[oparg]
48 63 C3 movsxd rax, ebx
49 8B 8C C7 78 01 00 00 mov rcx, [r15+rax*8+178h]
; if (value == NULL)
48 85 C9 test rcx, rcx
0F 84 5C 68 04 00 jz loc_1E0791BF
; value->ob_refcnt++
48 FF 01 inc qword ptr [rcx]
; *stack_pointer++ = value
49 89 0C 24 mov [r12], rcx
49 83 C4 08 add r12, 8
4C 89 64 24 48 mov [rsp+48h], r12
; end of switch case
E9 77 FF FF FF jmp loc_1E0328EF
loc_1E0328EF:
; if (_Py_TracingPossible)
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
48 8B C2 mov rax, rdx
49 2B C6 sub rax, r14
48 D1 F8 sar rax, 1
03 C0 add eax, eax
83 3D 8F 2F 33 00 00 cmp cs:_Py_TracingPossible, 0
41 89 47 78 mov [r15+78h], eax
0F 85 DB 3D 04 00 jnz loc_1E0766E6
; word = *next_instr
0F B7 1A movzx ebx, word ptr [rdx]
; opcode = word & 255
44 0F B6 EB movzx r13d, bl
; oparg = word >> 8
C1 EB 08 shr ebx, 8
; next_instr++
48 83 C2 02 add rdx, 2
48 89 54 24 40 mov [rsp+40h], rdx
; spill oparg
89 5C 24 70 mov dword ptr [rsp+70h], ebx
; align instruction stream
0F 1F 40 00 nop dword ptr [rax+00h]
db 66h, 66h, 66h
66 0F 1F 84 00 00 00 00 nop word ptr [rax+rax+00000000h]
; check opcode in valid range
41 8D 45 FF lea eax, [r13-1]
3D 9D 00 00 00 cmp eax, 9Dh
0F 87 48 6A 04 00 ja loc_1E079387
; goto *opcode_targets[opcode] (actually jump to switch case)
48 63 C8 movsxd rcx, eax
41 8B 84 88 CC 67 03 00 mov eax, ds:(off_1E0367CC - 1E000000h)[r8+rcx*4]
49 03 C0 add rax, r8
FF E0 jmp rax
For MSVC, we can infer:
oparg
is stored in the registerebx
, and also on the stack at[rsp+70h]
.stack_pointer
is stored in the registerr12
, and also on the stack at[rsp+48h]
.next_instr
is stored in registerrdx
, and also on the stack at[rsp+40h]
.first_instr
is stored in registerr14
.f
is stored in the registerr15
.
As per the established pattern, the critique on MSVC's code is:
movsxd rax, ebx
is, yet again, the fault ofoparg
being signed.- Like clang, MSVC realised that
fastlocals
can be expressed asr15+178h
, which is good. - Like clang, MSVC opted for
inc
overadd
, which is debatable. - Like clang, MSVC put
mov [r12], rcx
andadd r12, 8
in the right order. - MSVC's choice of register for
stack_pointer
is good, as it avoids+0
in memory operands. - MSVC's spilling of registers to the stack is kind of sad.
jmp loc_1E0328EF
is suffered because MSVC lacks computedgoto
statements.f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
andif (_Py_TracingPossible)
are interleaved slightly, which is cute.- Like gcc, MSVC avoids the trap of
cdqe
andadd rax, rax
, instead opting foradd eax, eax
. Alas, like gcc, it still doesn't realise that neither thesar rax, 1
nor theadd eax, eax
are neccessary. shr ebx, 8
is done instead ofmovzx ebx, bh
- I'd have a slight preference for the latter, but there isn't much in it. At least the choice of registers avoids the gratuitousmov
done by gcc and clang.- The aligning of the instruction stream is - as far as I can tell - completely redundant, as nothing jumps to the aligned instruction.
- Rather than having a jump table for opcodes 0 through 255 inclusive, MSVC has a jump table for opcodes 1 through 158 inclusive. This saves a bit of virtual address space, but comes at the cost of
lea eax, [r13-1]
,cmp eax, 9Dh
,ja loc_1E079387
. movsxd rcx, eax
could be avoided if MSVC realized thateax
(opcode - 1
) was never negative (which it should be able to spot fairly easily: by means ofcmp eax, 9Dh
,ja loc_1E079387
it has already verified thateax
is between 0 and 157 inclusive).- MSVC's jump table contains 32-bit offsets rather than 64-bit absolute pointers. This saves on address space and dcache and relocations, but comes at the cost of the
add rax, r8
instruction to turn an offset into an absolute pointer.
MSVC ends up being like gcc in some regards, like clang in others, and sometimes unlike either. The lack of computed goto
statements is certainly painful though, and accounts for four entries in the critique list.
Having bashed the three major compilers for being imperfect, I'm now obliged to provide what I think is the perfect assembly code for this opcode - if I was writing the CPython interpreter main loop in assembly [3] then this is what I'd write for LOAD_FAST
:
; PyObject *value = fastlocals[oparg]
48 8B 94 CB 00 01 00 00 mov rdx, [rbx+rcx*8+100h]
; word = *next_instr
41 0F B7 04 2E movzx eax, word ptr [r14+rbp]
; if (value == NULL)
48 85 D2 test rdx, rdx
0F 84 F7 12 00 00 jz loc_1E00696D
; *stack_pointer++ = value
49 89 14 24 mov [r12], rdx
49 83 C4 08 add r12, 8
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
89 2B mov [rbx], ebp
; value->ob_refcnt++
48 83 02 01 add qword ptr [rdx], 1
; if (_Py_TracingPossible)
41 F6 47 F8 01 test byte ptr [r15-8], 1
0F 85 8F BA FF FF jnz loc_1E00111E
; oparg = word >> 8
0F B6 CC movzx ecx, ah
; opcode = word & 255
0F B6 C0 movzx eax, al
; next_instr++
83 C5 02 add ebp, 2
; goto *opcode_targets[opcode]
41 FF 24 C7 jmp qword ptr [r15+rax*8]
My assembly makes the following register assignment choices:
oparg
is stored in the registerecx
(at least on Windows; on other platforms it would be inedi
- in both cases this happens to coincide with the register which the normal calling convention uses for the first argument).stack_pointer
is stored in the registerr12
.- The offset of
next_instr
fromfirst_instr
is stored in registerebp
. first_instr
is stored in registerr14
.f+78h
is stored in the registerrbx
(in particular, this means that accesses tof->f_lasti
don't need a displacement, and accesses to some fields toward the end ofPyFrameObject
can be accessed with a one-byte displacement rather than a four-byte displacement).- The address of the jump table is stored in register
r15
.
The combination of storing next_instr
as an offset and keeping f
biased by offsetof(PyFrameObject, f_lasti)
means that f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
is two bytes / one instruction, versus 19 bytes / six instructions for gcc. Keeping f
biased has no downside, and has the occasional other upside (accesses to some fields toward the end of PyFrameObject
can be accessed with a one-byte displacement rather than a four-byte displacement). Storing next_instr
as an offset has the minor downside of making the *next_instr
memory operand slightly more complex ([14+rbp]
rather than [rbp]
), but this is a very low cost, and the offset approach also makes certain jump-related opcodes slightly cleaner and avoids a REX prefix on next_instr++
. Keeping the jump table address in r15
is expensive (as POSIX x64 only has six non-volatile registers, and this burns one of those six for a runtime constant), but makes opcode dispatch cheap (which is important, given that dispatch is replicated into all 100+ opcodes), and has some upsides (e.g. rip
-relative lea
instructions can instead be r15
-relative, and thus be executed on a wider range of ports). I also change _Py_TracingPossible
from being a 32-bit variable to being a 1-bit variable, and put this variable just before the jump table (so that it can be addressed with a one-byte offset from r15
). The other notable thing to point out is pulling word = *next_instr
up towards the start of the instruction steam - I want to give the CPU as much time as possible to perform that load, as it is critical for control-flow.
That is one opcode - LOAD_FAST
- considered in detail. Only 100+ other opcodes to go...
[1] There are two kinds of assignment to oparg
: one kind we've already seen, namely oparg = word >> 8
, which fairly obviously can't make oparg
negative. The other kind is in the EXTENDED_ARG
opcode, which does oparg |= oldoparg << 8;
: we have to appeal to language lawyering to claim that oldoparg
being non-negative implies that oldoparg << 8
is non-negative (signed overflow is undefined and all that). Then one simple step to claim that oparg
being non-negative and oldoparg << 8
being non-negative implies oparg | (oldoparg << 8)
is non-negative.
[2] The ah
/bh
/ch
/dh
registers can only be accessed if a REX prefix is not used. The r8
through r15
registers can only be accessed if a REX prefix is used. QED.
[3] If I was writing the CPython interpreter main loop in assembly. If. I mean, I'd have to be crazy to write that much assembly...