Micro-optimisations can speed up CPython

Last time, I bemoaned what compilers did to some of the CPython interpreter main loop. Following those remarks, there are three obvious courses of action:

  1. Make targeted improvements to the compilers.
  2. Write the interpreter main loop directly in assembly.
  3. Tweak the C source code to make it more amenable to good compilation.

Option three is the easiest to explore, so let's start with a random benchmark to use as a baseline:

Python 3.6.0+ (default, Mar  7 2017, 00:04:40) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from performance.benchmarks.bm_nbody import bench_nbody
>>> bench_nbody(10, 'sun', 100000)
9.907924063038081

The following patch is a little long, but each individual change is relatively boring, and all the changes are motivated by what we saw in gcc's assembly:

diff --git a/Python/ceval.c b/Python/ceval.c
index d5172b9..79ccf2a 100644
--- a/Python/ceval.c
+++ b/Python/ceval.c
@@ -729,7 +729,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
     int opcode;        /* Current opcode */
     int oparg;         /* Current opcode argument, if any */
     enum why_code why; /* Reason for block stack unwind */
-    PyObject **fastlocals, **freevars;
+    PyObject **freevars;
     PyObject *retval = NULL;            /* Return value */
     PyThreadState *tstate = PyThreadState_GET();
     PyCodeObject *co;
@@ -865,7 +865,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
 /* Code access macros */
 
 /* The integer overflow is checked by an assertion below. */
-#define INSTR_OFFSET()  (sizeof(_Py_CODEUNIT) * (int)(next_instr - first_instr))
+#define INSTR_OFFSET()  ((char*)next_instr - (char*)first_instr)
 #define NEXTOPARG()  do { \
         _Py_CODEUNIT word = *next_instr; \
         opcode = _Py_OPCODE(word); \
@@ -959,7 +959,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
 
 /* Local variable macros */
 
-#define GETLOCAL(i)     (fastlocals[i])
+#define GETLOCAL(i)     (f->f_localsplus[i])
 
 /* The SETLOCAL() macro must not DECREF the local variable in-place and
    then store the new value; it must copy the old value to a temporary
@@ -1045,7 +1045,6 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
     co = f->f_code;
     names = co->co_names;
     consts = co->co_consts;
-    fastlocals = f->f_localsplus;
     freevars = f->f_localsplus + co->co_nlocals;
     assert(PyBytes_Check(co->co_code));
     assert(PyBytes_GET_SIZE(co->co_code) <= INT_MAX);
@@ -1228,7 +1227,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
             FAST_DISPATCH();
 
         TARGET(LOAD_FAST) {
-            PyObject *value = GETLOCAL(oparg);
+            PyObject *value = GETLOCAL((unsigned)oparg);
             if (value == NULL) {
                 format_exc_check_arg(PyExc_UnboundLocalError,
                                      UNBOUNDLOCAL_ERROR_MSG,
@@ -1242,7 +1241,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
 
         PREDICTED(LOAD_CONST);
         TARGET(LOAD_CONST) {
-            PyObject *value = GETITEM(consts, oparg);
+            PyObject *value = GETITEM(consts, (unsigned)oparg);
             Py_INCREF(value);
             PUSH(value);
             FAST_DISPATCH();
@@ -1251,7 +1250,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         PREDICTED(STORE_FAST);
         TARGET(STORE_FAST) {
             PyObject *value = POP();
-            SETLOCAL(oparg, value);
+            SETLOCAL((unsigned)oparg, value);
             FAST_DISPATCH();
         }
 
@@ -1526,7 +1525,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
 
         TARGET(LIST_APPEND) {
             PyObject *v = POP();
-            PyObject *list = PEEK(oparg);
+            PyObject *list = PEEK((size_t)(unsigned)oparg);
             int err;
             err = PyList_Append(list, v);
             Py_DECREF(v);
@@ -1731,7 +1730,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
             _Py_IDENTIFIER(__annotations__);
             PyObject *ann_dict;
             PyObject *ann = POP();
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
             int err;
             if (f->f_locals == NULL) {
                 PyErr_Format(PyExc_SystemError,
@@ -2155,7 +2154,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(STORE_NAME) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
             PyObject *v = POP();
             PyObject *ns = f->f_locals;
             int err;
@@ -2176,7 +2175,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(DELETE_NAME) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
             PyObject *ns = f->f_locals;
             int err;
             if (ns == NULL) {
@@ -2198,7 +2197,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         TARGET(UNPACK_SEQUENCE) {
             PyObject *seq = POP(), *item, **items;
             if (PyTuple_CheckExact(seq) &&
-                PyTuple_GET_SIZE(seq) == oparg) {
+                PyTuple_GET_SIZE(seq) == (Py_ssize_t)(size_t)(unsigned)oparg) {
                 items = ((PyTupleObject *)seq)->ob_item;
                 while (oparg--) {
                     item = items[oparg];
@@ -2206,7 +2205,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
                     PUSH(item);
                 }
             } else if (PyList_CheckExact(seq) &&
-                       PyList_GET_SIZE(seq) == oparg) {
+                       PyList_GET_SIZE(seq) == (Py_ssize_t)(size_t)(unsigned)oparg) {
                 items = ((PyListObject *)seq)->ob_item;
                 while (oparg--) {
                     item = items[oparg];
@@ -2215,7 +2214,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
                 }
             } else if (unpack_iterable(seq, oparg, -1,
                                        stack_pointer + oparg)) {
-                STACKADJ(oparg);
+                STACKADJ((unsigned)oparg);
             } else {
                 /* unpack_iterable() raised an exception */
                 Py_DECREF(seq);
@@ -2241,7 +2240,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(STORE_ATTR) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
             PyObject *owner = TOP();
             PyObject *v = SECOND();
             int err;
@@ -2255,7 +2254,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(DELETE_ATTR) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
             PyObject *owner = POP();
             int err;
             err = PyObject_SetAttr(owner, name, (PyObject *)NULL);
@@ -2266,7 +2265,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(STORE_GLOBAL) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
             PyObject *v = POP();
             int err;
             err = PyDict_SetItem(f->f_globals, name, v);
@@ -2277,7 +2276,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(DELETE_GLOBAL) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
             int err;
             err = PyDict_DelItem(f->f_globals, name);
             if (err != 0) {
@@ -2289,7 +2288,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(LOAD_NAME) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
             PyObject *locals = f->f_locals;
             PyObject *v;
             if (locals == NULL) {
@@ -2340,7 +2339,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(LOAD_GLOBAL) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
             PyObject *v;
             if (PyDict_CheckExact(f->f_globals)
                 && PyDict_CheckExact(f->f_builtins))
@@ -2385,9 +2384,9 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(DELETE_FAST) {
-            PyObject *v = GETLOCAL(oparg);
+            PyObject *v = GETLOCAL((unsigned)oparg);
             if (v != NULL) {
-                SETLOCAL(oparg, NULL);
+                SETLOCAL((unsigned)oparg, NULL);
                 DISPATCH();
             }
             format_exc_check_arg(
@@ -2488,7 +2487,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(BUILD_TUPLE) {
-            PyObject *tup = PyTuple_New(oparg);
+            PyObject *tup = PyTuple_New((unsigned)oparg);
             if (tup == NULL)
                 goto error;
             while (--oparg >= 0) {
@@ -2500,7 +2499,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(BUILD_LIST) {
-            PyObject *list =  PyList_New(oparg);
+            PyObject *list =  PyList_New((unsigned)oparg);
             if (list == NULL)
                 goto error;
             while (--oparg >= 0) {
@@ -2571,7 +2570,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
                     err = PySet_Add(set, item);
                 Py_DECREF(item);
             }
-            STACKADJ(-oparg);
+            STACKADJ(-(size_t)(unsigned)oparg);
             if (err != 0) {
                 Py_DECREF(set);
                 goto error;
@@ -2601,7 +2600,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
 
         TARGET(BUILD_MAP) {
             Py_ssize_t i;
-            PyObject *map = _PyDict_NewPresized((Py_ssize_t)oparg);
+            PyObject *map = _PyDict_NewPresized((size_t)(unsigned)oparg);
             if (map == NULL)
                 goto error;
             for (i = oparg; i > 0; i--) {
@@ -2684,12 +2683,12 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
             PyObject *map;
             PyObject *keys = TOP();
             if (!PyTuple_CheckExact(keys) ||
-                PyTuple_GET_SIZE(keys) != (Py_ssize_t)oparg) {
+                PyTuple_GET_SIZE(keys) != (Py_ssize_t)(size_t)(unsigned)oparg) {
                 PyErr_SetString(PyExc_SystemError,
                                 "bad BUILD_CONST_KEY_MAP keys argument");
                 goto error;
             }
-            map = _PyDict_NewPresized((Py_ssize_t)oparg);
+            map = _PyDict_NewPresized((size_t)(unsigned)oparg);
             if (map == NULL) {
                 goto error;
             }
@@ -2746,7 +2745,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
             for (i = oparg; i > 0; i--) {
                 PyObject *arg = PEEK(i);
                 if (_PyDict_MergeEx(sum, arg, 2) < 0) {
-                    PyObject *func = PEEK(2 + oparg);
+                    PyObject *func = PEEK(2 + (unsigned)oparg);
                     if (PyErr_ExceptionMatches(PyExc_AttributeError)) {
                         PyErr_Format(PyExc_TypeError,
                                 "%.200s%.200s argument after ** "
@@ -2810,7 +2809,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(LOAD_ATTR) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
             PyObject *owner = TOP();
             PyObject *res = PyObject_GetAttr(owner, name);
             Py_DECREF(owner);
@@ -2835,7 +2834,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(IMPORT_NAME) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
             PyObject *fromlist = POP();
             PyObject *level = TOP();
             PyObject *res;
@@ -2869,7 +2868,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(IMPORT_FROM) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
             PyObject *from = TOP();
             PyObject *res;
             res = import_from(from, name);
@@ -2880,7 +2879,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(JUMP_FORWARD) {
-            JUMPBY(oparg);
+            JUMPBY((unsigned)oparg);
             FAST_DISPATCH();
         }
 
@@ -2894,7 +2893,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
             }
             if (cond == Py_False) {
                 Py_DECREF(cond);
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
                 FAST_DISPATCH();
             }
             err = PyObject_IsTrue(cond);
@@ -2902,7 +2901,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
             if (err > 0)
                 err = 0;
             else if (err == 0)
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
             else
                 goto error;
             DISPATCH();
@@ -2918,14 +2917,14 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
             }
             if (cond == Py_True) {
                 Py_DECREF(cond);
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
                 FAST_DISPATCH();
             }
             err = PyObject_IsTrue(cond);
             Py_DECREF(cond);
             if (err > 0) {
                 err = 0;
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
             }
             else if (err == 0)
                 ;
@@ -2943,7 +2942,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
                 FAST_DISPATCH();
             }
             if (cond == Py_False) {
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
                 FAST_DISPATCH();
             }
             err = PyObject_IsTrue(cond);
@@ -2953,7 +2952,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
                 err = 0;
             }
             else if (err == 0)
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
             else
                 goto error;
             DISPATCH();
@@ -2968,13 +2967,13 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
                 FAST_DISPATCH();
             }
             if (cond == Py_True) {
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
                 FAST_DISPATCH();
             }
             err = PyObject_IsTrue(cond);
             if (err > 0) {
                 err = 0;
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
             }
             else if (err == 0) {
                 STACKADJ(-1);
@@ -2987,7 +2986,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
 
         PREDICTED(JUMP_ABSOLUTE);
         TARGET(JUMP_ABSOLUTE) {
-            JUMPTO(oparg);
+            JUMPTO((unsigned)oparg);
 #if FAST_LOOPS
             /* Enabling this path speeds-up all while and for-loops by bypassing
                the per-loop checks for signals.  By default, this should be turned-off
@@ -3065,7 +3064,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
             /* iterator ended normally */
             STACKADJ(-1);
             Py_DECREF(iter);
-            JUMPBY(oparg);
+            JUMPBY((unsigned)oparg);
             PREDICT(POP_BLOCK);
             DISPATCH();
         }
@@ -3076,7 +3075,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
         }
 
         TARGET(CONTINUE_LOOP) {
-            retval = PyLong_FromLong(oparg);
+            retval = PyLong_FromLong((unsigned)oparg);
             if (retval == NULL)
                 goto error;
             why = WHY_CONTINUE;
@@ -3755,7 +3754,7 @@ format_missing(const char *kind, PyCodeObject *co, PyObject *names)
 
 static void
 missing_arguments(PyCodeObject *co, Py_ssize_t missing, Py_ssize_t defcount,
-                  PyObject **fastlocals)
+                  PyFrameObject *f)
 {
     Py_ssize_t i, j = 0;
     Py_ssize_t start, end;
@@ -3793,7 +3792,7 @@ missing_arguments(PyCodeObject *co, Py_ssize_t missing, Py_ssize_t defcount,
 
 static void
 too_many_positional(PyCodeObject *co, Py_ssize_t given, Py_ssize_t defcount,
-                    PyObject **fastlocals)
+                    PyFrameObject *f)
 {
     int plural;
     Py_ssize_t kwonly_given = 0;
@@ -3863,7 +3862,7 @@ _PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
     PyCodeObject* co = (PyCodeObject*)_co;
     PyFrameObject *f;
     PyObject *retval = NULL;
-    PyObject **fastlocals, **freevars;
+    PyObject **freevars;
     PyThreadState *tstate;
     PyObject *x, *u;
     const Py_ssize_t total_args = co->co_argcount + co->co_kwonlyargcount;
@@ -3883,7 +3882,6 @@ _PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
     if (f == NULL) {
         return NULL;
     }
-    fastlocals = f->f_localsplus;
     freevars = f->f_localsplus + co->co_nlocals;
 
     /* Create a dictionary for keyword parameters (**kwags) */
@@ -3990,7 +3988,7 @@ _PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
 
     /* Check the number of positional arguments */
     if (argcount > co->co_argcount && !(co->co_flags & CO_VARARGS)) {
-        too_many_positional(co, argcount, defcount, fastlocals);
+        too_many_positional(co, argcount, defcount, f);
         goto fail;
     }
 
@@ -4004,7 +4002,7 @@ _PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
             }
         }
         if (missing) {
-            missing_arguments(co, missing, defcount, fastlocals);
+            missing_arguments(co, missing, defcount, f);
             goto fail;
         }
         if (n > m)
@@ -4039,7 +4037,7 @@ _PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
             missing++;
         }
         if (missing) {
-            missing_arguments(co, missing, -1, fastlocals);
+            missing_arguments(co, missing, -1, f);
             goto fail;
         }
     }
@@ -4845,7 +4843,6 @@ _PyFunction_FastCall(PyCodeObject *co, PyObject **args, Py_ssize_t nargs,
 {
     PyFrameObject *f;
     PyThreadState *tstate = PyThreadState_GET();
-    PyObject **fastlocals;
     Py_ssize_t i;
     PyObject *result;
 
@@ -4861,11 +4858,9 @@ _PyFunction_FastCall(PyCodeObject *co, PyObject **args, Py_ssize_t nargs,
         return NULL;
     }
 
-    fastlocals = f->f_localsplus;
-
     for (i = 0; i < nargs; i++) {
         Py_INCREF(*args);
-        fastlocals[i] = *args++;
+        f->f_localsplus[(size_t)i] = *args++;
     }
     result = PyEval_EvalFrameEx(f,0);
 
@@ -5335,9 +5330,8 @@ unicode_concatenate(PyObject *v, PyObject *w,
         switch (opcode) {
         case STORE_FAST:
         {
-            PyObject **fastlocals = f->f_localsplus;
-            if (GETLOCAL(oparg) == v)
-                SETLOCAL(oparg, NULL);
+            if (GETLOCAL((unsigned)oparg) == v)
+                SETLOCAL((unsigned)oparg, NULL);
             break;
         }
         case STORE_DEREF:
@@ -5352,7 +5346,7 @@ unicode_concatenate(PyObject *v, PyObject *w,
         case STORE_NAME:
         {
             PyObject *names = f->f_code->co_names;
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
             PyObject *locals = f->f_locals;
             if (PyDict_CheckExact(locals) &&
                 PyDict_GetItem(locals, name) == v) {

With all of these changes applied, we get a 1.3% speedup:

Python 3.6.0+ (default, Mar  7 2017, 00:06:13) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from performance.benchmarks.bm_nbody import bench_nbody
>>> bench_nbody(10, 'sun', 100000)
9.777307317999657

A 1.3% speedup is simutaneously not very much, and also a surprisingly large amount for eliminating just a few instructions here and there. Of course, your mileage may vary, and this is just one randomly chosen benchmark.

What do compilers do with the CPython interpreter main loop?

Compilers are notoriously bad at compiling the main loop of a programming language interpreter, and the CPython interpreter main loop is no exception: it is hard to compile it perfectly. The difficulty of compilation scales with the number of opcodes which the interpreter has, which in the case of CPython is more than 100, but we can actually get a feel for how well a compiler is doing by looking at just one opcode.

For this exercise I'll look at CPython 3.6's LOAD_FAST opcode, which in C is:

TARGET(LOAD_FAST) {
  PyObject *value = GETLOCAL(oparg);
  if (value == NULL) {
    format_exc_check_arg(PyExc_UnboundLocalError,
                         UNBOUNDLOCAL_ERROR_MSG,
                         PyTuple_GetItem(co->co_varnames, oparg));
    goto error;
  }
  Py_INCREF(value);
  PUSH(value);
  FAST_DISPATCH();
}

This opcode does a very small task: it loads a single local variable and pushes it onto the Python stack. After expanding various macros, the code becomes:

TARGET(LOAD_FAST) {
  PyObject *value = fastlocals[oparg];
  if (value == NULL) {
    // ... error handling ...
  }
  value->ob_refcnt++;
  *stack_pointer++ = value;
  if (_Py_TracingPossible) {
    // ... slow path ...
  }
  f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr));
  uint16_t word = *next_instr;
  opcode = word & 255;
  oparg = word >> 8;
  next_instr++;
  goto *opcode_targets[opcode];
}

With this code in hand, we can start looking at what compilers do with it, starting with gcc on Linux x64:

; PyObject *value = fastlocals[oparg]
48 8B 44 24 38           mov     rax, [rsp+38h]
49 63 F5                 movsxd  rsi, r13d
48 8B 04 F0              mov     rax, [rax+rsi*8]
; if (value == NULL)
48 85 C0                 test    rax, rax
0F 84 86 3A 00 00        jz      loc_545A2B
; value->ob_refcnt++
; *stack_pointer++ = value
4C 89 F2                 mov     rdx, r14
48 83 00 01              add     qword ptr [rax], 1
49 83 C6 08              add     r14, 8
48 89 02                 mov     [rdx], rax
; if (_Py_TracingPossible)
8B 05 97 AA 3B 00        mov     eax, cs:_Py_TracingPossible
85 C0                    test    eax, eax
0F 85 A3 CD FF FF        jnz     loc_53ED64
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
; next_instr++
48 89 EA                 mov     rdx, rbp
48 2B 14 24              sub     rdx, [rsp]
48 83 C5 02              add     rbp, 2
48 D1 FA                 sar     rdx, 1
01 D2                    add     edx, edx
89 53 78                 mov     [rbx+78h], edx
; word = *next_instr
0F B7 55 FE              movzx   edx, word ptr [rbp-2]
; opcode = word & 255
44 0F B6 C2              movzx   r8d, dl
; oparg = word >> 8
0F B6 F6                 movzx   esi, dh
; goto *opcode_targets[opcode]
49 63 D0                 movsxd  rdx, r8d
41 89 F5                 mov     r13d, esi
48 8B 14 D5 20 51 60 00  mov     rdx, ds:opcode_targets_11490[rdx*8]
FF E2                    jmp     rdx

From this, we can infer that gcc made the following choices:

We can also spot a number of sad things:

It feels like there are quite a few things which gcc could improve upon. Despite that, it could be the case that gcc is doing better than other compilers. To find out, we'll have to consider a few other compilers. With that in mind, we can look at what Clang on OSX x64 does:

; PyObject *value = fastlocals[oparg]
49 63 F7                 movsxd  rsi, r15d
49 8B 84 F6 78 01 00 00  mov     rax, [r14+rsi*8+178h]
; if (value == NULL)
48 85 C0                 test    rax, rax
0F 84 51 41 00 00        jz      loc_F7B6D
; value->ob_refcnt++
48 FF 00                 inc     qword ptr [rax]
; *stack_pointer++ = value
49 89 45 00              mov     [r13+0], rax
49 83 C5 08              add     r13, 8
; if (_Py_TracingPossible)
44 8B 3D D2 D5 1A 00     mov     r15d, cs:__Py_TracingPossible
45 85 FF                 test    r15d, r15d
0F 85 B7 BB FF FF        jnz     loc_EF5EE
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
48 8B 85 48 FE FF FF     mov     rax, [rbp-1B8h]
48 2B 85 28 FE FF FF     sub     rax, [rbp-1D8h]
48 D1 F8                 sar     rax, 1
48 98                    cdqe
48 01 C0                 add     rax, rax
41 89 46 78              mov     [r14+78h], eax
; word = *next_instr
48 8B 95 48 FE FF FF     mov     rdx, [rbp-1B8h]
0F B7 02                 movzx   eax, word ptr [rdx]
; opcode = word & 255
0F B6 D8                 movzx   ebx, al
; oparg = word >> 8
0F B6 C4                 movzx   eax, ah
41 89 C7                 mov     r15d, eax
; next_instr++
48 83 C2 02              add     rdx, 2
48 89 95 48 FE FF FF     mov     [rbp+var_1B8], rdx
; goto *opcode_targets[opcode]
48 63 C3                 movsxd  rax, ebx
48 8D 0D 87 01 13 00     lea     rcx, _opcode_targets_11343
48 8B 04 C1              mov     rax, [rcx+rax*8]
FF E0                    jmp     rax

From this, we can infer that clang made the following choices:

Again, we can critique this assembly:

Overall, Clang did some things better than gcc, made some of the same mistakes, and did some things worse than gcc.

Next up is MSVC on Windows x64. This compiler is at a slight disadvantage, as it doesn't supported computed goto statements, and has to instead fall back to using a switch statement. Bearing that in mind, the assembly is:

; PyObject *value = fastlocals[oparg]
48 63 C3                  movsxd  rax, ebx
49 8B 8C C7 78 01 00 00   mov     rcx, [r15+rax*8+178h]
; if (value == NULL)
48 85 C9                  test    rcx, rcx
0F 84 5C 68 04 00         jz      loc_1E0791BF
; value->ob_refcnt++
48 FF 01                  inc     qword ptr [rcx]
; *stack_pointer++ = value
49 89 0C 24               mov     [r12], rcx
49 83 C4 08               add     r12, 8
4C 89 64 24 48            mov     [rsp+48h], r12
; end of switch case
E9 77 FF FF FF            jmp     loc_1E0328EF

loc_1E0328EF:
; if (_Py_TracingPossible)
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
48 8B C2                 mov     rax, rdx
49 2B C6                 sub     rax, r14
48 D1 F8                 sar     rax, 1
03 C0                    add     eax, eax
83 3D 8F 2F 33 00 00     cmp     cs:_Py_TracingPossible, 0
41 89 47 78              mov     [r15+78h], eax
0F 85 DB 3D 04 00        jnz     loc_1E0766E6
; word = *next_instr
0F B7 1A                 movzx   ebx, word ptr [rdx]
; opcode = word & 255
44 0F B6 EB              movzx   r13d, bl
; oparg = word >> 8
C1 EB 08                 shr     ebx, 8
; next_instr++
48 83 C2 02              add     rdx, 2
48 89 54 24 40           mov     [rsp+40h], rdx
; spill oparg
89 5C 24 70              mov     dword ptr [rsp+70h], ebx
; align instruction stream
0F 1F 40 00              nop     dword ptr [rax+00h]
                         db      66h, 66h, 66h
66 0F 1F 84 00 00 00 00  nop     word ptr [rax+rax+00000000h]
; check opcode in valid range
41 8D 45 FF              lea     eax, [r13-1]
3D 9D 00 00 00           cmp     eax, 9Dh
0F 87 48 6A 04 00        ja      loc_1E079387
; goto *opcode_targets[opcode] (actually jump to switch case)
48 63 C8                 movsxd  rcx, eax
41 8B 84 88 CC 67 03 00  mov     eax, ds:(off_1E0367CC - 1E000000h)[r8+rcx*4]
49 03 C0                 add     rax, r8
FF E0                    jmp     rax

For MSVC, we can infer:

As per the established pattern, the critique on MSVC's code is:

MSVC ends up being like gcc in some regards, like clang in others, and sometimes unlike either. The lack of computed goto statements is certainly painful though, and accounts for four entries in the critique list.

Having bashed the three major compilers for being imperfect, I'm now obliged to provide what I think is the perfect assembly code for this opcode - if I was writing the CPython interpreter main loop in assembly [3] then this is what I'd write for LOAD_FAST:

; PyObject *value = fastlocals[oparg]
48 8B 94 CB 00 01 00 00  mov     rdx, [rbx+rcx*8+100h]
; word = *next_instr
41 0F B7 04 2E           movzx   eax, word ptr [r14+rbp]
; if (value == NULL)
48 85 D2                 test    rdx, rdx
0F 84 F7 12 00 00        jz      loc_1E00696D
; *stack_pointer++ = value
49 89 14 24              mov     [r12], rdx
49 83 C4 08              add     r12, 8
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
89 2B                    mov     [rbx], ebp
; value->ob_refcnt++
48 83 02 01              add     qword ptr [rdx], 1
; if (_Py_TracingPossible)
41 F6 47 F8 01           test    byte ptr [r15-8], 1
0F 85 8F BA FF FF        jnz     loc_1E00111E
; oparg = word >> 8
0F B6 CC                 movzx   ecx, ah
; opcode = word & 255
0F B6 C0                 movzx   eax, al
; next_instr++
83 C5 02                 add     ebp, 2
; goto *opcode_targets[opcode]
41 FF 24 C7              jmp     qword ptr [r15+rax*8]

My assembly makes the following register assignment choices:

The combination of storing next_instr as an offset and keeping f biased by offsetof(PyFrameObject, f_lasti) means that f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr)) is two bytes / one instruction, versus 19 bytes / six instructions for gcc. Keeping f biased has no downside, and has the occasional other upside (accesses to some fields toward the end of PyFrameObject can be accessed with a one-byte displacement rather than a four-byte displacement). Storing next_instr as an offset has the minor downside of making the *next_instr memory operand slightly more complex ([14+rbp] rather than [rbp]), but this is a very low cost, and the offset approach also makes certain jump-related opcodes slightly cleaner and avoids a REX prefix on next_instr++. Keeping the jump table address in r15 is expensive (as POSIX x64 only has six non-volatile registers, and this burns one of those six for a runtime constant), but makes opcode dispatch cheap (which is important, given that dispatch is replicated into all 100+ opcodes), and has some upsides (e.g. rip-relative lea instructions can instead be r15-relative, and thus be executed on a wider range of ports). I also change _Py_TracingPossible from being a 32-bit variable to being a 1-bit variable, and put this variable just before the jump table (so that it can be addressed with a one-byte offset from r15). The other notable thing to point out is pulling word = *next_instr up towards the start of the instruction steam - I want to give the CPU as much time as possible to perform that load, as it is critical for control-flow.

That is one opcode - LOAD_FAST - considered in detail. Only 100+ other opcodes to go...

[1] There are two kinds of assignment to oparg: one kind we've already seen, namely oparg = word >> 8, which fairly obviously can't make oparg negative. The other kind is in the EXTENDED_ARG opcode, which does oparg |= oldoparg << 8;: we have to appeal to language lawyering to claim that oldoparg being non-negative implies that oldoparg << 8 is non-negative (signed overflow is undefined and all that). Then one simple step to claim that oparg being non-negative and oldoparg << 8 being non-negative implies oparg | (oldoparg << 8) is non-negative.

[2] The ah/bh/ch/dh registers can only be accessed if a REX prefix is not used. The r8 through r15 registers can only be accessed if a REX prefix is used. QED.

[3] If I was writing the CPython interpreter main loop in assembly. If. I mean, I'd have to be crazy to write that much assembly...

Estimating π with LuaJIT

There are various ways of estimating π, though one cheap and cheerful way is to estimate the area of a circle: imagine a grid of 1x1 squares, draw a perfect circle of radius r, and estimate the area of the circle as being the number of squares whose midpoint is within the circle. We can implement this in a few lines of Lua:

r = 1000
for x = -r, r do
  for y = -r, r do
    if x^2 + y^2 <= r^2 then
      n = (n or 0) + 1
    end
  end
end
print(n / r^2)

With a small amount of thought, the inner for and if can be replaced with an analytical solution:

r = 1000
for x = -r, r do
  n = (n or 0) + math.floor(math.sqrt(r^2 - x^2)) * 2 + 1
end
print(n / r^2)

At this point, we can bump r up to 10000000, run the above through LuaJIT, and get an estimate of 3.1415926535059 in less than a tenth of a second. Not a bad estimate for a few lines of code and a tiny amount of compute time.

The estimate itself is nice, but I find it more interesting to look at what LuaJIT is doing behind the scenes with this code. To begin with, we can inspect the bytecode - if source code is an array of characters, then bytecode is an array of instructions (plus an array of constants, plus a few other bits and bobs), and the first thing LuaJIT does with any source code is turn it into bytecode. We can see the bytecode by using the -bl argument:

$ luajit -bl pi.lua 
-- BYTECODE -- pi.lua:0-6
0001    KNUM     0   0      ; 10000000
0002    GSET     0   0      ; "r"
0003    GGET     0   0      ; "r"
0004    UNM      0   0
0005    GGET     1   0      ; "r"
0006    KSHORT   2   1
0007    FORI     0 => 0029
0008 => GGET     4   1      ; "n"
0009    IST          4
0010    JMP      5 => 0012
0011    KSHORT   4   0
0012 => GGET     5   2      ; "math"
0013    TGETS    5   5   3  ; "floor"
0014    GGET     6   2      ; "math"
0015    TGETS    6   6   4  ; "sqrt"
0016    GGET     7   0      ; "r"
0017    KSHORT   8   2
0018    POW      7   7   8
0019    KSHORT   8   2
0020    POW      8   3   8
0021    SUBVV    7   7   8
0022    CALL     6   0   2
0023    CALLM    5   2   0
0024    MULVN    5   5   1  ; 2
0025    ADDVV    4   4   5
0026    ADDVN    4   4   2  ; 1
0027    GSET     4   1      ; "n"
0028    FORL     0 => 0008
0029 => GGET     0   5      ; "print"
0030    GGET     1   1      ; "n"
0031    GGET     2   0      ; "r"
0032    KSHORT   3   2
0033    POW      2   2   3
0034    DIVVV    1   1   2
0035    CALL     0   1   2
0036    RET0     0   1

Having said bytecode was an array of instructions, the first column in the above (containing 0001 through 0036) is the array index. There is actually also an instruction at array index 0 which -bl doesn't show us, which in this case is 0000 FUNCV 9. The next column contains either or => - it is blank for most instructions, and is => for any instruction which is the target of a jump. The next column (KNUM, GSET, ..., RET0) contains the instruction name. The next few columns contain numbers, the meaning of which depend upon the instruction name. Finally, some instructions have a comment printed after the ;.

We can run through the various different kinds of instruction in the above (in order of first appearance):

Phew. That gets us through the bytecode listing. There's a lot of it, some of it is executed just one, and other bits of it are executed many many times. The LuaJIT interpreter doesn't care though; it'll happily churn through bytecode all day long. At some point though, the JIT part of LuaJIT comes into play. To explore this part, we'll switch from -bl (which dumps bytecode instead of running it) to -jdump=bitmsrx (which runs bytecode and also dumps various pieces of information as JIT compilation happens and JIT-compiled code executes):

$ luajit -jdump=bitmsrx pi.lua

The first thing to be output by -jdump=bitmsrx is the following:

---- TRACE 1 start pi.lua:2
0008  GGET     4   1      ; "n"
0009  IST          4
0010  JMP      5 => 0012
0012  GGET     5   2      ; "math"
0013  TGETS    5   5   3  ; "floor"
0014  GGET     6   2      ; "math"
0015  TGETS    6   6   4  ; "sqrt"
0016  GGET     7   0      ; "r"
0017  KSHORT   8   2
0018  POW      7   7   8
0019  KSHORT   8   2
0020  POW      8   3   8
0021  SUBVV    7   7   8
0022  CALL     6   0   2
0000  . FUNCC               ; math.sqrt
0023  CALLM    5   2   0
0000  . FUNCC               ; math.floor
0024  MULVN    5   5   1  ; 2
0025  ADDVV    4   4   5
0026  ADDVN    4   4   2  ; 1
0027  GSET     4   1      ; "n"
0028  FORL     0 => 0008

This tells us that trace #1 starts at line 2 of pi.lua, and is followed by the bytecode instructions which comprise the trace. This should look familiar to the -bl dump from earlier, albeit with a few differences. We start at instruction index 0008 - the JIT compiler doesn't compile everything, only the things it thinks are worth compiling, and in this case it thinks that the loop body (which starts at instruction 0008) is worth compiling. The column which contained either or => is gone - traces are strictly linear sequences with no branches, and hence no jump targets within them, and hence no need for => jump target indicators. Instruction 0011 isn't listed in the trace, as it isn't part of the trace - in the particular flow of execution recorded by the JIT compiler, the 0 branch of (n or 0) wasn't taken. The other major difference happens at function calls: when a call happens, tracing follows the call, and starts including bytecode instructions from the called function in the trace. The function known as math.sqrt consists of the single bytecode instruction 0000 FUNCC, the effect of which is three-fold:

  1. Start a C function (c.f. the FUNCV instruction we saw earlier for starting a vararg Lua function).
  2. Go off and run the C code associated with the function.
  3. Return to the calling function (c.f. the RET0 instruction we saw earlier for returning from a Lua function).

Like math.sqrt, math.floor also consists of just a single bytecode instruction. In both cases, the bytecode instruction is included in the trace; the . marker between the array index and the instruction name denotes a call frame level (. . denotes two levels of call frame, etc.).

Actually, -jdump=bitmsrx is telling us a lie: math.sqrt does consist of just a single bytecode instruction, and that instruction does do steps 1 and 3 from the above list, but its step 2 is "Go off and run the assembly code for math.sqrt". This super-specialised bytecode instruction is only used by math.sqrt, and doesn't have a name per-se, so reporting its name as FUNCC is perhaps not the worse lie in the world. Similarly, math.floor conists of a single super-specialised bytecode instruction (not all standard library functions follow this pattern - some are just plain old C functions - but most of the math library happens to be implemented in assembly rather than C).

We talk about bytecode instructions being included in a trace, but the bytecode isn't actually retained in the trace. Instead, as each bytecode instruction is executed, some so-called IR instructions are appended to the trace. After bytecode recording has finished for the trace, we get the next chunk of output from -jdump=bitmsrx, which is the full list of IR instructions:

---- TRACE 1 IR
....              SNAP   #0   [ ---- ]
0001 rbx   >  int SLOAD  #2    CRI
0002       >  int LE     0001  +2147483646
0003 rbp   >  int SLOAD  #1    CI
0004 rax      fun SLOAD  #0    R
0005 rax      tab FLOAD  0004  func.env
0006          int FLOAD  0005  tab.hmask
0007       >  int EQ     0006  +63 
0008 r14      p32 FLOAD  0005  tab.node
0009 r12   >  p32 HREFK  0008  "n"  @19
0010       >  num HLOAD  0009
0011       >  p32 HREFK  0008  "math" @54
0012 r15   >  tab HLOAD  0011
0013          int FLOAD  0012  tab.hmask
0014       >  int EQ     0013  +31 
0015 r13      p32 FLOAD  0012  tab.node
0016       >  p32 HREFK  0015  "floor" @14
0017       >  fun HLOAD  0016
0018       >  p32 HREFK  0015  "sqrt" @15
0019       >  fun HLOAD  0018
0020       >  p32 HREFK  0008  "r"  @12
0021 xmm0  >  num HLOAD  0020
0022 [8]      num MUL    0021  0021
0023 xmm1     num CONV   0003  num.int
0024 xmm1     num MUL    0023  0023
0025 xmm0     num SUB    0022  0024
0026       >  fun EQ     0019  math.sqrt
0027 xmm0     num FPMATH 0025  sqrt
0028       >  fun EQ     0017  math.floor
0029 xmm7     num FPMATH 0027  floor
0030 xmm7     num ADD    0029  0029
0031 xmm7     num ADD    0030  0010
0032 xmm7   + num ADD    0031  +1  
0033          num HSTORE 0009  0032
0034 rbp    + int ADD    0003  +1  
....              SNAP   #1   [ ---- ]
0035       >  int LE     0034  0001
....              SNAP   #2   [ ---- 0034 0001 ---- 0034 ]
0036 ------------ LOOP ------------
0037 xmm5     num CONV   0034  num.int
0038 xmm5     num MUL    0037  0037
0039 xmm6     num SUB    0022  0038
0040 xmm0     num FPMATH 0039  sqrt
0041 xmm6     num FPMATH 0040  floor
0042 xmm6     num ADD    0041  0041
0043 xmm7     num ADD    0042  0032
0044 xmm7   + num ADD    0043  +1  
0045          num HSTORE 0009  0044
0046 rbp    + int ADD    0034  +1  
....              SNAP   #3   [ ---- ]
0047       >  int LE     0046  0001
0048 rbp      int PHI    0034  0046
0049 xmm7     num PHI    0032  0044

Ignoring the .... SNAP lines for a moment, the first column (containing 0001 through 0049) is the index into the IR array. The next column contains either a register name (like rbx or xmm7) or a stack offset (like [8]) or is blank. This column isn't populated as the IR is created - instead it is populated as the IR is turned into machine code: if the result of the instruction gets assigned to a register or a stack slot, then that register or stack slot is recorded (some instructions don't have results, or don't have their result materialised, and so this column remains blank for them). The next column contains either > or : the > symbol indicates that the instruction is a so-called guard instruction: these instructions need to be executed even if their result is otherwise unused, and these instructions are also able to "fail" (failure, if/when it happens, causes execution to leave the JIT-compiled code and return to the interpreter). The next column contains either + or : the + symbol indicates that the instruction is used later-on in a PHI instruction - as with the register/stack column, this column isn't populated as the IR is recorded, and instead it is populated by the LOOP optimisation pass (as it is this optimisation pass which emits PHI instructions). The next column contains the type of the IR instruction, which in our case is one of:

The next columns contain the instruction name, and then the two (or occasionally one or three) operands to the instruction. We can run through the IR instructions in order of first appearance:

The effectiveness of the LOOP optimisation pass is really quite impressive: instructions 0001 through 0022 disappear completely, as do 0026 and 0028. The bytecode performs seven table lookups per iteration (n, math, floor, math, sqrt, r, n). Common-subexpression-elimination and load-forwarding during the bytecode to IR conversion causes the IR before the LOOP instruction to contain just five HREFK instructions (i.e. n and math are looked up once each rather than twice each). Despite the table store to n within the loop, these five HREFK instructions instructions are all determined to be loop-invariant (good alias analysis is to thank here). The HLOAD instructions for math, floor, sqrt, and r are also determined to be loop-invariant. The HLOAD for n isn't loop-invariant, but forwarding saves us, so there is no HLOAD for n after the LOOP instruction (that is, the reason for the HLOAD not being loop-invariant is the HSTORE within the loop body, but LuaJIT can forward the stored value rather than having to reload it). The HSTORE instruction for n is still done after the LOOP instruction: stores to local variables are deferred to trace exits, but stores to tables (including stores to global variables) are not deferred.

On that note, we can begin to consider the .... SNAP lines in the IR dump. Each of these lines corresponds to a so-called snapshot. A snapshot is used to transition from JIT-compiled code back to the interpreter, and consists of:

Sadly, for snapshots, -jdump doesn't report the bytecode instruction at which the interpreter will resume. It does report the deferred stores though: reading from left to right, each token between [ and ] represents a slot, with the leftmost token being slot #0. For example, [ ---- 0034 0001 ---- 0034 ] means that when exiting with this snapshot:

Call frames are denoted with | symbols in snapshots, but we'll gloss over this as it doesn't occur in our example. If an IR instruction can fail, then when it fails, the nearest preceding snapshot is used to return to the interpreter. In our example, this means instructions 0001 through 0034 use snapshot #0, 0035 uses #1, 0036 though 0046 use #2, and 0047 through 0049 use #3. It is worth dwelling on snapshots for moment, and viewing them as a form of transactional rollback. For example, (in our case) if instruction 0001 fails, then snapshot #0 is used. If instruction 0028 fails, then snapshot #0 is still used, despite various table lookups, some arithmetic, and a call to math.sqrt all happening between instructions 0001 and 0028. This means that if instruction 0028 fails, then after restoring the interpreter state, the interpreter will repeat the lookups, the arithmetic, and the math.sqrt call (presumably it wouldn't repeat the math.floor call, as a failure of instruction 0028 would mean that math.floor no longer corresponded to the builtin floor function).

With that, we can move on to the next chunk of output from -jdump=bitmsrx, which is the generated assembly (though LuaJIT actually generates machine code directly rather than generating assembly, so what is shown is really the output of -jdump's own bundled disassembler):

---- TRACE 1 mcode 507
f125fdfa  mov dword [0x00043410], 0x1
f125fe05  movsd xmm7, [rdx+0x8]
f125fe0a  cvttsd2si ebx, xmm7
f125fe0e  xorps xmm6, xmm6
f125fe11  cvtsi2sd xmm6, ebx
f125fe15  ucomisd xmm7, xmm6
f125fe19  jnz 0xf1250010  ->0
f125fe1f  jpe 0xf1250010  ->0
f125fe25  cmp ebx, 0x7ffffffe
f125fe2b  jg 0xf1250010 ->0
f125fe31  movsd xmm7, [rdx]
f125fe35  cvttsd2si ebp, xmm7
f125fe39  xorps xmm6, xmm6
f125fe3c  cvtsi2sd xmm6, ebp
f125fe40  ucomisd xmm7, xmm6
f125fe44  jnz 0xf1250010  ->0
f125fe4a  jpe 0xf1250010  ->0
f125fe50  mov eax, [rdx-0x8]
f125fe53  mov eax, [rax+0x8]
f125fe56  cmp dword [rax+0x1c], +0x3f
f125fe5a  jnz 0xf1250010  ->0
f125fe60  mov r14d, [rax+0x14]
f125fe64  mov rdi, 0xfffffffb00054868
f125fe6e  cmp rdi, [r14+0x1d0]
f125fe75  jnz 0xf1250010  ->0
f125fe7b  lea r12d, [r14+0x1c8]
f125fe82  cmp dword [r12+0x4], 0xfffeffff
f125fe8b  jnb 0xf1250010  ->0
f125fe91  mov rdi, 0xfffffffb00048d48
f125fe9b  cmp rdi, [r14+0x518]
f125fea2  jnz 0xf1250010    ->0
f125fea8  cmp dword [r14+0x514], -0x0c
f125feb0  jnz 0xf1250010  ->0
f125feb6  mov r15d, [r14+0x510]
f125febd  cmp dword [r15+0x1c], +0x1f
f125fec2  jnz 0xf1250010  ->0
f125fec8  mov r13d, [r15+0x14]
f125fecc  mov rdi, 0xfffffffb00049150
f125fed6  cmp rdi, [r13+0x158]
f125fedd  jnz 0xf1250010  ->0
f125fee3  cmp dword [r13+0x154], -0x09
f125feeb  jnz 0xf1250010  ->0
f125fef1  mov rdi, 0xfffffffb000491e0
f125fefb  cmp rdi, [r13+0x170]
f125ff02  jnz 0xf1250010  ->0
f125ff08  cmp dword [r13+0x16c], -0x09
f125ff10  jnz 0xf1250010  ->0
f125ff16  mov rdi, 0xfffffffb0004ebc8
f125ff20  cmp rdi, [r14+0x128]
f125ff27  jnz 0xf1250010  ->0
f125ff2d  cmp dword [r14+0x124], 0xfffeffff
f125ff38  jnb 0xf1250010  ->0
f125ff3e  movsd xmm0, [r14+0x120]
f125ff47  mulsd xmm0, xmm0
f125ff4b  movsd [rsp+0x8], xmm0
f125ff51  xorps xmm1, xmm1
f125ff54  cvtsi2sd xmm1, ebp
f125ff58  mulsd xmm1, xmm1
f125ff5c  subsd xmm0, xmm1
f125ff60  cmp dword [r13+0x168], 0x000491b8
f125ff6b  jnz 0xf1250010  ->0
f125ff71  sqrtsd xmm0, xmm0
f125ff75  cmp dword [r13+0x150], 0x00049128
f125ff80  jnz 0xf1250010  ->0
f125ff86  roundsd xmm7, xmm0, 0x09
f125ff8c  addsd xmm7, xmm7
f125ff90  addsd xmm7, [r12]
f125ff96  addsd xmm7, [0x00064bc8]
f125ff9f  movsd [r12], xmm7
f125ffa5  add ebp, +0x01
f125ffa8  cmp ebp, ebx
f125ffaa  jg 0xf1250014 ->1
->LOOP:
f125ffb0  movsd xmm0, [rsp+0x8]
f125ffb6  xorps xmm5, xmm5
f125ffb9  cvtsi2sd xmm5, ebp
f125ffbd  mulsd xmm5, xmm5
f125ffc1  movaps xmm6, xmm0
f125ffc4  subsd xmm6, xmm5
f125ffc8  sqrtsd xmm0, xmm6
f125ffcc  roundsd xmm6, xmm0, 0x09
f125ffd2  addsd xmm6, xmm6
f125ffd6  addsd xmm7, xmm6
f125ffda  addsd xmm7, [0x00064bc8]
f125ffe3  movsd [r12], xmm7
f125ffe9  add ebp, +0x01
f125ffec  cmp ebp, ebx
f125ffee  jle 0xf125ffb0  ->LOOP
f125fff0  jmp 0xf125001c  ->3

The first column gives the memory address of the instruction, and the remainder of the line gives the instruction in Intel-ish syntax (which happens to be a syntax I'm fond of; AT&T syntax needs to die). I'm not going to explain the semantics of each individual instruction (there are multi-thousand page Intel manuals for that), but there are a number of interesting things to point out:

The next piece of output from -jdump=bitmsrx tells us that recording (and IR optimisation and machine code generation) of the trace has finished, and that the trace is a looping trace:

---- TRACE 1 stop -> loop

The final piece of output from -jdump=bitmsrx tells us that execution of the trace finished and snapshot #3 was used to restore the interpreter state (noting that -jdump=bitmsrx doesn't tell us when execution of a trace starts):

---- TRACE 1 exit 3

Finally, we get our estimate of π courtesy of the interpreter executing (the bytecode corresponding to) print(n / r^2):

3.1415926535059

Alternatives to short unconditional jumps on x86

When writing x86 assembly code, short unconditional jumps are sometimes needed. For example, an if statement might become:

  <condition>
  jz false_branch
  <true_branch>
  jmp end
false_branch:
  <false_branch>
end:

The jmp end instruction can also be expressed as jmp $+n, where n is the length (in bytes) of the machine code for <false_branch>. When n is small and positive, this is a short unconditional jump. These jumps tend to look ugly, and it can be entertaining (and sometimes beneficial) to consider ways of avoiding them. For example, jmp $+1 encodes as EB 01 ?? where ?? is the one byte to be jumped over. If burning a register is an option, then mov al, imm8 (encoded as B0 ??) might be an alternative (that is, the byte being jumped over becomes the imm8 value). If burning a regsiter isn't an option, but burning flags is an option, then test al, imm8 (encoded as A8 ??) might be an alternative. If not even flags can be burnt, then nop [eax+imm8] (encoded as 0F 1F 40 ??) might be an alternative.

For jmp $+4, similar patterns can be used: mov eax, imm32 (B8 ?? ?? ?? ??), test eax, imm32 (A9 ?? ?? ?? ??), and nop [eax+imm32] (0F 1F 80 ?? ?? ?? ??) are all options. For jmp $+3 or jmp $+2, one easy option is to take a jmp $+4 pattern and replace the first one or two ??s with 00 (or any other value).

For jmp $+5, slightly more effort is required. On x86_64, we could use mov rax, imm64 for jmp $+8 and then only use five of the eight immediate bytes, but this feels slightly wasteful (and isn't an option for non-64-bit code). One option to make up five bytes is to combine a 32-bit immediate value with a ModRM byte or a SIB byte. For example, a nop instruction with an arbitrary SIB byte and 32-bit immediate looks like 0F 1F 84 ?? ?? ?? ?? ??. At the cost of burning a register, a shorter option is lea eax, [?] (8D 84 ?? ?? ?? ?? ??). With some knowledge of what we're jumping over, we can get shorter still - for example, jumping over a five-byte call rel32 instruction (E8 ?? ?? ?? ??) can be done with sub eax, imm32 (81 E8 ?? ?? ?? ??), albeit at the cost of burning both eax and flags.

If this topic tickles your fancy, some terms to google are:

cfa == rsp on x86_64

The DWARF standard, in the area of stack unwinding, uses the term CFA quite a lot. It contains the following passage explaining what the CFA is:

The call frame is identified by an address on the stack. We refer to this address as the Canonical Frame Address or CFA. Typically, the CFA is defined to be the value of the stack pointer at the call site in the previous frame (which may be different from its value on entry to the current frame).

The word "typically" in the above leaves the definition of the CFA slighty up in the air. A subsequent example from the document seems to defer the definition to the platform's ABI committee: (emphasis mine)

The following example uses a hypothetical RISC machine in the style of the Motorola 88000.

...

  • There are 8 4-byte registers:
    • ...
    • R7 stack pointer.
  • The stack grows in the negative direction.
  • The architectural ABI committee specifies that the stack pointer (R7) is the same as the CFA.

We now move from the DWARF standard to the x86_64 ABI, which contains the following squirreled away in the function reference:

_Unwind_GetCFA

uint64 _Unwind_GetCFA(struct _Unwind_Context *context);

This function returns the 64-bit Canonical Frame Address which is defined as the value of %rsp at the call site in the previous frame.

With this, we can happily say that on x86_64, cfa == rsp (at the call site in the previous frame). This statement has a few consequences:

  1. The unwind rule for cfa specifies how to restore rsp. Combined with the stack growing downwards, this is why the cfa unwind rule is typically cfa = rsp + N or cfa = rbp + N.
  2. Unwind rules can be specified for rsp, but they have no effect - the implicit rule rsp = cfa is always used for restoring rsp.
  3. If you want to specify an exotic unwind rule for rsp, then you have to instead use an exotic rule to restore cfa. In turn, this might make it difficult to specify unwind rules for other registers (as said rules are typically specified relative to cfa) - one workaround is to use two call frames: one to restore rsp and one to restore everything else.

One observation related to the workaround in point 3 is that the unwind rule for rip doesn't have to tell the truth. The unwind rule for rip is typically rip = [cfa - 8] because this matches the x86_64 ret semantics of rsp += 8; rip = [rsp - 8], but even if your function returns via ret, you aren't obliged to specify rip = [cfa - 8] as the unwind rule for rip: you can specify whatever small white lie you need to get the correct unwind behaviour.

page: 7 8 9 10 11