Estimating π with LuaJIT
There are various ways of estimating π, though one cheap and cheerful way is to estimate the area of a circle: imagine a grid of 1x1 squares, draw a perfect circle of radius r
, and estimate the area of the circle as being the number of squares whose midpoint is within the circle. We can implement this in a few lines of Lua:
r = 1000
for x = -r, r do
for y = -r, r do
if x^2 + y^2 <= r^2 then
n = (n or 0) + 1
end
end
end
print(n / r^2)
With a small amount of thought, the inner for
and if
can be replaced with an analytical solution:
r = 1000
for x = -r, r do
n = (n or 0) + math.floor(math.sqrt(r^2 - x^2)) * 2 + 1
end
print(n / r^2)
At this point, we can bump r
up to 10000000
, run the above through LuaJIT, and get an estimate of 3.1415926535059
in less than a tenth of a second. Not a bad estimate for a few lines of code and a tiny amount of compute time.
The estimate itself is nice, but I find it more interesting to look at what LuaJIT is doing behind the scenes with this code. To begin with, we can inspect the bytecode - if source code is an array of characters, then bytecode is an array of instructions (plus an array of constants, plus a few other bits and bobs), and the first thing LuaJIT does with any source code is turn it into bytecode. We can see the bytecode by using the -bl
argument:
$ luajit -bl pi.lua
-- BYTECODE -- pi.lua:0-6
0001 KNUM 0 0 ; 10000000
0002 GSET 0 0 ; "r"
0003 GGET 0 0 ; "r"
0004 UNM 0 0
0005 GGET 1 0 ; "r"
0006 KSHORT 2 1
0007 FORI 0 => 0029
0008 => GGET 4 1 ; "n"
0009 IST 4
0010 JMP 5 => 0012
0011 KSHORT 4 0
0012 => GGET 5 2 ; "math"
0013 TGETS 5 5 3 ; "floor"
0014 GGET 6 2 ; "math"
0015 TGETS 6 6 4 ; "sqrt"
0016 GGET 7 0 ; "r"
0017 KSHORT 8 2
0018 POW 7 7 8
0019 KSHORT 8 2
0020 POW 8 3 8
0021 SUBVV 7 7 8
0022 CALL 6 0 2
0023 CALLM 5 2 0
0024 MULVN 5 5 1 ; 2
0025 ADDVV 4 4 5
0026 ADDVN 4 4 2 ; 1
0027 GSET 4 1 ; "n"
0028 FORL 0 => 0008
0029 => GGET 0 5 ; "print"
0030 GGET 1 1 ; "n"
0031 GGET 2 0 ; "r"
0032 KSHORT 3 2
0033 POW 2 2 3
0034 DIVVV 1 1 2
0035 CALL 0 1 2
0036 RET0 0 1
Having said bytecode was an array of instructions, the first column in the above (containing 0001
through 0036
) is the array index. There is actually also an instruction at array index 0 which -bl
doesn't show us, which in this case is 0000 FUNCV 9
. The next column contains either
or =>
- it is blank for most instructions, and is =>
for any instruction which is the target of a jump. The next column (KNUM
, GSET
, ..., RET0
) contains the instruction name. The next few columns contain numbers, the meaning of which depend upon the instruction name. Finally, some instructions have a comment printed after the ;
.
We can run through the various different kinds of instruction in the above (in order of first appearance):
FUNCV
- Special instruction which marks the start of a vararg Lua function (Lua turns files into functions by wrapping them infunction(...)
...end
, which is a vararg function with no fixed parameters). The sole number is the number of local variable slots required by the function (I'm calling them local variable slots, but they can also hold invisible variables and temporary values - I'll refrain from calling them registers, even though that is a common term for them).KNUM
- Load a numeric constant from the array of constants into the array of local variables. The first number is the local variable index, the second number is the constant index. The comment is the value of the constant.GSET
- Store a value into the table of globals. The first number is the index of the local variable to store, the second number is the index of the string constant containing the name of the global (numeric constants and string constants are in conceptually separate arrays, so whenKNUM
talks about constant #0 andGSET
talks about constant #0, they are not talking about the same constant). The comment is the name of the global.GGET
- LikeGSET
, but for loading out of the global table rather than storing into it. The first number is the index of the local variable to load into, the second number is the index of the string constant containing the name of the global. The comment is the name of the global.UNM
- Perform unary negation (aka. unary minus). The second number is the index of some local variablev
, and the first number is the local variable index at which to store-v
.KSHORT
- LikeKNUM
, but only suitable for small integer constants, and avoids an array lookup. The first number is a local variable index, and the second number is the numeric value to be stored in that local.FORI
- Start of a numericfor
loop. The second number (in this case=> 0029
) is the instruction to jump to if the loop body doesn't execute even once. The first number refers to four consecutive local variables (that is, a value ofi
refers toi
,i+1
,i+2
, andi+3
). The first three local variables respectively initially contain the loop start value (-r
), the loop end value (r
), and the loop step size (1
). As loop iterations are performed, the first local variable is incremented by the loop step value. At the start of every loop iteration, the first local variable is copied to the fourth local variable (Lua allows the programmer to assign to the loop iteration variable during a loop, but doing so doesn't affect the loop, so the first local variable is the real loop iteration variable, and the fourth local variable is a copy which can be assigned-to without affecting the loop).IST
- Test whether a local variable is truthy (nil
andfalse
are falsey, all other values are truthy). The sole number is the index of the local variable to test. If the local variable is truthy, the next instruction is executed, otherwise the next instruction is skipped. The instruction is printed asIST 4
, though it could equally be represented asIST - 4
- the4
is really the second argument, and the first argument is ignored (this makesIST
more like theISTC
instruction, which does use its first argument).JMP
- Causes control flow to jump to the second number (in this case=> 0012
). The first number is the number of live local variables (or, equivalently, the index of the first dead local variable) - this number isn't used by LuaJIT's interpreter, but it is used as a hint by the JIT compiler.TGETS
- Perform a table lookup using a string key. The second number is the index of some local variablev
which should contain a table (or something with an__index
metamethod), the third number is the index of some string constants
, and the first number is the index of the local variable into whichv[s]
should be stored. The comment is the string constant.POW
- Perform exponentiation (raising some value to the power of another). All three numbers are local variable indices, which if we label thema
,b
, andc
(in that order), then the instruction doesa = b ^ c
(invoking metamethods if eitherb
orc
aren't numbers).SUBVV
- LikePOW
, but for subtraction rather than exponentiation. That is, it doesa = b - c
, wherea
,b
, andc
are the three local variable indices given as numbers after the instruction name (theVV
suffix indicates thatb
andc
are local variable indices; there is alsoSUBVN
whereb
andc
are a local variable index and a numeric constant index respectively, andSUBNV
whereb
andc
are a numeric constant index and a local variable index respectively).CALL
- Call a function (or something with a__call
metamethod) passing a fixed number of arguments. The first number is the index of some local variablev
containing the function to call. The second number (sayb
) is the number of return values plus one, and the third number (sayc
) is the number of arguments to pass plus one. The arguments are in the local variable slots immediately afterv
(that is,v+1
,v+2
, ...,v+c-1
), unless you're using the GC64 mode of LuaJIT (in which case they are inv+2
,v+3
, ...,v+c
). The return values replacev
and the slots after it (that is,v
,v+1
, ...,v+b-2
). In our case ofCALL 6 0 2
, there are -1 return value(s) and 1 argument(s); the -1 indicates infinity (that is, keep all the return values of the call, don't throw any away, and don't substitutenil
for any missing return values).CALLM
- Call a function (or something with a__call
metamethod) passing a potentially infinite number of arguments. The first two numbers are the same as forCALL
: the index of some local variablev
containing the function to call, and the number of return values plus one. The third number is added to the number of values generated by the previous instruction to give the number of arguments (CALLM
must immediately follow something which returns an "infinite" number of values). The location of arguments and return values is the same as forCALL
.MULVN
- Perform multiplication by a numeric constant. The first two numbers, saya
andb
, are local variable indices. The third number, sayc
, is a numeric constant index. The instruction doesa = b * c
. The comment is the value of the numeric constantc
.ADDVV
- LikeSUBVV
, but for addition rather than subtraction. That is, all three numbers are local variable indices, and it doesa = b + c
.ADDVN
- LikeADDVV
, but where the third number is a numeric constant index rather than a local variable index. The comment is the value of the numeric constant.FORL
- End of a numericfor
loop. The second number (in this case=> 0008
) is the instruction to jump to if the loop body is to be executed again. The first number is as forFORI
.DIVVV
- LikeSUBVV
, but for division rather than subtraction.RET0
- Return to the calling function, returning zero values. The two numbers are always0
and1
(to makeRET0
look like other less-specialisedRET
instructions).
Phew. That gets us through the bytecode listing. There's a lot of it, some of it is executed just one, and other bits of it are executed many many times. The LuaJIT interpreter doesn't care though; it'll happily churn through bytecode all day long. At some point though, the JIT part of LuaJIT comes into play. To explore this part, we'll switch from -bl
(which dumps bytecode instead of running it) to -jdump=bitmsrx
(which runs bytecode and also dumps various pieces of information as JIT compilation happens and JIT-compiled code executes):
$ luajit -jdump=bitmsrx pi.lua
The first thing to be output by -jdump=bitmsrx
is the following:
---- TRACE 1 start pi.lua:2
0008 GGET 4 1 ; "n"
0009 IST 4
0010 JMP 5 => 0012
0012 GGET 5 2 ; "math"
0013 TGETS 5 5 3 ; "floor"
0014 GGET 6 2 ; "math"
0015 TGETS 6 6 4 ; "sqrt"
0016 GGET 7 0 ; "r"
0017 KSHORT 8 2
0018 POW 7 7 8
0019 KSHORT 8 2
0020 POW 8 3 8
0021 SUBVV 7 7 8
0022 CALL 6 0 2
0000 . FUNCC ; math.sqrt
0023 CALLM 5 2 0
0000 . FUNCC ; math.floor
0024 MULVN 5 5 1 ; 2
0025 ADDVV 4 4 5
0026 ADDVN 4 4 2 ; 1
0027 GSET 4 1 ; "n"
0028 FORL 0 => 0008
This tells us that trace #1 starts at line 2 of pi.lua
, and is followed by the bytecode instructions which comprise the trace. This should look familiar to the -bl
dump from earlier, albeit with a few differences. We start at instruction index 0008
- the JIT compiler doesn't compile everything, only the things it thinks are worth compiling, and in this case it thinks that the loop body (which starts at instruction 0008
) is worth compiling. The column which contained either
or =>
is gone - traces are strictly linear sequences with no branches, and hence no jump targets within them, and hence no need for =>
jump target indicators. Instruction 0011
isn't listed in the trace, as it isn't part of the trace - in the particular flow of execution recorded by the JIT compiler, the 0
branch of (n or 0)
wasn't taken. The other major difference happens at function calls: when a call happens, tracing follows the call, and starts including bytecode instructions from the called function in the trace. The function known as math.sqrt
consists of the single bytecode instruction 0000 FUNCC
, the effect of which is three-fold:
- Start a C function (c.f. the
FUNCV
instruction we saw earlier for starting a vararg Lua function). - Go off and run the C code associated with the function.
- Return to the calling function (c.f. the
RET0
instruction we saw earlier for returning from a Lua function).
Like math.sqrt
, math.floor
also consists of just a single bytecode instruction. In both cases, the bytecode instruction is included in the trace; the .
marker between the array index and the instruction name denotes a call frame level (. .
denotes two levels of call frame, etc.).
Actually, -jdump=bitmsrx
is telling us a lie: math.sqrt
does consist of just a single bytecode instruction, and that instruction does do steps 1 and 3 from the above list, but its step 2 is "Go off and run the assembly code for math.sqrt
". This super-specialised bytecode instruction is only used by math.sqrt
, and doesn't have a name per-se, so reporting its name as FUNCC
is perhaps not the worse lie in the world. Similarly, math.floor
conists of a single super-specialised bytecode instruction (not all standard library functions follow this pattern - some are just plain old C functions - but most of the math
library happens to be implemented in assembly rather than C).
We talk about bytecode instructions being included in a trace, but the bytecode isn't actually retained in the trace. Instead, as each bytecode instruction is executed, some so-called IR instructions are appended to the trace. After bytecode recording has finished for the trace, we get the next chunk of output from -jdump=bitmsrx
, which is the full list of IR instructions:
---- TRACE 1 IR
.... SNAP #0 [ ---- ]
0001 rbx > int SLOAD #2 CRI
0002 > int LE 0001 +2147483646
0003 rbp > int SLOAD #1 CI
0004 rax fun SLOAD #0 R
0005 rax tab FLOAD 0004 func.env
0006 int FLOAD 0005 tab.hmask
0007 > int EQ 0006 +63
0008 r14 p32 FLOAD 0005 tab.node
0009 r12 > p32 HREFK 0008 "n" @19
0010 > num HLOAD 0009
0011 > p32 HREFK 0008 "math" @54
0012 r15 > tab HLOAD 0011
0013 int FLOAD 0012 tab.hmask
0014 > int EQ 0013 +31
0015 r13 p32 FLOAD 0012 tab.node
0016 > p32 HREFK 0015 "floor" @14
0017 > fun HLOAD 0016
0018 > p32 HREFK 0015 "sqrt" @15
0019 > fun HLOAD 0018
0020 > p32 HREFK 0008 "r" @12
0021 xmm0 > num HLOAD 0020
0022 [8] num MUL 0021 0021
0023 xmm1 num CONV 0003 num.int
0024 xmm1 num MUL 0023 0023
0025 xmm0 num SUB 0022 0024
0026 > fun EQ 0019 math.sqrt
0027 xmm0 num FPMATH 0025 sqrt
0028 > fun EQ 0017 math.floor
0029 xmm7 num FPMATH 0027 floor
0030 xmm7 num ADD 0029 0029
0031 xmm7 num ADD 0030 0010
0032 xmm7 + num ADD 0031 +1
0033 num HSTORE 0009 0032
0034 rbp + int ADD 0003 +1
.... SNAP #1 [ ---- ]
0035 > int LE 0034 0001
.... SNAP #2 [ ---- 0034 0001 ---- 0034 ]
0036 ------------ LOOP ------------
0037 xmm5 num CONV 0034 num.int
0038 xmm5 num MUL 0037 0037
0039 xmm6 num SUB 0022 0038
0040 xmm0 num FPMATH 0039 sqrt
0041 xmm6 num FPMATH 0040 floor
0042 xmm6 num ADD 0041 0041
0043 xmm7 num ADD 0042 0032
0044 xmm7 + num ADD 0043 +1
0045 num HSTORE 0009 0044
0046 rbp + int ADD 0034 +1
.... SNAP #3 [ ---- ]
0047 > int LE 0046 0001
0048 rbp int PHI 0034 0046
0049 xmm7 num PHI 0032 0044
Ignoring the .... SNAP
lines for a moment, the first column (containing 0001
through 0049
) is the index into the IR array. The next column contains either a register name (like rbx
or xmm7
) or a stack offset (like [8]
) or is blank. This column isn't populated as the IR is created - instead it is populated as the IR is turned into machine code: if the result of the instruction gets assigned to a register or a stack slot, then that register or stack slot is recorded (some instructions don't have results, or don't have their result materialised, and so this column remains blank for them). The next column contains either >
or
: the >
symbol indicates that the instruction is a so-called guard instruction: these instructions need to be executed even if their result is otherwise unused, and these instructions are also able to "fail" (failure, if/when it happens, causes execution to leave the JIT-compiled code and return to the interpreter). The next column contains either +
or
: the +
symbol indicates that the instruction is used later-on in a PHI
instruction - as with the register/stack column, this column isn't populated as the IR is recorded, and instead it is populated by the LOOP optimisation pass (as it is this optimisation pass which emits PHI
instructions). The next column contains the type of the IR instruction, which in our case is one of:
int
- 32-bit signed integer (arising either from a 32-bit field which the IR accesses, or from a numeric local variable which LuaJIT determined would be better manipulated as an integer than as a double-precision float).fun
- Pointer to a GC-owned function (e.g. the pointer address you see when callingtostring
on a function value).tab
- Pointer to a GC-owned table (e.g. the pointer address you see when callingtostring
on a table value).p32
- Arbitrary 32-bit pointer (on 32-bit systems, this is a bog standard pointer, on 64-bit systems it is pointer to something in the low 4GB of address space).
The next columns contain the instruction name, and then the two (or occasionally one or three) operands to the instruction. We can run through the IR instructions in order of first appearance:
SLOAD
- Load a value from a slot. Most of the time, slots correspond to local variables, though not all of the time, and the numbering doesn't match up. Each Lua call frame uses a consecutive range of slots: the first few slots are used to store metadata about the call frame, and the remaining slots are used to store local variables. Bytecode instructions cannot refer to call frame metadata, so their local variable indices use0
to refer to the first local variable slot. IR instructions can refer to call frame metadata, so slot#0
refers to the first piece of frame metadata. For now, we'll say that there is only ever one slot of metadata (though continuation frames and GC64 mode both double the amount of metadata), meaning that slot#1
refers to the first local variable slot, slot#2
refers to the next local variable slot, and so on. The first operand is the slot number to load, and the second operand is a bitmask of the following:P
- Coalesce with parent trace (our example doesn't have parent traces, so let's gloss over this).F
- Load the "ft/sz" part of a frame metadata slot (our example doesn't use this, so let's gloss over it).T
- The generated machine code needs to check that the value in the slot is of the expected type.C
- Convert a value of typenumber
(i.e.double
) to anint
as part of loading it.R
- Read-only slot; no need to perform a deferred store on trace exit (we'll cover deferred stores later).I
- Inherited by exits/side traces (our example doesn't have side traces, so let's gloss over this).
LE
- Test whether the first operand is less than or equal to the second operand. The instruction must have the same type as both of its operands, and the type determines the kind of less-than-or-equal comparison to perform. The instruction fails if the first operand is not less than or equal to the second. The operands are either IR instructions (expressed as four digits, e.g.0001
) or constants (expressed with a leading+
or-
, e.g.+2147483646
).FLOAD
- Load a well-known field of a well-known structure type. The first operand is an IR instruction which gives a pointer to a well-known structure type, the second operand names the field and the structure. The type of the first operand must match the structure named in the second operand, and the type of the instruction must match the field named in the second operand. In our case, we see the following for the second operand:func.env
- The pointer to the environment table (aka. global table) of the given function object.tab.hmask
- The size of the hash part of the given table object (actually the mask, but size and mask are trivially related).tab.node
- The pointer to the hash part of the given table object.
EQ
- LikeLE
, but for testing equality rather than less-than-or-equal. Additionally, constant operands can be pointers, and these pointers can be expressed in a variety of means. In our case,math.sqrt
andmath.floor
both refer to pointers (the IR really does contain pointer values, and it is merely the pretty-printing done by-jdump=bitmsrx
which gives friendly names to those pointer values).HREFK
- Look up a constant key in the hash part of a table. The first operand must refer to anFLOAD
oftab.node
, the second operand is the constant key, and the third operand is the index within the table at which we expect to find the key (unlike other JIT compilers, LuaJIT doesn't do shape or hidden-type analysis, instead it has hash table index hints for constant keys). The instruction fails if the key is not present (or is present, but isn't at the expected index). The instruction type is alwaysp32
(orp64
in GC64 mode), as the instruction gives a pointer into the hash table.HLOAD
- Load a value from the hash part of a table. The sole operand must refer to anHREF
,HREFK
, orNEWREF
instruction (one of these three instructions performs the key lookup, and thenHLOAD
performs the value load). The type of the instruction specifies the type of value we expect to load, and the instruction fails if the type of the value in the table is not this.MUL
- Multiply two values. LikeLE
, the instruction must have the same type as both of its operands, and the type determines the kind of multiplication to perform. Again, likeLE
, the operands are either IR instructions (expressed as four digits, e.g.0001
) or constants (expressed with a leading+
or-
, e.g.+2147483646
). Of note is that thePOW
instructions in the bytecode have been turned intoMUL
instructions in the IR (asv ^ 2
is the same asv * v
, and the latter is much cheaper to compute). Also of note is that theMULVN
instruction in the bytecode didn't turn into aMUL
instruction in the IR (asv * 2
is the same asv + v
, and the latter is slightly cheaper to compute).CONV
- Convert a value from one type to another. In our case, from an int (int32_t
) to a num (double
). This occurs because LuaJIT narrowed the loop iteration variable from a double to an int, and needs to convert it back again before performing arithmetic which might overflow.SUB
- LikeMUL
, but performs subtraction rather than multiplication.FPMATH
- Perform a well-known unary floating-point function. The first operand specifies the input to the function, and the second operand names the function. The first operand and the instruction itself must both have typenum
.ADD
- LikeMUL
, but performs addition rather than multiplication.HSTORE
- Store a value into the hash part of a table. The first operand must refer to anHREF
,HREFK
, orNEWREF
instruction. The second operand specifies the value to be stored, and the type of the instruction must have the same type as the second operand.LOOP
- Marker instruction inserted by the LOOP optimisation pass. The instructions prior to theLOOP
instruction correspond to one iteration of the loop body. The instructions after theLOOP
instruction also correspond to one iteration of the loop body. The difference is that every instruction after theLOOP
instruction can assume that all of the instructions before theLOOP
instruction have already successfully executed. Consequently, there are generally far fewer instructions afterLOOP
than before it. Loop-invariant expressions (likemath.floor
,math.sqrt
, andr^2
in our case) drop out of this optimisation: the expression is computed for the first time before theLOOP
instruction, and then after theLOOP
instruction the previously computed value can be re-used. Loop-invariant guard instructions (likeEQ
andLE
) drop out similarly: provided the loop body doesn't affect the guard condition, the guard only needs to be checked before theLOOP
instruction, and can be omitted afterwards.PHI
- Pseudo-instruction inserted by the LOOP optimisation pass. Wherever the first operand appears after theLOOP
instruction, it actually refers to a phi of the first and second operands. See Wikipedia's article on SSA for details of phi functions, but in short, the two operands of thePHI
instruction need to be assigned to the same register in order to make the loop behave properly.
The effectiveness of the LOOP optimisation pass is really quite impressive: instructions 0001
through 0022
disappear completely, as do 0026
and 0028
. The bytecode performs seven table lookups per iteration (n
, math
, floor
, math
, sqrt
, r
, n
). Common-subexpression-elimination and load-forwarding during the bytecode to IR conversion causes the IR before the LOOP
instruction to contain just five HREFK
instructions (i.e. n
and math
are looked up once each rather than twice each). Despite the table store to n
within the loop, these five HREFK
instructions instructions are all determined to be loop-invariant (good alias analysis is to thank here). The HLOAD
instructions for math
, floor
, sqrt
, and r
are also determined to be loop-invariant. The HLOAD
for n
isn't loop-invariant, but forwarding saves us, so there is no HLOAD
for n
after the LOOP
instruction (that is, the reason for the HLOAD
not being loop-invariant is the HSTORE
within the loop body, but LuaJIT can forward the stored value rather than having to reload it). The HSTORE
instruction for n
is still done after the LOOP
instruction: stores to local variables are deferred to trace exits, but stores to tables (including stores to global variables) are not deferred.
On that note, we can begin to consider the .... SNAP
lines in the IR dump. Each of these lines corresponds to a so-called snapshot. A snapshot is used to transition from JIT-compiled code back to the interpreter, and consists of:
- The bytecode instruction at which the interpreter should resume.
- Deferred slot-stores which need to be performed (the IR has an
SLOAD
instruction, but no correspondingSSTORE
instruction, and instead all slot-stores are deferred until trace exit). - The slot index of the innermost call frame metadata (in our case this is always zero, but it can be non-zero when tracing follows a
CALL
instruction into a function and the called function contains some IR which can fail).
Sadly, for snapshots, -jdump
doesn't report the bytecode instruction at which the interpreter will resume. It does report the deferred stores though: reading from left to right, each token between [
and ]
represents a slot, with the leftmost token being slot #0. For example, [ ---- 0034 0001 ---- 0034 ]
means that when exiting with this snapshot:
- Slot #0 (call frame metadata) has not been modified.
- Slot #1 (local variable 0) should be overwritten with the result of IR instruction
0034
. - Slot #2 (local variable 1) should be overwritten with the result of IR instruction
0001
. - Slot #3 (local variable 2) has not been modified.
- Slot #4 (local variable 3) should be overwritten with the result of IR instruction
0034
.
Call frames are denoted with |
symbols in snapshots, but we'll gloss over this as it doesn't occur in our example. If an IR instruction can fail, then when it fails, the nearest preceding snapshot is used to return to the interpreter. In our example, this means instructions 0001
through 0034
use snapshot #0
, 0035
uses #1
, 0036
though 0046
use #2
, and 0047
through 0049
use #3
. It is worth dwelling on snapshots for moment, and viewing them as a form of transactional rollback. For example, (in our case) if instruction 0001
fails, then snapshot #0
is used. If instruction 0028
fails, then snapshot #0
is still used, despite various table lookups, some arithmetic, and a call to math.sqrt
all happening between instructions 0001
and 0028
. This means that if instruction 0028
fails, then after restoring the interpreter state, the interpreter will repeat the lookups, the arithmetic, and the math.sqrt
call (presumably it wouldn't repeat the math.floor
call, as a failure of instruction 0028
would mean that math.floor
no longer corresponded to the builtin floor function).
With that, we can move on to the next chunk of output from -jdump=bitmsrx
, which is the generated assembly (though LuaJIT actually generates machine code directly rather than generating assembly, so what is shown is really the output of -jdump
's own bundled disassembler):
---- TRACE 1 mcode 507
f125fdfa mov dword [0x00043410], 0x1
f125fe05 movsd xmm7, [rdx+0x8]
f125fe0a cvttsd2si ebx, xmm7
f125fe0e xorps xmm6, xmm6
f125fe11 cvtsi2sd xmm6, ebx
f125fe15 ucomisd xmm7, xmm6
f125fe19 jnz 0xf1250010 ->0
f125fe1f jpe 0xf1250010 ->0
f125fe25 cmp ebx, 0x7ffffffe
f125fe2b jg 0xf1250010 ->0
f125fe31 movsd xmm7, [rdx]
f125fe35 cvttsd2si ebp, xmm7
f125fe39 xorps xmm6, xmm6
f125fe3c cvtsi2sd xmm6, ebp
f125fe40 ucomisd xmm7, xmm6
f125fe44 jnz 0xf1250010 ->0
f125fe4a jpe 0xf1250010 ->0
f125fe50 mov eax, [rdx-0x8]
f125fe53 mov eax, [rax+0x8]
f125fe56 cmp dword [rax+0x1c], +0x3f
f125fe5a jnz 0xf1250010 ->0
f125fe60 mov r14d, [rax+0x14]
f125fe64 mov rdi, 0xfffffffb00054868
f125fe6e cmp rdi, [r14+0x1d0]
f125fe75 jnz 0xf1250010 ->0
f125fe7b lea r12d, [r14+0x1c8]
f125fe82 cmp dword [r12+0x4], 0xfffeffff
f125fe8b jnb 0xf1250010 ->0
f125fe91 mov rdi, 0xfffffffb00048d48
f125fe9b cmp rdi, [r14+0x518]
f125fea2 jnz 0xf1250010 ->0
f125fea8 cmp dword [r14+0x514], -0x0c
f125feb0 jnz 0xf1250010 ->0
f125feb6 mov r15d, [r14+0x510]
f125febd cmp dword [r15+0x1c], +0x1f
f125fec2 jnz 0xf1250010 ->0
f125fec8 mov r13d, [r15+0x14]
f125fecc mov rdi, 0xfffffffb00049150
f125fed6 cmp rdi, [r13+0x158]
f125fedd jnz 0xf1250010 ->0
f125fee3 cmp dword [r13+0x154], -0x09
f125feeb jnz 0xf1250010 ->0
f125fef1 mov rdi, 0xfffffffb000491e0
f125fefb cmp rdi, [r13+0x170]
f125ff02 jnz 0xf1250010 ->0
f125ff08 cmp dword [r13+0x16c], -0x09
f125ff10 jnz 0xf1250010 ->0
f125ff16 mov rdi, 0xfffffffb0004ebc8
f125ff20 cmp rdi, [r14+0x128]
f125ff27 jnz 0xf1250010 ->0
f125ff2d cmp dword [r14+0x124], 0xfffeffff
f125ff38 jnb 0xf1250010 ->0
f125ff3e movsd xmm0, [r14+0x120]
f125ff47 mulsd xmm0, xmm0
f125ff4b movsd [rsp+0x8], xmm0
f125ff51 xorps xmm1, xmm1
f125ff54 cvtsi2sd xmm1, ebp
f125ff58 mulsd xmm1, xmm1
f125ff5c subsd xmm0, xmm1
f125ff60 cmp dword [r13+0x168], 0x000491b8
f125ff6b jnz 0xf1250010 ->0
f125ff71 sqrtsd xmm0, xmm0
f125ff75 cmp dword [r13+0x150], 0x00049128
f125ff80 jnz 0xf1250010 ->0
f125ff86 roundsd xmm7, xmm0, 0x09
f125ff8c addsd xmm7, xmm7
f125ff90 addsd xmm7, [r12]
f125ff96 addsd xmm7, [0x00064bc8]
f125ff9f movsd [r12], xmm7
f125ffa5 add ebp, +0x01
f125ffa8 cmp ebp, ebx
f125ffaa jg 0xf1250014 ->1
->LOOP:
f125ffb0 movsd xmm0, [rsp+0x8]
f125ffb6 xorps xmm5, xmm5
f125ffb9 cvtsi2sd xmm5, ebp
f125ffbd mulsd xmm5, xmm5
f125ffc1 movaps xmm6, xmm0
f125ffc4 subsd xmm6, xmm5
f125ffc8 sqrtsd xmm0, xmm6
f125ffcc roundsd xmm6, xmm0, 0x09
f125ffd2 addsd xmm6, xmm6
f125ffd6 addsd xmm7, xmm6
f125ffda addsd xmm7, [0x00064bc8]
f125ffe3 movsd [r12], xmm7
f125ffe9 add ebp, +0x01
f125ffec cmp ebp, ebx
f125ffee jle 0xf125ffb0 ->LOOP
f125fff0 jmp 0xf125001c ->3
The first column gives the memory address of the instruction, and the remainder of the line gives the instruction in Intel-ish syntax (which happens to be a syntax I'm fond of; AT&T syntax needs to die). I'm not going to explain the semantics of each individual instruction (there are multi-thousand page Intel manuals for that), but there are a number of interesting things to point out:
- The first instruction,
mov dword [0x00043410], 0x1
, is writing its own trace number to theglobal_State::vmstate
field. - The next instruction,
movsd xmm7, [rdx+0x8]
, is loading slot #2 intoxmm7
(on trace entry,rdx
points to the first local variable slot, and slots are 8 bytes wide). The next six instructions (cvttsd2si
,xorps
,cvtsi2sd
,ucomisd
,jnz
,jpe
) convertxmm7
from a double to an int (inebx
), and then verify that the conversion was lossless. Note thejpe
, which checksxmm7
wasn't NaN, which is really a check that the local variable in slot #2 was of typenumber
(LuaJIT represents all values of all other types in a way which looks like NaN). In aggregate, these seven instructions correspond to the single IR instruction0001 rbx > int SLOAD #2 CRI
(the IR dump mentionsrbx
, but in this case it means the low 32-bits ofrbx
, i.e.ebx
). - Jumps which correspond to failed instructions, like
jnz 0xf1250010 ->0
, list two things: the absolute address of a so-called exit stub (f1250010
in this case), and the snapshot number which the stub corresponds to (0
in this case). An exit stub is a small piece of code which effectively doespush <snapshot-number>; jmp lj_vm_exit_handler
. In turn,lj_vm_exit_handler
is a piece of assembly code which captures the contents of every register, callslj_trace_exit
, and then resumes the interpreter. In turn,lj_trace_exit
is a piece of C code which takes the snapshot number and the current trace number and the captured register state, and uses all of this to perform the deferred slot stores specified by the snapshot (along with setting the interpreter's instruction pointer to the resumption point specified in the snapshot, and a bunch of other things, but those are details for another time). mov eax, [rdx-0x8]
is performing0004 rax fun SLOAD #0 R
- thisSLOAD
has no type conversion, and also cannot fail, so it turns into just one assembly instruction. The next instruction,mov eax, [rax+0x8]
, is performing0005 rax tab FLOAD 0004 func.env
(i.e. theenv
field is at offset8
of thefunc
struct).cmp dword [rax+0x1c], +0x3f
corresponds to one and a half IR instructions: it is all of0006 int FLOAD 0005 tab.hmask
, and also the first half of0007 > int EQ 0006 +63
(0x3f
is63
in hex). We say that the memory load has been fused into the comparison operand.- The four-instruction sequence
mov rdi, 0xfffffffb00054868
,cmp rdi, [r14+0x1d0]
,jnz 0xf1250010 ->0
,lea r12d, [r14+0x1c8]
corresponds to the single IR instruction0009 r12 > p32 HREFK 0008 "n" @19
: each key-value pair in a hash table takes up 24 bytes, and 19 times 24 is0x1c8
(plus 8 is0x1d0
; the value is at 0 mod 24, the key is at 8 mod 24, and a chain pointer is at 16 mod 24). The constant0xfffffffb00054868
is really two 32-bit constants in one:fffffffb
is the type code for strings, and00054868
is the interned address of the string"n"
(LuaJIT interns all strings, so string equality is pointer equality). - The two-instruction sequence
cmp dword [r12+0x4], 0xfffeffff
,jnb 0xf1250010 ->0
is the type-test from0010 > num HLOAD 0009
, that is, it is checking that the type of the value in the key-value pair is a number. Note that the "actual" load of the value from the key-value pair is done much later as part ofaddsd xmm7, [r12]
. jg 0xf1250014 ->1
is the first instruction to refer to exit stub 1, from which we can infer that each exit stub is a mere four bytes long (as->0
was at0xf1250010
). How you fitpush <snapshot-number>; jmp lj_vm_exit_handler
into four bytes is left as an exercise to the reader (look atasm_exitstub_gen
inlj_asm_x86.h
to determine whether your answer is correct).- The lookup and call to
math.sqrt
, which is Lord-knows how many CPU instructions when executed by the interpreter, becomes the single instructionsqrtsd xmm0, xmm6
. - The lookup and call to
math.floor
has becomeroundsd xmm6, xmm0, 0x09
. This instruction is only available on CPUs which support SSE 4.1 - had the example been run on a machine which didn't support SSE 4.1, then LuaJIT would have emitted something else (namely, a call tolj_vm_floor_sse
, which is an assembly function with a slightly special calling convention). addsd xmm7, [0x00064bc8]
corresponds to either0032 xmm7 + num ADD 0031 +1
or0044 xmm7 + num ADD 0043 +1
: there is no encoding ofaddsd
which accepts an immediate, so LuaJIT has put(double)1.0
at memory address0x00064bc8
.
The next piece of output from -jdump=bitmsrx
tells us that recording (and IR optimisation and machine code generation) of the trace has finished, and that the trace is a looping trace:
---- TRACE 1 stop -> loop
The final piece of output from -jdump=bitmsrx
tells us that execution of the trace finished and snapshot #3 was used to restore the interpreter state (noting that -jdump=bitmsrx
doesn't tell us when execution of a trace starts):
---- TRACE 1 exit 3
Finally, we get our estimate of π courtesy of the interpreter executing (the bytecode corresponding to) print(n / r^2)
:
3.1415926535059