Linux/ELF .eh_frame from the bottom up

Problem statement

Given the current register state for a thread, and read-only access to memory, what would the register state hypothetically become if the current function was to immediately return and execution was to resume in its caller?

Notably, the register state includes (amongst other things) an instruction pointer and a stack pointer, so this operation can be iterated to generate a backtrace. With some extensions to support calling destructors and catching exceptions, this operation can also be the basis of throwing exceptions.

This operation is typically called unwinding (as in "unwinding a call frame" or "unwinding the stack").

If certain registers are always undefined immediately after a function return (because they are volatile in the ABI), then we needn't care about their state.

Portability

The size and contents of the register state depends on the architecture being used. We'll paper over this slightly by refering to registers by ordinal rather than by name, and rely on DWARF register number mappings to go between ordinal and name, for example:

Ordinalx86 namex86_64 nameaarch64 name
0eaxraxX0
1ecxrdxX1
2edxrcxX2
3ebxrbxX3
4esprsiX4
5ebprdiX5
6esirbpX6
7edirspX7
8r8X8
9r9X9
10r10X10
11r11X11
12r12X12
13r13X13
14r14X14
15r15X15
16X16
30X30 (LR)
31SP

Because everyone loves special cases, the instruction pointer register (e.g. eip / rip / PC) isn't given an ordinal. On aarch64, it doesn't need one: the return address is always in some register (usually X30, aka. LR), and so we can use the return address register to represent the instruction pointer. In contrast, x86 and x86_64 put the return address on the stack rather than in a register. For these, we'll invent a fake register (typically ordinal 8 on x86, ordinal 16 on x86_64), pretend that it contains the return address, and use it to represent the instruction pointer.

Floating-point and vector registers are generally volatile in Linux ABIs, so we needn't worry about them too much.

Start small: unwinding just the stack pointer

In most cases, the hypothetical stack pointer after function return can be easily calculated by adding some value to some register. This can be expressed as two DWARF-style ULEB128 values: one ULEB128 to specify which register to start with (often the current stack pointer), then one ULEB128 to specify the amount to add. As foreshadowing, we'll call this DW_CFA_def_cfa.

Unwinding other registers

In most cases, if a function modifies a non-volatile register, it'll save it to somewhere on the stack before modifying it. Upon function return, it'll re-load the value from said stack slot. The locations of these stack slots can be expressed as being relative to the hypothetical stack pointer after function return. This can again be expressed as two DWARF-style ULEB128 values: one ULEB to specify which register we're talking about, then one ULEB to specify the offset of its stack slot. There's only one problem: because the offset is against the hypothetical stack pointer after function return, the required offset is going to be negative, which a ULEB can't represent. To get around this, we'll scale the 2nd ULEB by a so-called data_align value (typically -4 or -8), which also has the convenient side-effect of making the ULEB smaller. As foreshadowing, we'll call this DW_CFA_offset_extended. When the 1st ULEB is less than 64, we might also call this DW_CFA_offset.

More ways to unwind other registers

So far we've described forming the address of a stack slot, and then loading from it. There's a natural variation that skips the 2nd part: form the address of a stack slot, then set the value of a register to be that address.

There's another common variant: unwind register X by taking the value from register Y (when X equals Y, this means that said register was unchanged).

For when these common variants aren't sufficient, we'll later describe a little expression language and associated bytecode VM capable of describing all sorts of complicated mechanisms.

This leads to the following definitions to describe how to unwind registers:

enum StackPointerUnwindMechanism {
  Undefined,
  RegOffset(unsigned RegOrdinal, unsigned Offset),
  ExprResult(uint8_t DwOpBytecode[unsigned Len]),
};

enum OtherRegUnwindMechanism {
  Undefined,
  LoadFromStackSlot(int StackSlotOffset),
  AddressOfStackSlot(int StackSlotOffset),
  CopyFromRegister(unsigned RegOrdinal),
  LoadFromExprResult(uint8_t DwOpBytecode[unsigned Len]),
  ExprResult(uint8_t DwOpBytecode[unsigned Len]),
};

struct UnwindActions {
  StackPointerUnwindMechanism sp;
  OtherRegUnwindMechanism reg[NUM_REG];
};

With the associated unwind logic being:

struct RegState {
  uintptr_t value[NUM_REG];
};

RegState UnwindOneFrame(RegState* input, UnwindActions* actions) {
  RegState output;
  uintptr_t NewSP;
  switch (actions->sp) {
  case RegOffset(unsigned RegOrdinal, unsigned Offset):
    NewSP = input->value[RegOrdinal] + Offset;
    break;
  case ExprResult(uint8_t DwOpBytecode[unsigned Len]):
    NewSP = EvalExpr(DwOpBytecode, 0, input);
    break;
  }
  output->value[SP] = NewSP;
  for (unsigned i = 0; i < NUM_REG; ++i) {
    uintptr_t NewVal;
    switch (actions->reg[i]) {
    case LoadFromStackSlot(int StackSlotOffset):
      NewVal = *(uintptr_t*)(NewSP + StackSlotOffset);
      break;
    case AddressOfStackSlot(int StackSlotOffset):
      NewVal = NewSP + StackSlotOffset;
      break;
    case CopyFromRegister(unsigned RegOrdinal):
      NewVal = input->value[RegOrdinal];
      break;
    case LoadFromExprResult(uint8_t DwOpBytecode[unsigned Len]):
      NewVal = *(uintptr_t*)EvalExpr(DwOpBytecode, NewSP, input);
      break;
    case ExprResult(uint8_t DwOpBytecode[unsigned Len]):
      NewVal = EvalExpr(DwOpBytecode, NewSP, input);
      break;
    default:
      continue;
    }
    output->value[i] = NewVal;
  }
  return output;
}

A little bytecode VM is defined to populate an instance of UnwindActions:

OpcodeName and operandsSemantics assuming UnwindActions* ua
0x0cDW_CFA_def_cfa(uleb128 Reg, uleb128 Off)ua->sp = RegOffset(Reg, Off)
0x12DW_CFA_def_cfa_sf(uleb128 Reg, sleb128 Off)ua->sp = RegOffset(Reg, Off * data_align)
0x0fDW_CFA_def_cfa_expression(uleb128 Len, uint8_t DwOpBytecode[Len])ua->sp = ExprResult(DwOpBytecode)
0x80 + RegDW_CFA_offset(uleb128 Off)ua->reg[Reg] = LoadFromStackSlot(Off * data_align)
0x05DW_CFA_offset_extended(uleb128 Reg, uleb128 Off)ua->reg[Reg] = LoadFromStackSlot(Off * data_align)
0x11DW_CFA_offset_extended_sf(uleb128 Reg, sleb128 Off)ua->reg[Reg] = LoadFromStackSlot(Off * data_align)
0x2fDW_CFA_GNU_negative_offset_extended(uleb128 Reg, uleb128 Off)ua->reg[Reg] = LoadFromStackSlot(-(Off * data_align))
0x14DW_CFA_val_offset(uleb128 Reg, uleb128 Off)ua->reg[Reg] = AddressOfStackSlot(Off * data_align)
0x15DW_CFA_val_offset_sf(uleb128 Reg, sleb128 Off)ua->reg[Reg] = AddressOfStackSlot(Off * data_align)
0xc0 + RegDW_CFA_restoreua->reg[Reg] = CopyFromRegister(Reg)
0x06DW_CFA_restore_extended(uleb128 Reg)ua->reg[Reg] = CopyFromRegister(Reg)
0x08DW_CFA_same_value(uleb128 Reg)ua->reg[Reg] = CopyFromRegister(Reg)
0x09DW_CFA_register(uleb128 Reg, uleb128 Src)ua->reg[Reg] = CopyFromRegister(Src)
0x07DW_CFA_undefined(uleb128 Reg)ua->reg[Reg] = Undefined
0x10DW_CFA_expression(uleb128 Reg, uleb128 Len, uint8_t DwOpBytecode[Len])ua->reg[Reg] = LoadFromExprResult(DwOpBytecode)
0x16DW_CFA_val_expression(uleb128 Reg, uleb128 Len, uint8_t DwOpBytecode[Len])ua->reg[Reg] = ExprResult(DwOpBytecode)
0x00DW_CFA_nopNo-op

† This differs from the usual DWARF semantics. To avoid confusion, use DW_CFA_same_value instead of DW_CFA_restore.

If the bytecode does not initialise sp, its value is Undefined. If the bytecode does not initialise reg[i], its value is CopyFromRegister(i).

Putting it together in a CIE

The aforementioned bytecode gets wrapped up in something called a CIE, which also includes a bunch of other fields:

struct cie {
  uint32_t length;           // In bytes, of all subsequent fields
  int32_t  zero;             // Must be zero
  uint8_t  version;          // Typically 1 or 3, sometimes 4
  char     aug_string[];     // NUL terminated
  if (aug_string[0] == 'e' && aug_string[1] == 'h') {
    void* eh_ptr;            // Only used by very old g++
  }
  if (version >= 4) {
    uint8_t addr_size;       // Must be sizeof(void*)
    uint8_t segment_size;    // Must be zero
  }
  uleb128 code_align;        // Typically 1 (even on aarch64)
  sleb128 data_align;        // Typically -4 or -8
  if (version == 1) {
    uint8_t return_address_ordinal;
  } else {
    uleb128 return_address_ordinal;
  }
  uint8_t aug_operands[];    // Relates to aug_string
  uint8_t dw_cfa_bytecode[];
  uint8_t zero_padding[];    // To give 4 or 8 byte alignment
};

The aug_string field is a NUL-terminated ASCII string. Conceptually it is a bitmask of flags, except that each flag is represented by one or two characters rather than by a single bit. If the flag has associated operand(s), they are read out of the aug_operands field, in the same order as characters appear in aug_string. The recognised flags are:

Char(s)Operand(s)Notes
"eh"void* eh_ptrOperand not part of aug_operands
"z"uleb128 lengthIn bytes, of subsequent aug_operands
"R"uint8_t fde_ptr_encodingForeshadowing
"P"uint8_t ptr_encoding
uint8_t personality_ptr[]
Foreshadowing
"L"uint8_t lsda_ptr_encodingForeshadowing
"S"NoneSignal handler frame
"B"Noneaarch64 ptrauth uses B key
"\0"NoneEnd of aug_string

"eh", if it appears, must be first. In practice, it is never present, except in code compiled by very old versions of g++. "z", if it appears, must be next, and in practice is always present. The remaining characters (except NUL) can appear in any order, though if "R" is present, and "S" and/or "B" are also present, then "R" must appear before "S" and before "B" (to work around bugs in get_cie_encoding, as compared to the correct extract_cie_info).

The whole struct cie must have a length which is a multiple of sizeof(uintptr_t) bytes. The zero_padding field at the end ensures this. Helpfully, DW_CFA_nop is zero, so zero_padding can be seen as an extension of the dw_cfa_bytecode field. The length of the whole struct cie is 4 + length. To get the length of the combined dw_cfa_bytecode / zero_padding, the lengths of all the other fields need to be subtracted off. This is straightforward (albeit fiddly) for most fields, with the only hard case being aug_operands: its length depends on the contents of aug_string, and there may be characters in aug_string which we do not recognise. This is where "z" comes in: once we've decoded its operand, it tells us the remaining length of aug_operands.

From one location to many

Obtaining the UnwindActions applicable to one instruction pointer value is all well and good, but this needs to scale to every (*) possible instruction pointer value across an entire program. One relevant observation is that the UnwindActions applicable to an instruction pointer X are usually very similar (or even identical to) those applicable to an instruction pointer value of X+1.

(*) Assuming -fasynchronous-unwind-tables. Less coverage is required if only -fnon-call-exceptions is used, and less still is required if neither is used. Note that -fasynchronous-unwind-tables is enabled by default on some architectures.

With this observation in mind, we can borrow a trick from video compression: use a keyframe to describe the UnwindActions applicable at the start of every function, and then use delta frames to describe the differences as the instruction pointer progresses through the function.

This motivates a bunch of new VM opcodes, along with two pieces of VM state: set target to the instruction pointer that you want UnwindActions for, set fpc to the instruction pointer of the start of the function, and execute VM instructions either until you run out of VM instructions, or until fpc >= target:

OpcodeName and operandsSemantics
0x40 + OffDW_CFA_advance_locfpc += Off * code_align; if (fpc >= target) break;
0x02DW_CFA_advance_loc1(uint8_t Off)fpc += Off * code_align; if (fpc >= target) break;
0x03DW_CFA_advance_loc2(uint16_t Off)fpc += Off * code_align; if (fpc >= target) break;
0x04DW_CFA_advance_loc4(uint32_t Off)fpc += Off * code_align; if (fpc >= target) break;
0x01DW_CFA_set_loc(uint8_t Ptr[])fpc = DecodePtr(fde_encoding from CIE, Ptr); if (fpc >= target) break;

It also motivates two pieces of VM state called spReg and spOff, along with revisements to DW_CFA_def_cfa / DW_CFA_def_cfa_sf, and new opcodes which perform only one half of DW_CFA_def_cfa / DW_CFA_def_cfa_sf:

OpcodeName and operandsSemantics assuming UnwindActions* ua
0x0cDW_CFA_def_cfa(uleb128 Reg, uleb128 Off)ua->sp = RegOffset(spReg = Reg, spOff = Off)
0x12DW_CFA_def_cfa_sf(uleb128 Reg, sleb128 Off)ua->sp = RegOffset(spReg = Reg, spOff = Off * data_align)
0x0dDW_CFA_def_cfa_register(uleb128 Reg)spReg = Reg; ua->sp = RegOffset(spReg, spOff)
0x0eDW_CFA_def_cfa_offset(uleb128 Off)spOff = Off; if (ua->sp is RegOffset) ua->sp = RegOffset(spReg, spOff)
0x13DW_CFA_def_cfa_offset_sf(sleb128 Off)spOff = Off * data_align; if (ua->sp is RegOffset) ua->sp = RegOffset(spReg, spOff)

Finally, it motivates adding a Stack<UnwindActions> to the VM, along with some opcodes to manipulate it:

OpcodeName and operandsSemantics assuming UnwindActions* ua
0x0aDW_CFA_remember_statePush(*ua) (note that ua not modified)
0x0bDW_CFA_restore_state*ua = Pop()

From one function to many

We could happily have one CIE per function, but there would be quite a lot of repetition between CIEs. To resolve this, we introduce the concept of an FDE, which is essentially a stripped down CIE:

struct fde {
  uint32_t length;            // In bytes, of all subsequent fields
  int32_t  cie_offset;        // In bytes, from this field, to a CIE
  uint8_t  func_start[];      // Using fde_ptr_encoding from CIE
  uint8_t  func_length[];     // Using (fde_ptr_encoding & 0xF)
  uint8_t  aug_operands[];    // Relates to aug_string from CIE
  uint8_t  dw_cfa_bytecode[]; // Appended to that of CIE
  uint8_t  zero_padding[];    // To give 4 or 8 byte alignment
};

The cie_offset field points to a CIE, and the FDE inherits everything from the pointed-to CIE. For inheritance of aug_string, each flag therein sometimes has operand(s) in the CIE's aug_operands, sometimes in the FDE's aug_operands, and sometimes in both. The FDE aug_operands has:

Char(s)Operand(s)Notes
"eh"None
"z"uleb128 lengthIn bytes, of subsequent aug_operands
"R"None
"P"None
"L"uint8_t lsda_ptr[]Encoded using lsda_ptr_encoding from CIE
"S"None
"B"None
"\0"NoneEnd of aug_string

The dw_cfa_bytecode field is inherited by concatenation: executing the bytecode for an FDE involves first executing the bytecode of the associated CIE. All VM state is carried over from the end of the CIE bytecode to the start of the FDE bytecode, thoough if using DW_CFA_remember_state / DW_CFA_restore_state then the stack is emptied as part of switching from the CIE to the FDE.

That leaves func_start, which is a variable length pointer, and func_length, which is a variable length integer. If aug_string does not contain "R", then these fields are void* and uintptr_t respectively. If aug_string does contain "R", then the operand of "R" (fde_ptr_encoding) describes the size and interpretation of func_start, meanwhile fde_ptr_encoding & 0xF describes the size and interpretation of func_length. That's the segue to talking about pointer encodings.

Pointer encodings

Depending on the context, it can be desirable to encode a pointer in different ways. For example, sometimes a uintptr_t is desirable, whereas other times an int32_t offset from the start of the current function is desirable. To allow flexibility, various places allow a uint8_t to describe the pointer encoding being used. These places are:

Encoding fieldAssociated pointer field
cie::fde_ptr_encodingfde::func_start
cie::fde_ptr_encoding & 0xFfde::func_length
cie::fde_ptr_encodingDW_CFA_set_loc operand
cie::aug_string "P" ptr_encodingcie::aug_string "P" personality_ptr
cie::aug_string "L"fde::aug_string "L" (LSDA pointer)
eh_frame_hdr::eh_frame_ptr_enceh_frame_hdr::eh_frame_ptr
eh_frame_hdr::fde_count_enceh_frame_hdr::fde_count
eh_frame_hdr::table_enceh_frame_hdr::sorted_table
DW_OP_GNU_encoded_addrDW_OP_GNU_encoded_addr ‡

Foreshadowing.
More foreshadowing.

There are two special cases for this uint8_t:

EncodingMeaning
0xffEncoded value is absent/empty, decode as NULL.
0x50Encode a uintptr_t, but precede with padding to align it.

Once the two special cases are excluded, the remaining cases split the byte into a four bit field, a three bit field, and a one bit field. The low four bits denote the data type:

Encoding & 0xFData typeSize in bytes
0x0uintptr_tTypically either 4 or 8
0x1uleb128Varies (≥ 1)
0x2uint16_t2
0x3uint32_t4
0x4uint64_t8
0x9sleb128Varies (≥ 1)
0xAint16_t2
0xBint32_t4
0xCint64_t8

‡ Cannot be used for fde_ptr_encoding.

The next three bits denote a base value to be added to non-NULL values:

Encoding & 0x70Base value
0x00NULL
0x10 (pcrel)Address of first byte of encoded pointer
0x20 (textrel)Start of .text (in theory), but usually NULL in practice
0x30 (datarel)Start of .got (x86) or NULL (most other architectures) †
0x40 (funcrel) ‡fde::func_start (after decoding)

† Except in eh_frame_hdr::table_enc, where it means start of .eh_frame_hdr.
‡ Cannot be used for fde_ptr_encoding, or in eh_frame_hdr.

The final top bit controls an optional dereference:

Encoding & 0x80Semantics
0x00Treat value as void*
0x80Treat value as void**, dereference it to get void*

If an integer is desired rather than a pointer (as is the case for func_length and fde_count_enc), then the void* is reinterpreted as uintptr_t.

Finding all the FDEs

The various CIE and FDE structures are concatenated together, and then placed in the ELF .eh_frame section.

Traditionally, the compiler would insert a call to __register_frame somewhere during startup, passing along the address of the .eh_frame section. If non-NULL values were desired for textrel / datarel pointer encodings, it would instead insert a call to __register_frame_info_bases. At shutdown, a matching call to __deregister_frame or __deregister_frame_info_bases would be made. If there are multiple .eh_frame sections, and the linker didn't merge them, then __register_frame_table / __register_frame_info_table_bases can be used instead, which take a pointer to a NULL-terminated list of pointers to .eh_frame sections.

A more modern approach is to add an ELF .eh_frame_hdr section, and use either _dl_find_object or dl_iterate_phdr to find the appropriate .eh_frame_hdr section. Once found, the section contains a sorted table of FDE pointers:

struct eh_frame_hdr {
  uint8_t version;          // Must be 1
  uint8_t eh_frame_ptr_enc; // Typically 0x1B (pcrel int32_t)
  uint8_t fde_count_enc;    // Typically 0x03 (uint32_t)
  uint8_t table_enc;        // Must be 0x3B (datarel int32_t)
  uint8_t eh_frame_ptr[];   // Encoded with eh_frame_ptr_enc
  uint8_t fde_count[];      // Encoded with fde_count_enc
  struct {
    int32_t func_start;     // In bytes, relative to eh_frame_hdr
    int32_t fde_offset;     // In bytes, relative to eh_frame_hdr
  } sorted_table[fde_count];
};

The combined size of the eh_frame_ptr and fde_count fields must be a multiple of four (which happens naturally if eh_frame_ptr_enc and fde_count_enc both use 4 or 8 byte types). The eh_frame_ptr field contains a pointer to the .eh_frame section, which is used as fallback if the .eh_frame_hdr section is unparseable for some reason. The table_enc field contains the encoding used by the contents of sorted_table, albeit datarel means relative to the start of .eh_frame_hdr, and the only supported encoding is datarel int32_t. The table is sorted in ascending func_start order, and the func_start value therein overrides the func_start from the referenced FDE (though in practice they should be identical).

Personality pointers and LSDA pointers

To support calling destructors during unwinding, and to support catching exceptions (and thereby stopping unwinding), a CIE can specify a pointer to a personality function. Said function contains the destructing and/or catching logic, and will get called as part of unwinding (unless only generating a backtrace). I won't get into the specifics of the interface, though the personality function can query various parts of the unwind state:

Very old versions of g++ use eh_ptr instead of personality functions and LSDA pointers. Unless you are implementing __frame_state_for, you shouldn't need to worry about eh_ptr.

Expression bytecode

The UnwindOneFrame function from earlier included calls to EvalExpr to handle complex cases. The outline of EvalExpr is:

uintptr_t EvalExpr(uint8_t DwOpBytecode[unsigned Len],
                   uintptr_t StackInitial,
                   RegState* original) {
  const uint8_t* bpc = DwOpBytecode;
  Stack<uintptr_t> stk;
  stk.Push(StackInitial);
  while (bpc < DwOpBytecode + Len) {
    uint8_t opcode = *bpc++;
    /* Decode opcode-specific operand(s) from bpc */
    ...
    /* Perform opcode */
    ...
  }
  return stk.Pop();
}

With the various supported opcodes being:

OpcodeName and operandsSemantics (assuming stack for Push, Pop, At)
0x03DW_OP_addr(uintptr_t Lit)Push(Lit)
0x08DW_OP_const1u(uint8_t Lit)Push(Lit)
0x09DW_OP_const1s(int8_t Lit)Push(Lit)
0x0ADW_OP_const2u(uint16_t Lit)Push(Lit)
0x0BDW_OP_const2s(int16_t Lit)Push(Lit)
0x0CDW_OP_const4u(uint32_t Lit)Push(Lit)
0x0DDW_OP_const4s(int32_t Lit)Push(Lit)
0x0EDW_OP_const8u(uint64_t Lit)Push(Lit)
0x0FDW_OP_const8s(int64_t Lit)Push(Lit)
0x10DW_OP_constu(uleb128 Lit)Push(Lit)
0x11DW_OP_consts(sleb128 Lit)Push(Lit)
0x12DW_OP_dupPush(At(-1)) (duplicate top element)
0x14DW_OP_overPush(At(-2)) (duplicate penultimate element)
0x15DW_OP_pick(uint8_t Idx)Push(At(-1-Idx))
0x13DW_OP_dropPop()
0x16DW_OP_swapa = Pop(); b = Pop(); Push(a); Push(b)
0x17DW_OP_rota = Pop(); b = Pop(); c = Pop(); Push(a); Push(c); Push(b)
0x06DW_OP_derefPush(*(uintptr_t*)Pop())
0x94 0x01DW_OP_deref_sizePush(*(uint8_t*)Pop())
0x94 0x02DW_OP_deref_sizePush(*(uint16_t*)Pop())
0x94 0x04DW_OP_deref_sizePush(*(uint32_t*)Pop())
0x94 0x08DW_OP_deref_sizePush(*(uint64_t*)Pop())
0x19DW_OP_absa = Pop(); Push((intptr_t)a < 0 ? -a : a)
0x1fDW_OP_nega = Pop(); Push(-a)
0x20DW_OP_nota = Pop(); Push(~a)
0x23DW_OP_plus_uconst(uleb128 Lit)a = Pop(); Push(a + Lit)
0x1aDW_OP_andrhs = Pop(); lhs = Pop(); Push(lhs & rhs)
0x1bDW_OP_divrhs = Pop(); lhs = Pop(); Push((intptr_t)lhs / (intptr_t)rhs)
0x1cDW_OP_minusrhs = Pop(); lhs = Pop(); Push(lhs - rhs)
0x1dDW_OP_modrhs = Pop(); lhs = Pop(); Push(lhs % rhs)
0x1eDW_OP_mulrhs = Pop(); lhs = Pop(); Push(lhs * rhs)
0x21DW_OP_orrhs = Pop(); lhs = Pop(); Push(lhs | rhs)
0x22DW_OP_plusrhs = Pop(); lhs = Pop(); Push(lhs + rhs)
0x24DW_OP_shlrhs = Pop(); lhs = Pop(); Push(lhs << rhs)
0x25DW_OP_shrrhs = Pop(); lhs = Pop(); Push(lhs >> rhs)
0x26DW_OP_shrarhs = Pop(); lhs = Pop(); Push((intptr_t)lhs >> rhs)
0x27DW_OP_xorrhs = Pop(); lhs = Pop(); Push(lhs ^ rhs)
0x29DW_OP_eqrhs = Pop(); lhs = Pop(); Push(lhs == rhs)
0x2eDW_OP_nerhs = Pop(); lhs = Pop(); Push(lhs != rhs)
0x2aDW_OP_gerhs = Pop(); lhs = Pop(); Push((intptr_t)lhs >= (intptr_t)rhs)
0x2bDW_OP_gtrhs = Pop(); lhs = Pop(); Push((intptr_t)lhs > (intptr_t)rhs)
0x2cDW_OP_lerhs = Pop(); lhs = Pop(); Push((intptr_t)lhs <= (intptr_t)rhs)
0x2dDW_OP_ltrhs = Pop(); lhs = Pop(); Push((intptr_t)lhs < (intptr_t)rhs)
0x2fDW_OP_skip(int16_t Delta)bpc += Delta
0x28DW_OP_bra(int16_t Delta)if (Pop() != 0) bpc += Delta
0x30 + LitDW_OP_litPush(Lit)
0x50 + RegDW_OP_regPush(original->value[Reg])
0x70 + RegDW_OP_breg(sleb128 Lit)Push(original->value[Reg] + Lit)
0x90DW_OP_regx(uleb128 Reg)Push(original->value[Reg])
0x92DW_OP_bregx(uleb128 Reg, sleb128 Lit)Push(original->value[Reg] + Lit)
0x96DW_OP_nopNo-op
0xf1DW_OP_GNU_encoded_addr(uint8_t PtrEncoding, uint8_t Ptr[])Push(DecodePtr(PtrEncoding, Ptr))

The gcc implementation of EvalExpr assumes that the stack never holds more than 64 elements. Bad things will happen if the stack exceeds this size.

By far the most common usage of this expression language is to allow a single FDE to succinctly describe an arbitrary number of PLT entries. The serialised bytecode in this case will be something like:

0x92 0x07 0x08  DW_OP_bregx(7 /* rsp */, 8)
0x90 0x10       DW_OP_regx(16 /* rip */)
0x08 0x0f       DW_OP_const1u(15)
0x1a            DW_OP_and
0x08 0x0b       DW_OP_const1u(11)
0x2a            DW_OP_ge
0x08 0x03       DW_OP_const1u(3)
0x24            DW_OP_shl
0x22            DW_OP_plus

Which encodes the expression:

rsp + 8 + ((((rip & 15) >= 11) ? 1 : 0) << 3) 

In other words, each 16-byte PLT entry is split into two pieces: the 1st being 11 bytes long, and the 2nd being 5 bytes long. If in the 1st piece, the expression gives rsp + 8, whereas in the 2nd piece it gives rsp + 16.

References

The DWARF Standard is documented, though .eh_frame diverges from it in a few areas. It also requires architecture-specific extensions (e.g. x86, x86_64, aarch64). The LSB has a pointer encoding specification, though it is incomplete. The only true reference is the gcc source code, some relevant entry points into which are:

See also

For a completely different take on the same problem, consider ARM64 unwinding on Windows. An .xdata record therein is roughly equivalent to an FDE, and it has a bytecode interpreter similar in purpose (but very different in style) to the DW_CFA_ bytecode.

Some examples of complete .eh_frame sections are available in LuaJIT:

__VA_OPT__ Minutiae

Let's say we're writing C code, and would like to define a macro for printf-with-newline. We might start with:

#define printfln(fstr, ...) \
  printf(fstr "\n", __VA_ARGS__)

So far so good; we can write printfln("1+1 is %d", 1+1) and it'll print 1 + 1 is 2 followed by a newline. However, simpler cases such as printfln("Hello") result in a syntax error, as this macro expands to printf("Hello" "\n",), which is invalid due to the trailing comma.

The non-standard solution

One conventional solution is to rely on a non-standard language extension whereby inserting ## between , and __VA_ARGS__ has a special effect:

#define printfln(fstr, ...) \
  printf(fstr "\n", ## __VA_ARGS__)

With this, both printfln("1+1 is %d", 1+1) and printfln("Hello") work fine. However, combining the two into something like printfln("Wrote %d chars", printfln("Hello")) results in a mysterious error. To figure out why, we need to look into exactly what special effect this non-standard language extension has. The GNU C Preprocessor documentation on Variadic Macros says:

The ## token paste operator has a special meaning when placed between a comma and a variable argument. If you write [...] and the variable argument is left out [...], then the comma before the ## will be deleted. This does not happen if you pass an empty argument, nor does it happen if the token preceding ## is anything other than a comma.

Meanwhile, the GCC documentation on Variadic Macros says:

If the variable arguments are omitted or empty, the ## operator causes the preprocessor to remove the comma before it. If you do provide some variable arguments in your macro invocation, GNU CPP does not complain about the paste operation and instead places the variable arguments after the comma. Just like any other pasted macro argument, these arguments are not macro expanded.

The last sentence of this 2nd piece of documentation tells us what is going on: macro arguments are usually expanded before being substituted, but this doesn't happen to macro arguments used as operands to # or ##. In either case, after substitution has been performed, the resultant tokens are rescanned for more expansions, albeit with the original macro deactivated during this rescan. The only major observable difference between expansion-before-substitution and expansion-during-rescan is that the original macro is active in the former but deactivated in the latter. Hence printfln("Wrote %d chars", printfln("Hello")) expands to printf("Wrote %d chars" "\n", printfln("Hello")) without expanding the inner printfln, which the compiler then tries to parse as a call to the non-existent function printfln.

The two pieces of documentation are contradictory as to what happens if the variable arguments are present but empty (as in printfln("Hello",)). In practice the 1st piece of documentation is correct; the comma is kept if the variable arguments are present but empty.

The nor does it happen if the token preceding ## is anything other than a comma phrase from the 1st piece of documentation is interesting, as it turns out this is a slightly dynamic property: comma deletion can happen for things like , ## x ## __VA_ARGS__ provided that x expands to nothing. Clang will also delete the comma in , x ## __VA_ARGS__ when x expands to nothing. Things also seem to get funky if more pasting immediately follows , ## __VA_ARGS__. As examples:

#define F(x, ...) asF()F(1)F(,)F(1,)F(1,2)
,##__VA_ARGS__emptyempty,,,2
,##__VA_ARGS__ xempty1,,1,2 1
,##__VA_ARGS__##x,error,errorerror† or ,21
,##x##__VA_ARGS__emptyerror,errorerror
,x##__VA_ARGS__,† or empty ‡,1,,1,12

† According to gcc 13.2.
‡ According to clang 17.0.1.

In the cases where gcc and clang differ, who is to say which is correct? After all, there is no standard documenting the desired behaviour for non-standard language extensions.

Enter __VA_OPT__

Rather than standardise this mess, the C and C++ languages adopted a different solution, namely __VA_OPT__. With __VA_OPT__, our motivating example looks like:

#define printfln(fstr, ...) \
  printf(fstr "\n" __VA_OPT__(,) __VA_ARGS__)

In this, __VA_OPT__(,) expands to nothing if the variable arguments are absent or empty or become empty after macro expansion, whereas it expands to , otherwise. There's no token pasting going on, so printfln("Wrote %d chars", printfln("Hello")) now works. The other behavioural difference is that the odd-looking printfln("Hello",) now expands to the valid printf("Hello" "\n"), as does printfln("Hello", EMPTY) in the context of #define EMPTY /*nothing*/.

The standard didn't stop there though; there can be more than just a , within __VA_OPT__. Any (-)-balanced token sequence is allowed, and it can even access the arguments of the macro invocation, so for example it is valid to have:

#define M(x, ...) (0 __VA_OPT__(-(x)) )

Then M(1) expands to (0) and M(1,2) expands to (0 - (1)).

What about whitespace?

Compilers seem to disagree on the behaviour of whitespace just after __VA_OPT( or just before the matching ). Consider:

#define TILDAS(...) ~__VA_OPT__( ~ )~
#define S2(x) #x
#define S1(x) S2(x)
const char* s = S1(TILDAS());
const char* sa = S1(TILDAS(a));

The observed results are:

gcc 13.2clang 17.0.1
s"~~""~ ~"
sa"~~~""~ ~ ~"

One interpretation is that gcc is correct, based on this paragraph from the standard: (N3096 6.10.4.1¶7)

The preprocessing token sequence for [...] a va-opt-replacement [...] consists of the results of the expansion of the contained pp-tokens as the replacement list of the current function-like macro before removal of placemarker tokens, rescanning, and further replacement.

Combined with an earlier paragraph about replacement lists: (N3096 6.10.4¶7)

[...] Any white-space characters preceding or following the replacement list of preprocessing tokens are not considered part of the replacement list for either form of macro.

What does __VA_OPT__() expand to?

The short answer should be nothing, based on the rules for expanding it:

  1. If the variable arguments are absent or empty or become empty after macro expansion, the expansion of __VA_OPT__() is a single placemarker token.
  2. Otherwise, if used as an operand of ##, the expansion of __VA_OPT__() is a single placemarker token (because an empty expansion becomes a placemarker in this context).
  3. Otherwise, the expansion of __VA_OPT__() is empty (though a single placemarker token would be equally valid, as it'll evaporate away in due course).

#__VA_OPT__() therefore becomes an obfuscated way of writing "". Meanwhile, __VA_OPT__() as an operand of ## can be used to force the other operand to be in a ## context. For example, the undesirable behaviour of , ## __VA_ARGS__ can be reproduced via:

#define printfln(fstr, ...) \
  printf(fstr "\n" __VA_OPT__(,) __VA_OPT__() ## __VA_ARGS__)

If __VA_OPT__() expands to nothing, regardless of whether the variable arguments are absent or empty or become empty after macro expansion, then it might seem rational to optimise it away. There's a corner case though: determining whether the variable arguments become empty after macro expansion requires macro expanding them, and macro expansion of the (non-standard) __COUNTER__ macro has visible side-effects. Consider:

#define EMPTY_VA_OPT(...) __VA_OPT__()
int x = __COUNTER__;
EMPTY_VA_OPT(__COUNTER__)
int y = __COUNTER__;
return y - x;

For the above, gcc 13.2 returns 2, whereas clang 17.0.1 returns 1, indicating that clang optimised away the __VA_OPT__().

Token pasting and __VA_OPT__

The standard is careful to say that the expansion of __VA_OPT__ can contain placemarker tokens: (N3096 6.10.4.1¶7, emphasis mine)

The preprocessing token sequence for [...] a va-opt-replacement [...] consists of the results of the expansion of the contained pp-tokens as the replacement list of the current function-like macro before removal of placemarker tokens, rescanning, and further replacement.

These placemarker tokens were originally a fiction dreamt up by the standard to provide a succinct description of the semantics of ##:

Before __VA_OPT__, it was possible to ignore this fiction, and implement the preprocessing algorithm with careful rules around ## evaluation rather than producing and later removing placemarker tokens. It becomes harder to maintain this fiction with __VA_OPT__. Consider:

#define G(x, y, z, ...) 1 ## __VA_OPT__(x y ## y z) ## 5

Then G(,,) expands to the single token 15, while G(2,3,4,-) expands to the three tokens 12 33 45. If y is empty, then the inner y ## y should produce a placemarker, and then things get interesting:

Expansion ofResultNotes
G(2,,4,-)12 45The __VA_OPT__ expands to 24.
G( ,,4,-)1 45The placemarker inhibits merging to 145.
G(2,, ,-)12 5The placemarker inhibits merging to 125.
G( ,, ,-)1 5† or 15The correct result is the merged 15.

† According to gcc 13.2.
‡ According to clang 17.0.1.

The other fun observation is that macro arguments within __VA_OPT__ get macro expanded, even if the __VA_OPT__ itself is an operand of ##. Consider:

#define H1(...) x ##            __VA_ARGS__
#define H2(...) x ## __VA_OPT__(__VA_ARGS__)

Then H1(__LINE__) expands to x__LINE__, whereas H2(__LINE__) expands to something like x3.

Further reading

Higher quality random floats

Much has been written about how to generate random 64-bit integers; people might argue over which particular (CS)PRNG is best suited for a given task, but there is agreement on the general shape of the solution: keep some state around, and then extract 64 bits of randomness at a time, where each bit is equally likely to be either zero or one. This gives nice uniformly-distributed integers over the range [0, 264).

There is less agreement on how to generate random floating-point values. It is tempting to aim for what looks like a simple transformation of an integer generator: uniformly-distributed floats over the range [0, 1). There are various ways of doing such a transformation; one exposition does it as:

double pcg64_random_double(pcg64_t *rng) {
    return (double)(pcg64_random(rng) >> 11) * 0x1.0p-53;
}

Whereas LuaJIT does it as:

uint64_t lj_prng_u64d(PRNGState *rs) {
    uint64_t z, r = 0;
    TW223_STEP(rs, z, r)
    /* Returns a double bit pattern in the range 1.0 <= d < 2.0. */
    return (r & 0x000fffffffffffffull) | 0x3ff0000000000000ull;
}
/* Then to give d in [0, 1) range: */
U64double u;
double d;
u.u64 = lj_prng_u64d(rs);
d = u.d - 1.0;

And a Go version by Lemire does it as:

// toFloat64 -> [0,1)
func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    x &= 0x1fffffffffffff // %2**53
    return float64(x) / float64(0x1fffffffffffff)
}

The [0, 1) output range is less useful than it might initially appear. It is often stated that [0, 1) can be transformed to [A, B) by multiplying by B-A then adding A, but this is only true for some values of A and B; for other values, a number just less than 1 can be transformed and end up equal to B due to floating-point rounding, i.e. [0, 1) is transformed to [A, B]. The possibility of a zero output is also unhelpful if the result is going to be log-transformed (for example as part of a Box-Muller transform), as log(0) explodes.

An output range of [0, 1] would be better behaved under most additive and multiplicative transforms. After such a transform, a range like [A, B] can be easily turned into [A, B) via rejection sampling, should the half-open range be desired. It will also turn out to be very easy (on either theoretical or practical grounds) to exclude 0 as a possible output, giving the log-friendly output range (0, 1].

Before trying to generate a random float in the range [0, 1], we should instead consider the easier problem of generating a random float in the range [0.5, 1]. There are 252+1 different IEEE-754 doubles in this range. The "rounding basin" of some double d in that range can be defined as the range of infinite-precision real numbers in the range [0.5, 1] that would round to d (assuming round-to-nearest). Taking ε=2-54, most of these doubles have a basin of d-ε through d+ε. The exceptions are the extremes of the range; 0.5 has a basin of 0.5 through 0.5+ε, and 1 has a basin of 1-ε through 1. In other words, there are 252-1 values in the range whose basin is wide, and 2 values in the range whose basin is only ε wide. For a uniform distribution, the probability of seeing d should equal the width of the rounding basin of d. Given all this, there is a fairly simple monotonic transform from the uniform integer range [0, 253) to the uniform float range [0.5, 1]:

Integer(s)Resultant floating-point double (ε=2-54)
00.5
1, 20.5 + 2ε
3, 40.5 + 4ε
5, 60.5 + 6ε
253-7, 253-61 - 6ε
253-5, 253-41 - 4ε
253-3, 253-21 - 2ε
253-11

This transform can be expressed in code like so:

double rand_between_half_and_one() {
    double d;
    uint64_t x = rand_u64() >> 11; /* 53-bit uniform integer */
    x = ((x + 1) >> 1) + (1022ull << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

The range [0.25, 0.5] looks a lot like the range [0.5, 1], except that ε reduces to 2-55. That is, there is a fairly simple monotonic transform from the uniform integer range [0, 253) to the uniform float range [0.25, 0.5]:

Integer(s)Resultant floating-point double (ε=2-55)
00.25
1, 20.25 + 2ε
3, 40.25 + 4ε
5, 60.25 + 6ε
253-7, 253-60.5 - 6ε
253-5, 253-40.5 - 4ε
253-3, 253-20.5 - 2ε
253-10.5

Or in code:

double rand_between_quarter_and_half() {
    double d;
    uint64_t x = rand_u64() >> 11; /* 53-bit uniform integer */
    x = ((x + 1) >> 1) + (1021ull << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

In turn, the range [0.125, 0.25] looks a lot like [0.25, 0.5], and [0.0625, 0.125] looks a lot like [0.125, 0.25], and so on. To generate a float in [0, 1], we need to choose one of these ranges, and then choose a value within that range. For a uniform distribution, the probability of a range should be proportional to the width of the range, so we have:

Output rangeWidthProbability
[0.5, 1]2-150%
[0.25, 0.5]2-225%
[0.125, 0.25]2-312.5%
[0.0625, 0.125]2-46.25%

As the width of the range approaches zero, so does the probability of the range. Once the width is less than 2-75, the probabilities are so small as to be effectively impossible for most practical purposes. By making them actually impossible, we gain a nice guarantee of algorithm termination, and avoid having to think about denormals, and get the log-friendly output range (0, 1]. One way of implementing this in code is:

double rand_between_zero_and_one() {
    double d;
    uint64_t x = rand_u64() >> 11; /* 53-bit uniform integer */
    uint64_t e = 1022;
    do {
      if (rand_u64() & 1) break; /* 1-bit uniform integer */
      e -= 1;
    } while (e > 1022-75);
    x = ((x + 1) >> 1) + (e << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

The above throws away lots of random bits, whereas it would be more economical to remember and use them later:

double rand_between_zero_and_one() {
    double d;
    uint64_t x = rand_u64(); /* 53-bit and 11-bit uniform integers */
    uint64_t bits = x & 0x7ff, nbits = 11;
    uint64_t e = 1022;
    do {
      if (bits & 1) break; /* 1-bit uniform integer */
      bits >>= 1;
      if (--nbits == 0) bits = rand_u64(), nbits = 64;
      e -= 1;
    } while (e > 1022-75);
    x = (((x >> 11) + 1) >> 1) + (e << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

Finally, the loop can be eliminated using bit-counting intrinsics:

double rand_between_zero_and_one() {
    double d;
    uint64_t x = rand_u64();
    uint64_t e = __builtin_ctzll(x) - 11ull;
    if ((int64_t)e >= 0) e = __builtin_ctzll(rand_u64());
    x = (((x >> 11) + 1) >> 1) - ((e - 1011ull) << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

Or in Go rather than C:

// toFloat64 -> (0,1]
func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    e := bits.TrailingZeros64(x) - 11
    if e >= 0 {
        e = bits.TrailingZeros64(splitmix64(seed))
    }
    x = (((x >> 11) + 1) >> 1) - ((uint64(int64(e)) - 1011) << 52)
    return math.Float64frombits(x)
}

With that exposition done, I can now present my criteria for assessing the quality of functions that claim to generate uniformly-distributed IEEE-754 double-precision floating-point values over (0, 1] or [0, 1] or [0, 1):

CriteriaMost functionsPresented hereIdeal
Probability of zero2-52 or 2-5302-1075
Smallest non-zero2-52 or 2-532-762-1074
Largest non-seeable1 - 2-53 or 0.5 - 2-542-76 - 2-129-0
Distinct values252 or 253258.2262±1

If you've reached this far, then you might be interested in similar commentary from Taylor R. Campbell or Allen B. Downey or @moon_chilled.

Windows Arm64EC ABI Notes

The basic premise of Microsoft's Arm64EC is that a single virtual address space can contain a mixture of ARM64 code and X64 code; the ARM64 code executes natively, whereas the X64 code is transparently converted to ARM64 code by a combination of JIT and AOT compilation, and ARM64 ⇄ X64 transitions can happen at any function call/return boundary.

There are some good MSDN pages on the topic:

The first function to highlight is RtlIsEcCode; to tell apart ARM64 code and X64 code, the system maintains a bitmap with one bit per 4K page, specifying whether that page contains ARM64 code (bit set) or X64 code (bit clear). This bitmap can be queried using RtlIsEcCode. The loader sets bits in this bitmap when loading DLLs, as does VirtualAlloc2 when allocating executable memory with MEM_EXTENDED_PARAMETER_EC_CODE in MemExtendedParameterAttributeFlags.

After the X64 emulator executes a call or ret or jmp that crosses a page boundary, it needs to determine whether the new rip points to ARM64 code or X64 code. If !RtlIsEcCode(rip), then the X64 emulator can happily continue emulating X64 code. However, if RtlIsEcCode(rip), then a transition out of the emulator needs to be performed. This can be a call-like transition (X64 calling ARM64), or a return-like transition (X64 returning to ARM64). The transition type is determined by looking at the four bytes before rip; if they contain the encoding of blr x16 (0xd63f0200), then a return-like transition is performed by setting pc to rip. Otherwise, a call-like transition is performed by setting x9 to rip and setting pc to rip's entry thunk. To find the entry thunk, the four bytes before rip are used; the low two bits need to be 0b01, and after masking off said bits, the 32 bits are sign extended and then added to rip (the resultant value must be different to rip, i.e. a function cannot be its own entry thunk).

Before transferring control to the entry thunk, the X64 emulator performs a little manipulation:

ldr lr, [sp], #8 // pop return address (skipping sp alignment check)
mov x4, sp

After this, it ensures sp is 16-byte aligned:

if (unlikely(sp & 8)) {
  str lr, [sp, -#8]! // push return address again (again skip check)
  adr lr, x64_ret_stub // an X64 funclet containing just "ret"
}

In other words, the stack will look like whichever of the following diagrams gives rise to an aligned sp:

◀- lower addresses    x4                 higher addresses -▶
                      |
                      ▼
           ... retaddr home0 home1 home2 home3 arg4 arg5 ...
                      ▲
                      |
                      sp                        lr = retaddr
◀- lower addresses    x4                 higher addresses -▶
                      |
                      ▼
           ... retaddr home0 home1 home2 home3 arg4 arg5 ...
              ▲
              |
              sp                           lr = x64_ret_stub

The 32 bytes of X64 home space begin at x4, and any arguments passed on the stack begin at x4+#32. The entry thunk is free to use the X64 home space if it wants, and some of the MSDN documentation suggests using it as a place to save q6 and q7, but this is problematic, as to later restore from this space, we'd need to save x4 (not to mention that no unwind codes can load from x4). In practice, only 24 of the 32 bytes are easily usable; sp+#8 through sp+#32 will always coincide with some 24 bytes of the home space.

The entry thunk can either be a copy of the original function that speaks the X64 ABI, or it can copy the arguments from their X64 ABI locations to their Arm64EC ABI locations and then make a call to the original function (helpfully provided in x9) and then transfer the results to their X64 ABI locations. If the original function is vararg, then the Arm64EC ABI dictates that x4 should contain a pointer to the stack arguments (so add #32) and that x5 should contain the size of the stack arguments (which is not generally known; MSVC-generated thunks populate #0 for x5 in this case).

In either case, the entry thunk needs to ensure that X64 ABI non-volatile registers are preserved (which translates to ARM64 ABI non-volatile registers, plus q6 through q15), and then needs to return to X64 once it is done. That is achieved by means of a tailcall to __os_arm64x_dispatch_ret, which resumes X64 execution at lr.

If the original function modified arguments in-place and then tailcalled something else (an adjustor function), then the entry thunk for it can instead be an adjustor thunk: modify the arguments in-place (in their X64 ABI locations), put the tailcall target in x9, and then tailcall __os_arm64x_x64_jump. If x9 points to ARM64 code, then __os_arm64x_x64_jump will tailcall x9's entry thunk (which will then consume arguments from their X64 locations), whereas if x9 points to X64 code, then it'll act like __os_arm64x_dispatch_call_no_redirect (which will again consume arguments from their X64 locations).

The other side of things is when native ARM64 code wants to make an indirect function call. The whole premise of Arm64EC is that function pointers at ABI boundaries can point to either ARM64 functions or X64 functions, and a-priori the caller doesn't know which it has. The recommendation is to put the call target in x11 and then call __os_arm64x_dispatch_icall, which will leave x11 as-is if it is ARM64 code, or copy x10 to x11 if not.

Note that __os_arm64x_dispatch_icall is not a function per-se, but instead a (per-module) global variable containing a function pointer. The distinction is important, as it affects how to call it; it cannot be a bl target, instead it has to be loaded into a register (e.g. by adrp and ldr) and then be used as a blr target. The linker and loader conspire to put LdrpValidateEcCallTarget in this variable, the pseudocode for which is:

LdrpValidateEcCallTarget:  // aka. __os_arm64x_dispatch_icall
  // Specially written to only mutate x16/x17
  // (and x9/x11 where indicated)
  if (RtlIsEcCode(x11)) {
    // x11 unchanged
    // x9 unspecified, in practice unchanged from input
  } else {
    x9 = ResolveFastForwardSequences(x11);
    if (RtlIsEcCode(x9)) {
      x11 = x9
      // x9 unspecified, in practice same as exit x11
    } else {
      x11 = x10
      // x9 specified as X64 pointer
    }
  }
  // x0-x8,x10,x15 specified as unchanged from input
  // q0-q7 specified as unchanged from input
  // x12 unspecified, in practice unchanged from input
  return;

ResolveFastForwardSequences(uintptr_t p):
  // jmp qword [rip+imm32]
  if (bytes_match(p, "ff 25 ?? ?? ?? ??")) {
    p += 6;
    int32_t imm32 = ((int32_t*)p)[-1];
    p = *(uintptr_t*)(p + imm32);
    return ResolveFastForwardSequences(p);
  }
  if (p & 15) {
    return p;
  }
  // mov rax, rsp; mov [rax+32], rbx; push rbp; pop rbp; jmp rel32
  // mov rdi, rdi; push rbp; mov rbp, rsp; pop rbp; nop; jmp rel32
  if (bytes_match(p, "48 8b c4 48 89 58 20 55 5d e9 ?? ?? ?? ??")
  ||  bytes_match(p, "48 8b ff 55 48 8b ec 5d 90 e9 ?? ?? ?? ??")) {
    p += 14;
    int32_t rel32 = ((int32_t*)p)[-1];
    p += rel32;
    return ResolveFastForwardSequences(p);
  }
  // mov r10, rcx; mov eax, imm32; test byte [0x7ffe0308], 0x01
  // jne rip+3; syscall; ret; jne_target: int 0x2e; ret
  if (bytes_match(p   , "4c 8b d1 b8 ?? ?0 00 00 f6 04 25 08")
  &&  bytes_match(p+12, "03 fe 7f 01 75 03 0f 05 c3 cd 2e c3")) {
    uint32_t imm32 = *(uint32_t*)(p + 4);
    return SyscallFunctionTable[imm32];
  }
  return p;

The related __os_arm64x_dispatch_icall_cfg is also a (per-module) global variable containing a function pointer; courtesy of the linker and loader it'll end up containing either LdrpValidateEcCallTarget or LdrpValidateEcCallTargetCfg or LdrpValidateEcCallTargetCfgES, the latter two of which perform a control-flow-guard (CFG) test against x11 and then tailcall LdrpValidateEcCallTarget.

If LdrpValidateEcCallTarget found X64 code, then x9 will end up containing the X64 pointer, and x11 will end up containing whatever was in x10. It is expected that the caller put the address of an exit thunk in x10, so that blr x11 will either perform the ARM64 call or call the exit thunk for an X64 call.

The exit thunk should set up a call frame, copy arguments from their Arm64EC ABI locations to their X64 ABI locations, perform the X64 call, transfer the result back to its Arm64EC location, then tear down its call frame and return. To perform the X64 call, it should put the X64 pointer in x9 (if not already in x9), and then call __os_arm64x_dispatch_call_no_redirect. If making a vararg call, then this is where x5 comes in to play, as the exit thunk needs to copy the stack-based arguments to a new location at the bottom of the stack.

The call to __os_arm64x_dispatch_call_no_redirect must be done as blr x16; the ARM64 CPU will set lr to the next ARM64 instruction, the X64 emulator will then push lr on to the stack as the return address, and when the emulated X64 code eventually tries to resume execution at this address, the emulator will recognise the blr x16 preceding this address and perform a return-like transition.

The (per-module) global variables __os_arm64x_dispatch_call_no_redirect and __os_arm64x_dispatch_ret and __os_arm64x_x64_jump end up pointing to functions exported from xtajit.dll. Again the linker and loader conspire to populate these variables (this is not the usual GOT / IAT machinery, though the end effect is similar). All three functions are entry points to the X64 emulator, and follow a similar pattern:

__os_arm64x_dispatch_call_no_redirect__os_arm64x_dispatch_ret__os_arm64x_x64_jump
Intended usageCall X64 code (usually from exit thunk)Return to X64 code after entry thunkTailcall ARM64 or X64 code in adjustor thunk
Architecture checkNone (target assumed to be X64)None (target assumed to be X64)If RtlIsEcCode(x9) then tailcall x9's entry thunk
Stack manipulationPush lr (so that X64 can return)None (was done pre-entry-thunk)Push lr (revert pre-entry-thunk manipulation)
Passthrough registers
(extra non-volatiles)
x0 through x3, q0 through q15x8, q0 through q3, q6 through q15x0 through x3, q0 through q15
(plus x4 through x9 if tailcalling ARM64)
X64 execution targetx9lrx9

The observant reader will notice that __os_arm64x_x64_jump isn't quite sufficient for its intended usage. If it was sufficient, then its pseudocode would be something like:

__os_arm64x_x64_jump:     // hoped-for implementation
  if (RtlIsEcCode(x9)) {
    x11 = GetEntryThunk(x9)
    br x11
  } else {
    x9 = ResolveFastForwardSequences(x9)
    if (RtlIsEcCode(x9)) {
      x11 = GetEntryThunk(x9)
      br x11
    } else {
      revert the pre-entry-thunk stack manipulation
      br_to_X64_mode x9
    }
  }

Unfortunately, at least in the version of Windows 11 that I'm using right now, its pseudocode is more like:

__os_arm64x_x64_jump:      // actual implementation
  if (RtlIsEcCode(x9)) {
    x11 = GetEntryThunk(x9)
    br x11
  } else {
    str lr, [sp, #-8]! // push lr
    br_to_X64_mode x9
  }

In other words, it is deficient in two ways:

  1. Not checking for fast-forward sequences. This is unfortunate from a performance perspective, but not otherwise harmful.
  2. Assuming that the pre-entry-thunk stack manipulation was just popping into lr. This is true most of the time, but not all the time, and all sorts of mayhem can arise if this assumption is wrong.

To address point 2, the tail of __os_arm64x_x64_jump wants to be more like the following:

if (likely(sp == x4)) {
  str lr, [sp, #-8]! // push lr
  br_to_X64_mode x9
} else {
  br_to_X64_mode x9
}

A more complex, but ultimately more useful, variant of the above is:

if (likely(sp == x4)) {
  str lr, [sp, #-8]! // push lr
  br_to_X64_mode x9
} else {
  str x9, [sp, #-8]! // push x9
  adr lr, x64_ret_stub
  br_to_X64_mode lr
}

The adr in the above can be dropped, as if sp != x4, it means that the pre-entry-thunk manipulation will already have put x64_ret_stub in lr. At this point, the two branches differ only in swapping x9/lr, and so the swap can be pulled out:

if (unlikely(sp != x4)) {
  swap(x9, lr)
  mov x4, sp
}
str lr, [sp, #-8]! // push lr
br_to_X64_mode x9

Following this line of thinking, we can sidestep this particular deficiency of __os_arm64x_x64_jump by inserting the following immediately before the br to __os_arm64x_x64_jump:

cmp sp, x4          // sp == x4?
mov x4, x9
csel x9, x9, lr, eq // If not, swap
csel lr, lr, x4, eq // x9 and lr.
mov x4, sp

Alternatively, we could provide our own version of __os_arm64x_x64_jump, built using __os_arm64x_check_icall and __os_arm64x_dispatch_call_no_redirect, which fixes both deficiencies:

my_arm64x_x64_jump:
  adrp x17, __os_arm64x_check_icall
  ldr x17, [x17, __os_arm64x_check_icall]
  mov x11, x9 // __os_arm64x_check_icall main input
  adr x10, after_blr // __os_arm64x_check_icall thunk input
  stp fp, lr, [sp, #-16]!
  mov fp, sp
  blr x17 // __os_arm64x_check_icall
after_blr:
  ldrsw x16, [x11, #-4] // load entry thunk offset
  adrp x17, __os_arm64x_dispatch_call_no_redirect
  cmp x10, x11 // is x64 target?
  ldr x17, [x17, __os_arm64x_dispatch_call_no_redirect]
  csel x10, x9, x11, eq // x10 will become x9 later
  sub x11, x11, #1 // bias for 0b01 in entry thunk offset
  sub x9, sp, x4
  add x11, x11, x16 // entry thunk address
  ldp fp, lr, [sp], #16
  csel x11, x17, x11, eq
  ccmn x9, #16, #4, eq
  csel x9, x10, lr, eq
  csel lr, lr, x10, eq
  br x11 // entry thunk or __os_arm64x_dispatch_call_no_redirect

Windows ARM64 Frame Unwind Code Details

This post is intended to supplement MSDN's ARM64 exception handling information. To start, here's a fleshed-out version of a table listing the available unwind codes:

NameEncodingProlog InstructionEpilog InstructionUnwind Effect
alloc_s000iiiiisub sp, sp, i*16add sp, sp, i*16Emulate epilog instruction
alloc_m11000iiiiiiiiiiisub sp, sp, i*16add sp, sp, i*16Emulate epilog instruction
alloc_l11100000i:u24sub sp, sp, i*16add sp, sp, i*16Emulate epilog instruction
add_fp11100010i:u8add fp, sp, i*8 (NB: not 16)sub sp, fp, i*8 (NB: not 16)Emulate epilog instruction
set_fp11100001mov fp, spmov sp, fpEmulate epilog instruction
pac_sign_lr11111100pacibsp (sign lr using sp)autibsp (authenticate lr using sp)Emulate autibsp or xpaclri
save_fplr01iiiiiistp fp, lr, [sp+i*8]ldp fp, lr, [sp+i*8]Emulate epilog instruction
save_lrpair1101011nnniiiiistp x(19+2*n), lr, [sp+i*8]ldp x(19+2*i), lr, [sp+i*8]Emulate epilog instruction
save_fplr_x10iiiiiistp fp, lr, [sp-(i+1)*8]!ldp fp, lr, [sp], (i+1)*8Emulate epilog instruction
save_regp110010nnnniiiiiistp x(19+n), x(20+n), [sp+i*8]ldp x(19+n), x(20+n), [sp+i*8]Emulate epilog instruction
save_regp_x110011nnnniiiiiistp x(19+n), x(20+n), [sp-(i+1)*8]!ldp x(19+n), x(20+n), [sp], (i+1)*8Emulate epilog instruction
save_r19r20_x001iiiiistp x19, x20, [sp-i*8]! (NB: not i+1)ldp x19, x20, [sp], i*8 (NB: not i+1)Emulate epilog instruction
save_next11100110stp similar to previous non-lr stp ldp similar to next non-lr ldp Extend next effect by two
save_reg110100nnnniiiiiistr x(19+n), [sp+i*8]ldr x(19+n), [sp+i*8]Emulate epilog instruction
save_reg_x1101010nnnniiiiistr x(19+n), [sp-(i+1)*8]!ldr x(19+n), [sp], (i+1)*8Emulate epilog instruction
save_fregp1101100nnniiiiiistp d(8+n), d(9+n), [sp+i*8]ldp d(8+n), d(9+n), [sp+i*8]Emulate epilog instruction
save_fregp_x1101101nnniiiiiistp d(8+n), d(9+n), [sp-(i+1)*8]!ldp d(8+n), d(9+n), [sp], (i+1)*8Emulate epilog instruction
save_freg1101110nnniiiiiistr d(8+n), [sp+i*8]ldr d(8+n), [sp+i*8]Emulate epilog instruction
save_freg_x11011110nnniiiiistr d(8+n), [sp-(i+1)*8]!ldr d(8+n), [sp], (i+1)*8Emulate epilog instruction
save_any_reg11100111x:u16 *One str or stp to stack *One ldr or ldp from stack *Varies *
custom11101xxxNo instructionNo instructionBespoke
reserved11110xxxNo instructionNo instructionUnwind fails
reserved11111000x:u8Any one instructionAny one instructionNo effect (yet)
reserved11111001x:u16Any one instructionAny one instructionNo effect (yet)
reserved11111010x:u24Any one instructionAny one instructionNo effect (yet)
reserved11111011x:u32Any one instructionAny one instructionNo effect (yet)
reserved11111101Any one instructionAny one instructionNo effect (yet)
reserved11111110Any one instructionAny one instructionNo effect (yet)
reserved11111111Any one instructionAny one instructionNo effect (yet)
nop11100011Any one instructionAny one instructionNo effect
end_c11100101No instruction, start of prolog Any one instruction, then end of epilog No effect
end11100100No instruction, start of prologret (or tailcall b), then end of epilogUnwind complete

(*) save_any_reg was added for Arm64EC. The 16-bit payload is made from a number of fields, packed as rpxnnnnnmmiiiiii. This gives rise to many different variants:

NameEncodingProlog InstructionEpilog InstructionUnwind Effect
save_any_reg11100111000nnnnn00iiiiiistr x(0+n), [sp+i*8]ldr x(0+n), [sp+i*8]Emulate epilog instruction
save_any_reg11100111001nnnnn00iiiiiistr x(0+n), [sp-(i+1)*16]!ldr x(0+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111010nnnnn00iiiiiistp x(0+n), x(1+n), [sp+i*16]ldp x(0+n), x(1+n), [sp+i*16]Emulate epilog instruction
save_any_reg11100111011nnnnn00iiiiiistp x(0+n), x(1+n), [sp-(i+1)*16]!ldp x(0+n), x(1+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111000nnnnn01iiiiiistr d(0+n), [sp+i*8]ldr d(0+n), [sp+i*8]Emulate epilog instruction
save_any_reg11100111001nnnnn01iiiiiistr d(0+n), [sp-(i+1)*16]!ldr d(0+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111010nnnnn01iiiiiistp d(0+n), d(1+n), [sp+i*16]ldp d(0+n), d(1+n), [sp+i*16]Emulate epilog instruction
save_any_reg11100111011nnnnn01iiiiiistp d(0+n), d(1+n), [sp-(i+1)*16]!ldp d(0+n), d(1+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111000nnnnn10iiiiiistr q(0+n), [sp+i*16] (NB: not 8)ldr q(0+n), [sp+i*16] (NB: not 8)Emulate epilog instruction
save_any_reg11100111001nnnnn10iiiiiistr q(0+n), [sp-(i+1)*16]!ldr q(0+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111010nnnnn10iiiiiistp q(0+n), q(1+n), [sp+i*16]ldp q(0+n), q(1+n), [sp+i*16]Emulate epilog instruction
save_any_reg11100111011nnnnn10iiiiiistp q(0+n), q(1+n), [sp-(i+1)*16]!ldp q(0+n), q(1+n), [sp], (i+1)*16Emulate epilog instruction
reserved111001111xxxxxxxxxxxxxxxAny one instructionAny one instructionUnwind fails
reserved11100111xxxxxxxx11xxxxxxAny one instructionAny one instructionUnwind fails

() The save_next unwind code requires further explanation. It causes the next unwind code to load four registers rather than two. It can be stacked; a sequence of N save_next unwind codes causes the next unwind code thereafter to load (N+1)*2 registers. Said next unwind code must be one of save_regp / save_regp_x / save_r19r20_x / save_fregp / save_fregp_x (notably, this excludes pair instructions with lr in their name). As examples:

The MSDN documentation suggests that a sufficiently long sequence of save_next codes can overflow from x registers to d registers, but best not to try this; stop at x31 or d15.

() In a prolog, end_c corresponds to no instruction, and also causes subsequent codes to correspond to no instruction. In an epilog, end_c corresponds to ret (or b, or any other one instruction that has no unwind effect), and causes subsequent codes to correspond to no instruction. In either case, subsequent codes (up to the first end) are still executed during unwind.

Moving on, each .xdata record describes a contiguous region of machine instructions. There is typically a 1:1 correspondence between regions and functions, though this needn't be true. A region contains:

If any of the above limits would be exceeded, then the region needs to be artifically split into multiple smaller regions, such that no limit is exceeded. The MSDN documentation suggests that split boundaries should be chosen as to not be in the middle of an epilog (this suggestion would not be required if end_c was treated as "No instruction, end of epilog", but alas it is treated as "Any one instruction, then end of epilog").

The length of the prolog is not explicitly given; it is calculated by analysing the unwind codes strictly preceding the first end or end_c and counting how many of them correspond to one instruction. Said unwind codes, once reversed, should correspond 1:1 with instructions in the prolog (nop codes can be used to make things line up, if there are instructions in the prolog which do not need an unwind effect). If unwinding from within the prolog, we calculate how many instructions we are away from the end of the prolog, and skip that many unwind codes from the start of the list. To indicate a lack of prolog, the first unwind code should be end_c (or end if no unwind effects are required by the main body of the region). Note that unwind effects after end_c are still executed if unwinding from within the prolog (though there's a Wine bug here).

The length of each epilog is not explicitly given; it is calculated by analysing the unwind codes starting at the epilog-specific offset and continuing up to and including the first end or end_c thereafter. Said unwind codes should correspond 1:1 with instructions in the epilog (nop codes can be used to make things line up, if there are instructions in the epilog which do not need an unwind effect). If unwinding from within an epilog, we calculate how many instructions we are away from the start of the epilog, and skip that many unwind codes from the start of the list. Note that unwind effects after end_c are still executed if unwinding from within an epilog (though again Wine bug).

If the E bit is set in the .xdata header, then the end of the epilog is equal to the end of the region (the start of the epilog is found by calculating the length of the epilog and subtracting this from the end). If the E bit is not set, then the start of each epilog is explicitly specified.

If unwinding from a PC not within a prolog or epilog, unwind codes are executed until the first end is reached. Note that these codes are shared with those describing the prolog; this is not normally a problem, but if it is, the prolog can be split off to a separate region consisting solely of prolog, leaving no prolog in the original region.

If the X bit is set in the .xdata header, and PC is not within a prolog or epilog, then the specified function is called during exception handler search and during exception unwinding. See __C_specific_handler for the prototype of this function, and see LuaJIT for a concrete example.

page: 1 2 3