Windows Arm64EC ABI Notes

The basic premise of Microsoft's Arm64EC is that a single virtual address space can contain a mixture of ARM64 code and X64 code; the ARM64 code executes natively, whereas the X64 code is transparently converted to ARM64 code by a combination of JIT and AOT compilation, and ARM64 ⇄ X64 transitions can happen at any function call/return boundary.

There are some good MSDN pages on the topic:

The first function to highlight is RtlIsEcCode; to tell apart ARM64 code and X64 code, the system maintains a bitmap with one bit per 4K page, specifying whether that page contains ARM64 code (bit set) or X64 code (bit clear). This bitmap can be queried using RtlIsEcCode. The loader sets bits in this bitmap when loading DLLs, as does VirtualAlloc2 when allocating executable memory with MEM_EXTENDED_PARAMETER_EC_CODE in MemExtendedParameterAttributeFlags.

After the X64 emulator executes a call or ret or jmp that crosses a page boundary, it needs to determine whether the new rip points to ARM64 code or X64 code. If !RtlIsEcCode(rip), then the X64 emulator can happily continue emulating X64 code. However, if RtlIsEcCode(rip), then a transition out of the emulator needs to be performed. This can be a call-like transition (X64 calling ARM64), or a return-like transition (X64 returning to ARM64). The transition type is determined by looking at the four bytes before rip; if they contain the encoding of blr x16 (0xd63f0200), then a return-like transition is performed by setting pc to rip. Otherwise, a call-like transition is performed by setting x9 to rip and setting pc to rip's entry thunk. To find the entry thunk, the four bytes before rip are used; the low two bits need to be 0b01, and after masking off said bits, the 32 bits are sign extended and then added to rip (the resultant value must be different to rip, i.e. a function cannot be its own entry thunk).

Before transferring control to the entry thunk, the X64 emulator performs a little manipulation:

ldr lr, [sp], #8 // pop return address (skipping sp alignment check)
mov x4, sp

After this, it ensures sp is 16-byte aligned:

if (unlikely(sp & 8)) {
  str lr, [sp, -#8]! // push return address again (again skip check)
  adr lr, x64_ret_stub // an X64 funclet containing just "ret"
}

In other words, the stack will look like whichever of the following diagrams gives rise to an aligned sp:

◀- lower addresses    x4                 higher addresses -▶
                      |
                      ▼
           ... retaddr home0 home1 home2 home3 arg4 arg5 ...
                      ▲
                      |
                      sp                        lr = retaddr

◀- lower addresses    x4                 higher addresses -▶
                      |
                      ▼
           ... retaddr home0 home1 home2 home3 arg4 arg5 ...
              ▲
              |
              sp                           lr = x64_ret_stub

The 32 bytes of X64 home space begin at x4, and any arguments passed on the stack begin at x4+#32. The entry thunk is free to use the X64 home space if it wants, and some of the MSDN documentation suggests using it as a place to save q6 and q7, but this is problematic, as to later restore from this space, we'd need to save x4 (not to mention that no unwind codes can load from x4). In practice, only 24 of the 32 bytes are easily usable; sp+#8 through sp+#32 will always coincide with some 24 bytes of the home space.

The entry thunk can either be a copy of the original function that speaks the X64 ABI, or it can copy the arguments from their X64 ABI locations to their Arm64EC ABI locations and then make a call to the original function (helpfully provided in x9) and then transfer the results to their X64 ABI locations. If the original function is vararg, then the Arm64EC ABI dictates that x4 should contain a pointer to the stack arguments (so add #32) and that x5 should contain the size of the stack arguments (which is not generally known; MSVC-generated thunks populate #0 for x5 in this case).

In either case, the entry thunk needs to ensure that X64 ABI non-volatile registers are preserved (which translates to ARM64 ABI non-volatile registers, plus q6 through q15), and then needs to return to X64 once it is done. That is achieved by means of a tailcall to __os_arm64x_dispatch_ret, which resumes X64 execution at lr.

If the original function modified arguments in-place and then tailcalled something else (an adjustor function), then the entry thunk for it can instead be an adjustor thunk: modify the arguments in-place (in their X64 ABI locations), put the tailcall target in x9, and then tailcall __os_arm64x_x64_jump. If x9 points to ARM64 code, then __os_arm64x_x64_jump will tailcall x9's entry thunk (which will then consume arguments from their X64 locations), whereas if x9 points to X64 code, then it'll act like __os_arm64x_dispatch_call_no_redirect (which will again consume arguments from their X64 locations).

The other side of things is when native ARM64 code wants to make an indirect function call. The whole premise of Arm64EC is that function pointers at ABI boundaries can point to either ARM64 functions or X64 functions, and a-priori the caller doesn't know which it has. The recommendation is to put the call target in x11 and then call __os_arm64x_dispatch_icall, which will leave x11 as-is if it is ARM64 code, or copy x10 to x11 if not.

Note that __os_arm64x_dispatch_icall is not a function per-se, but instead a (per-module) global variable containing a function pointer. The distinction is important, as it affects how to call it; it cannot be a bl target, instead it has to be loaded into a register (e.g. by adrp and ldr) and then be used as a blr target. The linker and loader conspire to put LdrpValidateEcCallTarget in this variable, the pseudocode for which is:

LdrpValidateEcCallTarget:  // aka. __os_arm64x_dispatch_icall
  // Specially written to only mutate x16/x17
  // (and x9/x11 where indicated)
  if (RtlIsEcCode(x11)) {
    // x11 unchanged
    // x9 unspecified, in practice unchanged from input
  } else {
    x9 = ResolveFastForwardSequences(x11);
    if (RtlIsEcCode(x9)) {
      x11 = x9
      // x9 unspecified, in practice same as exit x11
    } else {
      x11 = x10
      // x9 specified as X64 pointer
    }
  }
  // x0-x8,x10,x15 specified as unchanged from input
  // q0-q7 specified as unchanged from input
  // x12 unspecified, in practice unchanged from input
  return;

ResolveFastForwardSequences(uintptr_t p):
  // jmp qword [rip+imm32]
  if (bytes_match(p, "ff 25 ?? ?? ?? ??")) {
    p += 6;
    int32_t imm32 = ((int32_t*)p)[-1];
    p = *(uintptr_t*)(p + imm32);
    return ResolveFastForwardSequences(p);
  }
  if (p & 15) {
    return p;
  }
  // mov rax, rsp; mov [rax+32], rbx; push rbp; pop rbp; jmp rel32
  // mov rdi, rdi; push rbp; mov rbp, rsp; pop rbp; nop; jmp rel32
  if (bytes_match(p, "48 8b c4 48 89 58 20 55 5d e9 ?? ?? ?? ??")
  ||  bytes_match(p, "48 8b ff 55 48 8b ec 5d 90 e9 ?? ?? ?? ??")) {
    p += 14;
    int32_t rel32 = ((int32_t*)p)[-1];
    p += rel32;
    return ResolveFastForwardSequences(p);
  }
  // mov r10, rcx; mov eax, imm32; test byte [0x7ffe0308], 0x01
  // jne rip+3; syscall; ret; jne_target: int 0x2e; ret
  if (bytes_match(p   , "4c 8b d1 b8 ?? ?0 00 00 f6 04 25 08")
  &&  bytes_match(p+12, "03 fe 7f 01 75 03 0f 05 c3 cd 2e c3")) {
    uint32_t imm32 = *(uint32_t*)(p + 4);
    return SyscallFunctionTable[imm32];
  }
  return p;

The related __os_arm64x_dispatch_icall_cfg is also a (per-module) global variable containing a function pointer; courtesy of the linker and loader it'll end up containing either LdrpValidateEcCallTarget or LdrpValidateEcCallTargetCfg or LdrpValidateEcCallTargetCfgES, the latter two of which perform a control-flow-guard (CFG) test against x11 and then tailcall LdrpValidateEcCallTarget.

If LdrpValidateEcCallTarget found X64 code, then x9 will end up containing the X64 pointer, and x11 will end up containing whatever was in x10. It is expected that the caller put the address of an exit thunk in x10, so that blr x11 will either perform the ARM64 call or call the exit thunk for an X64 call.

The exit thunk should set up a call frame, copy arguments from their Arm64EC ABI locations to their X64 ABI locations, perform the X64 call, transfer the result back to its Arm64EC location, then tear down its call frame and return. To perform the X64 call, it should put the X64 pointer in x9 (if not already in x9), and then call __os_arm64x_dispatch_call_no_redirect. If making a vararg call, then this is where x5 comes in to play, as the exit thunk needs to copy the stack-based arguments to a new location at the bottom of the stack.

The call to __os_arm64x_dispatch_call_no_redirect must be done as blr x16; the ARM64 CPU will set lr to the next ARM64 instruction, the X64 emulator will then push lr on to the stack as the return address, and when the emulated X64 code eventually tries to resume execution at this address, the emulator will recognise the blr x16 preceding this address and perform a return-like transition.

The (per-module) global variables __os_arm64x_dispatch_call_no_redirect and __os_arm64x_dispatch_ret and __os_arm64x_x64_jump end up pointing to functions exported from xtajit.dll. Again the linker and loader conspire to populate these variables (this is not the usual GOT / IAT machinery, though the end effect is similar). All three functions are entry points to the X64 emulator, and follow a similar pattern:

	`__os_arm64x_dispatch_call_no_redirect`	`__os_arm64x_dispatch_ret`	`__os_arm64x_x64_jump`
Intended usage	Call X64 code (usually from exit thunk)	Return to X64 code after entry thunk	Tailcall ARM64 or X64 code in adjustor thunk
Architecture check	None (target assumed to be X64)	None (target assumed to be X64)	If `RtlIsEcCode(x9)` then tailcall `x9`'s entry thunk
Stack manipulation	Push `lr` (so that X64 can return)	None (was done pre-entry-thunk)	Push `lr` (revert pre-entry-thunk manipulation)
Passthrough registers (extra non-volatiles)	`x0` through `x3`, `q0` through `q15`	`x8`, `q0` through `q3`, `q6` through `q15`	`x0` through `x3`, `q0` through `q15` (plus `x4` through `x9` if tailcalling ARM64)
X64 execution target	`x9`	`lr`	`x9`

The observant reader will notice that __os_arm64x_x64_jump isn't quite sufficient for its intended usage. If it was sufficient, then its pseudocode would be something like:

__os_arm64x_x64_jump:     // hoped-for implementation
  if (RtlIsEcCode(x9)) {
    x11 = GetEntryThunk(x9)
    br x11
  } else {
    x9 = ResolveFastForwardSequences(x9)
    if (RtlIsEcCode(x9)) {
      x11 = GetEntryThunk(x9)
      br x11
    } else {
      revert the pre-entry-thunk stack manipulation
      br_to_X64_mode x9
    }
  }

Unfortunately, at least in the version of Windows 11 that I'm using right now, its pseudocode is more like:

__os_arm64x_x64_jump:      // actual implementation
  if (RtlIsEcCode(x9)) {
    x11 = GetEntryThunk(x9)
    br x11
  } else {
    str lr, [sp, #-8]! // push lr
    br_to_X64_mode x9
  }

In other words, it is deficient in two ways:

Not checking for fast-forward sequences. This is unfortunate from a performance perspective, but not otherwise harmful.
Assuming that the pre-entry-thunk stack manipulation was just popping into lr. This is true most of the time, but not all the time, and all sorts of mayhem can arise if this assumption is wrong.

To address point 2, the tail of __os_arm64x_x64_jump wants to be more like the following:

if (likely(sp == x4)) {
  str lr, [sp, #-8]! // push lr
  br_to_X64_mode x9
} else {
  br_to_X64_mode x9
}

A more complex, but ultimately more useful, variant of the above is:

if (likely(sp == x4)) {
  str lr, [sp, #-8]! // push lr
  br_to_X64_mode x9
} else {
  str x9, [sp, #-8]! // push x9
  adr lr, x64_ret_stub
  br_to_X64_mode lr
}

The adr in the above can be dropped, as if sp != x4, it means that the pre-entry-thunk manipulation will already have put x64_ret_stub in lr. At this point, the two branches differ only in swapping x9/lr, and so the swap can be pulled out:

if (unlikely(sp != x4)) {
  swap(x9, lr)
  mov x4, sp
}
str lr, [sp, #-8]! // push lr
br_to_X64_mode x9

Following this line of thinking, we can sidestep this particular deficiency of __os_arm64x_x64_jump by inserting the following immediately before the br to __os_arm64x_x64_jump:

cmp sp, x4          // sp == x4?
mov x4, x9
csel x9, x9, lr, eq // If not, swap
csel lr, lr, x4, eq // x9 and lr.
mov x4, sp

Alternatively, we could provide our own version of __os_arm64x_x64_jump, built using __os_arm64x_check_icall and __os_arm64x_dispatch_call_no_redirect, which fixes both deficiencies:

my_arm64x_x64_jump:
  adrp x17, __os_arm64x_check_icall
  ldr x17, [x17, __os_arm64x_check_icall]
  mov x11, x9 // __os_arm64x_check_icall main input
  adr x10, after_blr // __os_arm64x_check_icall thunk input
  stp fp, lr, [sp, #-16]!
  mov fp, sp
  blr x17 // __os_arm64x_check_icall
after_blr:
  ldrsw x16, [x11, #-4] // load entry thunk offset
  adrp x17, __os_arm64x_dispatch_call_no_redirect
  cmp x10, x11 // is x64 target?
  ldr x17, [x17, __os_arm64x_dispatch_call_no_redirect]
  csel x10, x9, x11, eq // x10 will become x9 later
  sub x11, x11, #1 // bias for 0b01 in entry thunk offset
  sub x9, sp, x4
  add x11, x11, x16 // entry thunk address
  ldp fp, lr, [sp], #16
  csel x11, x17, x11, eq
  ccmn x9, #16, #4, eq
  csel x9, x10, lr, eq
  csel lr, lr, x10, eq
  br x11 // entry thunk or __os_arm64x_dispatch_call_no_redirect