Windows Arm64EC ABI Notes
The basic premise of Microsoft's Arm64EC is that a single virtual address space can contain a mixture of ARM64 code and X64 code; the ARM64 code executes natively, whereas the X64 code is transparently converted to ARM64 code by a combination of JIT and AOT compilation, and ARM64 ⇄ X64 transitions can happen at any function call/return boundary.
There are some good MSDN pages on the topic:
- Understanding Arm64EC ABI and assembly code
- Overview of Arm64EC ABI conventions
- The Old New Thing ... arm64 ... part 25: the Arm64EC ABI
The first function to highlight is RtlIsEcCode
; to tell apart ARM64 code and X64 code, the system maintains a bitmap with one bit per 4K page, specifying whether that page contains ARM64 code (bit set) or X64 code (bit clear). This bitmap can be queried using RtlIsEcCode
. The loader sets bits in this bitmap when loading DLLs, as does VirtualAlloc2
when allocating executable memory with MEM_EXTENDED_PARAMETER_EC_CODE
in MemExtendedParameterAttributeFlags
.
After the X64 emulator executes a call
or ret
or jmp
that crosses a page boundary, it needs to determine whether the new rip
points to ARM64 code or X64 code. If !RtlIsEcCode(rip)
, then the X64 emulator can happily continue emulating X64 code. However, if RtlIsEcCode(rip)
, then a transition out of the emulator needs to be performed. This can be a call-like transition (X64 calling ARM64), or a return-like transition (X64 returning to ARM64). The transition type is determined by looking at the four bytes before rip
; if they contain the encoding of blr x16
(0xd63f0200
), then a return-like transition is performed by setting pc
to rip
. Otherwise, a call-like transition is performed by setting x9
to rip
and setting pc
to rip
's entry thunk. To find the entry thunk, the four bytes before rip
are used; the low two bits need to be 0b01
, and after masking off said bits, the 32 bits are sign extended and then added to rip
(the resultant value must be different to rip
, i.e. a function cannot be its own entry thunk).
Before transferring control to the entry thunk, the X64 emulator performs a little manipulation:
ldr lr, [sp], #8 // pop return address (skipping sp alignment check)
mov x4, sp
After this, it ensures sp
is 16-byte aligned:
if (unlikely(sp & 8)) {
str lr, [sp, -#8]! // push return address again (again skip check)
adr lr, x64_ret_stub // an X64 funclet containing just "ret"
}
In other words, the stack will look like whichever of the following diagrams gives rise to an aligned sp
:
◀- lower addresses x4 higher addresses -▶
|
▼
... retaddr home0 home1 home2 home3 arg4 arg5 ...
▲
|
sp lr = retaddr
◀- lower addresses x4 higher addresses -▶
|
▼
... retaddr home0 home1 home2 home3 arg4 arg5 ...
▲
|
sp lr = x64_ret_stub
The 32 bytes of X64 home space begin at x4
, and any arguments passed on the stack begin at x4+#32
. The entry thunk is free to use the X64 home space if it wants, and some of the MSDN documentation suggests using it as a place to save q6
and q7
, but this is problematic, as to later restore from this space, we'd need to save x4
(not to mention that no unwind codes can load from x4
). In practice, only 24 of the 32 bytes are easily usable; sp+#8
through sp+#32
will always coincide with some 24 bytes of the home space.
The entry thunk can either be a copy of the original function that speaks the X64 ABI, or it can copy the arguments from their X64 ABI locations to their Arm64EC ABI locations and then make a call to the original function (helpfully provided in x9
) and then transfer the results to their X64 ABI locations. If the original function is vararg, then the Arm64EC ABI dictates that x4
should contain a pointer to the stack arguments (so add #32
) and that x5
should contain the size of the stack arguments (which is not generally known; MSVC-generated thunks populate #0
for x5
in this case).
In either case, the entry thunk needs to ensure that X64 ABI non-volatile registers are preserved (which translates to ARM64 ABI non-volatile registers, plus q6
through q15
), and then needs to return to X64 once it is done. That is achieved by means of a tailcall to __os_arm64x_dispatch_ret
, which resumes X64 execution at lr
.
If the original function modified arguments in-place and then tailcalled something else (an adjustor function), then the entry thunk for it can instead be an adjustor thunk: modify the arguments in-place (in their X64 ABI locations), put the tailcall target in x9
, and then tailcall __os_arm64x_x64_jump
. If x9
points to ARM64 code, then __os_arm64x_x64_jump
will tailcall x9
's entry thunk (which will then consume arguments from their X64 locations), whereas if x9
points to X64 code, then it'll act like __os_arm64x_dispatch_call_no_redirect
(which will again consume arguments from their X64 locations).
The other side of things is when native ARM64 code wants to make an indirect function call. The whole premise of Arm64EC is that function pointers at ABI boundaries can point to either ARM64 functions or X64 functions, and a-priori the caller doesn't know which it has. The recommendation is to put the call target in x11
and then call __os_arm64x_dispatch_icall
, which will leave x11
as-is if it is ARM64 code, or copy x10
to x11
if not.
Note that __os_arm64x_dispatch_icall
is not a function per-se, but instead a (per-module) global variable containing a function pointer. The distinction is important, as it affects how to call it; it cannot be a bl
target, instead it has to be loaded into a register (e.g. by adrp
and ldr
) and then be used as a blr
target. The linker and loader conspire to put LdrpValidateEcCallTarget
in this variable, the pseudocode for which is:
LdrpValidateEcCallTarget: // aka. __os_arm64x_dispatch_icall
// Specially written to only mutate x16/x17
// (and x9/x11 where indicated)
if (RtlIsEcCode(x11)) {
// x11 unchanged
// x9 unspecified, in practice unchanged from input
} else {
x9 = ResolveFastForwardSequences(x11);
if (RtlIsEcCode(x9)) {
x11 = x9
// x9 unspecified, in practice same as exit x11
} else {
x11 = x10
// x9 specified as X64 pointer
}
}
// x0-x8,x10,x15 specified as unchanged from input
// q0-q7 specified as unchanged from input
// x12 unspecified, in practice unchanged from input
return;
ResolveFastForwardSequences(uintptr_t p):
// jmp qword [rip+imm32]
if (bytes_match(p, "ff 25 ?? ?? ?? ??")) {
p += 6;
int32_t imm32 = ((int32_t*)p)[-1];
p = *(uintptr_t*)(p + imm32);
return ResolveFastForwardSequences(p);
}
if (p & 15) {
return p;
}
// mov rax, rsp; mov [rax+32], rbx; push rbp; pop rbp; jmp rel32
// mov rdi, rdi; push rbp; mov rbp, rsp; pop rbp; nop; jmp rel32
if (bytes_match(p, "48 8b c4 48 89 58 20 55 5d e9 ?? ?? ?? ??")
|| bytes_match(p, "48 8b ff 55 48 8b ec 5d 90 e9 ?? ?? ?? ??")) {
p += 14;
int32_t rel32 = ((int32_t*)p)[-1];
p += rel32;
return ResolveFastForwardSequences(p);
}
// mov r10, rcx; mov eax, imm32; test byte [0x7ffe0308], 0x01
// jne rip+3; syscall; ret; jne_target: int 0x2e; ret
if (bytes_match(p , "4c 8b d1 b8 ?? ?0 00 00 f6 04 25 08")
&& bytes_match(p+12, "03 fe 7f 01 75 03 0f 05 c3 cd 2e c3")) {
uint32_t imm32 = *(uint32_t*)(p + 4);
return SyscallFunctionTable[imm32];
}
return p;
The related __os_arm64x_dispatch_icall_cfg
is also a (per-module) global variable containing a function pointer; courtesy of the linker and loader it'll end up containing either LdrpValidateEcCallTarget
or LdrpValidateEcCallTargetCfg
or LdrpValidateEcCallTargetCfgES
, the latter two of which perform a control-flow-guard (CFG) test against x11
and then tailcall LdrpValidateEcCallTarget
.
If LdrpValidateEcCallTarget
found X64 code, then x9
will end up containing the X64 pointer, and x11
will end up containing whatever was in x10
. It is expected that the caller put the address of an exit thunk in x10
, so that blr x11
will either perform the ARM64 call or call the exit thunk for an X64 call.
The exit thunk should set up a call frame, copy arguments from their Arm64EC ABI locations to their X64 ABI locations, perform the X64 call, transfer the result back to its Arm64EC location, then tear down its call frame and return. To perform the X64 call, it should put the X64 pointer in x9
(if not already in x9
), and then call __os_arm64x_dispatch_call_no_redirect
. If making a vararg call, then this is where x5
comes in to play, as the exit thunk needs to copy the stack-based arguments to a new location at the bottom of the stack.
The call to __os_arm64x_dispatch_call_no_redirect
must be done as blr x16
; the ARM64 CPU will set lr
to the next ARM64 instruction, the X64 emulator will then push lr
on to the stack as the return address, and when the emulated X64 code eventually tries to resume execution at this address, the emulator will recognise the blr x16
preceding this address and perform a return-like transition.
The (per-module) global variables __os_arm64x_dispatch_call_no_redirect
and __os_arm64x_dispatch_ret
and __os_arm64x_x64_jump
end up pointing to functions exported from xtajit.dll
. Again the linker and loader conspire to populate these variables (this is not the usual GOT / IAT machinery, though the end effect is similar). All three functions are entry points to the X64 emulator, and follow a similar pattern:
__os_arm64x_dispatch_call_no_redirect | __os_arm64x_dispatch_ret | __os_arm64x_x64_jump | |
---|---|---|---|
Intended usage | Call X64 code (usually from exit thunk) | Return to X64 code after entry thunk | Tailcall ARM64 or X64 code in adjustor thunk |
Architecture check | None (target assumed to be X64) | None (target assumed to be X64) | If RtlIsEcCode(x9) then tailcall x9 's entry thunk |
Stack manipulation | Push lr (so that X64 can return) | None (was done pre-entry-thunk) | Push lr (revert pre-entry-thunk manipulation) |
Passthrough registers (extra non-volatiles) | x0 through x3 , q0 through q15 | x8 , q0 through q3 , q6 through q15 | x0 through x3 , q0 through q15 (plus x4 through x9 if tailcalling ARM64) |
X64 execution target | x9 | lr | x9 |
The observant reader will notice that __os_arm64x_x64_jump
isn't quite sufficient for its intended usage. If it was sufficient, then its pseudocode would be something like:
__os_arm64x_x64_jump: // hoped-for implementation
if (RtlIsEcCode(x9)) {
x11 = GetEntryThunk(x9)
br x11
} else {
x9 = ResolveFastForwardSequences(x9)
if (RtlIsEcCode(x9)) {
x11 = GetEntryThunk(x9)
br x11
} else {
revert the pre-entry-thunk stack manipulation
br_to_X64_mode x9
}
}
Unfortunately, at least in the version of Windows 11 that I'm using right now, its pseudocode is more like:
__os_arm64x_x64_jump: // actual implementation
if (RtlIsEcCode(x9)) {
x11 = GetEntryThunk(x9)
br x11
} else {
str lr, [sp, #-8]! // push lr
br_to_X64_mode x9
}
In other words, it is deficient in two ways:
- Not checking for fast-forward sequences. This is unfortunate from a performance perspective, but not otherwise harmful.
- Assuming that the pre-entry-thunk stack manipulation was just popping into
lr
. This is true most of the time, but not all the time, and all sorts of mayhem can arise if this assumption is wrong.
To address point 2, the tail of __os_arm64x_x64_jump
wants to be more like the following:
if (likely(sp == x4)) {
str lr, [sp, #-8]! // push lr
br_to_X64_mode x9
} else {
br_to_X64_mode x9
}
A more complex, but ultimately more useful, variant of the above is:
if (likely(sp == x4)) {
str lr, [sp, #-8]! // push lr
br_to_X64_mode x9
} else {
str x9, [sp, #-8]! // push x9
adr lr, x64_ret_stub
br_to_X64_mode lr
}
The adr
in the above can be dropped, as if sp != x4
, it means that the pre-entry-thunk manipulation will already have put x64_ret_stub
in lr
. At this point, the two branches differ only in swapping x9
/lr
, and so the swap can be pulled out:
if (unlikely(sp != x4)) {
swap(x9, lr)
mov x4, sp
}
str lr, [sp, #-8]! // push lr
br_to_X64_mode x9
Following this line of thinking, we can sidestep this particular deficiency of __os_arm64x_x64_jump
by inserting the following immediately before the br
to __os_arm64x_x64_jump
:
cmp sp, x4 // sp == x4?
mov x4, x9
csel x9, x9, lr, eq // If not, swap
csel lr, lr, x4, eq // x9 and lr.
mov x4, sp
Alternatively, we could provide our own version of __os_arm64x_x64_jump
, built using __os_arm64x_check_icall
and __os_arm64x_dispatch_call_no_redirect
, which fixes both deficiencies:
my_arm64x_x64_jump:
adrp x17, __os_arm64x_check_icall
ldr x17, [x17, __os_arm64x_check_icall]
mov x11, x9 // __os_arm64x_check_icall main input
adr x10, after_blr // __os_arm64x_check_icall thunk input
stp fp, lr, [sp, #-16]!
mov fp, sp
blr x17 // __os_arm64x_check_icall
after_blr:
ldrsw x16, [x11, #-4] // load entry thunk offset
adrp x17, __os_arm64x_dispatch_call_no_redirect
cmp x10, x11 // is x64 target?
ldr x17, [x17, __os_arm64x_dispatch_call_no_redirect]
csel x10, x9, x11, eq // x10 will become x9 later
sub x11, x11, #1 // bias for 0b01 in entry thunk offset
sub x9, sp, x4
add x11, x11, x16 // entry thunk address
ldp fp, lr, [sp], #16
csel x11, x17, x11, eq
ccmn x9, #16, #4, eq
csel x9, x10, lr, eq
csel lr, lr, x10, eq
br x11 // entry thunk or __os_arm64x_dispatch_call_no_redirect