Higher quality random floats

Much has been written about how to generate random 64-bit integers; people might argue over which particular (CS)PRNG is best suited for a given task, but there is agreement on the general shape of the solution: keep some state around, and then extract 64 bits of randomness at a time, where each bit is equally likely to be either zero or one. This gives nice uniformly-distributed integers over the range [0, 264).

There is less agreement on how to generate random floating-point values. It is tempting to aim for what looks like a simple transformation of an integer generator: uniformly-distributed floats over the range [0, 1). There are various ways of doing such a transformation; one exposition does it as:

double pcg64_random_double(pcg64_t *rng) {
    return (double)(pcg64_random(rng) >> 11) * 0x1.0p-53;
}

Whereas LuaJIT does it as:

uint64_t lj_prng_u64d(PRNGState *rs) {
    uint64_t z, r = 0;
    TW223_STEP(rs, z, r)
    /* Returns a double bit pattern in the range 1.0 <= d < 2.0. */
    return (r & 0x000fffffffffffffull) | 0x3ff0000000000000ull;
}
/* Then to give d in [0, 1) range: */
U64double u;
double d;
u.u64 = lj_prng_u64d(rs);
d = u.d - 1.0;

And a Go version by Lemire does it as:

// toFloat64 -> [0,1)
func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    x &= 0x1fffffffffffff // %2**53
    return float64(x) / float64(0x1fffffffffffff)
}

The [0, 1) output range is less useful than it might initially appear. It is often stated that [0, 1) can be transformed to [A, B) by multiplying by B-A then adding A, but this is only true for some values of A and B; for other values, a number just less than 1 can be transformed and end up equal to B due to floating-point rounding, i.e. [0, 1) is transformed to [A, B]. The possibility of a zero output is also unhelpful if the result is going to be log-transformed (for example as part of a Box-Muller transform), as log(0) explodes.

An output range of [0, 1] would be better behaved under most additive and multiplicative transforms. After such a transform, a range like [A, B] can be easily turned into [A, B) via rejection sampling, should the half-open range be desired. It will also turn out to be very easy (on either theoretical or practical grounds) to exclude 0 as a possible output, giving the log-friendly output range (0, 1].

Before trying to generate a random float in the range [0, 1], we should instead consider the easier problem of generating a random float in the range [0.5, 1]. There are 252+1 different IEEE-754 doubles in this range. The "rounding basin" of some double d in that range can be defined as the range of infinite-precision real numbers in the range [0.5, 1] that would round to d (assuming round-to-nearest). Taking ε=2-54, most of these doubles have a basin of d-ε through d+ε. The exceptions are the extremes of the range; 0.5 has a basin of 0.5 through 0.5+ε, and 1 has a basin of 1-ε through 1. In other words, there are 252-1 values in the range whose basin is wide, and 2 values in the range whose basin is only ε wide. For a uniform distribution, the probability of seeing d should equal the width of the rounding basin of d. Given all this, there is a fairly simple monotonic transform from the uniform integer range [0, 253) to the uniform float range [0.5, 1]:

Integer(s)Resultant floating-point double (ε=2-54)
00.5
1, 20.5 + 2ε
3, 40.5 + 4ε
5, 60.5 + 6ε
253-7, 253-61 - 6ε
253-5, 253-41 - 4ε
253-3, 253-21 - 2ε
253-11

This transform can be expressed in code like so:

double rand_between_half_and_one() {
    double d;
    uint64_t x = rand_u64() >> 11; /* 53-bit uniform integer */
    x = ((x + 1) >> 1) + (1022ull << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

The range [0.25, 0.5] looks a lot like the range [0.5, 1], except that ε reduces to 2-55. That is, there is a fairly simple monotonic transform from the uniform integer range [0, 253) to the uniform float range [0.25, 0.5]:

Integer(s)Resultant floating-point double (ε=2-55)
00.25
1, 20.25 + 2ε
3, 40.25 + 4ε
5, 60.25 + 6ε
253-7, 253-60.5 - 6ε
253-5, 253-40.5 - 4ε
253-3, 253-20.5 - 2ε
253-10.5

Or in code:

double rand_between_quarter_and_half() {
    double d;
    uint64_t x = rand_u64() >> 11; /* 53-bit uniform integer */
    x = ((x + 1) >> 1) + (1021ull << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

In turn, the range [0.125, 0.25] looks a lot like [0.25, 0.5], and [0.0625, 0.125] looks a lot like [0.125, 0.25], and so on. To generate a float in [0, 1], we need to choose one of these ranges, and then choose a value within that range. For a uniform distribution, the probability of a range should be proportional to the width of the range, so we have:

Output rangeWidthProbability
[0.5, 1]2-150%
[0.25, 0.5]2-225%
[0.125, 0.25]2-312.5%
[0.0625, 0.125]2-46.25%

As the width of the range approaches zero, so does the probability of the range. Once the width is less than 2-75, the probabilities are so small as to be effectively impossible for most practical purposes. By making them actually impossible, we gain a nice guarantee of algorithm termination, and avoid having to think about denormals, and get the log-friendly output range (0, 1]. One way of implementing this in code is:

double rand_between_zero_and_one() {
    double d;
    uint64_t x = rand_u64() >> 11; /* 53-bit uniform integer */
    uint64_t e = 1022;
    do {
      if (rand_u64() & 1) break; /* 1-bit uniform integer */
      e -= 1;
    } while (e > 1022-75);
    x = ((x + 1) >> 1) + (e << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

The above throws away lots of random bits, whereas it would be more economical to remember and use them later:

double rand_between_zero_and_one() {
    double d;
    uint64_t x = rand_u64(); /* 53-bit and 11-bit uniform integers */
    uint64_t bits = x & 0x7ff, nbits = 11;
    uint64_t e = 1022;
    do {
      if (bits & 1) break; /* 1-bit uniform integer */
      bits >>= 1;
      if (--nbits == 0) bits = rand_u64(), nbits = 64;
      e -= 1;
    } while (e > 1022-75);
    x = (((x >> 11) + 1) >> 1) + (e << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

Finally, the loop can be eliminated using bit-counting intrinsics:

double rand_between_zero_and_one() {
    double d;
    uint64_t x = rand_u64();
    uint64_t e = __builtin_ctzll(x) - 11ull;
    if ((int64_t)e >= 0) e = __builtin_ctzll(rand_u64());
    x = (((x >> 11) + 1) >> 1) - ((e - 1011ull) << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

Or in Go rather than C:

// toFloat64 -> (0,1]
func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    e := bits.TrailingZeros64(x) - 11
    if e >= 0 {
        e = bits.TrailingZeros64(splitmix64(seed))
    }
    x = (((x >> 11) + 1) >> 1) - ((uint64(int64(e)) - 1011) << 52)
    return math.Float64frombits(x)
}

With that exposition done, I can now present my criteria for assessing the quality of functions that claim to generate uniformly-distributed IEEE-754 double-precision floating-point values over (0, 1] or [0, 1] or [0, 1):

CriteriaMost functionsPresented hereIdeal
Probability of zero2-52 or 2-5302-1075
Smallest non-zero2-52 or 2-532-762-1074
Largest non-seeable1 - 2-53 or 0.5 - 2-542-76 - 2-129-0
Distinct values252 or 253258.2262±1

If you've reached this far, then you might be interested in similar commentary from Taylor R. Campbell or Allen B. Downey or @moon_chilled.

Windows Arm64EC ABI Notes

The basic premise of Microsoft's Arm64EC is that a single virtual address space can contain a mixture of ARM64 code and X64 code; the ARM64 code executes natively, whereas the X64 code is transparently converted to ARM64 code by a combination of JIT and AOT compilation, and ARM64 ⇄ X64 transitions can happen at any function call/return boundary.

There are some good MSDN pages on the topic:

The first function to highlight is RtlIsEcCode; to tell apart ARM64 code and X64 code, the system maintains a bitmap with one bit per 4K page, specifying whether that page contains ARM64 code (bit set) or X64 code (bit clear). This bitmap can be queried using RtlIsEcCode. The loader sets bits in this bitmap when loading DLLs, as does VirtualAlloc2 when allocating executable memory with MEM_EXTENDED_PARAMETER_EC_CODE in MemExtendedParameterAttributeFlags.

After the X64 emulator executes a call or ret or jmp that crosses a page boundary, it needs to determine whether the new rip points to ARM64 code or X64 code. If !RtlIsEcCode(rip), then the X64 emulator can happily continue emulating X64 code. However, if RtlIsEcCode(rip), then a transition out of the emulator needs to be performed. This can be a call-like transition (X64 calling ARM64), or a return-like transition (X64 returning to ARM64). The transition type is determined by looking at the four bytes before rip; if they contain the encoding of blr x16 (0xd63f0200), then a return-like transition is performed by setting pc to rip. Otherwise, a call-like transition is performed by setting x9 to rip and setting pc to rip's entry thunk. To find the entry thunk, the four bytes before rip are used; the low two bits need to be 0b01, and after masking off said bits, the 32 bits are sign extended and then added to rip (the resultant value must be different to rip, i.e. a function cannot be its own entry thunk).

Before transferring control to the entry thunk, the X64 emulator performs a little manipulation:

ldr lr, [sp], #8 // pop return address (skipping sp alignment check)
mov x4, sp

After this, it ensures sp is 16-byte aligned:

if (unlikely(sp & 8)) {
  str lr, [sp, -#8]! // push return address again (again skip check)
  adr lr, x64_ret_stub // an X64 funclet containing just "ret"
}

In other words, the stack will look like whichever of the following diagrams gives rise to an aligned sp:

◀- lower addresses    x4                 higher addresses -▶
                      |
                      ▼
           ... retaddr home0 home1 home2 home3 arg4 arg5 ...
                      ▲
                      |
                      sp                        lr = retaddr
◀- lower addresses    x4                 higher addresses -▶
                      |
                      ▼
           ... retaddr home0 home1 home2 home3 arg4 arg5 ...
              ▲
              |
              sp                           lr = x64_ret_stub

The 32 bytes of X64 home space begin at x4, and any arguments passed on the stack begin at x4+#32. The entry thunk is free to use the X64 home space if it wants, and some of the MSDN documentation suggests using it as a place to save q6 and q7, but this is problematic, as to later restore from this space, we'd need to save x4 (not to mention that no unwind codes can load from x4). In practice, only 24 of the 32 bytes are easily usable; sp+#8 through sp+#32 will always coincide with some 24 bytes of the home space.

The entry thunk can either be a copy of the original function that speaks the X64 ABI, or it can copy the arguments from their X64 ABI locations to their Arm64EC ABI locations and then make a call to the original function (helpfully provided in x9) and then transfer the results to their X64 ABI locations. If the original function is vararg, then the Arm64EC ABI dictates that x4 should contain a pointer to the stack arguments (so add #32) and that x5 should contain the size of the stack arguments (which is not generally known; MSVC-generated thunks populate #0 for x5 in this case).

In either case, the entry thunk needs to ensure that X64 ABI non-volatile registers are preserved (which translates to ARM64 ABI non-volatile registers, plus q6 through q15), and then needs to return to X64 once it is done. That is achieved by means of a tailcall to __os_arm64x_dispatch_ret, which resumes X64 execution at lr.

If the original function modified arguments in-place and then tailcalled something else (an adjustor function), then the entry thunk for it can instead be an adjustor thunk: modify the arguments in-place (in their X64 ABI locations), put the tailcall target in x9, and then tailcall __os_arm64x_x64_jump. If x9 points to ARM64 code, then __os_arm64x_x64_jump will tailcall x9's entry thunk (which will then consume arguments from their X64 locations), whereas if x9 points to X64 code, then it'll act like __os_arm64x_dispatch_call_no_redirect (which will again consume arguments from their X64 locations).

The other side of things is when native ARM64 code wants to make an indirect function call. The whole premise of Arm64EC is that function pointers at ABI boundaries can point to either ARM64 functions or X64 functions, and a-priori the caller doesn't know which it has. The recommendation is to put the call target in x11 and then call __os_arm64x_dispatch_icall, which will leave x11 as-is if it is ARM64 code, or copy x10 to x11 if not.

Note that __os_arm64x_dispatch_icall is not a function per-se, but instead a (per-module) global variable containing a function pointer. The distinction is important, as it affects how to call it; it cannot be a bl target, instead it has to be loaded into a register (e.g. by adrp and ldr) and then be used as a blr target. The linker and loader conspire to put LdrpValidateEcCallTarget in this variable, the pseudocode for which is:

LdrpValidateEcCallTarget:  // aka. __os_arm64x_dispatch_icall
  // Specially written to only mutate x16/x17
  // (and x9/x11 where indicated)
  if (RtlIsEcCode(x11)) {
    // x11 unchanged
    // x9 unspecified, in practice unchanged from input
  } else {
    x9 = ResolveFastForwardSequences(x11);
    if (RtlIsEcCode(x9)) {
      x11 = x9
      // x9 unspecified, in practice same as exit x11
    } else {
      x11 = x10
      // x9 specified as X64 pointer
    }
  }
  // x0-x8,x10,x15 specified as unchanged from input
  // q0-q7 specified as unchanged from input
  // x12 unspecified, in practice unchanged from input
  return;

ResolveFastForwardSequences(uintptr_t p):
  // jmp qword [rip+imm32]
  if (bytes_match(p, "ff 25 ?? ?? ?? ??")) {
    p += 6;
    int32_t imm32 = ((int32_t*)p)[-1];
    p = *(uintptr_t*)(p + imm32);
    return ResolveFastForwardSequences(p);
  }
  if (p & 15) {
    return p;
  }
  // mov rax, rsp; mov [rax+32], rbx; push rbp; pop rbp; jmp rel32
  // mov rdi, rdi; push rbp; mov rbp, rsp; pop rbp; nop; jmp rel32
  if (bytes_match(p, "48 8b c4 48 89 58 20 55 5d e9 ?? ?? ?? ??")
  ||  bytes_match(p, "48 8b ff 55 48 8b ec 5d 90 e9 ?? ?? ?? ??")) {
    p += 14;
    int32_t rel32 = ((int32_t*)p)[-1];
    p += rel32;
    return ResolveFastForwardSequences(p);
  }
  // mov r10, rcx; mov eax, imm32; test byte [0x7ffe0308], 0x01
  // jne rip+3; syscall; ret; jne_target: int 0x2e; ret
  if (bytes_match(p   , "4c 8b d1 b8 ?? ?0 00 00 f6 04 25 08")
  &&  bytes_match(p+12, "03 fe 7f 01 75 03 0f 05 c3 cd 2e c3")) {
    uint32_t imm32 = *(uint32_t*)(p + 4);
    return SyscallFunctionTable[imm32];
  }
  return p;

The related __os_arm64x_dispatch_icall_cfg is also a (per-module) global variable containing a function pointer; courtesy of the linker and loader it'll end up containing either LdrpValidateEcCallTarget or LdrpValidateEcCallTargetCfg or LdrpValidateEcCallTargetCfgES, the latter two of which perform a control-flow-guard (CFG) test against x11 and then tailcall LdrpValidateEcCallTarget.

If LdrpValidateEcCallTarget found X64 code, then x9 will end up containing the X64 pointer, and x11 will end up containing whatever was in x10. It is expected that the caller put the address of an exit thunk in x10, so that blr x11 will either perform the ARM64 call or call the exit thunk for an X64 call.

The exit thunk should set up a call frame, copy arguments from their Arm64EC ABI locations to their X64 ABI locations, perform the X64 call, transfer the result back to its Arm64EC location, then tear down its call frame and return. To perform the X64 call, it should put the X64 pointer in x9 (if not already in x9), and then call __os_arm64x_dispatch_call_no_redirect. If making a vararg call, then this is where x5 comes in to play, as the exit thunk needs to copy the stack-based arguments to a new location at the bottom of the stack.

The call to __os_arm64x_dispatch_call_no_redirect must be done as blr x16; the ARM64 CPU will set lr to the next ARM64 instruction, the X64 emulator will then push lr on to the stack as the return address, and when the emulated X64 code eventually tries to resume execution at this address, the emulator will recognise the blr x16 preceding this address and perform a return-like transition.

The (per-module) global variables __os_arm64x_dispatch_call_no_redirect and __os_arm64x_dispatch_ret and __os_arm64x_x64_jump end up pointing to functions exported from xtajit.dll. Again the linker and loader conspire to populate these variables (this is not the usual GOT / IAT machinery, though the end effect is similar). All three functions are entry points to the X64 emulator, and follow a similar pattern:

__os_arm64x_dispatch_call_no_redirect__os_arm64x_dispatch_ret__os_arm64x_x64_jump
Intended usageCall X64 code (usually from exit thunk)Return to X64 code after entry thunkTailcall ARM64 or X64 code in adjustor thunk
Architecture checkNone (target assumed to be X64)None (target assumed to be X64)If RtlIsEcCode(x9) then tailcall x9's entry thunk
Stack manipulationPush lr (so that X64 can return)None (was done pre-entry-thunk)Push lr (revert pre-entry-thunk manipulation)
Passthrough registers
(extra non-volatiles)
x0 through x3, q0 through q15x8, q0 through q3, q6 through q15x0 through x3, q0 through q15
(plus x4 through x9 if tailcalling ARM64)
X64 execution targetx9lrx9

The observant reader will notice that __os_arm64x_x64_jump isn't quite sufficient for its intended usage. If it was sufficient, then its pseudocode would be something like:

__os_arm64x_x64_jump:     // hoped-for implementation
  if (RtlIsEcCode(x9)) {
    x11 = GetEntryThunk(x9)
    br x11
  } else {
    x9 = ResolveFastForwardSequences(x9)
    if (RtlIsEcCode(x9)) {
      x11 = GetEntryThunk(x9)
      br x11
    } else {
      revert the pre-entry-thunk stack manipulation
      br_to_X64_mode x9
    }
  }

Unfortunately, at least in the version of Windows 11 that I'm using right now, its pseudocode is more like:

__os_arm64x_x64_jump:      // actual implementation
  if (RtlIsEcCode(x9)) {
    x11 = GetEntryThunk(x9)
    br x11
  } else {
    str lr, [sp, #-8]! // push lr
    br_to_X64_mode x9
  }

In other words, it is deficient in two ways:

  1. Not checking for fast-forward sequences. This is unfortunate from a performance perspective, but not otherwise harmful.
  2. Assuming that the pre-entry-thunk stack manipulation was just popping into lr. This is true most of the time, but not all the time, and all sorts of mayhem can arise if this assumption is wrong.

To address point 2, the tail of __os_arm64x_x64_jump wants to be more like the following:

if (likely(sp == x4)) {
  str lr, [sp, #-8]! // push lr
  br_to_X64_mode x9
} else {
  br_to_X64_mode x9
}

A more complex, but ultimately more useful, variant of the above is:

if (likely(sp == x4)) {
  str lr, [sp, #-8]! // push lr
  br_to_X64_mode x9
} else {
  str x9, [sp, #-8]! // push x9
  adr lr, x64_ret_stub
  br_to_X64_mode lr
}

The adr in the above can be dropped, as if sp != x4, it means that the pre-entry-thunk manipulation will already have put x64_ret_stub in lr. At this point, the two branches differ only in swapping x9/lr, and so the swap can be pulled out:

if (unlikely(sp != x4)) {
  swap(x9, lr)
  mov x4, sp
}
str lr, [sp, #-8]! // push lr
br_to_X64_mode x9

Following this line of thinking, we can sidestep this particular deficiency of __os_arm64x_x64_jump by inserting the following immediately before the br to __os_arm64x_x64_jump:

cmp sp, x4          // sp == x4?
mov x4, x9
csel x9, x9, lr, eq // If not, swap
csel lr, lr, x4, eq // x9 and lr.
mov x4, sp

Alternatively, we could provide our own version of __os_arm64x_x64_jump, built using __os_arm64x_check_icall and __os_arm64x_dispatch_call_no_redirect, which fixes both deficiencies:

my_arm64x_x64_jump:
  adrp x17, __os_arm64x_check_icall
  ldr x17, [x17, __os_arm64x_check_icall]
  mov x11, x9 // __os_arm64x_check_icall main input
  adr x10, after_blr // __os_arm64x_check_icall thunk input
  stp fp, lr, [sp, #-16]!
  mov fp, sp
  blr x17 // __os_arm64x_check_icall
after_blr:
  ldrsw x16, [x11, #-4] // load entry thunk offset
  adrp x17, __os_arm64x_dispatch_call_no_redirect
  cmp x10, x11 // is x64 target?
  ldr x17, [x17, __os_arm64x_dispatch_call_no_redirect]
  csel x10, x9, x11, eq // x10 will become x9 later
  sub x11, x11, #1 // bias for 0b01 in entry thunk offset
  sub x9, sp, x4
  add x11, x11, x16 // entry thunk address
  ldp fp, lr, [sp], #16
  csel x11, x17, x11, eq
  ccmn x9, #16, #4, eq
  csel x9, x10, lr, eq
  csel lr, lr, x10, eq
  br x11 // entry thunk or __os_arm64x_dispatch_call_no_redirect

Windows ARM64 Frame Unwind Code Details

This post is intended to supplement MSDN's ARM64 exception handling information. To start, here's a fleshed-out version of a table listing the available unwind codes:

NameEncodingProlog InstructionEpilog InstructionUnwind Effect
alloc_s000iiiiisub sp, sp, i*16add sp, sp, i*16Emulate epilog instruction
alloc_m11000iiiiiiiiiiisub sp, sp, i*16add sp, sp, i*16Emulate epilog instruction
alloc_l11100000i:u24sub sp, sp, i*16add sp, sp, i*16Emulate epilog instruction
add_fp11100010i:u8add fp, sp, i*8 (NB: not 16)sub sp, fp, i*8 (NB: not 16)Emulate epilog instruction
set_fp11100001mov fp, spmov sp, fpEmulate epilog instruction
pac_sign_lr11111100pacibsp (sign lr using sp)autibsp (authenticate lr using sp)Emulate autibsp or xpaclri
save_fplr01iiiiiistp fp, lr, [sp+i*8]ldp fp, lr, [sp+i*8]Emulate epilog instruction
save_lrpair1101011nnniiiiistp x(19+2*n), lr, [sp+i*8]ldp x(19+2*i), lr, [sp+i*8]Emulate epilog instruction
save_fplr_x10iiiiiistp fp, lr, [sp-(i+1)*8]!ldp fp, lr, [sp], (i+1)*8Emulate epilog instruction
save_regp110010nnnniiiiiistp x(19+n), x(20+n), [sp+i*8]ldp x(19+n), x(20+n), [sp+i*8]Emulate epilog instruction
save_regp_x110011nnnniiiiiistp x(19+n), x(20+n), [sp-(i+1)*8]!ldp x(19+n), x(20+n), [sp], (i+1)*8Emulate epilog instruction
save_r19r20_x001iiiiistp x19, x20, [sp-i*8]! (NB: not i+1)ldp x19, x20, [sp], i*8 (NB: not i+1)Emulate epilog instruction
save_next11100110stp similar to previous non-lr stp ldp similar to next non-lr ldp Extend next effect by two
save_reg110100nnnniiiiiistr x(19+n), [sp+i*8]ldr x(19+n), [sp+i*8]Emulate epilog instruction
save_reg_x1101010nnnniiiiistr x(19+n), [sp-(i+1)*8]!ldr x(19+n), [sp], (i+1)*8Emulate epilog instruction
save_fregp1101100nnniiiiiistp d(8+n), d(9+n), [sp+i*8]ldp d(8+n), d(9+n), [sp+i*8]Emulate epilog instruction
save_fregp_x1101101nnniiiiiistp d(8+n), d(9+n), [sp-(i+1)*8]!ldp d(8+n), d(9+n), [sp], (i+1)*8Emulate epilog instruction
save_freg1101110nnniiiiiistr d(8+n), [sp+i*8]ldr d(8+n), [sp+i*8]Emulate epilog instruction
save_freg_x11011110nnniiiiistr d(8+n), [sp-(i+1)*8]!ldr d(8+n), [sp], (i+1)*8Emulate epilog instruction
save_any_reg11100111x:u16 *One str or stp to stack *One ldr or ldp from stack *Varies *
custom11101xxxNo instructionNo instructionBespoke
reserved11110xxxNo instructionNo instructionUnwind fails
reserved11111000x:u8Any one instructionAny one instructionNo effect (yet)
reserved11111001x:u16Any one instructionAny one instructionNo effect (yet)
reserved11111010x:u24Any one instructionAny one instructionNo effect (yet)
reserved11111011x:u32Any one instructionAny one instructionNo effect (yet)
reserved11111101Any one instructionAny one instructionNo effect (yet)
reserved11111110Any one instructionAny one instructionNo effect (yet)
reserved11111111Any one instructionAny one instructionNo effect (yet)
nop11100011Any one instructionAny one instructionNo effect
end_c11100101No instruction, start of prolog Any one instruction, then end of epilog No effect
end11100100No instruction, start of prologret (or tailcall b), then end of epilogUnwind complete

(*) save_any_reg was added for Arm64EC. The 16-bit payload is made from a number of fields, packed as rpxnnnnnmmiiiiii. This gives rise to many different variants:

NameEncodingProlog InstructionEpilog InstructionUnwind Effect
save_any_reg11100111000nnnnn00iiiiiistr x(0+n), [sp+i*8]ldr x(0+n), [sp+i*8]Emulate epilog instruction
save_any_reg11100111001nnnnn00iiiiiistr x(0+n), [sp-(i+1)*16]!ldr x(0+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111010nnnnn00iiiiiistp x(0+n), x(1+n), [sp+i*16]ldp x(0+n), x(1+n), [sp+i*16]Emulate epilog instruction
save_any_reg11100111011nnnnn00iiiiiistp x(0+n), x(1+n), [sp-(i+1)*16]!ldp x(0+n), x(1+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111000nnnnn01iiiiiistr d(0+n), [sp+i*8]ldr d(0+n), [sp+i*8]Emulate epilog instruction
save_any_reg11100111001nnnnn01iiiiiistr d(0+n), [sp-(i+1)*16]!ldr d(0+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111010nnnnn01iiiiiistp d(0+n), d(1+n), [sp+i*16]ldp d(0+n), d(1+n), [sp+i*16]Emulate epilog instruction
save_any_reg11100111011nnnnn01iiiiiistp d(0+n), d(1+n), [sp-(i+1)*16]!ldp d(0+n), d(1+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111000nnnnn10iiiiiistr q(0+n), [sp+i*16] (NB: not 8)ldr q(0+n), [sp+i*16] (NB: not 8)Emulate epilog instruction
save_any_reg11100111001nnnnn10iiiiiistr q(0+n), [sp-(i+1)*16]!ldr q(0+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111010nnnnn10iiiiiistp q(0+n), q(1+n), [sp+i*16]ldp q(0+n), q(1+n), [sp+i*16]Emulate epilog instruction
save_any_reg11100111011nnnnn10iiiiiistp q(0+n), q(1+n), [sp-(i+1)*16]!ldp q(0+n), q(1+n), [sp], (i+1)*16Emulate epilog instruction
reserved111001111xxxxxxxxxxxxxxxAny one instructionAny one instructionUnwind fails
reserved11100111xxxxxxxx11xxxxxxAny one instructionAny one instructionUnwind fails

() The save_next unwind code requires further explanation. It causes the next unwind code to load four registers rather than two. It can be stacked; a sequence of N save_next unwind codes causes the next unwind code thereafter to load (N+1)*2 registers. Said next unwind code must be one of save_regp / save_regp_x / save_r19r20_x / save_fregp / save_fregp_x (notably, this excludes pair instructions with lr in their name). As examples:

The MSDN documentation suggests that a sufficiently long sequence of save_next codes can overflow from x registers to d registers, but best not to try this; stop at x31 or d15.

() In a prolog, end_c corresponds to no instruction, and also causes subsequent codes to correspond to no instruction. In an epilog, end_c corresponds to ret (or b, or any other one instruction that has no unwind effect), and causes subsequent codes to correspond to no instruction. In either case, subsequent codes (up to the first end) are still executed during unwind.

Moving on, each .xdata record describes a contiguous region of machine instructions. There is typically a 1:1 correspondence between regions and functions, though this needn't be true. A region contains:

If any of the above limits would be exceeded, then the region needs to be artifically split into multiple smaller regions, such that no limit is exceeded. The MSDN documentation suggests that split boundaries should be chosen as to not be in the middle of an epilog (this suggestion would not be required if end_c was treated as "No instruction, end of epilog", but alas it is treated as "Any one instruction, then end of epilog").

The length of the prolog is not explicitly given; it is calculated by analysing the unwind codes strictly preceding the first end or end_c and counting how many of them correspond to one instruction. Said unwind codes, once reversed, should correspond 1:1 with instructions in the prolog (nop codes can be used to make things line up, if there are instructions in the prolog which do not need an unwind effect). If unwinding from within the prolog, we calculate how many instructions we are away from the end of the prolog, and skip that many unwind codes from the start of the list. To indicate a lack of prolog, the first unwind code should be end_c (or end if no unwind effects are required by the main body of the region). Note that unwind effects after end_c are still executed if unwinding from within the prolog (though there's a Wine bug here).

The length of each epilog is not explicitly given; it is calculated by analysing the unwind codes starting at the epilog-specific offset and continuing up to and including the first end or end_c thereafter. Said unwind codes should correspond 1:1 with instructions in the epilog (nop codes can be used to make things line up, if there are instructions in the epilog which do not need an unwind effect). If unwinding from within an epilog, we calculate how many instructions we are away from the start of the epilog, and skip that many unwind codes from the start of the list. Note that unwind effects after end_c are still executed if unwinding from within an epilog (though again Wine bug).

If the E bit is set in the .xdata header, then the end of the epilog is equal to the end of the region (the start of the epilog is found by calculating the length of the epilog and subtracting this from the end). If the E bit is not set, then the start of each epilog is explicitly specified.

If unwinding from a PC not within a prolog or epilog, unwind codes are executed until the first end is reached. Note that these codes are shared with those describing the prolog; this is not normally a problem, but if it is, the prolog can be split off to a separate region consisting solely of prolog, leaving no prolog in the original region.

If the X bit is set in the .xdata header, and PC is not within a prolog or epilog, then the specified function is called during exception handler search and during exception unwinding. See __C_specific_handler for the prototype of this function, and see LuaJIT for a concrete example.

A whirlwind tour of AArch64 vector instructions (NEON)

32 vector registers, each 128 bits wide. Also a control register (FPCR) and a status register (FPSR), though scalar comparison instructions use PSTATE. Each vector register can contain:

128 bits might seem small compared to AVX-512, but for most vector instructions, the M1's Firestorm cores can issue four instances of the instruction per cycle, which gets you to a similar place. There are also AMX units on the M1.

A lane is 1/8/16/32/64/128 bits wide. The common lane types are:

1-bit8-bit16-bit32-bit64-bit128-bit
uintbituint8 [q]uint16 [q]uint32 [q]uint64 [q]
sintsint8 [q]sint16 [q]sint32 [q]sint64 [q]
fpfp16 [a]fp32fp64
bfbf16 [b]
polybitpoly8poly16poly64 [c]poly128 [c]
[a] Requires fp16 extension, and improved further by fp16fml extension.
[b] Requires bf16 extension (present on Apple M2, but not on M1).
[c] Requires crypto extension.
[q] Often with the choice of truncate on overflow or saturate on overflow.

The syntactic names for the 32 registers vary based on the lane width and also how much of the register is being used:

Lane widthLow scalarLow 64 bitsAll 128 bits
1 bit N/AV0.8BV31.8BV0.16BV31.16B
8 bitsB0B31V0.8BV31.8BV0.16BV31.16B
16 bitsH0H31V0.4HV31.4HV0.8HV31.8H
32 bitsS0S31V0.2SV31.2SV0.4SV31.4S
64 bitsD0D31V0.1DV31.1DV0.2DV31.2D
128 bitsQ0Q31N/AV0.1QV31.1Q
In some contexts, individual lanes can be referenced. The syntax for this appends [0][15] to one of the above, for example V7.4S[3] denotes the most significant 32-bit lane of V7. Writes to a single lane with this syntax generally preserve other lanes, whereas in all other cases, writes to the low bits of a register generally zero the remaining bits (though see FPCR.NEP).

Vector instructions have up to six input registers, and up to one output register (with the exception of loads, which can have up to four output registers). In most cases, all the registers involved in a single instruction have the same lane width, though there are exceptions:

In most cases, operations are lane-wise: lane i of the output register is formed by combining lane i of each input register, though there are exceptions:

GPR to / from vector

InstructionDirectionBehaviour
FMOVGPR to vectorTruncate to lane width, then zero extend
DUPGPR to vectorTruncate then replicate to all lanes
INSGPR to vectorTruncate then insert to arbitrary lane
FMOVVector to GPRTake low lane, zero extend to GPR width
UMOVVector to GPRArbitrary lane, zero extend to GPR width
SMOVVector to GPRArbitrary lane, sign extend to GPR width

A number of data type conversion instrucions can also have GPR as source or destination (see "Data type conversion, integer to float" and "Data type conversion, float to integer").

Load / store

A scalar load moves 8/16/32/64 bits from memory to (part of) a vector register, whereas a vector load moves 64/128 bits. This can be repeated up to four times, reading from consecutive memory locations and writing to distinct vector registers (which must be consecutively numbered, except for LDP / LDNP).

×1×2×3×4
Scalar (low lane,
zero others)
LDR
LDUR
LDP
LDNP
Scalar (any lane,
preserve others)
LD1(SS 1R)LD2(SS 2R)LD3(SS 3R)LD4(SS 4R)
Scalar (replicate
to all lanes)
LD1RLD2RLD3RLD4R
VectorLD1(MS 1R)
LDR
LDUR
LD1(MS 2R)
LDP
LDNP
LD1(MS 3R)LD1(MS 4R)
Vector, transposedLD2(MS 2R)LD3(MS 3R)LD4(MS 4R)

With the exception of scalar replicating to all lanes, all load instructions have a corresponding store instruction performing the inverse - replace LD with ST.

The SS or MS suffix denotes what the ARM reference manual calls "single structure" or "multiple structures", and is followed by the number of destination registers (1, 2, 3, or 4). The operand syntax relates to this suffix.

The vector transposed loads perform one of the following tranposes, where M denotes the number of destination registers (2, 3, or 4):

8-bit lanes16-bit lanes32-bit lanes64-bit lanes
64-bit vectors8×M ↦ M×84×M ↦ M×42×M ↦ M×2
128-bit vectors16×M ↦ M×168×M ↦ M×84×M ↦ M×42×M ↦ M×2

The addressing modes are all over the place:

BaseOffsetWriteback Mode
LDRXn or SPsigned imm9Pre- or post-index
LDURXn or SPsigned imm9No writeback
LDRXn or SPunsigned imm12, scaledNo writeback
LDRXn or SPXm or Wm, optional extend/scaleNo writeback
LDRPCsigned imm19, times 4No writeback
LDPXn or SPsigned imm7, scaledPre- or post-index
LDPXn or SPsigned imm7, scaledNo writeback
LDNPXn or SPsigned imm7, scaledPre- or post-index
LDNPXn or SPsigned imm7, scaledNo writeback
OthersXn or SPunsigned imm1, scaled (or Xm)Post-index

Data movement

Moving lanes around:

Source laneDestination laneOther lanes
FMOVLow laneLow laneZeroed
MOV (DUP)ArbitraryLow laneZeroed
MOV (INS)ArbitraryArbitraryPreserved
DUPArbitraryAll lanes (replicated)N/A
MOV (ORR)All lanesAll lanes (1:1)N/A

Various reversals:

in bytesin u16sin u32sin u64sin u128s
Reverse bitsRBIT
Reverse bytesno-opREV16REV32REV64TBL
Reverse u16sno-opREV32REV64TBL
Reverse u32sno-opREV64TBL
Reverse u64sno-opEXT

Various ways of creating one vector from two:

First stepSecond step
TRN1, TRN2Discard odd or even lanesInterleave lanes
ZIP1, ZIP2Discard high half or low halfInterleave lanes
UZP1, UZP2Concatenate vectorsDiscard odd or even lanes
EXTConcatenate vectorsTake a contiguous 16-byte span
Note that EXT with the same source vector for both operands gives a rotation by a whole number of bytes. Note that some math instructions come with a free UZP1 and UZP2 (annotated as [p]).

The TBL, TBX family concatenate 1-4 vectors from consecutively numbered registers to form a table T of 16/32/48/64 bytes, then another byte vector serves as indices into said table:

Per-byte behaviour
TBLDi = (Yi < len(T)) ? T[Yi] : 0
TBXDi = (Yi < len(T)) ? T[Yi] : Di
Note that x86 pshufb does Di = (Yi < 128) ? X[Yi & 15] : 0, which is similar.

Immediates

One flavour of FMOV loads constants of the form ±(1.0 + m/16) × 2e (where 0 ≤ m ≤ 15 and −3 ≤ e ≤ 4) into the low fp16/fp32/fp64 lane, and either zeros the other lanes or replicates the constant to the other lanes.

One flavour of MOVI loads a constant into the low 64-bit lane, each byte of which is independently either 0 or 255, and either zeros the high u64 lane or replicates the constant there.

The remaining flavours of MOVI and MVNI load a constant into the low 8/16/32-bit lane, one byte of which is an arbitrary value, bytes to the left either all 0 or all 255, and bytes to right either all 0 or all 255, then replicates this constant to all lanes.

The bitwise BIC and ORR instructions support constants of the form: 16/32-bit lane, one byte of which is an arbitrary value, other bytes all 0, replicated to all lanes.

Various comparison instructions support comparison against constant zero:

SignedUnsignedFloating
X == 0CMEQ X, #0CMEQ X, #0FCMEQ X, #0.0
X <= 0CMLE X, #0CMEQ X, #0FCMLE X, #0.0
X < 0CMLT X, #0always falseFCMLT X, #0.0
X > 0CMGT X, #0CMTST X, XFCMGT X, #0.0
X >= 0CMGE X, #0always trueFCMGE X, #0.0
X <=> 0N/AN/AFCMP X, #0.0

Shifts

Note >>R is used to denote a rounding right shift: if the most significant bit shifted out was a 1, then 1 is added to the result. If shifting right by N, this is equivalent to adding 1 << (N - 1) to the input before shifting.

In-lane shifts by immediate:

Per-lane behaviour
SHLD = X << N
SQSHL, UQSHLD = sat(X << N)
SQSHLUD = sat(X << N) (X signed, D unsigned)
SLID = (X << N) | bzhi(D, N) (bzhi clears all but the low N bits)
SSHR, USHRD = X >> N
SRSHR, URSHRD = X >>R N
SSRA, USRAD += X >> N
SRSRA, URSRAD += X >>R N
SRID = (X >> N) | bzlo(D, N) (bzlo clears all but the high N bits)

In-lane variable shifts, where the shift amount is a signed value in the low 8 bits of each lane in the 2nd operand:

Per-lane behaviour (Y > 0)Per-lane behaviour (Y < 0)
SSHL, USHLD = X << YD = X >> -Y
SRSHL, URSHLD = X << YD = X >>R -Y
SQSHL, UQSHLD = sat(X << Y)D = X >> -Y
SQRSHL, UQRSHLD = sat(X << Y)D = X >>R -Y

Widening shifts by immediate (from 8-bit lanes to 16-bit, 16-bit lanes to 32-bit, or 32-bit lanes to 64-bit), where X is sign-extended or zero-extended before use, and D lanes are twice the width of X lanes:

Per-lane behaviour (twice-width D)
SSHLL, SSHLL2, USHLL, USHLL2D = X << N (where 0 ≤ N < bitwidth(X))
SHLL, SHLL2D = X << bitwidth(X)

Narrowing shifts by immediate (from 64-bit lanes to 32-bit, 32-bit lanes to 16-bit, or 16-bit lanes to 8-bit), where D lanes are half the width of X lanes. In all cases, 1 ≤ N ≤ bitwidth(D):

Per-lane behaviour (half-width D)
XTN, XTN2D = truncate(X)
SHRN, SHRN2D = truncate(X >> N)
RSHRN, RSHRN2D = truncate(X >>R N)
SQXTN, SQXTN2D = sat(X) (signed)
UQXTN, UQXTN2D = sat(X) (unsigned)
SQXTUN, SQXTUN2D = sat(X) (X signed, D unsigned)
SQSHRN, SQSHRN2D = sat(X >> N) (signed)
UQSHRN, UQSHRN2D = sat(X >> N) (unsigned)
SQSHRUN, SQSHRUN2D = sat(X >> N) (X signed, D unsigned)
SQRSHRN, SQRSHRN2D = sat(X >>R N) (signed)
UQRSHRN, UQRSHRN2D = sat(X >>R N) (unsigned)
SQRSHRUN, SQRSHRUN2D = sat(X >>R N) (X signed, D unsigned)

There is no narrowing shift from 8-bit lanes to something narrower. This is a notable difference to x86, where pmovmskb can pull 1 bit out of every 8-bit lane. That said, SHRN from 16-bit lanes to 8-bit lanes with N set to 4 does something interesting: it pulls 4 bits out of every 8-bit lane, taking the high 4 bits of even lanes and the low 4 bits of odd lanes. Alternating between high/low halves is weird, but innocuous if every 8-bit lane starts out containing either 0 or 255.

Subject to the sha3 extension, two instructions provide rotates in 64-bit lanes:

Per-lane behaviour (64-bit lanes only)
RAX1D = X xor rotate_left(Y, 1)
XARD = rotate_right(X xor Y, N)

Note that rotate by immediate (without xor, for any lane width, and without needing sha3) can be constructed from SHL followed by USRA.

Bitwise

Assorted instructions operating bitwise:

DescriptionPer-bit behaviour
ANDAndD = X and Y
BCAX [s]Clear and xorD = X xor (Y and not Z)
BICClearD = X and not Y
BICClear (immedate)D = D and not imm [i]
BIFInsert if falseD = Y ? D : X
BITInsert if trueD = Y ? X : D
BSLSelectD = D ? X : Y
EORXorD = X xor Y
EOR3 [s]Xor (three-way)D = X xor Y xor Z
NOTNotD = not X
ORNOr notD = X or not Y
ORROrD = X or Y
ORROr (immediate)D = D or imm [i]
[s] Requires sha3 extension.
[i] Immediate is a 16-bit or 32-bit constant where one byte is an arbitrary imm8 and other bytes are zero, broadcast to all 16-bit or 32-bit lanes.

Bit counting instructions:

CountsPossible lane widths
CLSLeading sign bits8-bit, 16-bit, 32-bit
CLZLeading zero bits (i.e. lzcnt)8-bit, 16-bit, 32-bit
CNTNon-zero bits (i.e. popcnt)8-bit [a]
[a] Other lane widths can be achieved by a follow-up UADDLP or ADDV.

There are no horizontal bitwise instructions in the traditional sense, though various horizontal reductions can be constructed from other instructions:

Integer math

Assorted instructions operating lanewise, with 8/16/32/64-bit lanes:

Per-lane behaviour
ABSD = abs(X)
SQABSD = sat(abs(X))
NEGD = -X
SQNEGD = sat(-X)
ADD, SUBD = X ± Y
ADDPD = A + B [p]
ADDVD0 = X0 + X1 + ⋯ + Xn-1 [v]
SQADD, UQADD, SUQADD, USQADDD = sat(X + Y)
SQSUB, UQSUBD = sat(X - Y)
SABD, UABD [d]D = abs(X - Y)
SABA, UABA [d]D += abs(X - Y)
SHADD, SHSUB, UHADD, UHSUB [d]D = (X ± Y) >> 1
SRHADD, URHADD [d]D = (X + Y) >>R 1
MUL [b] [d]D = X * Y
MLA, MLS [b] [d]D ±= X * Y
SQDMULH [a] [b] [d]D = sat((2 * X * Y) >> bitwidth(D))
SQRDMULH [a] [b] [d]D = sat((2 * X * Y) >>R bitwidth(D))
SQRDMLAH, SQRDMLSH [a] [b] [d]D = sat(D ± ((2 * X * Y) >>R bitwidth(D)))
SMIN, UMIN [d]D = min(X, Y)
SMINP, UMINP [d]D = min(A, B) [p]
SMINV, UMINV [d]D0 = min(X0, X1, …, Xn-1) [v]
SMAX, UMAX [d]D = max(X, Y)
SMAXP, UMAXP [d]D = max(A, B) [p]
SMAXV, UMAXV [d]D0 = max(X0, X1, …, Xn-1) [v]
CMEQ [z]D = (X == Y) ? ones_mask : 0
CMGE, CMHS [z]D = (X >= Y) ? ones_mask : 0
CMGT, CMHI [z]D = (X > Y) ? ones_mask : 0
CMTSTD = (X & Y) ? ones_mask : 0
[a] Not available for 8-bit lanes. Not available as unsigned.
[b] When using 16/32-bit lanes, can broadcast a single lane of Y to all lanes of Y.
[d] Not available for 64-bit lanes.
[p] Ai is concat(X, Y)2*i+0, Bi is concat(X, Y)2*i+1, i.e. adjacent pairs.
[v] Low lane of D gets sum/min/max of all lanes of X, rest of D cleared.
[z] Operands can be registers or constant zero (at least logically).

Assorted instructions operating lanewise, with 8/16/32-bit lanes for X and Y, and D lanes twice as wide (X/Y sign-extended or zero-extended before use):

Per-lane behaviour (twice-width D)
SXTL, SXTL2, UXTL, UXTL2D = X (i.e. just sign/zero extend)
SABDL, SABDL2, UABDL, UABDL2D = abs(X - Y)
SABAL, SABAL2, UABAL, UABAL2D += abs(X - Y)
SADDL, SADDL2, UADDL, UADDL2D = X + Y
SSUBL, SSUBL2, USUBL, USUBL2D = X - Y
SADDLP, UADDLPD = A + B [p]
SADALP, UADALPD += A + B [p]
SADDLV, UADDLVD0 = X0 + X1 + ⋯ + Xn-1 [v]
SMULL, SMULL2, UMULL, UMULL2 [b]D = X * Y
SMLAL, SMLAL2, UMLAL, UMLAL2 [b]D += X * Y
SMLSL, SMLSL2, UMLSL, UMLSL2 [b]D -= X * Y
SQDMULL, SQDMULL2 [a] [b]D = sat(2 * X * Y)
SQDMLAL, SQDMLAL2 [a] [b]D = sat(D + sat(2 * X * Y))
SQDMLSL, SQDMLSL2 [a] [b]D = sat(D - sat(2 * X * Y))
[a] Not available for 8-bit lanes of X/Y. Not available as unsigned.
[b] When using 16/32-bit lanes of Y, can broadcast a single lane of Y to all lanes.
[p] Ai is X2*i+0, Bi is X2*i+1, i.e. adjacent pairs.
[v] Low lane of D gets sum of all lanes of X, rest of D cleared.

A few instructions operating lanewise, with 16/32/64-bit lanes for D and X, and Y lanes half as wide (Y sign-extended or zero-extended before use):

Per-lane behaviour (half-width Y)
SADDW, SADDW2, UADDW, UADDW2D = X + Y
SSUBW, SSUBW2, USUBW, USUBW2D = X - Y

A few instructions operating lanewise, with 16/32/64-bit lanes for X and Y, and D lanes half as wide:

Per-lane behaviour (half-width D)
ADDHN, ADDHN2, SUBHN, SUBHN2D = (X ± Y) >> bitwidth(D)
RADDHN, RADDHN2, RSUBHN, RSUBHN2D = (X ± Y) >>R bitwidth(D)

Dense linear algebra instructions:

BehaviourD typeX typeY type
SDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]s8[4][4]s8[4][4]
UDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]u8[4][4]u8[4][4]
USDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]u8[4][4]s8[4][4]
SUDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]s8[4][4]u8[4][4]
SMMLAD += X @ YTs32[2][2] or u32[2][2]s8[2][8]s8[2][8]
UMMLAD += X @ YTs32[2][2] or u32[2][2]u8[2][8]u8[2][8]
USMMLAD += X @ YTs32[2][2] or u32[2][2]u8[2][8]s8[2][8]
[b] Can broadcast a 32-bit lane of Y to all 32-bit lanes.

Two oddball instructions operate on 32-bit unsigned lanes containing fixed-precision numbers with 32 fractional bits (i.e. range is 0 through 1-ε):

Per-lane behaviour
URECPED = sat(0.5 * X-1) (approximate, using just top 9 bits)
URSQRTED = sat(0.5 * X-0.5) (approximate, using just top 9 bits)

Float math

A broad range of floating-point math instructions are available, operating on fp32 or fp64 lanes (or fp16 subject to the fp16 extension), in either vector form or scalar form:

Per-lane behaviour
FABSD = abs(X)
FNEGD = -X
FADD, FSUBD = X ± Y
FADDPD = A + B [p]
FABDD = abs(X - Y)
FMUL [b]D = X * Y
FMULX [b]D = X * Y (except that ±0 times ±infinity is ±2.0)
FNMUL [s]D = -(X * Y)
FMADD, FMSUB [s]D = Z ± X * Y
FNMADD, FNMSUB [s]D = -(Z ± X * Y)
FMLA, FMLS [b]D ±= X * Y
FDIVD = X / Y
FSQRTD = X0.5
FRECPX [s]D = X-1 (crude approximate [a], using no fractional bits)
FRECPED = X-1 (approximate, using just 8 fractional bits)
FRECPSD = 2.0 - X * Y [c]
FRSQRTED = X-0.5 (approximate, using just 8 fractional bits)
FRSQRTSD = 1.5 - 0.5 * X * Y [d]
FMIN, FMINNMD = min(X, Y) [m]
FMINP, FMINNMPD = min(A, B) [m] [p]
FMINV, FMINNMVD0 = min(X0, X1, …, Xn-1) [m] [v]
FMAX, FMAXNMD = max(X, Y) [m]
FMAXP, FMAXNMPD = max(A, B) [m] [p]
FMAXV, FMAXNMVD0 = max(X0, X1, …, Xn-1) [m] [v]
FCMEQ [z]D = (X == Y) ? ones_mask : 0
FCMGE [z]D = (X >= Y) ? ones_mask : 0
FCMGT [z]D = (X > Y) ? ones_mask : 0
FACGED = (abs(X) >= abs(Y)) ? ones_mask : 0
FACGTD = (abs(X) > abs(Y)) ? ones_mask : 0
[a] Clears fraction bits, then adds one to exponent if zero, then bitwise inverse of exponent bits. Can be used with FMULX as part of vector normalisation.
[b] Can broadcast a single lane of Y to all lanes of Y.
[c] Useful as part of Newton-Raphson step where successive approximations to a-1 are computed as xn+1 = xn * (2.0 - a * xn). See FRECPE.
[d] Useful as part of Newton-Raphson step where successive approximations to a-0.5 are computed as xn+1 = xn * (1.5 - 0.5 * a * xn * xn). See FRSQRTE.
[m] Note that min/max are not quite equivalent to comparison followed by selection, due to signed zeros and NaNs. The NM variants return the non-NaN operand if exactly one operand is NaN.
[p] Ai is concat(X, Y)2*i+0, Bi is concat(X, Y)2*i+1, i.e. adjacent pairs.
[s] Scalar form only, no vector form.
[v] Low lane of D gets min/max of all lanes of X, rest of D cleared.
[z] Operands can be registers or constant zero (at least logically).

Various per-lane rounding instructions with floating-point inputs and outputs (see "Data type conversion, float to integer" for integer outputs):

RoundingRange [r]Exceptions
FRINT32XMode from FPCR-231 ⋯ 231-1Inexact, InvalidOp
FRINT32ZToward zero (truncate)-231 ⋯ 231-1Inexact, InvalidOp
FRINT64XMode from FPCR-263 ⋯ 263-1Inexact, InvalidOp
FRINT64ZToward zero (truncate)-263 ⋯ 263-1Inexact, InvalidOp
FRINTATo nearest, ties away from zeroUnbounded
FRINTIMode from FPCRUnbounded
FRINTMToward minus infinity (floor)Unbounded
FRINTNTo nearest, ties toward evenUnbounded
FRINTPToward positive infinity (ceil)Unbounded
FRINTXMode from FPCRUnboundedInexact
FRINTZToward zero (truncate)Unbounded
[r] Out of range results (in either direction) replaced by -231 or -263.

Mixed-width operations and dense linear algebra:

BehaviourD typeX/Y type
FMLAL, FMLSL [b]Di ±= Xi+0 * Yi+0fp32[4]fp16[8]
FMLAL2, FMLSL2 [b]Di ±= Xi+4 * Yi+4fp32[4]fp16[8]
BFMLALB [b]Di += X2*i+0 * Y2*i+0fp32[4]bf16[8]
BFMLALT [b]Di += X2*i+1 * Y2*i+1fp32[4]bf16[8]
BFDOT [b]Di += dot(Xi, Yi)fp32[4]bf16[4][2]
BFMMLAD += X @ YTfp32[2][2]bf16[2][4]
[b] Can broadcast a single lane of Y to all lanes of Y (for BFDOT, a lane is 32 bits).

Float comparisons involving PSTATE

The FCMP, FCMPE family perform a three-way comparison of the low fp16/fp32/fp64 lane of two operands, writing the result to PSTATE:

Following FCMP X, Y, the meaning of condition codes is:

EQX0 == Y0NE!(X0 == Y0)
LSX0 <= Y0HI!(X0 <= Y0)
LOX0 < Y0HS!(X0 < Y0)
MIX0 < Y0PL!(X0 < Y0)
CCX0 < Y0CS!(X0 < Y0)
GTX0 > Y0LE!(X0 > Y0)
GEX0 >= Y0LT!(X0 >= Y0)
VSis_nan(X0) or is_nan(Y0)VC!is_nan(X0) and !is_nan(Y0)

The FCCMP, FCCMPE family perform a conditional three-way comparison of the low fp16/fp32/fp64 lane of two operands: some condition is evaluated against the contents of PSTATE; if true, the instruction behaves like FCMP/FCMPE; if false, a four-bit immediate is written to the relevant PSTATE bits.

The FCSEL instruction uses PSTATE to conditionally select between two scalar fp16/fp32/fp64 operands: D0 = cond ? X0 : Y0 (other lanes of D cleared).

Data type conversion, float to float

Vector form, FPCR rounding:

to bf16to fp16to fp32to fp64
From bf16no changevia fp32SSHLL, SSHLL2via fp32
From fp16via fp32no changeFCVTL, FCVTL2via fp32
From fp32BFCVTN, BFCVTNFCVTN, FCVTN2no changeFCVTL, FCVTL2
From fp64via fp32 [x]via fp32 [x]FCVTN, FCVTN2no change
[x] Using FCVTXN or FCVTXN2, which employ round-to-odd rounding mode.

For scalar conversions, FCVT can convert from any of fp16/fp32/fp64 to any other of fp16/fp32/fp64.

Data type conversion, integer to float

Vector form, FPCR rounding, free division by a power of two afterwards:

to fp16to fp32to fp64
From s16 or u16SCVTF or UCVTFvia s32 or u32via s64 or u64
From s32 or u32via fp32SCVTF or UCVTFvia s64 or u64
From s64 or u64via fp64via fp64SCVTF or UCVTF

SCVTF and UCVTF can also take a GPR as input (32-bit or 64-bit, signed or unsigned), and convert that to any of fp16/fp32/fp64, again with a free division by a power of two afterwards.

Data type conversion, float to integer

A family of conversion instructions exist, available in two forms:

RoundingOverflow
FCVTAS, FCVTAUTo nearest, ties away from zeroSaturate
FCVTMS, FCVTMUToward minus infinity (floor)Saturate
FCVTNS, FCVTNUTo nearest, ties toward evenSaturate
FCVTPS, FCVTPUToward positive infinity (ceil)Saturate
FCVTZS, FCVTZU [f]Toward zero (truncate)Saturate
FJCVTZS [j]Toward zero (truncate)Modulo 232
[f] Free multiplication by a power of two possible before the conversion.
[j] Only exists in fp64 to s32 GPR form. Also sets PSTATE.

Complex float math

A pair of floating point lanes can represent a complex floating point number, where the low scalar lane contains the real part of the complex number and the high scalar lane contains the imaginary part of the complex number. A 128-bit register can then contain 4 fp16 complex lanes, or 2 fp32 complex lanes, or a single fp64 complex lane. A few instructions exist for manipulating these:

Real part of resultImaginary part of result
FCADD #90Re(D) = Re(X) - Im(Y)Im(D) = Im(X) + Re(Y)
FCADD #270Re(D) = Re(X) + Im(Y)Im(D) = Im(X) - Re(Y)
FCMLA #0 [b]Re(D) += Re(X) * Re(Y)Im(D) += Re(X) * Im(Y)
FCMLA #90 [b]Re(D) -= Im(X) * Im(Y)Im(D) += Im(X) * Re(Y)
FCMLA #180 [b]Re(D) -= Re(X) * Re(Y)Im(D) -= Re(X) * Im(Y)
FCMLA #270 [b]Re(D) += Im(X) * Im(Y)Im(D) -= Im(X) * Re(Y)
[b] Can broadcast a complex lane (i.e. 2 scalars) of Y to all complex lanes of Y.

Polynomial math

The PMUL, PMULL, and PMULL2 instructions all perform D = X * Y, where all lanes of D/X/Y contain ℤ2 polynomials. This is alternatively known as carryless multiplication (pclmulqdq on x86).

D lanesX/Y lanes
PMUL8-bit poly (high 7 bits of result discarded)8-bit poly
PMULL, PMULL216-bit poly (top bit always clear)8-bit poly
PMULL, PMULL2 [c]128-bit poly (top bit always clear)64-bit poly
[c] Requires crypto extension.

For ℤ2 polynomial addition/subtraction, see EOR or EOR3. Polynomial division and remainder against a constant Y can be performed via multiplication.

Cryptography

Some instructions are provided to accelerate AES encryption. A single round of AES encryption consists of AddRoundKey (just xor), then SubBytes and ShiftRows (in either order), then optionally MixColumns (performed for every round except the last). The provided instructions are:

Steps
AESEAddRoundKey then ShiftRows and SubBytes
AESMCMixColumns
AESDInverse AddRoundKey then inverse ShiftRows and inverse SubBytes
AESIMCInverse MixColumns

Note that x86 AES instructions are slightly different, for example aesenc there does ShiftRows and SubBytes, then MixColumns, then AddRoundKey.

Some instructions are provided to accelerate SHA-1 hashes: SHA1C, SHA1H, SHA1M, SHA1P, SHA1SU0, SHA1SU1.

Some instructions are provided to accelerate SHA-2 hashes: SHA256H, SHA256H2, SHA256SU0, SHA256SU1 for SHA-256, and SHA512H, SHA512H2, SHA512SU0, SHA512SU1 for SHA-512.

Some instructions are provided to accelerate SHA-3 hashes: EOR3, RAX1, XAR, BCAX. See "Shifts" or "Bitwise" for descriptions.

Some instructions are provided to accelerate SM3 hashes: SM3SS1, SM3TT1A, SM3TT1B, SM3TT2A, SM3TT2B, SM3PARTW1, SM3PARTW2.

Some instructions are provided to accelerate SM4 encryption: SM4E, SM4EKEY.

The many ways of converting FP32 to FP16

An IEEE 754 single-precision 32-bit binary float (henceforth FP32) can store:

An IEEE 754 half-precision 16-bit binary float (henceforth FP16) can store:

Converting from FP32 to FP16 thus involves getting rid of m11 through m23, and dealing with the reduced range of e. If e starts in the range -14 ≤ e ≤ 15 and m11 through m23 are all 0, then this conversion is trivial. If any of m11 through m23 are 1 then a rounding decision has to be made, typical options for which include:

If e starts in the range e < -14 (i.e. underflow), then the resultant FP16 has to end up with e = -14, as it cannot go any smaller. Some number of mi will be discarded, and so a rounding decision again has to be made. A fun bonus decision crops up when rounding toward ±∞ and the starting FP32 is denormal: should denormals be treated as zero? (for other rounding modes, denormal inputs will naturally round to zero) Regardless of rounding mode, if the resultant FP16 is denormal, then there's the decision of whether to flush to zero.

If e starts in the range 15 < e (i.e. overflow), then there's no particularly great FP16 to convert to, with the options being:

Choosing between infinity and 65504 is yet again a rounding decision, with the unintuitive quirk that 65520 is considered to be nearer to infinity than to 65504.

If the starting FP32 is NaN, then there's choice about what to do with the sign bit (either preserve it, or force it to a particular value), and again a choice about what to do with the payload bits (either preserve some of them, or force them to a particular canonical pattern).

All said, there are clearly lots of decision points in the conversion process, so it is no surprise that different implementations make different choices. We'll look at a bunch of software and hardware implementations, and the choices that they make (as they don't always advertise their decisions).

The summary of implementations and most important choices is:

ImplementationRoundingNaNs
numpyToward nearest, ties toward evenPreserve (1)
CPythonToward nearest, ties toward even (†)±Canonical
James Tursa (Matlab)Toward nearest, ties away from zero-Canonical
Fabian "ryg" GiesenToward nearest, details vary±Canonical
MaratyszczaToward nearest, ties toward even±Canonical
VCVTPS2PH (x86)ConfigurablePreserve (2)
FCVT / FCVTN (ARM)ConfigurableConfigurable
(†) Except that overflows throw an exception
(1) Top 10 bits of payload preserved, then LSB set to 1 if required
(2) Top bit of payload set to 1, then next 9 bits preserved

Jumping into the details, we start with numpy, which consists of very well-commented bit manipulations:

uint16_t numpy_floatbits_to_halfbits(uint32_t f) {
  uint16_t h_sgn = (uint16_t)((f & 0x80000000u) >> 16);
  uint32_t f_exp = f & 0x7f800000u;
  uint32_t f_sig = f & 0x007fffffu;

  // Exponent overflow/NaN converts to signed inf/NaN
  if (f_exp >= 0x47800000u) {
    if ((f_exp == 0x7f800000u) && (f_sig != 0)) {
      // NaN - propagate the flag in the significand...
      uint16_t ret = (uint16_t)(0x7c00u + (f_sig >> 13));
      ret += (ret == 0x7c00u); // ...but make sure it stays a NaN
      return h_sgn + ret;
    } else {
      // (overflow to) signed inf
      return (uint16_t)(h_sgn + 0x7c00u);
    }
  }

  // Exponent underflow converts to a subnormal half or signed zero
  if (f_exp <= 0x38000000u) {
    // Signed zeros, subnormal floats, and floats with small
    // exponents all convert to signed zero half-floats.
    if (f_exp < 0x33000000u) {
      return h_sgn;
    }
    // Make the subnormal significand
    f_exp >>= 23;
    f_sig += 0x00800000u;
    f_sig >>= (113 - f_exp);
    // Handle rounding by adding 1 to the bit beyond half precision
    //
    // If the last bit in the half significand is 0 (already even),
    // and the remaining bit pattern is 1000...0, then we do not add
    // one to the bit after the half significand. However, the
    // (113 - f_exp) shift can lose up to 11 bits, so the || checks
    // them in the original. In all other cases, we can just add one.
    if (((f_sig & 0x3fffu) != 0x1000u) || (f & 0x07ffu)) {
      f_sig += 0x1000u;
    }
    uint16_t h_sig = (uint16_t)(f_sig >> 13);
    // If the rounding causes a bit to spill into h_exp, it will
    // increment h_exp from zero to one and h_sig will be zero.
    // This is the correct result.
    return (uint16_t)(h_sgn + h_sig);
  }

  // Regular case with no overflow or underflow
  uint16_t h_exp = (uint16_t)((f_exp - 0x38000000u) >> 13);
  // Handle rounding by adding 1 to the bit beyond half precision
  //
  // If the last bit in the half significand is 0 (already even), and
  // the remaining bit pattern is 1000...0, then we do not add one to
  // the bit after the half significand. In all other cases, we do.
  if ((f_sig & 0x3fffu) != 0x1000u) {
      f_sig += 0x1000u;
  }
  uint16_t h_sig = (uint16_t)(f_sig >> 13);
  // If the rounding causes a bit to spill into h_exp, it will
  // increment h_exp by one and h_sig will be zero. This is the
  // correct result. h_exp may increment to 15, at greatest, in
  // which case the result overflows to a signed inf.
  return (uint16_t)(h_sgn + h_exp + h_sig);
}

CPython takes a very different approach, involving copysign and frexp:

int PyFloat_Pack2(double x) {
  uint16_t sign, bits;
  int e;

  if (x == 0.0) {
    sign = (copysign(1.0, x) == -1.0);
    e = 0;
    bits = 0;
  } else if (isinf(x)) {
    sign = (x < 0.0);
    e = 0x1f;
    bits = 0;
  } else if (isnan(x)) {
    // There are 2046 distinct half-precision NaNs (1022 signaling
    // and 1024 quiet), but there are only two quiet NaNs that don't
    // arise by quieting a signaling NaN; we get those by setting
    // the topmost bit of the fraction field and clearing all other
    // fraction bits. We choose the one with the appropriate sign.
    sign = (copysign(1.0, x) == -1.0);
    e = 0x1f;
    bits = 0x200;
  } else {
    sign = (x < 0.0);
    if (sign) {
      x = -x;
    }

    double f = frexp(x, &e);
    // Normalize f to be in the range [1.0, 2.0)
    f *= 2.0;
    e--;

    if (e >= 16) {
      goto Overflow;
    } else if (e < -25) {
      // |x| < 2**-25. Underflow to zero.
      f = 0.0;
      e = 0;
    } else if (e < -14) {
      // |x| < 2**-14. Gradual underflow
      f = ldexp(f, 14 + e);
      e = 0;
    } else /* if (!(e == 0 && f == 0.0)) */ {
      e += 15;
      f -= 1.0; // Get rid of leading 1
    }

    f *= 1024.0; // 2**10
    // Round to even
    bits = (uint16_t)f; // Note the truncation
    assert(bits < 1024);
    assert(e < 31);
    if ((f - bits > 0.5) || ((f - bits == 0.5) && (bits % 2 == 1))) {
      if (++bits == 1024) {
        // The carry propagated out of a string of 10 1 bits.
        bits = 0;
        if (++e == 31) {
          goto Overflow;
        }
      }
    }
  }

  return (sign << 15) | (e << 10) | bits;
Overflow:
  PyErr_SetString(PyExc_OverflowError,
                  "float too large to pack with e format");
  return -1;
}

The "Round to even" part of this CPython code was explicitly put in to match the numpy behaviour, however it diverges in its treatment of finite inputs that can't plausibly be represented by a finite FP16 - numpy will round them to infinity, whereas CPython will throw. This can be seen in the following example:

import numpy
import struct

def np_f2h(x):
  return int(numpy.array(x, 'u4').view('f4').astype('f2').view('u2'))

def py_f2h(x):
  f, = struct.unpack('f', struct.pack('I', x))
  return struct.unpack('H', struct.pack('e', f))[0]

print(hex(np_f2h(0x49800000))) # 0x7c00
print(hex(py_f2h(0x49800000))) # OverflowError

The other difference is in treatment of NaN values, where numpy tries to preserve them as much as possible, whereas CPython only preserves the sign bit and otherwise turns all NaN values into a canonical quiet NaN:

print(hex(np_f2h(0xffffffff))) # 0xffff
print(hex(py_f2h(0xffffffff))) # 0xfe00

A different lineage of code comes from James Tursa. It was originally written for Matlab and requires a MathWorks account to download, but the interesting part found its way to GitHub regardless, and is transcribed with slight modifications here:

uint16_t tursa_floatbits_to_halfbits(uint32_t x) {
  uint32_t xs = x & 0x80000000u; // Pick off sign bit
  uint32_t xe = x & 0x7f800000u; // Pick off exponent bits
  uint32_t xm = x & 0x007fffffu; // Pick off mantissa bits
  uint16_t hs = (uint16_t)(xs >> 16); // Sign bit
  if (xe == 0) { // Denormal will underflow, return a signed zero
    return hs;
  }
  if (xe == 0x7f800000u) { // Inf or NaN
    if (xm == 0) { // If mantissa is zero ...
      return (uint16_t)(hs | 0x7c00u); // Signed Inf
    } else {
      return (uint16_t)0xfe00u; // NaN, only 1st mantissa bit set
    }
  }
  // Normalized number
  int hes = ((int)(xe >> 23)) - 127 + 15; // Exponent unbias the single, then bias the halfp
  if (hes >= 0x1f) { // Overflow
    return (uint16_t)(hs | 0x7c00u); // Signed Inf
  } else if (hes <= 0) { // Underflow
    uint16_t hm;
    if ((14 - hes) > 24) { // Mantissa shifted all the way off & no rounding possibility
      hm = (uint16_t)0u; // Set mantissa to zero
    } else {
      xm |= 0x00800000u; // Add the hidden leading bit
      hm = (uint16_t)(xm >> (14 - hes)); // Mantissa
      if ((xm >> (13 - hes)) & 1u) { // Check for rounding
        hm += (uint16_t)1u; // Round, might overflow into exp bit, but this is OK
      }
    }
    return (hs | hm); // Combine sign bit and mantissa bits, biased exponent is zero
  } else {
    uint16_t he = (uint16_t)(hes << 10); // Exponent
    uint16_t hm = (uint16_t)(xm >> 13); // Mantissa
    if (xm & 0x1000u) { // Check for rounding
      return (hs | he | hm) + (uint16_t)1u; // Round, might overflow to inf, this is OK
    } else {
      return (hs | he | hm); // No rounding
    }
  }
}

Though it shares a lot in spirit with the numpy code, this code has different rounding behaviour to what we've seen before; ties are rounded away from zero rather than towards even. It also has yet another take on NaNs: all input NaNs get turned into a single output NaN, with not even the sign bit preserved.

The code from James Tursa was adopted into ISPC, meaning that ISPC had the same rounding and NaN behaviours. Then came along Fabian "ryg" Giesen, deploying his usual optimisation skills, and also fixing the rounding behaviour (back to ties toward even) and the NaN behaviour (preserve sign bit) at the same time. Several variants were written along that journey (and then adopted back by ISPC), but I'll choose this one to showcase:

uint16_t float_to_half_fast3_rtne(uint32_t x) {
  uint32_t x_sgn = x & 0x80000000u;
  x ^= x_sgn;

  uint16_t o;
  if (x >= 0x47800000u) { // result is Inf or NaN
    o = (x > 0x7f800000u) ? 0x7e00  // NaN->qNaN
                          : 0x7c00; // and Inf->Inf
  } else { // (De)normalized number or zero
    if (x < 0x38800000u) { // resulting FP16 is subnormal or zero
      // use a magic value to align our 10 mantissa bits at
      // the bottom of the float. as long as FP addition
      // is round-to-nearest-even this just works.
      union { uint32_t u; float f; } f, denorm_magic;
      f.u = x;
      denorm_magic.u = ((127 - 14) + (23 - 10)) << 23;

      f.f += denorm_magic.f;

      // and one integer subtract of the bias later, we have our
      // final float!
      o = f.u - denorm_magic.u;
    } else {
      uint32_t mant_odd = (x >> 13) & 1; // resulting mantissa is odd

      // update exponent, rounding bias part 1
      x += ((15 - 127) << 23) + 0xfff;
      // rounding bias part 2
      x += mant_odd;
      // take the bits!
      o = x >> 13;
    }
  }
  return (x_sgn >> 16) | o;
}

The interesting part of the above is how it handles the e < -14 case. Both the numpy code and the code from James Tursa handled this case with a variable distance bitshift and then some fiddly rounding logic, whereas the above makes the observation that there's existing hardware in a contemporary CPU for doing shifting and fiddly rounding: the floating point addition circuit. The first step of a floating point addition circuit is a variable distance bitshift of the smaller operand, with the aim of making its exponent equal to that of the larger operand. The above engineers the larger operand to have e = -1, which is what we want, as FP16 bottoms out at e = -14 and there are 13 (23 - 10) extra mantissa bits in FP32 compared to FP16. Any bits that get shifted out as part of exponent equalisation will contribute to rounding, which is exactly what we want.

Taking the idea further, we have this gem from Maratyszcza:

uint16_t fp16_ieee_from_fp32_value(uint32_t x) {
  uint32_t x_sgn = x & 0x80000000u;
  uint32_t x_exp = x & 0x7f800000u;
  x_exp = (x_exp < 0x38800000u) ? 0x38800000u : x_exp; // max(e, -14)
  x_exp += 15u << 23; // e += 15
  x &= 0x7fffffffu; // Discard sign

  union { uint32_t u; float f; } f, magic;
  f.u = x;
  magic.u = x_exp;

  // If 15 < e then inf, otherwise e += 2
  f.f = (f.f * 0x1.0p+112f) * 0x1.0p-110f;

  // If we were in the x_exp >= 0x38800000u case:
  // Replace f's exponent with that of x_exp, and discard 13 bits of
  // f's significand, leaving the remaining bits in the low bits of
  // f, relying on FP addition being round-to-nearest-even. If f's
  // significand was previously `a.bcdefghijk`, then it'll become
  // `1.000000000000abcdefghijk`, from which `bcdefghijk` will become
  // the FP16 mantissa, and `a` will add onto the exponent. Note that
  // rounding might add one to all this.
  // If we were in the x_exp < 0x38800000u case:
  // Replace f's exponent with the minimum FP16 exponent, discarding
  // however many bits are required to make that happen, leaving
  // whatever is left in the low bits.
  f.f += magic.f;

  uint32_t h_exp = (f.u >> 13) & 0x7c00u; // low 5 bits of exponent
  uint32_t h_sig = f.u & 0x0fffu; // low 12 bits (10 are mantissa)
  h_sig = (x > 0x7f800000u) ? 0x0200u : h_sig; // any NaN -> qNaN
  return (x_sgn >> 16) + h_exp + h_sig;
}

The 15 < e case is elegantly folded into the infinity case by using the floating point multiply circuit and the constant 0x1.0p+112f. After that, the x_exp < 0x38800000u case plays out similarly to ryg's code: add a fixed constant to cause the result to appear in the low bits. The x_exp >= 0x38800000u case uses a dynamic magic number rather than a constant magic number, with the aim of shifting off exactly 13 bits, getting the rounding done by the floating point addition circuit, and again having the result appear in the low bits. The final trickery is in computing the FP16 exponent bits; starting with the 8 bits of FP32 exponent, the top 3 bits are discarded, and modulo-32 arithmetic is done on the low 5 bits. Modulo 32, the bias adjustment is +16 ((15-127) % 32), which gets done in two parts: a fixed +15, and then either +0 or +1 depending on whether the result is denormal or not.

Sadly, at some point, new hardware makes software tricks obsolete. On x86, said new hardware was the F16C instruction set, which (amongst other things) adds a VCVTPS2PH instruction for converting (a vector of) FP32 to FP16. Four different rounding modes are available: round to nearest with ties toward even, round toward +∞, round toward -∞, round toward zero. Note that round to nearest with ties away from zero isn't an option. An immediate operand specifies the rounding mode, or it can come from MXCSR.RC. NaN signs are preserved, and NaN payloads are mostly preserved: the MSB of the input payload is forced to 1, and then the top 10 bits of the input payload become the output payload. When rounding toward +∞ or -∞, MXCSR.DAZ causes input denormals to be treated as zero. On the other hand, MXCSR.FTZ is ignored, so denormal results are never flushed to zero. Code to exercise this instruction looks incredibly boring:

uint16_t cvtps2ph(uint32_t x) {
    __m128 xmm = _mm_castsi128_ps(_mm_cvtsi32_si128(x));
    return _mm_extract_epi16(_mm_cvtps_ph(xmm, 0), 0);
}

On ARMv8, we have FCVT (scalar) and FCVTN / FCVTN2 (vector), which are capable of performing FP32 to FP16 conversion. FPCR controls the rounding mode, with the same options as on x86. FPCR also controls denormal flushing, of both inputs and outputs. Finally, FPCR.DN controls whether NaNs are canonicalised (without even the sign bit preserved), or whether they have the same preservation behaviour as x86.

page: 2 3 4 5 6