corsix.org

@geofflangdale posed the question on Twitter of how to vectorise this:

__mmask64 reference_impl(__m512i indices, __mmask64 valids) {
  __mmask64 result = 0;
  for (int i = 0; i < 64; ++i) {
    if (valids.bit[i]) {
      result ^= 1ull << indices.byte[i];
    }
  }
  return result;
}

After a week of code golf also involving @HaroldAptroot, we ended up with:

__mmask64 simd_impl(__m512i indices, __mmask64 valids) {
  // Convert indices to bits within each qword lane.
  __m512i khi = _mm512_setr_epi8(
    0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01,
    0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02,
    0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04,
    0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08,
    0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10,
    0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
    0x40, 0x40, 0x40, 0x40, 0x40, 0x40, 0x40, 0x40,
    0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80
  );
  __m512i hi0 = _mm512_permutexvar_epi8(indices, khi);
  __m512i klo = _mm512_set1_epi64(0x0102040810204080);
  __m512i lo0 = _mm512_maskz_shuffle_epi8(valids, klo, indices);
  __m512i kid = _mm512_set1_epi64(0x8040201008040201);
  __m512i hi1 = _mm512_gf2p8affine_epi64_epi8(kid, hi0, 0);
  __m512i lo1 = _mm512_gf2p8affine_epi64_epi8(kid, lo0, 0);
  __m512i x0  = _mm512_gf2p8affine_epi64_epi8(hi1, lo1, 0);
  // Combine results from various qword lanes.
  __m512i ktr = _mm512_setr_epi8(
    0,  8, 16, 24, 32, 40, 48, 56,
    1,  9, 17, 25, 33, 41, 49, 57,
    2, 10, 18, 26, 34, 42, 50, 58,
    3, 11, 19, 27, 35, 43, 51, 59,
    4, 12, 20, 28, 36, 44, 52, 60,
    5, 13, 21, 29, 37, 45, 53, 61,
    6, 14, 22, 30, 38, 46, 54, 62,
    7, 15, 23, 31, 39, 47, 55, 63);
  __m512i x1  = _mm512_permutexvar_epi8(ktr, x0);
  __m512i x2  = _mm512_gf2p8affine_epi64_epi8(kid, x1, 0);
  // Reduce 64 bytes down to 64 bits.
  __m512i kff = _mm512_set1_epi8(0xff);
  __m512i x3  = _mm512_gf2p8affine_epi64_epi8(x2, kff, 0);
  return _mm512_movepi8_mask(x3);
}

NB: If the valid indices can be assumed to be distinct, then the final reduction from 64 bytes to 64 bits can instead be:

  return _mm512_cmpneq_epi8_mask(x2, _mm512_setzero_si512());

As is often the case, simd_impl looks nothing like reference_impl, despite doing the same thing. In particular, simd_impl contains no shifts, and instead contains alternating shuffles and invocations of the mysterious _mm512_gf2p8affine_epi64_epi8, which is the intrinsic function corresponding to the gf2p8affineqb assembly instruction. To understand how simd_impl works, we're going to have to first understand what gf2p8affineqb does.

There are various ways of understanding what gf2p8affineqb does, but for the purposes of this blog post, I think the following Python pseudo-code is most useful:

def gf2p8affineqb(src1 : vector, src2 : vector, imm8 : u8) -> vector:
  assert len(src1.byte) == len(src2.byte)
  dst = vector()
  for i in range(len(src1.byte)):
    munged_src2 = munge(src2.qword[i // 8])
    dst.byte[i] = xor_selected(src1.byte[i], munged_src2, imm8)
  return dst

def xor_selected(src1 : u8, munged_src2 : u64, imm8 : u8) -> u8:
  result = imm8
  for i in range(8):
    if src1.bit[i]:
      result ^= munged_src2.byte[i]
  return result

def munge(x : u64) -> u64:
  return transpose8x8(byte_swap(x))
  # Or equivalently:
  return bitrev_in_each_byte(transpose8x8(x))

def transpose8x8(x : u64) -> u64:
  result = 0
  for i in range(8):
    for j in range(8):
      result.byte[i].bit[j] = x.byte[j].bit[i]
  return result

def byte_swap(x : u64) -> u64:
  result = 0
  for i in range(8):
    result.byte[i] = x.byte[7 - i]
  return result

def bitrev_in_each_byte(x : u64) -> u64:
  result = 0
  for i in range(8):
    result.byte[i] = bitrev(x.byte[i])
  return result

def bitrev(x : u8) -> u8:
  result = 0
  for i in range(8):
    result.bit[i] = x.bit[7 - i]
  return result

The mathematically inclined might notice that the above is in fact doing matrix multiplication of two 8x8 matrices of bits:

def gf2p8affineqb(src1 : vector, src2 : vector, imm8 : u8) -> vector:
  assert len(src1.byte) == len(src2.byte)
  dst = vector()
  for i in range(len(src1.qword)):
    dst.qword[i] = matmul(src1.qword[i], munge(src2.qword[i]))
  for i in range(len(src1.byte)):
    dst.byte[i] ^= imm8
  return dst

def matmul(lhs : u64, rhs : u64) -> u64:
  result = 0
  for i in range(8):
    for j in range(8):
      for k in range(8):
        b = lhs.byte[i].bit[j] * rhs.byte[j].bit[k] # * or &
        result.byte[i].bit[k] += b                  # + or ^
  return result

def munge(x : u64) -> u64:
  # Same as previously

The xor_selected view of gf2p8affineqb and the matmul view of gf2p8affineqb are complementary: I think that the xor_selected view makes it clearer what is going on, but the matmul view is useful for higher level transformations and optimisations. As a middle ground between the two views, matmul can be re-expressed as byte-level operations by unrolling the k loop:

def matmul(lhs : u64, rhs : u64) -> u64:
  result = 0
  for i in range(8):
    for j in range(8):
      if lhs.byte[i].bit[j]:
        result.byte[i] ^= rhs.byte[j]
  return result

One observation from the matmul view is that when src1.qword[i] is the identity matrix, we end up with dst.qword[i] being munge(src2.qword[i]). As a 64-bit integer, said identity matrix is 0x8040201008040201 (i.e. in byte i, just bit i is set). This explains __m512i kid = _mm512_set1_epi64(0x8040201008040201) in simd_impl (kid is just an identity matrix) and also explains __m512i hi1 = _mm512_gf2p8affine_epi64_epi8(kid, hi0, 0) and __m512i lo1 = _mm512_gf2p8affine_epi64_epi8(kid, lo0, 0) - these are just applying munge to every qword (as for what said munges are achieving, we'll get to later).

Changing tack somewhat, it is time to gradually transform reference_impl to make it look more like matmul. For this, we'll start with a simplified version of reference_impl that takes 8 indices rather than 64:

__mmask64 reference_impl_1(__m64i indices, __mmask8 valids) {
  __mmask64 result = 0;
  for (int i = 0; i < 8; ++i) {
    if (valids.bit[i]) {
      result ^= 1ull << indices.byte[i];
    }
  }
  return result;
}

The first transformation is to split each 6-bit index into its low 3 bits and high 3 bits, so that we can address bytes of result:

__mmask64 reference_impl_2(__m64i indices, __mmask8 valids) {
  __mmask64 result = 0;
  for (int i = 0; i < 8; ++i) {
    if (valids.bit[i]) {
      uint8_t b = indices.byte[i];
      uint8_t hi = b >> 3;
      uint8_t lo = b  & 7;
      result.byte[hi] ^= 1 << lo;
    }
  }
  return result;
}

Next up we perform loop fission; doing the exact same work, but using two loops rather than one (so that we can focus on the loops separately):

__mmask64 reference_impl_3(__m64i indices, __mmask8 valids) {
  __m64i hi;
  __m64i lo;
  for (int i = 0; i < 8; ++i) {
    uint8_t b = indices.byte[i];
    hi.byte[i] = b >> 3;
    lo.byte[i] = b  & 7;
  }
  __mmask64 result = 0;
  for (int i = 0; i < 8; ++i) {
    if (valids.bit[i]) {
      result.byte[hi.byte[i]] ^= 1 << lo.byte[i];
    }
  }
  return result;
}

Then the if and the 1 << can also be moved from the 2^nd loop to the 1^st loop:

__mmask64 reference_impl_4(__m64i indices, __mmask8 valids) {
  __m64i hi;
  __m64i lo;
  for (int i = 0; i < 8; ++i) {
    uint8_t b = indices.byte[i];
    hi.byte[i] = b >> 3;
    lo.byte[i] = valids.bit[i] ? 1 << (b & 7) : 0;
  }
  __mmask64 result = 0;
  for (int i = 0; i < 8; ++i) {
    result.byte[hi.byte[i]] ^= lo.byte[i];
  }
  return result;
}

Then a transformation that looks utterly deranged, but is key to the SIMD transformation; rather than directly indexing using hi.byte[i], we'll loop over the 8 possible values of hi.byte[i] and act when we find the right value:

__mmask64 reference_impl_5(__m64i indices, __mmask8 valids) {
  __m64i hi;
  __m64i lo;
  for (int i = 0; i < 8; ++i) {
    uint8_t b = indices.byte[i];
    hi.byte[i] = b >> 3;
    lo.byte[i] = valids.bit[i] ? 1 << (b & 7) : 0;
  }
  __mmask64 result = 0;
  for (int i = 0; i < 8; ++i) {
    for (int j = 0; j < 8; ++j) {
      if (hi.byte[i] == j) {
        result.byte[j] ^= lo.byte[i];
      }
    }
  }
  return result;
}

Next up we perform loop interchange of the two nested loops:

__mmask64 reference_impl_6(__m64i indices, __mmask8 valids) {
  __m64i hi;
  __m64i lo;
  for (int i = 0; i < 8; ++i) {
    uint8_t b = indices.byte[i];
    hi.byte[i] = b >> 3;
    lo.byte[i] = valids.bit[i] ? 1 << (b & 7) : 0;
  }
  __mmask64 result = 0;
  for (int i = 0; i < 8; ++i) {
    for (int j = 0; j < 8; ++j) {
      if (hi.byte[j] == i) {
        result.byte[i] ^= lo.byte[j];
      }
    }
  }
  return result;
}

Then another transformation that initially looks deranged; the == in hi.byte[j] == i is annoying, and can be replaced by a bit test if we one-hot encode hi:

__mmask64 reference_impl_7(__m64i indices, __mmask8 valids) {
  __m64i hi;
  __m64i lo;
  for (int i = 0; i < 8; ++i) {
    uint8_t b = indices.byte[i];
    hi.byte[i] = 1 << (b >> 3);
    lo.byte[i] = valids.bit[i] ? 1 << (b & 7) : 0;
  }
  __mmask64 result = 0;
  for (int i = 0; i < 8; ++i) {
    for (int j = 0; j < 8; ++j) {
      if (hi.byte[j].bit[i]) {
        result.byte[i] ^= lo.byte[j];
      }
    }
  }
  return result;
}

Then one final transformation to get where we want to be; apply transpose8x8 to hi, and undo it by changing .byte[j].bit[i] to .byte[i].bit[j]:

__mmask64 reference_impl_8(__m64i indices, __mmask8 valids) {
  __m64i hi;
  __m64i lo;
  for (int i = 0; i < 8; ++i) {
    uint8_t b = indices.byte[i];
    hi.byte[i] = 1 << (b >> 3);
    lo.byte[i] = valids.bit[i] ? 1 << (b & 7) : 0;
  }
  __mmask64 result = 0;
  for (int i = 0; i < 8; ++i) {
    for (int j = 0; j < 8; ++j) {
      if (transpose8x8(hi).byte[i].bit[j]) {
        result.byte[i] ^= lo.byte[j];
      }
    }
  }
  return result;
}

A number of these transformations seemed pointless or even unhelpful, but having done them all, the latter half of reference_impl_8 is exactly result = matmul(transpose8x8(hi), lo).

The expression matmul(transpose8x8(A), B) looks deceptively similar to the matmul(A, munge(B)) done by gf2p8affineqb(A, B, 0), and if munge was just transpose8x8, then gf2p8affineqb(munge(A), munge(B), 0) would be exactly matmul(transpose8x8(A), B). Unfortunately, munge also does a bit or byte reversal, causing gf2p8affineqb(munge(A), munge(B), 0) to actually be matmul(transpose8x8(A), bitrev_in_each_byte(B)) (if deriving this, note that munge(A) is bitrev_in_each_byte(transpose8x8(A)), munge(munge(B)) is byte_swap(bitrev_in_each_byte(B)), and then the bitrev_in_each_byte on A cancels out with the byte_swap on B).

The expression matmul(transpose8x8(A), bitrev_in_each_byte(B)) is very close to what we want, and the errant bitrev_in_each_byte can be cancelled out by doing another bitrev_in_each_byte on B:

__mmask64 reference_impl_9(__m64i indices, __mmask8 valids) {
  __m64i hi;
  __m64i lo;
  for (int i = 0; i < 8; ++i) {
    uint8_t b = indices.byte[i];
    hi.byte[i] = 1 << (b >> 3);
    lo.byte[i] = bitrev(valids.bit[i] ? 1 << (b & 7) : 0);
  }
  __mmask64 result = gf2p8affineqb(munge(hi), munge(lo), 0);
  return result;
}

The 1^st loop is easy to express in a SIMD manner via a pair of table lookups, thereby giving us the first chunk of simd_impl:

__mmask64 simd_impl(__m512i indices, __mmask64 valids) {
  // Convert indices to bits within each qword lane.
  __m512i khi = _mm512_setr_epi8(
    0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01,
    0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02,
    0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04,
    0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08,
    0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10,
    0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
    0x40, 0x40, 0x40, 0x40, 0x40, 0x40, 0x40, 0x40,
    0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80
  );
  __m512i hi0 = _mm512_permutexvar_epi8(indices, khi);
  __m512i klo = _mm512_set1_epi64(0x0102040810204080);
  __m512i lo0 = _mm512_maskz_shuffle_epi8(valids, klo, indices);
  __m512i kid = _mm512_set1_epi64(0x8040201008040201);
  __m512i hi1 = _mm512_gf2p8affine_epi64_epi8(kid, hi0, 0); // munge
  __m512i lo1 = _mm512_gf2p8affine_epi64_epi8(kid, lo0, 0); // munge
  __m512i x0  = _mm512_gf2p8affine_epi64_epi8(hi1, lo1, 0);
}

At this point, x0.qword[i] contains reference_impl_9(indices.qword[i], valids.word[i]). To finish up, "all" we need to do is xor together the eight qwords of x0. The traditional way of doing this would be a shuffle followed by a xor to reduce eight to four, another shuffle followed by a xor to reduce four to two, and yet another shuffle followed by a xor to reduce two to one. We can do better than the traditional approach though. The first step is to do one big shuffle rather than three sequential suffles, where the result of the big shuffle moves the eight bytes qword[i].byte[0] to be contiguous, then the eight bytes qword[i].byte[1] to be contiguous, and so on. Seen differently, the bug shuffle is a transpose on an 8x8 matrix of bytes. After this big shuffle, the remaining problem is to take each contiguous group of eight bytes and xor them together. If we wanted to add together each contiguous group of eight bytes, then _mm512_sad_epu8 against zero would be one option, but we want xor rather than add. There are a few different ways of approaching the problem, but one cute way is to apply transpose8x8 to each contiguous group of eight bytes, after which we just need to xor together each contiguous group of eight bits. Applying transpose8x8 on its own is hard, but we can apply munge fairly easily, which does transpose8x8 followed by bitrev_in_each_byte, and the bitrev_in_each_byte is harmless given that we're about to xor together the bits in each byte. This gives us the next chunk of simd_impl:

  // Combine results from various qword lanes.
  __m512i ktr = _mm512_setr_epi8(
    0,  8, 16, 24, 32, 40, 48, 56,
    1,  9, 17, 25, 33, 41, 49, 57,
    2, 10, 18, 26, 34, 42, 50, 58,
    3, 11, 19, 27, 35, 43, 51, 59,
    4, 12, 20, 28, 36, 44, 52, 60,
    5, 13, 21, 29, 37, 45, 53, 61,
    6, 14, 22, 30, 38, 46, 54, 62,
    7, 15, 23, 31, 39, 47, 55, 63);
  __m512i x1  = _mm512_permutexvar_epi8(ktr, x0); // transpose bytes
  __m512i x2  = _mm512_gf2p8affine_epi64_epi8(kid, x1, 0); // munge

If the valid indices can be assumed to be distinct, then we can or (rather than xor) together the bits in each byte, which is just _mm512_cmpneq_epi8_mask against zero.

If we really do need to xor the bits together, then what we want is this function applied to every byte:

def xor_together_bits(x : u8) -> u8:
  result = 0
  for i in range(8):
    if x.bit[i]:
      result ^= 0xff
  return result

If you're thinking that xor_together_bits looks very similar to xor_selected, then you'd be right: xor_together_bits is just xor_selected where every byte of munged_src2 is 0xff, and it so happens that if every byte of src2 is 0xff, then the same is true for munged_src2. This gives the final chunk of simd_impl:

  // Reduce 64 bytes down to 64 bits.
  __m512i kff = _mm512_set1_epi8(0xff);
  __m512i x3  = _mm512_gf2p8affine_epi64_epi8(x2, kff, 0);
  return _mm512_movepi8_mask(x3);

In recent versions of the Linux kernel, a pidfd is a special type of file that holds a reference to a process. Notably, a pidfd allows for certain process-related operations to be performed in a race-free manner, and it allows poll / select / epoll to be used to detect process termination.

Before you get too excited:

A pidfd does not let you hold a reference to an individual thread, only to a process (or in kernel terminology, a thread group leader).
A pidfd does not hold a reference to a pid number, nor does holding open a pidfd prevent the pid number of the underlying process from being reused.
A pidfd cannot circumvent the at-most-once semantics of retreiving the exit code / status of a process via wait / waitpid / waitid.
Closing a pidfd does not terminate the underlying process.

There are various ways of obtaining a pidfd:

Kernel version	glibc version	Function
5.2	2.2.5 / 2.31	`clone` with `CLONE_PIDFD` flag
5.3	N/A	`clone3` with `CLONE_PIDFD` flag
5.3 / 5.10	2.36	`pidfd_open`
5.4	2.39	`pidfd_spawn` / `pidfd_spawnp`
6.5	2.2.5 / N/A	`getsockopt` with `SO_PEERPIDFD` optname
6.5	2.2.5 / 2.39	`recvmsg` with `SCM_PIDFD` cmsg_type

Once you have a pidfd, there are a bunch of things you can do with it:

Kernel version	glibc version	Function
5.1	2.36	`pidfd_send_signal`
5.2 / 5.5	2.39	`pidfd_getpid`
5.3	2.2.5 / 2.3.2	`poll` / `select` / `epoll`
5.4	2.2.5 / 2.36	`waitid` with `P_PIDFD` mode
5.6	2.36	`pidfd_getfd`
5.8	2.14	`setns`
5.10 / 5.12	2.36	`process_madvise`
5.15	2.36	`process_mrelease`
6.9	2.2.5 / 2.28	`fstat` / `statx` for meaningful `stx_ino`

Some of the subsequent text refers to a process being alive or zombie or dead. These terms come from the usual lifecycle of a unix process: it is initially alive, then transitions to zombie when it terminates, and then transitions to dead once it is waited upon. As a quick summary of the states:

	Alive	Zombie	Dead
Can execute code and receive signals	✅	❌	❌
Has pid number	✅	✅	❌
Exit code / status retrievable	❌	✅	❌
pidfd polls as readable	❌	✅	✅
Cleaned up by kernel	❌	❌	✅

clone with CLONE_PIDFD flag

Available since: kernel 5.2, glibc 2.31 (or glibc 2.2.5 if you provide your own definition of CLONE_PIDFD; its value is 0x1000).

If the CLONE_PIDFD flag is specified, then clone returns a freshly allocated pidfd referring to the child (in addition to returning the pid number of the child). The O_CLOEXEC flag is automatically set on the returned pidfd. Note that if CLONE_PIDFD is specified, then CLONE_THREAD cannot be specified, nor can CLONE_DETACHED. Furthermore, if CLONE_PIDFD is specified, then CLONE_PARENT_SETTID cannot be specified (unless using clone3).

One of the arguments to clone is the signal number that the child will send to its parent when the child terminates. Setting this to anything other than SIGCHLD has several consequences:

Calls to wait do not recognise the child.
Calls to waitpid / waitid only recognise the child if the __WALL or __WCLONE option is passed (this is true even for P_PIDFD calls).
The child will always transition to the zombie state upon termination and stay there until waited upon, even if the parent's SIGCHLD handler is SIG_IGN or has SA_NOCLDWAIT.
When the child terminates, a signal other than SIGCHLD will be sent to the parent (or no signal will be sent if the termination signal is set to zero).

Note that if the child calls execve (or a similar exec function), then the termination signal number is reset to SIGCHLD, and the above points stop applying.

clone3 with CLONE_PIDFD flag

Available since: kernel 5.3, no glibc wrapper.

This function is just a more extensible version of clone; everything written above about clone applies equally to clone3.

pidfd_open

Available since: kernel 5.3, glibc 2.36.

This function takes a pid number (in the pid namespace of the caller), and returns a freshly allocated pidfd refering to said process (or an error if said process does not exist). It is inherently racy, unless the pid number being passed is the result of getpid (i.e. creating a pidfd referring to your own process).

Since kernel 5.10, the PIDFD_NONBLOCK flag can be passed to pidfd_open, which affects subsequent waitid calls. No other flags are valid to pass. The O_CLOEXEC flag is automatically set on the returned pidfd.

pidfd_spawn / pidfd_spawnp

Available since: kernel 5.4, glibc 2.39.

These functions are like posix_spawn / posix_spawnp, except that they have an int* output parameter for a freshly allocated pidfd instead of a pid_t* output parameter for a pid number. The O_CLOEXEC flag is automatically set on the returned pidfd.

In glibc 2.39, bug BZ#31695 causes these functions to leak a file descriptor in some error scenarios. This will hopefully be fixed in 2.40.

getsockopt with SO_PEERPIDFD optname

Available since: kernel 6.5, glibc 2.2.5 for getsockopt. The definition of SO_PEERPIDFD is not tied to a particular glibc version; its value is 77 should you need to provide your own definition of it.

SO_PEERPIDFD is the pidfd version of SO_PEERCRED. For a unix socket created via socketpair, SO_PEERPIDFD gives a pidfd referring to the process that called socketpair, meanwhile for a connected unix stream socket, SO_PEERPIDFD gives a pidfd referring to the process that called connect (if called on the server end of the socket) or the process that called listen (if called on the client end of the socket). The O_CLOEXEC flag is automatically set on the returned pidfd.

recvmsg with SCM_PIDFD cmsg_type

Available since: kernel 6.5, glibc 2.39 (or glibc 2.2.5 if you provide your own definition of SCM_PIDFD; its value is 0x04).

SCM_PIDFD is the pidfd version of (the pid part of) SCM_CREDENTIALS. If the receivier sets SO_PASSPIDFD on a unix socket (c.f. setting SO_PASSCRED), then it'll receive a SCM_PIDFD cmsg as part of receiving a message, with the associated cmsg data being a freshly allocated pidfd referring to the process of the sender of the message (or some other process if the sender has CAP_SYS_ADMIN and specifies a pid number other than itself as part of its SCM_CREDENTIALS). The O_CLOEXEC flag is automatically set on the pidfd.

pidfd_send_signal

Available since: kernel 5.1, glibc 2.36.

This function is similar to kill / rt_sigqueueinfo: it sends a signal to a process. It differs from these functions in that the destination is given as a pidfd rather than as a pid number.

This function also accepts the result of open("/proc/$pid") as an fd, though it is the only function to do so: open("/proc/$pid") does not give a pidfd, and no other functions accept the result of open("/proc/$pid") in place of a pidfd.

pidfd_getpid

Available since: kernel 5.2, glibc 2.39.

This function is the inverse of pidfd_open: given a pidfd, it returns the pid number associated with the underlying process. This function requires that /proc be mounted, and returns the pid number in the pid namespace associated with the mounted /proc. Note that the pid number can be reused for a different process once the underlying process is dead.

Changed in kernel 5.5: if the process referenced by the pidfd is dead, this function returns -1 (prior to 5.5, it returned whatever pid number the process had prior to its death).

Note that this is not a direct system call; instead it opens /proc/self/fdinfo/$pidfd and parses the Pid: line therein.

poll / select / epoll

Available since: kernel 5.3, glibc 2.2.5 (poll / select) or glibc 2.3.2 (epoll).

These functions can be used to asynchronously monitor a pidfd. They will report the pidfd as readable iff the underlying process is a zombie or is dead. Note however that read on a pidfd always fails; to get the exit code / status of the process, use waitid (possibly with WNOHANG).

waitid with P_PIDFD mode

Available since: kernel 5.4, glibc 2.36 (or glibc 2.2.5 if you provide your own definition of P_PIDFD; its value is 3).

waitid(P_PIDFD, fd, infop, options) is identical to waitid(P_PID, pidfd_getpid(fd), infop, options), except for the following:

The embedded pidfd_getpid call is done atomically as part of waitid; there is no race condition.
The embedded pidfd_getpid call does not require /proc to be mounted.
If the pidfd was opened with the PIDFD_NONBLOCK flag, and options does not contain WNOHANG, and the process referenced by the pidfd is alive, then waitid will fail with EAGAIN rather than blocking. Note that if options does contain WNOHANG, then PIDFD_NONBLOCK has no effect: if the process referenced by the pidfd is alive, then waitid will succeed with result 0 rather than blocking.

In particular, note that:

Waiting on a zombie process will retreive the exit code / status (in si_code / si_status), and transition the process from zombie to dead. The si_signo, si_errno, si_pid, and si_uid fields will also be set.
Waiting on a dead process will fail with ECHILD.

The above points are true for all waitid calls, including P_PIDFD calls. The first time a zombie is waited upon (by any kind of wait / waitpid / waitid call), then the exit code / status is retreived, and subsequent attempts to wait upon it (again by any kind of wait / waitpid / waitid call) will fail.

When a process transitions from alive to zombie, if that process's parent's SIGCHLD handler is SIG_IGN or has SA_NOCLDWAIT, then the kernel does an automatic wait call on behalf of the parent and discards the result, thereby transitioning the child onward from zombie to dead. This causes all attempts to wait upon the child (including via P_PIDFD) to fail. The only exception to this is if the child was created with clone or clone3, and the termination signal was specified as something other than SIGCHLD, and the child has not called execve or similar: given this combination of circumstances, the automatic wait call will not recognise the child.

pidfd_getfd

Available since: kernel 5.6, glibc 2.36.

This function takes a pidfd, along with an fd number in the file table of the process referenced by the pidfd, creates a duplicate of that file descriptor in the file table of the calling process, and returns the new fd number. The effect is similar to what would happen if the referenced process used an SCM_RIGHTS message to send a file descriptor to the calling process. The O_CLOEXEC flag is automatically set on the new fd.

Calling this function incurs a PTRACE_MODE_ATTACH_REALCREDS security check.

setns

Available since: kernel 5.8, glibc 2.14.

Passing a pidfd to this function moves the caller into one or more of the namespaces that the process referenced by the pidfd is in. Note that this function can also be passed the result of open("/proc/$pid/ns/$name") as an fd.

process_madvise

Available since: kernel 5.10, glibc 2.36.

This function is similar to madvise, except that it operates on an arbitrary process (specified via a pidfd) rather than on the calling process.

Since 5.12, calling this function incurs PTRACE_MODE_READ_FSCREDS and CAP_SYS_NICE security checks. In 5.10 and 5.11, it incurred a PTRACE_MODE_ATTACH_FSCREDS security check.

process_mrelease

Available since: kernel 5.15, glibc 2.36.

This is a relatively niche function, which you are unlikely to ever need unless writing a userspace OOM killer. It can be called against a process which is no longer alive, but hasn't yet had its virtual memory released up by the kernel, to cause the kernel to release said virtual memory faster.

fstat / statx for meaningful stx_ino

Available since: kernel 6.9, glibc 2.2.5 (fstat) or glibc 2.28 (statx).

It has always been possible to call fstat or statx on a pidfd, but prior to kernel 6.9, it was not useful to do so. Since 6.9, calling statx on a pidfd gives a meaningful stx_ino: the 64-bit inode number of a pidfd uniquely identifies a process, so two pidfds referencing the same process will have identical stx_ino values, while two pidfds referencing different processes will have different stx_ino values. The same is true for fstat, provided that st_ino is 64 bits wide. In other words, since 6.9, a process's inode number (as observed via a pidfd) is a unique 64-bit identifier for the process, which is never reused (until the system is restarted), and is unique even across different pid namespaces.

It is likely that future kernel versions will add more things that can be done with (or to) a pidfd. As for the existing functionality, if you find yourself constrained by glibc version rather than kernel version, one option is to compile against a very recent glibc, then use polyfill-glibc to restore runtime compatibility with an older version of glibc.

In terms of future directions, some of the things that I'd like to see are:

The ability for a pidfd to obtain the exit code and status of dead processes, not just zombie processes (c.f. GetExitCodeProcess in Windows).
The ability to mark a process as transitioning directly from alive to dead, without sitting in the zombie state until someone waits upon it. This would be similar to SA_NOCLDWAIT, but as a property of the child rather than a property of the parent. Combined with the previous point, the exit code and status would still be retrievable (by any holder of a relevant pidfd).
Subject to a flag, closing a pidfd could cause the underlying process (and possibly all of its transitive descendants) to be terminated by the kernel.
pidfd variants of process_vm_readv and process_vm_writev.

Macros	Implementation
`#define __STDC_ENDIAN_LITTLE__`	Some integer constant
`#define __STDC_ENDIAN_BIG__`	Some integer constant
`#define __STDC_ENDIAN_NATIVE__`	`__STDC_ENDIAN_LITTLE__` or `__STDC_ENDIAN_BIG__` (†)
Regular functions	Implementation
`unsigned stdc_leading_zeros(T x)`	`lzcnt(x)`
`unsigned stdc_first_leading_one(T x)`	`x ? lzcnt(x) + 1 : 0`
`unsigned stdc_trailing_zeros(T x)`	`tzcnt(x)`
`unsigned stdc_first_trailing_one(T x)`	`x ? tzcnt(x) + 1 : 0`
`unsigned stdc_count_ones(T x)`	`popcnt(x)`
`bool stdc_has_single_bit(T x)`	`popcnt(x) == 1`
`unsigned stdc_bit_width(T x)`	`x ? floor(log2(x)) + 1 : 0`
`T stdc_bit_floor(T x)`	`x ? (T)1 << floor(log2(x)) : 0`
`T stdc_bit_ceil(T x)`	`x ? (T)1 << ceil(log2(x)) : 1` (‡)
Inverted functions	Implementation
`unsigned stdc_leading_ones(T x)`	`lzcnt((T)~x)`
`unsigned stdc_first_leading_zero(T x)`	`(T)~x ? lzcnt((T)~x) + 1 : 0`
`unsigned stdc_trailing_ones(T x)`	`tzcnt((T)~x)`
`unsigned stdc_first_trailing_zero(T x)`	`(T)~x ? tzcnt((T)~x) + 1 : 0`
`unsigned stdc_count_zeros(T x)`	`popcnt((T)~x)`

(†) Or some third value if the execution environment is neither little endian nor big endian.

(‡) Undefined if the << overflows or if the result does not fit in T.

Where:

T denotes either unsigned char or unsigned short or unsigned int or unsigned long or unsigned long long. Each function taking T is type-generic, dispatching to one of these five variants. The individual variants are also available by appending a _uc or _us or _ui or _ul or _ull suffix to the function name. Implementations might also support uint128_t.
lzcnt returns the number of leading zero bits. For an input of zero, this is sizeof(T) * CHAR_BIT.
tzcnt returns the number of trailing zero bits. For an input of zero, this is sizeof(T) * CHAR_BIT.
popcnt returns the number of set bits. For an input with all bits set, this is sizeof(T) * CHAR_BIT.
log2(0) is undefined, log2 coincides with tzcnt for inputs that are powers of two, and otherwise log2 is strictly monotonic and returns a float.

The inverted functions can all be implemented by inverting the input and then passing it to one of the regular functions.

The stdc_first_leading_ functions are slightly slippery, and require a careful reading of the standard. For example, stdc_first_leading_one is defined as:

Returns the most significant index of the first 1 bit in value, plus 1. If it is not found, this function returns 0.

In turn, most significant index has the following unintuitive definition:

The most significant index is the 0-based index counting from the most significant bit, 0, to the least significant bit, w − 1, where w is the width of the type that is having its most significant index computed.

The initial patches for these functions in musl got this wrong, instead using the more intuitive definition of most significant index. The VLC compat header also got this wrong, twice.

Problem statement

Given the current register state for a thread, and read-only access to memory, what would the register state hypothetically become if the current function was to immediately return and execution was to resume in its caller?

Notably, the register state includes (amongst other things) an instruction pointer and a stack pointer, so this operation can be iterated to generate a backtrace. With some extensions to support calling destructors and catching exceptions, this operation can also be the basis of throwing exceptions.

This operation is typically called unwinding (as in "unwinding a call frame" or "unwinding the stack").

If certain registers are always undefined immediately after a function return (because they are volatile in the ABI), then we needn't care about their state.

Portability

The size and contents of the register state depends on the architecture being used. We'll paper over this slightly by refering to registers by ordinal rather than by name, and rely on DWARF register number mappings to go between ordinal and name, for example:

Ordinal	x86 name	x86_64 name	aarch64 name
0	eax	rax	X0
1	ecx	rdx	X1
2	edx	rcx	X2
3	ebx	rbx	X3
4	esp	rsi	X4
5	ebp	rdi	X5
6	esi	rbp	X6
7	edi	rsp	X7
8		r8	X8
9		r9	X9
10		r10	X10
11		r11	X11
12		r12	X12
13		r13	X13
14		r14	X14
15		r15	X15
16			X16
⋮			⋮
30			X30 (LR)
31			SP

Because everyone loves special cases, the instruction pointer register (e.g. eip / rip / PC) isn't given an ordinal. On aarch64, it doesn't need one: the return address is always in some register (usually X30, aka. LR), and so we can use the return address register to represent the instruction pointer. In contrast, x86 and x86_64 put the return address on the stack rather than in a register. For these, we'll invent a fake register (typically ordinal 8 on x86, ordinal 16 on x86_64), pretend that it contains the return address, and use it to represent the instruction pointer.

Floating-point and vector registers are generally volatile in Linux ABIs, so we needn't worry about them too much.

Start small: unwinding just the stack pointer

In most cases, the hypothetical stack pointer after function return can be easily calculated by adding some value to some register. This can be expressed as two DWARF-style ULEB128 values: one ULEB128 to specify which register to start with (often the current stack pointer), then one ULEB128 to specify the amount to add. As foreshadowing, we'll call this DW_CFA_def_cfa.

Unwinding other registers

In most cases, if a function modifies a non-volatile register, it'll save it to somewhere on the stack before modifying it. Upon function return, it'll re-load the value from said stack slot. The locations of these stack slots can be expressed as being relative to the hypothetical stack pointer after function return. This can again be expressed as two DWARF-style ULEB128 values: one ULEB to specify which register we're talking about, then one ULEB to specify the offset of its stack slot. There's only one problem: because the offset is against the hypothetical stack pointer after function return, the required offset is going to be negative, which a ULEB can't represent. To get around this, we'll scale the 2^nd ULEB by a so-called data_align value (typically -4 or -8), which also has the convenient side-effect of making the ULEB smaller. As foreshadowing, we'll call this DW_CFA_offset_extended. When the 1^st ULEB is less than 64, we might also call this DW_CFA_offset.

More ways to unwind other registers

So far we've described forming the address of a stack slot, and then loading from it. There's a natural variation that skips the 2^nd part: form the address of a stack slot, then set the value of a register to be that address.

There's another common variant: unwind register X by taking the value from register Y (when X equals Y, this means that said register was unchanged).

For when these common variants aren't sufficient, we'll later describe a little expression language and associated bytecode VM capable of describing all sorts of complicated mechanisms.

This leads to the following definitions to describe how to unwind registers:

enum StackPointerUnwindMechanism {
  Undefined,
  RegOffset(unsigned RegOrdinal, unsigned Offset),
  ExprResult(uint8_t DwOpBytecode[unsigned Len]),
};

enum OtherRegUnwindMechanism {
  Undefined,
  LoadFromStackSlot(int StackSlotOffset),
  AddressOfStackSlot(int StackSlotOffset),
  CopyFromRegister(unsigned RegOrdinal),
  LoadFromExprResult(uint8_t DwOpBytecode[unsigned Len]),
  ExprResult(uint8_t DwOpBytecode[unsigned Len]),
};

struct UnwindActions {
  StackPointerUnwindMechanism sp;
  OtherRegUnwindMechanism reg[NUM_REG];
};

With the associated unwind logic being:

struct RegState {
  uintptr_t value[NUM_REG];
};

RegState UnwindOneFrame(RegState* input, UnwindActions* actions) {
  RegState output;
  uintptr_t NewSP;
  switch (actions->sp) {
  case RegOffset(unsigned RegOrdinal, unsigned Offset):
    NewSP = input->value[RegOrdinal] + Offset;
    break;
  case ExprResult(uint8_t DwOpBytecode[unsigned Len]):
    NewSP = EvalExpr(DwOpBytecode, 0, input);
    break;
  }
  output->value[SP] = NewSP;
  for (unsigned i = 0; i < NUM_REG; ++i) {
    uintptr_t NewVal;
    switch (actions->reg[i]) {
    case LoadFromStackSlot(int StackSlotOffset):
      NewVal = *(uintptr_t*)(NewSP + StackSlotOffset);
      break;
    case AddressOfStackSlot(int StackSlotOffset):
      NewVal = NewSP + StackSlotOffset;
      break;
    case CopyFromRegister(unsigned RegOrdinal):
      NewVal = input->value[RegOrdinal];
      break;
    case LoadFromExprResult(uint8_t DwOpBytecode[unsigned Len]):
      NewVal = *(uintptr_t*)EvalExpr(DwOpBytecode, NewSP, input);
      break;
    case ExprResult(uint8_t DwOpBytecode[unsigned Len]):
      NewVal = EvalExpr(DwOpBytecode, NewSP, input);
      break;
    default:
      continue;
    }
    output->value[i] = NewVal;
  }
  return output;
}

A little bytecode VM is defined to populate an instance of UnwindActions:

Opcode	Name and operands	Semantics assuming `UnwindActions* ua`
`0x0c`	DW_CFA_def_cfa(uleb128 Reg, uleb128 Off)	`ua->sp = RegOffset(Reg, Off)`
`0x12`	DW_CFA_def_cfa_sf(uleb128 Reg, sleb128 Off)	`ua->sp = RegOffset(Reg, Off * data_align)`
`0x0f`	DW_CFA_def_cfa_expression(uleb128 Len, uint8_t DwOpBytecode[Len])	`ua->sp = ExprResult(DwOpBytecode)`
`0x80 + Reg`	DW_CFA_offset(uleb128 Off)	`ua->reg[Reg] = LoadFromStackSlot(Off * data_align)`
`0x05`	DW_CFA_offset_extended(uleb128 Reg, uleb128 Off)	`ua->reg[Reg] = LoadFromStackSlot(Off * data_align)`
`0x11`	DW_CFA_offset_extended_sf(uleb128 Reg, sleb128 Off)	`ua->reg[Reg] = LoadFromStackSlot(Off * data_align)`
`0x2f`	DW_CFA_GNU_negative_offset_extended(uleb128 Reg, uleb128 Off)	`ua->reg[Reg] = LoadFromStackSlot(-(Off * data_align))`
`0x14`	DW_CFA_val_offset(uleb128 Reg, uleb128 Off)	`ua->reg[Reg] = AddressOfStackSlot(Off * data_align)`
`0x15`	DW_CFA_val_offset_sf(uleb128 Reg, sleb128 Off)	`ua->reg[Reg] = AddressOfStackSlot(Off * data_align)`
`0xc0 + Reg`	DW_CFA_restore	`ua->reg[Reg] = CopyFromRegister(Reg)` †
`0x06`	DW_CFA_restore_extended(uleb128 Reg)	`ua->reg[Reg] = CopyFromRegister(Reg)` †
`0x08`	DW_CFA_same_value(uleb128 Reg)	`ua->reg[Reg] = CopyFromRegister(Reg)`
`0x09`	DW_CFA_register(uleb128 Reg, uleb128 Src)	`ua->reg[Reg] = CopyFromRegister(Src)`
`0x07`	DW_CFA_undefined(uleb128 Reg)	`ua->reg[Reg] = Undefined`
`0x10`	DW_CFA_expression(uleb128 Reg, uleb128 Len, uint8_t DwOpBytecode[Len])	`ua->reg[Reg] = LoadFromExprResult(DwOpBytecode)`
`0x16`	DW_CFA_val_expression(uleb128 Reg, uleb128 Len, uint8_t DwOpBytecode[Len])	`ua->reg[Reg] = ExprResult(DwOpBytecode)`
`0x00`	DW_CFA_nop	No-op

† This differs from the usual DWARF semantics. To avoid confusion, use DW_CFA_same_value instead of DW_CFA_restore.

If the bytecode does not initialise sp, its value is Undefined. If the bytecode does not initialise reg[i], its value is CopyFromRegister(i).

Putting it together in a CIE

The aforementioned bytecode gets wrapped up in something called a CIE, which also includes a bunch of other fields:

struct cie {
  uint32_t length;           // In bytes, of all subsequent fields
  int32_t  zero;             // Must be zero
  uint8_t  version;          // Typically 1 or 3, sometimes 4
  char     aug_string[];     // NUL terminated
  if (aug_string[0] == 'e' && aug_string[1] == 'h') {
    void* eh_ptr;            // Only used by very old g++
  }
  if (version >= 4) {
    uint8_t addr_size;       // Must be sizeof(void*)
    uint8_t segment_size;    // Must be zero
  }
  uleb128 code_align;        // Typically 1 (even on aarch64)
  sleb128 data_align;        // Typically -4 or -8
  if (version == 1) {
    uint8_t return_address_ordinal;
  } else {
    uleb128 return_address_ordinal;
  }
  uint8_t aug_operands[];    // Relates to aug_string
  uint8_t dw_cfa_bytecode[];
  uint8_t zero_padding[];    // To give 4 or 8 byte alignment
};

The aug_string field is a NUL-terminated ASCII string. Conceptually it is a bitmask of flags, except that each flag is represented by one or two characters rather than by a single bit. If the flag has associated operand(s), they are read out of the aug_operands field, in the same order as characters appear in aug_string. The recognised flags are:

Char(s)	Operand(s)	Notes
`"eh"`	`void* eh_ptr`	Operand not part of `aug_operands`
`"z"`	`uleb128 length`	In bytes, of subsequent `aug_operands`
`"R"`	`uint8_t fde_ptr_encoding`	Foreshadowing
`"P"`	`uint8_t ptr_encoding` `uint8_t personality_ptr[]`	Foreshadowing
`"L"`	`uint8_t lsda_ptr_encoding`	Foreshadowing
`"S"`	None	Signal handler frame
`"B"`	None	aarch64 ptrauth uses B key
`"\0"`	None	End of `aug_string`

"eh", if it appears, must be first. In practice, it is never present, except in code compiled by very old versions of g++. "z", if it appears, must be next, and in practice is always present. The remaining characters (except NUL) can appear in any order, though if "R" is present, and "S" and/or "B" are also present, then "R" must appear before "S" and before "B" (to work around bugs in get_cie_encoding, as compared to the correct extract_cie_info). Furthermore, if "R" is present and its operand is non-zero, then "z" must be present (again due to get_cie_encoding bugs).

The whole struct cie must have a length which is a multiple of sizeof(uintptr_t) bytes. The zero_padding field at the end ensures this. Helpfully, DW_CFA_nop is zero, so zero_padding can be seen as an extension of the dw_cfa_bytecode field. The length of the whole struct cie is 4 + length. To get the length of the combined dw_cfa_bytecode / zero_padding, the lengths of all the other fields need to be subtracted off. This is straightforward (albeit fiddly) for most fields, with the only hard case being aug_operands: its length depends on the contents of aug_string, and there may be characters in aug_string which we do not recognise. This is where "z" comes in: once we've decoded its operand, it tells us the remaining length of aug_operands.

From one location to many

Obtaining the UnwindActions applicable to one instruction pointer value is all well and good, but this needs to scale to every (*) possible instruction pointer value across an entire program. One relevant observation is that the UnwindActions applicable to an instruction pointer X are usually very similar (or even identical to) those applicable to an instruction pointer value of X+1.

(*) Assuming -fasynchronous-unwind-tables. Less coverage is required if only -fnon-call-exceptions is used, and less still is required if neither is used. Note that -fasynchronous-unwind-tables is enabled by default on some architectures.

With this observation in mind, we can borrow a trick from video compression: use a keyframe to describe the UnwindActions applicable at the start of every function, and then use delta frames to describe the differences as the instruction pointer progresses through the function.

This motivates a bunch of new VM opcodes, along with two pieces of VM state: set target to the instruction pointer that you want UnwindActions for, set fpc to the instruction pointer of the start of the function, and execute VM instructions either until you run out of VM instructions, or until fpc >= target:

Opcode	Name and operands	Semantics
`0x40 + Off`	DW_CFA_advance_loc	`fpc += Off * code_align; if (fpc >= target) break;`
`0x02`	DW_CFA_advance_loc1(uint8_t Off)	`fpc += Off * code_align; if (fpc >= target) break;`
`0x03`	DW_CFA_advance_loc2(uint16_t Off)	`fpc += Off * code_align; if (fpc >= target) break;`
`0x04`	DW_CFA_advance_loc4(uint32_t Off)	`fpc += Off * code_align; if (fpc >= target) break;`
`0x01`	DW_CFA_set_loc(uint8_t Ptr[])	`fpc = DecodePtr(fde_encoding from CIE, Ptr); if (fpc >= target) break;`

It also motivates two pieces of VM state called spReg and spOff, along with revisements to DW_CFA_def_cfa / DW_CFA_def_cfa_sf, and new opcodes which perform only one half of DW_CFA_def_cfa / DW_CFA_def_cfa_sf:

Opcode	Name and operands	Semantics assuming `UnwindActions* ua`
`0x0c`	DW_CFA_def_cfa(uleb128 Reg, uleb128 Off)	`ua->sp = RegOffset(spReg = Reg, spOff = Off)`
`0x12`	DW_CFA_def_cfa_sf(uleb128 Reg, sleb128 Off)	`ua->sp = RegOffset(spReg = Reg, spOff = Off * data_align)`
`0x0d`	DW_CFA_def_cfa_register(uleb128 Reg)	`spReg = Reg; ua->sp = RegOffset(spReg, spOff)`
`0x0e`	DW_CFA_def_cfa_offset(uleb128 Off)	`spOff = Off; if (ua->sp is RegOffset) ua->sp = RegOffset(spReg, spOff)`
`0x13`	DW_CFA_def_cfa_offset_sf(sleb128 Off)	`spOff = Off * data_align; if (ua->sp is RegOffset) ua->sp = RegOffset(spReg, spOff)`

Finally, it motivates adding a Stack<UnwindActions> to the VM, along with some opcodes to manipulate it:

Opcode	Name and operands	Semantics assuming `UnwindActions* ua`
`0x0a`	DW_CFA_remember_state	`Push(*ua)` (note that `ua` not modified)
`0x0b`	DW_CFA_restore_state	`*ua = Pop()`

From one function to many

We could happily have one CIE per function, but there would be quite a lot of repetition between CIEs. To resolve this, we introduce the concept of an FDE, which is essentially a stripped down CIE:

struct fde {
  uint32_t length;            // In bytes, of all subsequent fields
  int32_t  negated_cie_offset;// In bytes, from this field, to a CIE
  uint8_t  func_start[];      // Using fde_ptr_encoding from CIE
  uint8_t  func_length[];     // Using (fde_ptr_encoding & 0xF)
  uint8_t  aug_operands[];    // Relates to aug_string from CIE
  uint8_t  dw_cfa_bytecode[]; // Appended to that of CIE
  uint8_t  zero_padding[];    // To give 4 or 8 byte alignment
};

The negated_cie_offset field points to a CIE, and the FDE inherits everything from the pointed-to CIE. Note that the value is negated, i.e. the value is subtracted from &fde::negated_cie_offset to form the cie*. For inheritance of aug_string, each flag therein sometimes has operand(s) in the CIE's aug_operands, sometimes in the FDE's aug_operands, and sometimes in both. The FDE aug_operands has:

Char(s)	Operand(s)	Notes
`"eh"`	None
`"z"`	`uleb128 length`	In bytes, of subsequent `aug_operands`
`"R"`	None
`"P"`	None
`"L"`	`uint8_t lsda_ptr[]`	Encoded using `lsda_ptr_encoding` from CIE
`"S"`	None
`"B"`	None
`"\0"`	None	End of `aug_string`

The dw_cfa_bytecode field is inherited by concatenation: executing the bytecode for an FDE involves first executing the bytecode of the associated CIE. All VM state is carried over from the end of the CIE bytecode to the start of the FDE bytecode, thoough if using DW_CFA_remember_state / DW_CFA_restore_state then the stack is emptied as part of switching from the CIE to the FDE.

That leaves func_start, which is a variable length pointer, and func_length, which is a variable length integer. If aug_string does not contain "R", then these fields are void* and uintptr_t respectively. If aug_string does contain "R", then the operand of "R" (fde_ptr_encoding) describes the size and interpretation of func_start, meanwhile fde_ptr_encoding & 0xF describes the size and interpretation of func_length. That's the segue to talking about pointer encodings.

Pointer encodings

Depending on the context, it can be desirable to encode a pointer in different ways. For example, sometimes a uintptr_t is desirable, whereas other times an int32_t offset from the start of the current function is desirable. To allow flexibility, various places allow a uint8_t to describe the pointer encoding being used. These places are:

Encoding field	Associated pointer field
`cie::fde_ptr_encoding`	`fde::func_start`
`cie::fde_ptr_encoding & 0xF`	`fde::func_length`
`cie::fde_ptr_encoding`	`DW_CFA_set_loc` operand
`cie::aug_string` `"P"` `ptr_encoding`	`cie::aug_string` `"P"` `personality_ptr`
`cie::aug_string` `"L"`	`fde::aug_string` `"L"` (LSDA pointer)
`eh_frame_hdr::eh_frame_ptr_enc`	`eh_frame_hdr::eh_frame_ptr` †
`eh_frame_hdr::fde_count_enc`	`eh_frame_hdr::fde_count` †
`eh_frame_hdr::table_enc`	`eh_frame_hdr::sorted_table` †
DW_OP_GNU_encoded_addr	DW_OP_GNU_encoded_addr ‡

† Foreshadowing.
‡ More foreshadowing.

There are two special cases for this uint8_t:

Encoding	Meaning
`0xff`	Encoded value is absent/empty, decode as NULL.
`0x50`	Encode a `uintptr_t`, but precede with padding to align it.

Once the two special cases are excluded, the remaining cases split the byte into a four bit field, a three bit field, and a one bit field. The low four bits denote the data type:

Encoding & 0xF	Data type	Size in bytes
`0x0`	`uintptr_t`	Typically either 4 or 8
`0x1` ‡	`uleb128`	Varies (≥ 1)
`0x2`	`uint16_t`	2
`0x3`	`uint32_t`	4
`0x4`	`uint64_t`	8
`0x9` ‡	`sleb128`	Varies (≥ 1)
`0xA`	`int16_t`	2
`0xB`	`int32_t`	4
`0xC`	`int64_t`	8

‡ Cannot be used for fde_ptr_encoding.

The next three bits denote a base value to be added to non-NULL values:

Encoding & 0x70	Base value
`0x00`	NULL
`0x10` (pcrel)	Address of first byte of encoded pointer
`0x20` (textrel)	Start of .text (in theory), but usually NULL in practice
`0x30` (datarel)	Start of .got (x86) or NULL (most other architectures) †
`0x40` (funcrel) ‡	`fde::func_start` (after decoding)

† Except in eh_frame_hdr::table_enc, where it means start of .eh_frame_hdr.
‡ Cannot be used for fde_ptr_encoding, or in eh_frame_hdr.

The final top bit controls an optional dereference:

Encoding & 0x80	Semantics
`0x00`	Treat value as `void*`
`0x80`	Treat value as `void*`, dereference it to get `void`

If an integer is desired rather than a pointer (as is the case for func_length and fde_count_enc), then the void* is reinterpreted as uintptr_t.

Finding all the FDEs

The various CIE and FDE structures are concatenated together, and then placed in the ELF .eh_frame section.

Traditionally, the compiler would insert a call to __register_frame somewhere during startup, passing along the address of the .eh_frame section. If non-NULL values were desired for textrel / datarel pointer encodings, it would instead insert a call to __register_frame_info_bases. At shutdown, a matching call to __deregister_frame or __deregister_frame_info_bases would be made. If there are multiple .eh_frame sections, and the linker didn't merge them, then __register_frame_table / __register_frame_info_table_bases can be used instead, which take a pointer to a NULL-terminated list of pointers to .eh_frame sections.

A more modern approach is to add an ELF .eh_frame_hdr section, and use either _dl_find_object or dl_iterate_phdr to find the appropriate .eh_frame_hdr section. Once found, the section contains a sorted table of FDE pointers:

struct eh_frame_hdr {
  uint8_t version;          // Must be 1
  uint8_t eh_frame_ptr_enc; // Typically 0x1B (pcrel int32_t)
  uint8_t fde_count_enc;    // Typically 0x03 (uint32_t)
  uint8_t table_enc;        // Must be 0x3B (datarel int32_t)
  uint8_t eh_frame_ptr[];   // Encoded with eh_frame_ptr_enc
  uint8_t fde_count[];      // Encoded with fde_count_enc
  struct {
    int32_t func_start;     // In bytes, relative to eh_frame_hdr
    int32_t fde_offset;     // In bytes, relative to eh_frame_hdr
  } sorted_table[fde_count];
};

The combined size of the eh_frame_ptr and fde_count fields must be a multiple of four (which happens naturally if eh_frame_ptr_enc and fde_count_enc both use 4 or 8 byte types). The eh_frame_ptr field contains a pointer to the .eh_frame section, which is used as fallback if the .eh_frame_hdr section is unparseable for some reason. The table_enc field contains the encoding used by the contents of sorted_table, albeit datarel means relative to the start of .eh_frame_hdr, and the only supported encoding is datarel int32_t. The table is sorted in ascending func_start order, and the func_start value therein overrides the func_start from the referenced FDE (though in practice they should be identical).

Personality pointers and LSDA pointers

To support calling destructors during unwinding, and to support catching exceptions (and thereby stopping unwinding), a CIE can specify a pointer to a personality function. Said function contains the destructing and/or catching logic, and will get called as part of unwinding (unless only generating a backtrace). I won't get into the specifics of the interface, though the personality function can query various parts of the unwind state:

_Unwind_GetLanguageSpecificData returns lsda_ptr from the FDE.
_Unwind_GetRegionStart returns func_start from the FDE.

Very old versions of g++ use eh_ptr instead of personality functions and LSDA pointers. Unless you are implementing __frame_state_for, you shouldn't need to worry about eh_ptr.

Expression bytecode

The UnwindOneFrame function from earlier included calls to EvalExpr to handle complex cases. The outline of EvalExpr is:

uintptr_t EvalExpr(uint8_t DwOpBytecode[unsigned Len],
                   uintptr_t StackInitial,
                   RegState* original) {
  const uint8_t* bpc = DwOpBytecode;
  Stack<uintptr_t> stk;
  stk.Push(StackInitial);
  while (bpc < DwOpBytecode + Len) {
    uint8_t opcode = *bpc++;
    /* Decode opcode-specific operand(s) from bpc */
    ...
    /* Perform opcode */
    ...
  }
  return stk.Pop();
}

With the various supported opcodes being:

Opcode	Name and operands	Semantics (assuming stack for `Push`, `Pop`, `At`)
`0x03`	DW_OP_addr(uintptr_t Lit)	`Push(Lit)`
`0x08`	DW_OP_const1u(uint8_t Lit)	`Push(Lit)`
`0x09`	DW_OP_const1s(int8_t Lit)	`Push(Lit)`
`0x0A`	DW_OP_const2u(uint16_t Lit)	`Push(Lit)`
`0x0B`	DW_OP_const2s(int16_t Lit)	`Push(Lit)`
`0x0C`	DW_OP_const4u(uint32_t Lit)	`Push(Lit)`
`0x0D`	DW_OP_const4s(int32_t Lit)	`Push(Lit)`
`0x0E`	DW_OP_const8u(uint64_t Lit)	`Push(Lit)`
`0x0F`	DW_OP_const8s(int64_t Lit)	`Push(Lit)`
`0x10`	DW_OP_constu(uleb128 Lit)	`Push(Lit)`
`0x11`	DW_OP_consts(sleb128 Lit)	`Push(Lit)`
`0x12`	DW_OP_dup	`Push(At(-1))` (duplicate top element)
`0x14`	DW_OP_over	`Push(At(-2))` (duplicate penultimate element)
`0x15`	DW_OP_pick(uint8_t Idx)	`Push(At(-1-Idx))`
`0x13`	DW_OP_drop	`Pop()`
`0x16`	DW_OP_swap	`a = Pop(); b = Pop(); Push(a); Push(b)`
`0x17`	DW_OP_rot	`a = Pop(); b = Pop(); c = Pop(); Push(a); Push(c); Push(b)`
`0x06`	DW_OP_deref	`Push((uintptr_t)Pop())`
`0x94` `0x01`	DW_OP_deref_size	`Push((uint8_t)Pop())`
`0x94` `0x02`	DW_OP_deref_size	`Push((uint16_t)Pop())`
`0x94` `0x04`	DW_OP_deref_size	`Push((uint32_t)Pop())`
`0x94` `0x08`	DW_OP_deref_size	`Push((uint64_t)Pop())`
`0x19`	DW_OP_abs	`a = Pop(); Push((intptr_t)a < 0 ? -a : a)`
`0x1f`	DW_OP_neg	`a = Pop(); Push(-a)`
`0x20`	DW_OP_not	`a = Pop(); Push(~a)`
`0x23`	DW_OP_plus_uconst(uleb128 Lit)	`a = Pop(); Push(a + Lit)`
`0x1a`	DW_OP_and	`rhs = Pop(); lhs = Pop(); Push(lhs & rhs)`
`0x1b`	DW_OP_div	`rhs = Pop(); lhs = Pop(); Push((intptr_t)lhs / (intptr_t)rhs)`
`0x1c`	DW_OP_minus	`rhs = Pop(); lhs = Pop(); Push(lhs - rhs)`
`0x1d`	DW_OP_mod	`rhs = Pop(); lhs = Pop(); Push(lhs % rhs)`
`0x1e`	DW_OP_mul	`rhs = Pop(); lhs = Pop(); Push(lhs * rhs)`
`0x21`	DW_OP_or	`rhs = Pop(); lhs = Pop(); Push(lhs \| rhs)`
`0x22`	DW_OP_plus	`rhs = Pop(); lhs = Pop(); Push(lhs + rhs)`
`0x24`	DW_OP_shl	`rhs = Pop(); lhs = Pop(); Push(lhs << rhs)`
`0x25`	DW_OP_shr	`rhs = Pop(); lhs = Pop(); Push(lhs >> rhs)`
`0x26`	DW_OP_shra	`rhs = Pop(); lhs = Pop(); Push((intptr_t)lhs >> rhs)`
`0x27`	DW_OP_xor	`rhs = Pop(); lhs = Pop(); Push(lhs ^ rhs)`
`0x29`	DW_OP_eq	`rhs = Pop(); lhs = Pop(); Push(lhs == rhs)`
`0x2e`	DW_OP_ne	`rhs = Pop(); lhs = Pop(); Push(lhs != rhs)`
`0x2a`	DW_OP_ge	`rhs = Pop(); lhs = Pop(); Push((intptr_t)lhs >= (intptr_t)rhs)`
`0x2b`	DW_OP_gt	`rhs = Pop(); lhs = Pop(); Push((intptr_t)lhs > (intptr_t)rhs)`
`0x2c`	DW_OP_le	`rhs = Pop(); lhs = Pop(); Push((intptr_t)lhs <= (intptr_t)rhs)`
`0x2d`	DW_OP_lt	`rhs = Pop(); lhs = Pop(); Push((intptr_t)lhs < (intptr_t)rhs)`
`0x2f`	DW_OP_skip(int16_t Delta)	`bpc += Delta`
`0x28`	DW_OP_bra(int16_t Delta)	`if (Pop() != 0) bpc += Delta`
`0x30 + Lit`	DW_OP_lit	`Push(Lit)`
`0x50 + Reg`	DW_OP_reg	`Push(original->value[Reg])`
`0x70 + Reg`	DW_OP_breg(sleb128 Lit)	`Push(original->value[Reg] + Lit)`
`0x90`	DW_OP_regx(uleb128 Reg)	`Push(original->value[Reg])`
`0x92`	DW_OP_bregx(uleb128 Reg, sleb128 Lit)	`Push(original->value[Reg] + Lit)`
`0x96`	DW_OP_nop	No-op
`0xf1`	DW_OP_GNU_encoded_addr(uint8_t PtrEncoding, uint8_t Ptr[])	`Push(DecodePtr(PtrEncoding, Ptr))`

The gcc implementation of EvalExpr assumes that the stack never holds more than 64 elements. Bad things will happen if the stack exceeds this size.

By far the most common usage of this expression language is to allow a single FDE to succinctly describe an arbitrary number of PLT entries. The serialised bytecode in this case will be something like:

0x92 0x07 0x08  DW_OP_bregx(7 /* rsp */, 8)
0x90 0x10       DW_OP_regx(16 /* rip */)
0x08 0x0f       DW_OP_const1u(15)
0x1a            DW_OP_and
0x08 0x0b       DW_OP_const1u(11)
0x2a            DW_OP_ge
0x08 0x03       DW_OP_const1u(3)
0x24            DW_OP_shl
0x22            DW_OP_plus

Which encodes the expression:

rsp + 8 + ((((rip & 15) >= 11) ? 1 : 0) << 3)

In other words, each 16-byte PLT entry is split into two pieces: the 1^st being 11 bytes long, and the 2^nd being 5 bytes long. If in the 1^st piece, the expression gives rsp + 8, whereas in the 2^nd piece it gives rsp + 16.

References

The DWARF Standard is documented, though .eh_frame diverges from it in a few areas. It also requires architecture-specific extensions (e.g. x86, x86_64, aarch64). The LSB has a pointer encoding specification, though it is incomplete. The only true reference is the gcc source code, some relevant entry points into which are:

unwind.inc:
unwind-dw2-fde-dip.c:
- _Unwind_Find_FDE - searching .eh_frame_hdr for an FDE
unwind-dw2-fde.c:
- _Unwind_Find_registered_FDE - searching .eh_frame for an FDE
- __register_frame
- __register_frame_info_bases
- __deregister_frame
- __deregister_frame_info_bases
- __register_frame_table
- __register_frame_info_table_bases
unwind-dw2.c:
- uw_frame_state_for - ties together _Unwind_Find_FDE and execute_cfa_program
- uw_update_context - UnwindOneFrame in my exposition
- execute_stack_op - EvalExpr in my exposition
unwind-dw2-execute_cfa.h:
- execute_cfa_program - interpreter for DW_CFA_ bytecode
unwind-pe.h:
- read_encoded_value - pointer encoding

See also

For a completely different take on the same problem, consider ARM64 unwinding on Windows. An .xdata record therein is roughly equivalent to an FDE, and it has a bytecode interpreter similar in purpose (but very different in style) to the DW_CFA_ bytecode.

Some examples of complete .eh_frame sections are available in LuaJIT:

Let's say we're writing C code, and would like to define a macro for printf-with-newline. We might start with:

#define printfln(fstr, ...) \
  printf(fstr "\n", __VA_ARGS__)

So far so good; we can write printfln("1+1 is %d", 1+1) and it'll print 1 + 1 is 2 followed by a newline. However, simpler cases such as printfln("Hello") result in a syntax error, as this macro expands to printf("Hello" "\n",), which is invalid due to the trailing comma.

The non-standard solution

One conventional solution is to rely on a non-standard language extension whereby inserting ## between , and __VA_ARGS__ has a special effect:

#define printfln(fstr, ...) \
  printf(fstr "\n", ## __VA_ARGS__)

With this, both printfln("1+1 is %d", 1+1) and printfln("Hello") work fine. However, combining the two into something like printfln("Wrote %d chars", printfln("Hello")) results in a mysterious error. To figure out why, we need to look into exactly what special effect this non-standard language extension has. The GNU C Preprocessor documentation on Variadic Macros says:

The ## token paste operator has a special meaning when placed between a comma and a variable argument. If you write [...] and the variable argument is left out [...], then the comma before the ## will be deleted. This does not happen if you pass an empty argument, nor does it happen if the token preceding ## is anything other than a comma.

Meanwhile, the GCC documentation on Variadic Macros says:

If the variable arguments are omitted or empty, the ## operator causes the preprocessor to remove the comma before it. If you do provide some variable arguments in your macro invocation, GNU CPP does not complain about the paste operation and instead places the variable arguments after the comma. Just like any other pasted macro argument, these arguments are not macro expanded.

The last sentence of this 2^nd piece of documentation tells us what is going on: macro arguments are usually expanded before being substituted, but this doesn't happen to macro arguments used as operands to # or ##. In either case, after substitution has been performed, the resultant tokens are rescanned for more expansions, albeit with the original macro deactivated during this rescan. The only major observable difference between expansion-before-substitution and expansion-during-rescan is that the original macro is active in the former but deactivated in the latter. Hence printfln("Wrote %d chars", printfln("Hello")) expands to printf("Wrote %d chars" "\n", printfln("Hello")) without expanding the inner printfln, which the compiler then tries to parse as a call to the non-existent function printfln.

The two pieces of documentation are contradictory as to what happens if the variable arguments are present but empty (as in printfln("Hello",)). In practice the 1^st piece of documentation is correct; the comma is kept if the variable arguments are present but empty.

The nor does it happen if the token preceding ## is anything other than a comma phrase from the 1^st piece of documentation is interesting, as it turns out this is a slightly dynamic property: comma deletion can happen for things like , ## x ## __VA_ARGS__ provided that x expands to nothing. Clang will also delete the comma in , x ## __VA_ARGS__ when x expands to nothing. Things also seem to get funky if more pasting immediately follows , ## __VA_ARGS__. As examples:

`#define F(x, ...)` as	`F()`	`F(1)`	`F(,)`	`F(1,)`	`F(1,2)`
`,##__VA_ARGS__`	empty	empty	`,`	`,`	`,2`
`,##__VA_ARGS__ x`	empty	`1`	`,`	`,1`	`,2 1`
`,##__VA_ARGS__##x`	`,`	error	`,`	error	error† or `,21`‡
`,##x##__VA_ARGS__`	empty	error	`,`	error	error
`,x##__VA_ARGS__`	`,`† or empty ‡	`,1`	`,`	`,1`	`,12`

† According to gcc 13.2.
‡ According to clang 17.0.1.

In the cases where gcc and clang differ, who is to say which is correct? After all, there is no standard documenting the desired behaviour for non-standard language extensions.

Enter __VA_OPT__

Rather than standardise this mess, the C and C++ languages adopted a different solution, namely __VA_OPT__. With __VA_OPT__, our motivating example looks like:

#define printfln(fstr, ...) \
  printf(fstr "\n" __VA_OPT__(,) __VA_ARGS__)

In this, __VA_OPT__(,) expands to nothing if the variable arguments are absent or empty or become empty after macro expansion, whereas it expands to , otherwise. There's no token pasting going on, so printfln("Wrote %d chars", printfln("Hello")) now works. The other behavioural difference is that the odd-looking printfln("Hello",) now expands to the valid printf("Hello" "\n"), as does printfln("Hello", EMPTY) in the context of #define EMPTY /*nothing*/.

The standard didn't stop there though; there can be more than just a , within __VA_OPT__. Any (-)-balanced token sequence is allowed, and it can even access the arguments of the macro invocation, so for example it is valid to have:

#define M(x, ...) (0 __VA_OPT__(-(x)) )

Then M(1) expands to (0) and M(1,2) expands to (0 - (1)).

What about whitespace?

Compilers seem to disagree on the behaviour of whitespace just after __VA_OPT( or just before the matching ). Consider:

#define TILDAS(...) ~__VA_OPT__( ~ )~
#define S2(x) #x
#define S1(x) S2(x)
const char* s = S1(TILDAS());
const char* sa = S1(TILDAS(a));

The observed results are:

	gcc 13.2	clang 17.0.1
s	`"~~"`	`"~ ~"`
sa	`"~~~"`	`"~ ~ ~"`

One interpretation is that gcc is correct, based on this paragraph from the standard: (N3096 6.10.4.1¶7)

The preprocessing token sequence for [...] a va-opt-replacement [...] consists of the results of the expansion of the contained pp-tokens as the replacement list of the current function-like macro before removal of placemarker tokens, rescanning, and further replacement.

Combined with an earlier paragraph about replacement lists: (N3096 6.10.4¶7)

[...] Any white-space characters preceding or following the replacement list of preprocessing tokens are not considered part of the replacement list for either form of macro.

What does __VA_OPT__() expand to?

The short answer should be nothing, based on the rules for expanding it:

If the variable arguments are absent or empty or become empty after macro expansion, the expansion of __VA_OPT__() is a single placemarker token.
Otherwise, if used as an operand of ##, the expansion of __VA_OPT__() is a single placemarker token (because an empty expansion becomes a placemarker in this context).
Otherwise, the expansion of __VA_OPT__() is empty (though a single placemarker token would be equally valid, as it'll evaporate away in due course).

#__VA_OPT__() therefore becomes an obfuscated way of writing "". Meanwhile, __VA_OPT__() as an operand of ## can be used to force the other operand to be in a ## context. For example, the undesirable behaviour of , ## __VA_ARGS__ can be reproduced via:

#define printfln(fstr, ...) \
  printf(fstr "\n" __VA_OPT__(,) __VA_OPT__() ## __VA_ARGS__)

If __VA_OPT__() expands to nothing, regardless of whether the variable arguments are absent or empty or become empty after macro expansion, then it might seem rational to optimise it away. There's a corner case though: determining whether the variable arguments become empty after macro expansion requires macro expanding them, and macro expansion of the (non-standard) __COUNTER__ macro has visible side-effects. Consider:

#define EMPTY_VA_OPT(...) __VA_OPT__()
int x = __COUNTER__;
EMPTY_VA_OPT(__COUNTER__)
int y = __COUNTER__;
return y - x;

For the above, gcc 13.2 returns 2, whereas clang 17.0.1 returns 1, indicating that clang optimised away the __VA_OPT__().

Token pasting and __VA_OPT__

The standard is careful to say that the expansion of __VA_OPT__ can contain placemarker tokens: (N3096 6.10.4.1¶7, emphasis mine)

The preprocessing token sequence for [...] a va-opt-replacement [...] consists of the results of the expansion of the contained pp-tokens as the replacement list of the current function-like macro before removal of placemarker tokens, rescanning, and further replacement.

These placemarker tokens were originally a fiction dreamt up by the standard to provide a succinct description of the semantics of ##:

The operands of ## always produce at least one token. If an operand were to produce zero tokens, it instead produces a single placemarker token.
If one operand of ## is a placemarker token, the result of ## is the other operand.
Placemarker tokens are removed after ## processing and before rescan.

Before __VA_OPT__, it was possible to ignore this fiction, and implement the preprocessing algorithm with careful rules around ## evaluation rather than producing and later removing placemarker tokens. It becomes harder to maintain this fiction with __VA_OPT__. Consider:

#define G(x, y, z, ...) 1 ## __VA_OPT__(x y ## y z) ## 5

Then G(,,) expands to the single token 15, while G(2,3,4,-) expands to the three tokens 12 33 45. If y is empty, then the inner y ## y should produce a placemarker, and then things get interesting:

Expansion of	Result	Notes
`G(2,,4,-)`	`12` `45`	The `__VA_OPT__` expands to `24`.
`G( ,,4,-)`	`1` `45`	The placemarker inhibits merging to `145`.
`G(2,, ,-)`	`12` `5`	The placemarker inhibits merging to `125`.
`G( ,, ,-)`	`1` `5`† or `15`‡	The correct result is the merged `15`.

† According to gcc 13.2.
‡ According to clang 17.0.1.

The other fun observation is that macro arguments within __VA_OPT__ get macro expanded, even if the __VA_OPT__ itself is an operand of ##. Consider:

#define H1(...) x ##            __VA_ARGS__
#define H2(...) x ## __VA_OPT__(__VA_ARGS__)

Then H1(__LINE__) expands to x__LINE__, whereas H2(__LINE__) expands to something like x3.

Further reading

P1042R1 tweaked the __VA_OPT__ semantics slightly a few years ago.
__VA_OPT__ can be (ab)used for recursive macros.

(Ab)using gf2p8affineqb to turn indices into bits

What even is a pidfd anyway?

C23 stdbit.h quick reference

Linux/ELF .eh_frame from the bottom up

__VA_OPT__ Minutiae