What even is a pidfd anyway?

In recent versions of the Linux kernel, a pidfd is a special type of file that holds a reference to a process. Notably, a pidfd allows for certain process-related operations to be performed in a race-free manner, and it allows poll / select / epoll to be used to detect process termination.

Before you get too excited:

A pidfd does not let you hold a reference to an individual thread, only to a process (or in kernel terminology, a thread group leader).
A pidfd does not hold a reference to a pid number, nor does holding open a pidfd prevent the pid number of the underlying process from being reused.
A pidfd cannot circumvent the at-most-once semantics of retreiving the exit code / status of a process via wait / waitpid / waitid.
Closing a pidfd does not terminate the underlying process.

There are various ways of obtaining a pidfd:

Kernel version	glibc version	Function
5.2	2.2.5 / 2.31	`clone` with `CLONE_PIDFD` flag
5.3	N/A	`clone3` with `CLONE_PIDFD` flag
5.3 / 5.10	2.36	`pidfd_open`
5.4	2.39	`pidfd_spawn` / `pidfd_spawnp`
6.5	2.2.5 / N/A	`getsockopt` with `SO_PEERPIDFD` optname
6.5	2.2.5 / 2.39	`recvmsg` with `SCM_PIDFD` cmsg_type

Once you have a pidfd, there are a bunch of things you can do with it:

Kernel version	glibc version	Function
5.1	2.36	`pidfd_send_signal`
5.2 / 5.5	2.39	`pidfd_getpid`
5.3	2.2.5 / 2.3.2	`poll` / `select` / `epoll`
5.4	2.2.5 / 2.36	`waitid` with `P_PIDFD` mode
5.6	2.36	`pidfd_getfd`
5.8	2.14	`setns`
5.10 / 5.12	2.36	`process_madvise`
5.15	2.36	`process_mrelease`
6.9	2.2.5 / 2.28	`fstat` / `statx` for meaningful `stx_ino`

Some of the subsequent text refers to a process being alive or zombie or dead. These terms come from the usual lifecycle of a unix process: it is initially alive, then transitions to zombie when it terminates, and then transitions to dead once it is waited upon. As a quick summary of the states:

	Alive	Zombie	Dead
Can execute code and receive signals	✅	❌	❌
Has pid number	✅	✅	❌
Exit code / status retrievable	❌	✅	❌
pidfd polls as readable	❌	✅	✅
Cleaned up by kernel	❌	❌	✅

clone with CLONE_PIDFD flag

Available since: kernel 5.2, glibc 2.31 (or glibc 2.2.5 if you provide your own definition of CLONE_PIDFD; its value is 0x1000).

If the CLONE_PIDFD flag is specified, then clone returns a freshly allocated pidfd referring to the child (in addition to returning the pid number of the child). The O_CLOEXEC flag is automatically set on the returned pidfd. Note that if CLONE_PIDFD is specified, then CLONE_THREAD cannot be specified, nor can CLONE_DETACHED. Furthermore, if CLONE_PIDFD is specified, then CLONE_PARENT_SETTID cannot be specified (unless using clone3).

One of the arguments to clone is the signal number that the child will send to its parent when the child terminates. Setting this to anything other than SIGCHLD has several consequences:

Calls to wait do not recognise the child.
Calls to waitpid / waitid only recognise the child if the __WALL or __WCLONE option is passed (this is true even for P_PIDFD calls).
The child will always transition to the zombie state upon termination and stay there until waited upon, even if the parent's SIGCHLD handler is SIG_IGN or has SA_NOCLDWAIT.
When the child terminates, a signal other than SIGCHLD will be sent to the parent (or no signal will be sent if the termination signal is set to zero).

Note that if the child calls execve (or a similar exec function), then the termination signal number is reset to SIGCHLD, and the above points stop applying.

clone3 with CLONE_PIDFD flag

Available since: kernel 5.3, no glibc wrapper.

This function is just a more extensible version of clone; everything written above about clone applies equally to clone3.

pidfd_open

Available since: kernel 5.3, glibc 2.36.

This function takes a pid number (in the pid namespace of the caller), and returns a freshly allocated pidfd refering to said process (or an error if said process does not exist). It is inherently racy, unless the pid number being passed is the result of getpid (i.e. creating a pidfd referring to your own process).

Since kernel 5.10, the PIDFD_NONBLOCK flag can be passed to pidfd_open, which affects subsequent waitid calls. No other flags are valid to pass. The O_CLOEXEC flag is automatically set on the returned pidfd.

pidfd_spawn / pidfd_spawnp

Available since: kernel 5.4, glibc 2.39.

These functions are like posix_spawn / posix_spawnp, except that they have an int* output parameter for a freshly allocated pidfd instead of a pid_t* output parameter for a pid number. The O_CLOEXEC flag is automatically set on the returned pidfd.

In glibc 2.39, bug BZ#31695 causes these functions to leak a file descriptor in some error scenarios. This will hopefully be fixed in 2.40.

getsockopt with SO_PEERPIDFD optname

Available since: kernel 6.5, glibc 2.2.5 for getsockopt. The definition of SO_PEERPIDFD is not tied to a particular glibc version; its value is 77 should you need to provide your own definition of it.

SO_PEERPIDFD is the pidfd version of SO_PEERCRED. For a unix socket created via socketpair, SO_PEERPIDFD gives a pidfd referring to the process that called socketpair, meanwhile for a connected unix stream socket, SO_PEERPIDFD gives a pidfd referring to the process that called connect (if called on the server end of the socket) or the process that called listen (if called on the client end of the socket). The O_CLOEXEC flag is automatically set on the returned pidfd.

recvmsg with SCM_PIDFD cmsg_type

Available since: kernel 6.5, glibc 2.39 (or glibc 2.2.5 if you provide your own definition of SCM_PIDFD; its value is 0x04).

SCM_PIDFD is the pidfd version of (the pid part of) SCM_CREDENTIALS. If the receivier sets SO_PASSPIDFD on a unix socket (c.f. setting SO_PASSCRED), then it'll receive a SCM_PIDFD cmsg as part of receiving a message, with the associated cmsg data being a freshly allocated pidfd referring to the process of the sender of the message (or some other process if the sender has CAP_SYS_ADMIN and specifies a pid number other than itself as part of its SCM_CREDENTIALS). The O_CLOEXEC flag is automatically set on the pidfd.

pidfd_send_signal

Available since: kernel 5.1, glibc 2.36.

This function is similar to kill / rt_sigqueueinfo: it sends a signal to a process. It differs from these functions in that the destination is given as a pidfd rather than as a pid number.

This function also accepts the result of open("/proc/$pid") as an fd, though it is the only function to do so: open("/proc/$pid") does not give a pidfd, and no other functions accept the result of open("/proc/$pid") in place of a pidfd.

pidfd_getpid

Available since: kernel 5.2, glibc 2.39.

This function is the inverse of pidfd_open: given a pidfd, it returns the pid number associated with the underlying process. This function requires that /proc be mounted, and returns the pid number in the pid namespace associated with the mounted /proc. Note that the pid number can be reused for a different process once the underlying process is dead.

Changed in kernel 5.5: if the process referenced by the pidfd is dead, this function returns -1 (prior to 5.5, it returned whatever pid number the process had prior to its death).

Note that this is not a direct system call; instead it opens /proc/self/fdinfo/$pidfd and parses the Pid: line therein.

poll / select / epoll

Available since: kernel 5.3, glibc 2.2.5 (poll / select) or glibc 2.3.2 (epoll).

These functions can be used to asynchronously monitor a pidfd. They will report the pidfd as readable iff the underlying process is a zombie or is dead. Note however that read on a pidfd always fails; to get the exit code / status of the process, use waitid (possibly with WNOHANG).

waitid with P_PIDFD mode

Available since: kernel 5.4, glibc 2.36 (or glibc 2.2.5 if you provide your own definition of P_PIDFD; its value is 3).

waitid(P_PIDFD, fd, infop, options) is identical to waitid(P_PID, pidfd_getpid(fd), infop, options), except for the following:

The embedded pidfd_getpid call is done atomically as part of waitid; there is no race condition.
The embedded pidfd_getpid call does not require /proc to be mounted.
If the pidfd was opened with the PIDFD_NONBLOCK flag, and options does not contain WNOHANG, and the process referenced by the pidfd is alive, then waitid will fail with EAGAIN rather than blocking. Note that if options does contain WNOHANG, then PIDFD_NONBLOCK has no effect: if the process referenced by the pidfd is alive, then waitid will succeed with result 0 rather than blocking.

In particular, note that:

Waiting on a zombie process will retreive the exit code / status (in si_code / si_status), and transition the process from zombie to dead. The si_signo, si_errno, si_pid, and si_uid fields will also be set.
Waiting on a dead process will fail with ECHILD.

The above points are true for all waitid calls, including P_PIDFD calls. The first time a zombie is waited upon (by any kind of wait / waitpid / waitid call), then the exit code / status is retreived, and subsequent attempts to wait upon it (again by any kind of wait / waitpid / waitid call) will fail.

When a process transitions from alive to zombie, if that process's parent's SIGCHLD handler is SIG_IGN or has SA_NOCLDWAIT, then the kernel does an automatic wait call on behalf of the parent and discards the result, thereby transitioning the child onward from zombie to dead. This causes all attempts to wait upon the child (including via P_PIDFD) to fail. The only exception to this is if the child was created with clone or clone3, and the termination signal was specified as something other than SIGCHLD, and the child has not called execve or similar: given this combination of circumstances, the automatic wait call will not recognise the child.

pidfd_getfd

Available since: kernel 5.6, glibc 2.36.

This function takes a pidfd, along with an fd number in the file table of the process referenced by the pidfd, creates a duplicate of that file descriptor in the file table of the calling process, and returns the new fd number. The effect is similar to what would happen if the referenced process used an SCM_RIGHTS message to send a file descriptor to the calling process. The O_CLOEXEC flag is automatically set on the new fd.

Calling this function incurs a PTRACE_MODE_ATTACH_REALCREDS security check.

setns

Available since: kernel 5.8, glibc 2.14.

Passing a pidfd to this function moves the caller into one or more of the namespaces that the process referenced by the pidfd is in. Note that this function can also be passed the result of open("/proc/$pid/ns/$name") as an fd.

process_madvise

Available since: kernel 5.10, glibc 2.36.

This function is similar to madvise, except that it operates on an arbitrary process (specified via a pidfd) rather than on the calling process.

Since 5.12, calling this function incurs PTRACE_MODE_READ_FSCREDS and CAP_SYS_NICE security checks. In 5.10 and 5.11, it incurred a PTRACE_MODE_ATTACH_FSCREDS security check.

process_mrelease

Available since: kernel 5.15, glibc 2.36.

This is a relatively niche function, which you are unlikely to ever need unless writing a userspace OOM killer. It can be called against a process which is no longer alive, but hasn't yet had its virtual memory released up by the kernel, to cause the kernel to release said virtual memory faster.

fstat / statx for meaningful stx_ino

Available since: kernel 6.9, glibc 2.2.5 (fstat) or glibc 2.28 (statx).

It has always been possible to call fstat or statx on a pidfd, but prior to kernel 6.9, it was not useful to do so. Since 6.9, calling statx on a pidfd gives a meaningful stx_ino: the 64-bit inode number of a pidfd uniquely identifies a process, so two pidfds referencing the same process will have identical stx_ino values, while two pidfds referencing different processes will have different stx_ino values. The same is true for fstat, provided that st_ino is 64 bits wide. In other words, since 6.9, a process's inode number (as observed via a pidfd) is a unique 64-bit identifier for the process, which is never reused (until the system is restarted), and is unique even across different pid namespaces.

It is likely that future kernel versions will add more things that can be done with (or to) a pidfd. As for the existing functionality, if you find yourself constrained by glibc version rather than kernel version, one option is to compile against a very recent glibc, then use polyfill-glibc to restore runtime compatibility with an older version of glibc.

In terms of future directions, some of the things that I'd like to see are:

The ability for a pidfd to obtain the exit code and status of dead processes, not just zombie processes (c.f. GetExitCodeProcess in Windows).
The ability to mark a process as transitioning directly from alive to dead, without sitting in the zombie state until someone waits upon it. This would be similar to SA_NOCLDWAIT, but as a property of the child rather than a property of the parent. Combined with the previous point, the exit code and status would still be retrievable (by any holder of a relevant pidfd).
Subject to a flag, closing a pidfd could cause the underlying process (and possibly all of its transitive descendants) to be terminated by the kernel.
pidfd variants of process_vm_readv and process_vm_writev.