What even is a pidfd anyway?
In recent versions of the Linux kernel, a pidfd is a special type of file that holds a reference to a process. Notably, a pidfd allows for certain process-related operations to be performed in a race-free manner, and it allows poll / select / epoll to be used to detect process termination.
Before you get too excited:
- A pidfd does not let you hold a reference to an individual thread, only to a process (or in kernel terminology, a thread group leader).
- A pidfd does not hold a reference to a pid number, nor does holding open a pidfd prevent the pid number of the underlying process from being reused.
- A pidfd cannot circumvent the at-most-once semantics of retreiving the exit code / status of a process via
wait/waitpid/waitid. - Closing a pidfd does not terminate the underlying process.
There are various ways of obtaining a pidfd:
| Kernel version | glibc version | Function |
|---|---|---|
| 5.2 | 2.2.5 / 2.31 | clone with CLONE_PIDFD flag |
| 5.3 | N/A | clone3 with CLONE_PIDFD flag |
| 5.3 / 5.10 | 2.36 | pidfd_open |
| 5.4 | 2.39 | pidfd_spawn / pidfd_spawnp |
| 6.5 | 2.2.5 / N/A | getsockopt with SO_PEERPIDFD optname |
| 6.5 | 2.2.5 / 2.39 | recvmsg with SCM_PIDFD cmsg_type |
Once you have a pidfd, there are a bunch of things you can do with it:
| Kernel version | glibc version | Function |
|---|---|---|
| 5.1 | 2.36 | pidfd_send_signal |
| 5.2 / 5.5 | 2.39 | pidfd_getpid |
| 5.3 | 2.2.5 / 2.3.2 | poll / select / epoll |
| 5.4 | 2.2.5 / 2.36 | waitid with P_PIDFD mode |
| 5.6 | 2.36 | pidfd_getfd |
| 5.8 | 2.14 | setns |
| 5.10 / 5.12 | 2.36 | process_madvise |
| 5.15 | 2.36 | process_mrelease |
| 6.9 | 2.2.5 / 2.28 | fstat / statx for meaningful stx_ino |
Some of the subsequent text refers to a process being alive or zombie or dead. These terms come from the usual lifecycle of a unix process: it is initially alive, then transitions to zombie when it terminates, and then transitions to dead once it is waited upon. As a quick summary of the states:
| Alive | Zombie | Dead | |
|---|---|---|---|
| Can execute code and receive signals | ✅ | ❌ | ❌ |
| Has pid number | ✅ | ✅ | ❌ |
| Exit code / status retrievable | ❌ | ✅ | ❌ |
| pidfd polls as readable | ❌ | ✅ | ✅ |
| Cleaned up by kernel | ❌ | ❌ | ✅ |
Available since: kernel 5.2, glibc 2.31 (or glibc 2.2.5 if you provide your own definition of CLONE_PIDFD; its value is 0x1000).
If the CLONE_PIDFD flag is specified, then clone returns a freshly allocated pidfd referring to the child (in addition to returning the pid number of the child). The O_CLOEXEC flag is automatically set on the returned pidfd. Note that if CLONE_PIDFD is specified, then CLONE_THREAD cannot be specified, nor can CLONE_DETACHED. Furthermore, if CLONE_PIDFD is specified, then CLONE_PARENT_SETTID cannot be specified (unless using clone3).
One of the arguments to clone is the signal number that the child will send to its parent when the child terminates. Setting this to anything other than SIGCHLD has several consequences:
- Calls to
waitdo not recognise the child. - Calls to
waitpid/waitidonly recognise the child if the__WALLor__WCLONEoption is passed (this is true even forP_PIDFDcalls). - The child will always transition to the zombie state upon termination and stay there until waited upon, even if the parent's
SIGCHLDhandler isSIG_IGNor hasSA_NOCLDWAIT. - When the child terminates, a signal other than
SIGCHLDwill be sent to the parent (or no signal will be sent if the termination signal is set to zero).
Note that if the child calls execve (or a similar exec function), then the termination signal number is reset to SIGCHLD, and the above points stop applying.
Available since: kernel 5.3, no glibc wrapper.
This function is just a more extensible version of clone; everything written above about clone applies equally to clone3.
Available since: kernel 5.3, glibc 2.36.
This function takes a pid number (in the pid namespace of the caller), and returns a freshly allocated pidfd refering to said process (or an error if said process does not exist). It is inherently racy, unless the pid number being passed is the result of getpid (i.e. creating a pidfd referring to your own process).
Since kernel 5.10, the PIDFD_NONBLOCK flag can be passed to pidfd_open, which affects subsequent waitid calls. No other flags are valid to pass. The O_CLOEXEC flag is automatically set on the returned pidfd.
Available since: kernel 5.4, glibc 2.39.
These functions are like posix_spawn / posix_spawnp, except that they have an int* output parameter for a freshly allocated pidfd instead of a pid_t* output parameter for a pid number. The O_CLOEXEC flag is automatically set on the returned pidfd.
In glibc 2.39, bug BZ#31695 causes these functions to leak a file descriptor in some error scenarios. This will hopefully be fixed in 2.40.
getsockopt with SO_PEERPIDFD optname
Available since: kernel 6.5, glibc 2.2.5 for getsockopt. The definition of SO_PEERPIDFD is not tied to a particular glibc version; its value is 77 should you need to provide your own definition of it.
SO_PEERPIDFD is the pidfd version of SO_PEERCRED. For a unix socket created via socketpair, SO_PEERPIDFD gives a pidfd referring to the process that called socketpair, meanwhile for a connected unix stream socket, SO_PEERPIDFD gives a pidfd referring to the process that called connect (if called on the server end of the socket) or the process that called listen (if called on the client end of the socket). The O_CLOEXEC flag is automatically set on the returned pidfd.
recvmsg with SCM_PIDFD cmsg_type
Available since: kernel 6.5, glibc 2.39 (or glibc 2.2.5 if you provide your own definition of SCM_PIDFD; its value is 0x04).
SCM_PIDFD is the pidfd version of (the pid part of) SCM_CREDENTIALS. If the receivier sets SO_PASSPIDFD on a unix socket (c.f. setting SO_PASSCRED), then it'll receive a SCM_PIDFD cmsg as part of receiving a message, with the associated cmsg data being a freshly allocated pidfd referring to the process of the sender of the message (or some other process if the sender has CAP_SYS_ADMIN and specifies a pid number other than itself as part of its SCM_CREDENTIALS). The O_CLOEXEC flag is automatically set on the pidfd.
Available since: kernel 5.1, glibc 2.36.
This function is similar to kill / rt_sigqueueinfo: it sends a signal to a process. It differs from these functions in that the destination is given as a pidfd rather than as a pid number.
This function also accepts the result of open("/proc/$pid") as an fd, though it is the only function to do so: open("/proc/$pid") does not give a pidfd, and no other functions accept the result of open("/proc/$pid") in place of a pidfd.
Available since: kernel 5.2, glibc 2.39.
This function is the inverse of pidfd_open: given a pidfd, it returns the pid number associated with the underlying process. This function requires that /proc be mounted, and returns the pid number in the pid namespace associated with the mounted /proc. Note that the pid number can be reused for a different process once the underlying process is dead.
Changed in kernel 5.5: if the process referenced by the pidfd is dead, this function returns -1 (prior to 5.5, it returned whatever pid number the process had prior to its death).
Note that this is not a direct system call; instead it opens /proc/self/fdinfo/$pidfd and parses the Pid: line therein.
Available since: kernel 5.3, glibc 2.2.5 (poll / select) or glibc 2.3.2 (epoll).
These functions can be used to asynchronously monitor a pidfd. They will report the pidfd as readable iff the underlying process is a zombie or is dead. Note however that read on a pidfd always fails; to get the exit code / status of the process, use waitid (possibly with WNOHANG).
Available since: kernel 5.4, glibc 2.36 (or glibc 2.2.5 if you provide your own definition of P_PIDFD; its value is 3).
waitid(P_PIDFD, fd, infop, options) is identical to waitid(P_PID, pidfd_getpid(fd), infop, options), except for the following:
- The embedded
pidfd_getpidcall is done atomically as part ofwaitid; there is no race condition. - The embedded
pidfd_getpidcall does not require/procto be mounted. - If the pidfd was opened with the
PIDFD_NONBLOCKflag, andoptionsdoes not containWNOHANG, and the process referenced by the pidfd is alive, thenwaitidwill fail withEAGAINrather than blocking. Note that ifoptionsdoes containWNOHANG, thenPIDFD_NONBLOCKhas no effect: if the process referenced by the pidfd is alive, thenwaitidwill succeed with result 0 rather than blocking.
In particular, note that:
- Waiting on a zombie process will retreive the exit code / status (in
si_code/si_status), and transition the process from zombie to dead. Thesi_signo,si_errno,si_pid, andsi_uidfields will also be set. - Waiting on a dead process will fail with
ECHILD.
The above points are true for all waitid calls, including P_PIDFD calls. The first time a zombie is waited upon (by any kind of wait / waitpid / waitid call), then the exit code / status is retreived, and subsequent attempts to wait upon it (again by any kind of wait / waitpid / waitid call) will fail.
When a process transitions from alive to zombie, if that process's parent's SIGCHLD handler is SIG_IGN or has SA_NOCLDWAIT, then the kernel does an automatic wait call on behalf of the parent and discards the result, thereby transitioning the child onward from zombie to dead. This causes all attempts to wait upon the child (including via P_PIDFD) to fail. The only exception to this is if the child was created with clone or clone3, and the termination signal was specified as something other than SIGCHLD, and the child has not called execve or similar: given this combination of circumstances, the automatic wait call will not recognise the child.
Available since: kernel 5.6, glibc 2.36.
This function takes a pidfd, along with an fd number in the file table of the process referenced by the pidfd, creates a duplicate of that file descriptor in the file table of the calling process, and returns the new fd number. The effect is similar to what would happen if the referenced process used an SCM_RIGHTS message to send a file descriptor to the calling process. The O_CLOEXEC flag is automatically set on the new fd.
Calling this function incurs a PTRACE_MODE_ATTACH_REALCREDS security check.
Available since: kernel 5.8, glibc 2.14.
Passing a pidfd to this function moves the caller into one or more of the namespaces that the process referenced by the pidfd is in. Note that this function can also be passed the result of open("/proc/$pid/ns/$name") as an fd.
Available since: kernel 5.10, glibc 2.36.
This function is similar to madvise, except that it operates on an arbitrary process (specified via a pidfd) rather than on the calling process.
Since 5.12, calling this function incurs PTRACE_MODE_READ_FSCREDS and CAP_SYS_NICE security checks. In 5.10 and 5.11, it incurred a PTRACE_MODE_ATTACH_FSCREDS security check.
Available since: kernel 5.15, glibc 2.36.
This is a relatively niche function, which you are unlikely to ever need unless writing a userspace OOM killer. It can be called against a process which is no longer alive, but hasn't yet had its virtual memory released up by the kernel, to cause the kernel to release said virtual memory faster.
fstat / statx for meaningful stx_ino
Available since: kernel 6.9, glibc 2.2.5 (fstat) or glibc 2.28 (statx).
It has always been possible to call fstat or statx on a pidfd, but prior to kernel 6.9, it was not useful to do so. Since 6.9, calling statx on a pidfd gives a meaningful stx_ino: the 64-bit inode number of a pidfd uniquely identifies a process, so two pidfds referencing the same process will have identical stx_ino values, while two pidfds referencing different processes will have different stx_ino values. The same is true for fstat, provided that st_ino is 64 bits wide. In other words, since 6.9, a process's inode number (as observed via a pidfd) is a unique 64-bit identifier for the process, which is never reused (until the system is restarted), and is unique even across different pid namespaces.
It is likely that future kernel versions will add more things that can be done with (or to) a pidfd. As for the existing functionality, if you find yourself constrained by glibc version rather than kernel version, one option is to compile against a very recent glibc, then use polyfill-glibc to restore runtime compatibility with an older version of glibc.
In terms of future directions, some of the things that I'd like to see are:
- The ability for a pidfd to obtain the exit code and status of dead processes, not just zombie processes (c.f.
GetExitCodeProcessin Windows). - The ability to mark a process as transitioning directly from alive to dead, without sitting in the zombie state until someone waits upon it. This would be similar to
SA_NOCLDWAIT, but as a property of the child rather than a property of the parent. Combined with the previous point, the exit code and status would still be retrievable (by any holder of a relevant pidfd). - Subject to a flag, closing a pidfd could cause the underlying process (and possibly all of its transitive descendants) to be terminated by the kernel.
- pidfd variants of
process_vm_readvandprocess_vm_writev.