What even is a pidfd anyway?
In recent versions of the Linux kernel, a pidfd is a special type of file that holds a reference to a process. Notably, a pidfd allows for certain process-related operations to be performed in a race-free manner, and it allows poll
/ select
/ epoll
to be used to detect process termination.
Before you get too excited:
- A pidfd does not let you hold a reference to an individual thread, only to a process (or in kernel terminology, a thread group leader).
- A pidfd does not hold a reference to a pid number, nor does holding open a pidfd prevent the pid number of the underlying process from being reused.
- A pidfd cannot circumvent the at-most-once semantics of retreiving the exit code / status of a process via
wait
/waitpid
/waitid
. - Closing a pidfd does not terminate the underlying process.
There are various ways of obtaining a pidfd:
Kernel version | glibc version | Function |
---|---|---|
5.2 | 2.2.5 / 2.31 | clone with CLONE_PIDFD flag |
5.3 | N/A | clone3 with CLONE_PIDFD flag |
5.3 / 5.10 | 2.36 | pidfd_open |
5.4 | 2.39 | pidfd_spawn / pidfd_spawnp |
6.5 | 2.2.5 / N/A | getsockopt with SO_PEERPIDFD optname |
6.5 | 2.2.5 / 2.39 | recvmsg with SCM_PIDFD cmsg_type |
Once you have a pidfd, there are a bunch of things you can do with it:
Kernel version | glibc version | Function |
---|---|---|
5.1 | 2.36 | pidfd_send_signal |
5.2 / 5.5 | 2.39 | pidfd_getpid |
5.3 | 2.2.5 / 2.3.2 | poll / select / epoll |
5.4 | 2.2.5 / 2.36 | waitid with P_PIDFD mode |
5.6 | 2.36 | pidfd_getfd |
5.8 | 2.14 | setns |
5.10 / 5.12 | 2.36 | process_madvise |
5.15 | 2.36 | process_mrelease |
6.9 | 2.2.5 / 2.28 | fstat / statx for meaningful stx_ino |
Some of the subsequent text refers to a process being alive or zombie or dead. These terms come from the usual lifecycle of a unix process: it is initially alive, then transitions to zombie when it terminates, and then transitions to dead once it is waited upon. As a quick summary of the states:
Alive | Zombie | Dead | |
---|---|---|---|
Can execute code and receive signals | ✅ | ❌ | ❌ |
Has pid number | ✅ | ✅ | ❌ |
Exit code / status retrievable | ❌ | ✅ | ❌ |
pidfd polls as readable | ❌ | ✅ | ✅ |
Cleaned up by kernel | ❌ | ❌ | ✅ |
Available since: kernel 5.2, glibc 2.31 (or glibc 2.2.5 if you provide your own definition of CLONE_PIDFD
; its value is 0x1000
).
If the CLONE_PIDFD
flag is specified, then clone
returns a freshly allocated pidfd referring to the child (in addition to returning the pid number of the child). The O_CLOEXEC
flag is automatically set on the returned pidfd. Note that if CLONE_PIDFD
is specified, then CLONE_THREAD
cannot be specified, nor can CLONE_DETACHED
. Furthermore, if CLONE_PIDFD
is specified, then CLONE_PARENT_SETTID
cannot be specified (unless using clone3
).
One of the arguments to clone
is the signal number that the child will send to its parent when the child terminates. Setting this to anything other than SIGCHLD
has several consequences:
- Calls to
wait
do not recognise the child. - Calls to
waitpid
/waitid
only recognise the child if the__WALL
or__WCLONE
option is passed (this is true even forP_PIDFD
calls). - The child will always transition to the zombie state upon termination and stay there until waited upon, even if the parent's
SIGCHLD
handler isSIG_IGN
or hasSA_NOCLDWAIT
. - When the child terminates, a signal other than
SIGCHLD
will be sent to the parent (or no signal will be sent if the termination signal is set to zero).
Note that if the child calls execve
(or a similar exec
function), then the termination signal number is reset to SIGCHLD
, and the above points stop applying.
Available since: kernel 5.3, no glibc wrapper.
This function is just a more extensible version of clone
; everything written above about clone
applies equally to clone3
.
Available since: kernel 5.3, glibc 2.36.
This function takes a pid number (in the pid namespace of the caller), and returns a freshly allocated pidfd refering to said process (or an error if said process does not exist). It is inherently racy, unless the pid number being passed is the result of getpid
(i.e. creating a pidfd referring to your own process).
Since kernel 5.10, the PIDFD_NONBLOCK
flag can be passed to pidfd_open
, which affects subsequent waitid
calls. No other flags are valid to pass. The O_CLOEXEC
flag is automatically set on the returned pidfd.
Available since: kernel 5.4, glibc 2.39.
These functions are like posix_spawn
/ posix_spawnp
, except that they have an int*
output parameter for a freshly allocated pidfd instead of a pid_t*
output parameter for a pid number. The O_CLOEXEC
flag is automatically set on the returned pidfd.
In glibc 2.39, bug BZ#31695 causes these functions to leak a file descriptor in some error scenarios. This will hopefully be fixed in 2.40.
getsockopt
with SO_PEERPIDFD
optname
Available since: kernel 6.5, glibc 2.2.5 for getsockopt
. The definition of SO_PEERPIDFD
is not tied to a particular glibc version; its value is 77
should you need to provide your own definition of it.
SO_PEERPIDFD
is the pidfd version of SO_PEERCRED
. For a unix socket created via socketpair
, SO_PEERPIDFD
gives a pidfd referring to the process that called socketpair
, meanwhile for a connected unix stream socket, SO_PEERPIDFD
gives a pidfd referring to the process that called connect
(if called on the server end of the socket) or the process that called listen
(if called on the client end of the socket). The O_CLOEXEC
flag is automatically set on the returned pidfd.
recvmsg
with SCM_PIDFD
cmsg_type
Available since: kernel 6.5, glibc 2.39 (or glibc 2.2.5 if you provide your own definition of SCM_PIDFD
; its value is 0x04
).
SCM_PIDFD
is the pidfd version of (the pid part of) SCM_CREDENTIALS
. If the receivier sets SO_PASSPIDFD
on a unix socket (c.f. setting SO_PASSCRED
), then it'll receive a SCM_PIDFD
cmsg as part of receiving a message, with the associated cmsg data being a freshly allocated pidfd referring to the process of the sender of the message (or some other process if the sender has CAP_SYS_ADMIN
and specifies a pid number other than itself as part of its SCM_CREDENTIALS
). The O_CLOEXEC
flag is automatically set on the pidfd.
Available since: kernel 5.1, glibc 2.36.
This function is similar to kill
/ rt_sigqueueinfo
: it sends a signal to a process. It differs from these functions in that the destination is given as a pidfd rather than as a pid number.
This function also accepts the result of open("/proc/$pid")
as an fd, though it is the only function to do so: open("/proc/$pid")
does not give a pidfd, and no other functions accept the result of open("/proc/$pid")
in place of a pidfd.
Available since: kernel 5.2, glibc 2.39.
This function is the inverse of pidfd_open
: given a pidfd, it returns the pid number associated with the underlying process. This function requires that /proc
be mounted, and returns the pid number in the pid namespace associated with the mounted /proc
. Note that the pid number can be reused for a different process once the underlying process is dead.
Changed in kernel 5.5: if the process referenced by the pidfd is dead, this function returns -1 (prior to 5.5, it returned whatever pid number the process had prior to its death).
Note that this is not a direct system call; instead it opens /proc/self/fdinfo/$pidfd
and parses the Pid:
line therein.
Available since: kernel 5.3, glibc 2.2.5 (poll
/ select
) or glibc 2.3.2 (epoll
).
These functions can be used to asynchronously monitor a pidfd. They will report the pidfd as readable iff the underlying process is a zombie or is dead. Note however that read
on a pidfd always fails; to get the exit code / status of the process, use waitid
(possibly with WNOHANG
).
Available since: kernel 5.4, glibc 2.36 (or glibc 2.2.5 if you provide your own definition of P_PIDFD
; its value is 3
).
waitid(P_PIDFD, fd, infop, options)
is identical to waitid(P_PID, pidfd_getpid(fd), infop, options)
, except for the following:
- The embedded
pidfd_getpid
call is done atomically as part ofwaitid
; there is no race condition. - The embedded
pidfd_getpid
call does not require/proc
to be mounted. - If the pidfd was opened with the
PIDFD_NONBLOCK
flag, andoptions
does not containWNOHANG
, and the process referenced by the pidfd is alive, thenwaitid
will fail withEAGAIN
rather than blocking. Note that ifoptions
does containWNOHANG
, thenPIDFD_NONBLOCK
has no effect: if the process referenced by the pidfd is alive, thenwaitid
will succeed with result 0 rather than blocking.
In particular, note that:
- Waiting on a zombie process will retreive the exit code / status (in
si_code
/si_status
), and transition the process from zombie to dead. Thesi_signo
,si_errno
,si_pid
, andsi_uid
fields will also be set. - Waiting on a dead process will fail with
ECHILD
.
The above points are true for all waitid
calls, including P_PIDFD
calls. The first time a zombie is waited upon (by any kind of wait
/ waitpid
/ waitid
call), then the exit code / status is retreived, and subsequent attempts to wait upon it (again by any kind of wait
/ waitpid
/ waitid
call) will fail.
When a process transitions from alive to zombie, if that process's parent's SIGCHLD
handler is SIG_IGN
or has SA_NOCLDWAIT
, then the kernel does an automatic wait
call on behalf of the parent and discards the result, thereby transitioning the child onward from zombie to dead. This causes all attempts to wait upon the child (including via P_PIDFD
) to fail. The only exception to this is if the child was created with clone
or clone3
, and the termination signal was specified as something other than SIGCHLD
, and the child has not called execve
or similar: given this combination of circumstances, the automatic wait
call will not recognise the child.
Available since: kernel 5.6, glibc 2.36.
This function takes a pidfd, along with an fd number in the file table of the process referenced by the pidfd, creates a duplicate of that file descriptor in the file table of the calling process, and returns the new fd number. The effect is similar to what would happen if the referenced process used an SCM_RIGHTS
message to send a file descriptor to the calling process. The O_CLOEXEC
flag is automatically set on the new fd.
Calling this function incurs a PTRACE_MODE_ATTACH_REALCREDS
security check.
Available since: kernel 5.8, glibc 2.14.
Passing a pidfd to this function moves the caller into one or more of the namespaces that the process referenced by the pidfd is in. Note that this function can also be passed the result of open("/proc/$pid/ns/$name")
as an fd.
Available since: kernel 5.10, glibc 2.36.
This function is similar to madvise
, except that it operates on an arbitrary process (specified via a pidfd) rather than on the calling process.
Since 5.12, calling this function incurs PTRACE_MODE_READ_FSCREDS
and CAP_SYS_NICE
security checks. In 5.10 and 5.11, it incurred a PTRACE_MODE_ATTACH_FSCREDS
security check.
Available since: kernel 5.15, glibc 2.36.
This is a relatively niche function, which you are unlikely to ever need unless writing a userspace OOM killer. It can be called against a process which is no longer alive, but hasn't yet had its virtual memory released up by the kernel, to cause the kernel to release said virtual memory faster.
fstat
/ statx
for meaningful stx_ino
Available since: kernel 6.9, glibc 2.2.5 (fstat
) or glibc 2.28 (statx
).
It has always been possible to call fstat
or statx
on a pidfd, but prior to kernel 6.9, it was not useful to do so. Since 6.9, calling statx
on a pidfd gives a meaningful stx_ino
: the 64-bit inode number of a pidfd uniquely identifies a process, so two pidfds referencing the same process will have identical stx_ino
values, while two pidfds referencing different processes will have different stx_ino
values. The same is true for fstat
, provided that st_ino
is 64 bits wide. In other words, since 6.9, a process's inode number (as observed via a pidfd) is a unique 64-bit identifier for the process, which is never reused (until the system is restarted), and is unique even across different pid namespaces.
It is likely that future kernel versions will add more things that can be done with (or to) a pidfd. As for the existing functionality, if you find yourself constrained by glibc version rather than kernel version, one option is to compile against a very recent glibc, then use polyfill-glibc to restore runtime compatibility with an older version of glibc.
In terms of future directions, some of the things that I'd like to see are:
- The ability for a pidfd to obtain the exit code and status of dead processes, not just zombie processes (c.f.
GetExitCodeProcess
in Windows). - The ability to mark a process as transitioning directly from alive to dead, without sitting in the zombie state until someone waits upon it. This would be similar to
SA_NOCLDWAIT
, but as a property of the child rather than a property of the parent. Combined with the previous point, the exit code and status would still be retrievable (by any holder of a relevant pidfd). - Subject to a flag, closing a pidfd could cause the underlying process (and possibly all of its transitive descendants) to be terminated by the kernel.
- pidfd variants of
process_vm_readv
andprocess_vm_writev
.