Background
Last week I posted about a program which was behaving strangely because it wasn’t properly cleaning up its child processes after receiving the SIGTERM
signal. The fix was to wait for the child processes to exit before exiting from the parent, but I never figured out what was happening to orphaned processes after the parent (a PID 1 process in that namespace) exited.
After cleaning up the code and fixing the bug, I still didn’t have any idea what was actually happening to the orphan child process running in Docker after its parent died. Here is a braindump of how I figured this out.
Edit: I found why the parent wasn’t shutting down correctly
I was launching new Python processes with the multiprocessing
Python module. Part of this module involves registering a cleanup function using the atexit
handler. In this cleanup function, all running processes are waited on with .join()
, which makes our parent process sleep indefinitely until the children are done running.
See the multiprocessing/util.py file for more details.
I’ll leave the rest of this post up just in case anyone finds the BPF programs I used interesting.
Setup
Here’s a minimal Dockerfile and two Python programs I used to investigate this. The parent process handles a SIGTERM
, passes the signal along to its child, and then calls sys.exit(0)
. The child process also handles the signal, but doesn’t exit.
You can view all of the code needed to reproduce this problem here.
Initial Investigation
docker top
is a quick way to see the PID of any process running in Docker. It includes the non-namespaced PID and PPID, because inside the container the first process spawned will see itself as PID 1. Let’s take a look at our containerized process:
# Terminal 1
$ docker run --rm -ti bad-parent
Parent process started: 1
Hello from child 6
Parent still alive!
Parent still alive!
Parent still alive!
...
# Terminal 2
$ docker top c44466ae62ec
UID PID PPID ... CMD
root 294147 296942 ... python parent.py
root 294200 294147 ... python parent.py
Now let’s take a look at the PID 294129 the parent process of our container, noting that names inside {}
are threads:
$ pstree -ps -U 296942
systemd(1)───containerd(1171)───containerd-shim(296942)─┬─python(294147)───python(294200)
├─{containerd-shim}(296943)
├─{containerd-shim}(296944)
├─{containerd-shim}(296945)
└─...
To see what processes are receiving which signals, I used the signals.bt script introduced in BPF Performance Tools. I made a small modification to allow filtering based on the process name:
tracepoint:signal:signal_generate
{
if (str($1) == "0" || args->comm == str($1)) {
@[@sig[args->sig], args->pid, args->comm] = count();
}
}
Next I found something surprising: After sending the Docker container a SIGTERM
, I found that it calls sys.exit(0)
and stops execution of the Python script, but the parent process actually stays alive!
# Terminal 1
$ docker kill --signal=SIGTERM c44466ae62ec
c44466ae62ec
# Running docker process
$ docker run --rm -ti bad-parent
Parent process started: 1
Hello from child 6
Parent still alive!
...
Gracefully shutting down
Shutting down process 6
Child 6[ppid=1] received SIGTERM...
And here’s the output of signals.bt
for the containerd-shim
process:
# Terminal 1
$ sudo signals.bt "containerd-shim"
Attaching 3 probes...
Counting signals. Hit Ctrl-C to end.
^C
@[SIGNAL, PID, COMM] = COUNT
@[SIGCHLD, 296943, containerd-shim]: 1
# Terminal 2
$ sudo signals.bt "python"
Attaching 3 probes...
Counting signals. Hit Ctrl-C to end.
^C
@[SIGNAL, PID, COMM] = COUNT
@[SIGTERM, 294147, python]: 1
@[SIGTERM, 294200, python]: 1
The one clue I see here is that a containerd-shim thread, 296943
, received a SIGCHLD
process. My guess is that containerd-shim
spawns a number of threads to handle signals, although it is unclear if it lets the kernel route the signals to a random thread (the default behavior for a process with many threads), or if it blocks signals in all but one threads.
To find out, we can take a look at the /proc/${PID}/status
file, which contains a human-readable process status including signal masks. Let’s look at it for the thread that received the SIGCHLD
and some of its siblings:
$ cat /proc/296943/status
Name: containerd-shim
Umask: 0022
State: S (sleeping)
...
SigCgt: fffffffe7fc1feff
...
$ cat /proc/296945/status
Name: containerd-shim
Umask: 0022
State: S (sleeping)
...
SigCgt: fffffffe7fc1feff
...
Next, we can take a look at what is happening with our process which called sys.exit(0)
but is still somehow alive:
$ head -n 10 /proc/296945/status
Name: python
Umask: 0022
State: S (sleeping)
Tgid: 296945
Ngid: 0
Pid: 296945
It looks like the process just went to sleep. We do have a call to time.sleep()
which should be happening every half a second or so. Let’s see what syscalls, if any, our process is calling with syscount-bpff
:
$ sudo syscount-bpfcc -p 296954
Tracing syscalls, printing top 10... Ctrl+C to quit.
^C[09:33:00]
SYSCALL COUNT
Detaching...
This output shows absolutely no syscalls during a 10 second period. So it’s pretty clear that something has taken control of our process and put it to sleep!
What is happening during container shutdown?
Let’s run syscount-bpfcc
on the parent containerd-shim
process before and during the process shutdown process to see what’s happening:
# Before shutdown
$ sudo syscount-bpfcc -p 696458
Tracing syscalls, printing top 10... Ctrl+C to quit.
^C[09:38:23]
SYSCALL COUNT
nanosleep 69
futex 46
read 16
epoll_pwait 16
epoll_wait 8
write 8
Detaching...
# During shutdown
$ sudo syscount-bpfcc -p 696458
Tracing syscalls, printing top 10... Ctrl+C to quit.
^C[09:39:42]
SYSCALL COUNT
nanosleep 386
futex 163
epoll_pwait 57
read 53
write 25
epoll_wait 23
fcntl 8
epoll_ctl 6
close 5
newfstatat 4
After looking at the four different syscalls being called during shutdown, I don’t think that any of these would be responsible for holding our process in a sleep state.
What is happening during container startup?
Containers start using the clone
system call. Let’s trace that to see what is starting our containers:
$ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_clone { printf("PID %d %s\n", pid, comm); }'
PID 1171 containerd
PID 1438914 containerd-shim
PID 1438914 containerd-shim
PID 1438914 containerd-shim
PID 1438914 containerd-shim
PID 1438914 containerd-shim
PID 1438914 containerd-shim
PID 1438914 containerd-shim
PID 1438914 containerd-shim
PID 1438914 containerd-shim
PID 1438923 runc
PID 1438923 runc
PID 1438923 runc
PID 1438923 runc
PID 1438923 runc
PID 1438923 runc
PID 1438923 runc
PID 1438929 runc:[0:PARENT]
PID 1438931 runc:[1:CHILD]
PID 1438933 runc:[2:INIT]
PID 1438933 runc:[2:INIT]
PID 1438933 runc:[2:INIT]
PID 1438933 runc:[2:INIT]
At this point it looks like runc
may be responsible for this weird behavior, and not containerd
.
Running our container with just runc
To eliminate the possibility that containerd
has anything to do with this weird behavior, let’s run our container with just runc
and see what happens. I followed the instructions on the runc GitHub repository for running containers. Here’s what happens:
$ sudo runc create bad-parent
$ sudo runc start bad-parent
Parent process started: 1
Parent still alive!
Hello from child 7
Parent still alive!
...
$ sudo runc kill bad-parent
Gracefully shutting down 1
Shutting down process 7
Child 7[ppid=1] received SIGTERM...
$ sudo runc list
ID PID STATUS BUNDLE CREATED OWNER
bad-parent 1444776 running /home/petermalmgren/code/docker-orphan-child-test 2020-11-11T13:21:54.872566227Z root
$ sudo runc kill bad-parent
Gracefully shutting down 1
Shutting down process 7
Child 7[ppid=1] received SIGTERM...
$ sudo runc list
ID PID STATUS BUNDLE CREATED OWNER
bad-parent 0 stopped /home/petermalmgren/code/docker-orphan-child-test 2020-11-11T13:21:54.872566227Z root
In a separate terminal I ran:
$ sudo signals.bt "python"
Attaching 3 probes...
Counting signals. Hit Ctrl-C to end.
^C
@[SIGNAL, PID, COMM] = COUNT
@[SIGKILL, 1444823, python]: 1
@[SIGTERM, 1444776, python]: 1
@[SIGTERM, 1444823, python]: 1
Checking syscalls of the Python process
I forgot to do something obvious above, which was check the syscalls of the Python process when it exits. Here’s what that looks like:
$ sudo syscount-bpfcc -p 1491744
Tracing syscalls, printing top 10... Ctrl+C to quit.
^C[13:01:31]
SYSCALL COUNT
write 9
select 8
wait4 2
getpid 2
kill 1
wait4
is what processes use to wait on children. And after some quick investigation, I realized that the multiprocessing
module registers an atexit
handler which will wait indefinitely for children to finish. This is why our process is showing up as sleeping.
I was looking in the wrong rabbit hole
I probably should’ve checked the behavior of sys.exit(0)
and multiprocessing
first before assuming there was something wrong with runc
or containerd
!