Kubernetes: zero-downtime rolling updates

Rolling Update only makes sure that Kubernetes stops pods in rolling fashion - one by one, always ensuring there is the minimal desired amount of pods running. That may seem like enough for zero-downtime deployments. But as usual it is not that simple.

During rolling update Kubernetes has to terminate the old versions of pods - after all that's what you want. But that's problem when your pod is being terminated in middle of processing HTTP request. And that can happen when your application is not cooperating with Kubernetes. In this article I will look at why does that happen and how fix this.

The problem

The problem became apparent to us when we recently moved one of our most connected service to our Kubernetes cluster. The move was seamless, but after while we started seeing seemingly random connections errors to this service across our platform.

After some investigation we realized that whenver we deployed new update to the service, there was chance some of the other services fails some requests during the deployment.

To prove that hypothesis I did synthetic test using wrk and triggering the rolling update at the same time:

wrk https://www.server.com/api &
sleep 1 && kubectl set image deployment/api api=driftrock/api:2

The result confirmed the problem:

Running 10s test @ https://www.server.com/api
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   290.50ms  176.36ms   1.19s    87.78%
    Req/Sec    19.39      9.31    49.00     47.06%
  368 requests in 10.10s, 319.48KB read
  Non-2xx or 3xx responses: 19
Requests/sec:     36.44
Transfer/sec:     31.63KB

Lets' look at the issue in more detail further in this article.

Kubernetes pod termination process

Let's look first at how Kubernetes terminates pods. It will be crucial to understand how application should handle the termination.

According to the documentation https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods it follows these steps (more or less):

  1. If pod has network services, kubernetes stops routing new connections to the terminating pod. Already established connections are left intact and open.
  2. Kubernetes sends TERM signal to the root process of each container in the pod, assuming containers will start stopping. The signal it sends cannot be configured.
  3. It waits time period specified in terminationGracePeriodSeconds of pod (30 seconds by default). If the containers are still running at this point, it sends KILL terminating the containers without giving them chance for one more breath.

That looks good. What is the issue?

It seems that Kubernetes does everything needed - stops sending new connections and gives the application 30 seconds to stop the currently processing work.

However there are two important catches. First is HTTP keep-alive connections and second is how applications react to TERM signal.

Keep-alive connections

As we could see above, Kubernetes keeps any established network connections intact during the termination. That makes sense - Kubernetes gives the 30 seconds to the application to tell it's clients to disconnect or to send the response.

But if application doesn't do anything, the application will be receiving new requests via the opened connection happily until the 30 seconds times out. So when client sends new request just milisecond before Kubernetes terminates the container using KILL signal, client will experience dropped connection instead of receiving response.

In HTTP world this affect not only browser clients, where few dropped connections may not be disasterous. Most applications runs behinde a Load Balancer. In most cases LB connects to the backend using keep-alive connection. Which then affect also API calls, even if those calls are usually made as separated connections from the client to the public facing side of the Load Balancer.

TERM signal mishandling

Now you may be thinking - wait a moment, so why the application doesn't stop? Why does it keep receiving new requests after receiving TERM signal until the last CPU instruction?

The problem is in that different applications within different Docker image setups will handle the signal in different, other, ways.

In this investigation I noticed 3 different cases of how application handle the TERM signal:

  • it ignores it - for example Elixir's Phoenix framework doesn't have builtin graceful termination (see here)
  • it does something else - for example nginx terminates itself immediately (ie non-gracefuly) (see here)
  • or it is run withing shell in Docker container (eg /bin/sh -c "rails s"), so the application doesn't receive the signal at all.

Any of those cases leads to that that the problem describe about keep-alive connections will take effect. Let's look at on how to avoid it and make the applications ready for kubernetes rolling update.

Zero-downtime ready pod™

From above we can get sense of what needs to be satisfied in order pods to be trully ready for rolling updates on Kubernetes. Let me try to put it in points:

  • If the pod has network service, it needs to gracefully drain connections after receiving TERM signal
  • If your application has built-in handling of TERM signal, make sure that it receives it when it runs in Docker container
  • Setup the terminationGracePeriodSeconds according to what is expected maximum time of finishing any job currently processing. This depends on your application workloads.
  • Have at least 2 (but ideally 3) replicas running, and setup the deployment to keep minimum at least 1 pod running during rolling updates.

When all conditions are met, Kubernetes will care about the rest and will make sure the rolling updates are actually zero-downtime - both for network application and long running background workloads.

Let's look now at how to achieve this conditions.

Draining connections (using nginx reverse proxy)

Unfortunately not all applications can be (easily) updated to gracefully drain connections. As mentioned above Phoenix framework doesn't have built-in support. To avoid dealing wich such discrepancies across different technologies, we can create simple nginx image, that will handle this universally.

Reading documentation on how to drain connections tells me

nginx can be controlled with signals to the main process:
TERM, INT    fast shutdown
QUIT    graceful shutdown

Great! It has signal for graceful shutdown. But wait... We need to send it QUIT signal in order to do so, but Kubernetes only will send TERM which would also be received by Nginx but with different result - terminating immediately.

To overcome this I wrote small shell script that wraps around the nginx process and translates the TERM signal to QUIT signal:

#!/bin/sh

# Setup trap for TERM (= 15) signal, and send QUIT to nginx instead
trap "echo SIGTERM trapped. Signalling nginx with QUIT.; kill -s QUIT \$(cat /var/run/nginx.pid)" 15

# Start the command in background - so the shell is in foreground and receives signals (otherwise it ignores signals)
nginx "$@" &

CHILD_PID=$!
echo "Nginx started with PID $CHILD_PID"

# Wait for the child in loop - apparently QUIT signal to nginx cases the `wait` to
# exit, even if the process is still running. So let's also add loop until the process
# exists
while kill -s 0 $CHILD_PID; do wait $CHILD_PID; done

echo "Process $CHILD_PID exited. Exitting too..."

Adding just simple default.conf with reverse proxy configuration we can make the Dockerfile for our proxy image:

FROM nginx:mainline-alpine

COPY default.conf /etc/nginx/conf.d/
ADD trap_term_nginx.sh /usr/local/bin/

CMD [ "/usr/local/bin/trap_term_nginx.sh", "-g", "daemon off;" ]

Sending TERM signal to the new container will result in nginx quiting gracefully - ie wait until active requests are responded and not accepting any new requests over the existing connections. Then it will stop itself.

We opensourced the image and it can be found in Docker Hub: https://hub.docker.com/r/driftrock/https-redirect-proxy/

Ensuring process get the signal

Another condition I described above is to make sure process receives the signal when we know the application is ready to handle it.

For example sidekiq already handles TERM as we want. This may seem like there is nothing extra to do. Unfortunatelly in Docker environment as Kubernetes is, one has to be extra careful to not make unintentional mistake.

The problem is when the container is setup to run command using shell. For example:

command: ["/bin/sh", "-c", "rake db:migrate && sidekiq -t 30"]

In this case the root process of the container will be /bin/sh. As desribed above Kubernetes sends the signal to it. What is not clear on first look however is that UNIX shell will ignores signals when there is a child process running in it. It doesn't forward it to the child, nor do anything else. That will draw the signal not being sent to our application - sidekiq in the example above.

There are two ways to fix this. Simple way is to instruct the shell to replace itself with the last command using exec command:

command: ["/bin/sh", "-c", "rake db:migrate && exec sidekiq -t 30"]

But if you can, best is to avoid using a shell wrapper at all. Run the command directly as first process and use Kubernetes Init containers for the commands you want to run before the application starts.

Setting Zero-downtime ready pod™

When we have proxy to handle keep-alive connections in HTTP applications and we know how to make sure other applications will receive TERM signal to gracefully stop, we can configure our Zero-downtime ready pod™.

First is to setup the nginx proxy. Add it as another container to your pod. The proxy assumes your application is listening on port 8080, and itself will listen on port 80. If your services are configured already for port 80, then you don't have to do anything else, just add the container (Side note: we find useful to name the container the same as the app container, with -proxy suffix).

...
  containers:
  ...
  - name: APP-NAME-proxy
    image: driftrock/https-redirect-proxy
    ports:
    - containerPort: 80
  ...
...

Second thing to do is to ensure application is ready to receive TERM signals. As I describe above, we can do:

...
  containers:
  - name: APP-NAME
    command: ['/bin/sh', '-c', 'rake db:migrate && exec puma -p 8080']
    ...
  ...
...

And that's all. Of course given your specific details, you may need to do something slightly different here and there. I hope you got the idea however now.

Result

To test this we updated the pods with the new proxy image, then started benchmark and make Kubernetes to run rolling update again:

Running 10s test @ https://www.server.com/api
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   249.36ms   95.76ms 842.78ms   90.19%
    Req/Sec    20.04      9.21    40.00     54.05%
  403 requests in 10.10s, 349.87KB read
Requests/sec:     39.91
Transfer/sec:     34.65KB

Now all benchmark request successfully finished even when pods were being terminated and spun up during the test rolling update.

Release the chaos monkey

After that I thought how to stress test this. In fact when the Kubernetes deployment is conducting the rolling update, it does just plain pod termination on the back. The same as you can do with kubectl delete pod [POD_NAME]. In fact the steps of termination described above are taken from article called "Pod termination process", not "Deployment rolling update process".

Given that I was interested if the new setup will handle with just killing the pods in loop (just making sure there is at least one pod running all the time), giving them just really short time to even live. In theory it should work. Pod starts, it receives perhaps one request and start processing it. At the same time it receives TERM signal as my chaos monkey will try to kill it. The request will be finished and no new ones will be routed to it.

Let's see what happens:

wrk -t 120 https://www.server.com/api &

while true; do 
  sleep 5
  READY_PODS=$(kubectl get pods -l app=api-server -o json | jq -r ".items | map(select(.metadata | has(\"deletionTimestamp\") | not)) | map(select(.status.containerStatuses | map(.ready) | all)) | .[].metadata.name")
  EXTRA_READY_PODS=$(echo $READY_PODS | ruby -e 'puts STDIN.readlines.shuffle[1..-1]' | tr '\n' ' ' )
  /bin/sh -c "kubectl delete pods $EXTRA_READY_PODS"
  kubectl get pods -l app=api-server
done

Gives result:

Running 120s test @ https://www.server.com/api
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   953.28ms   513.90ms 3938.00ms   90.19%
    Req/Sec    10.49      9.21    40.00     54.05%
  1261 requests in 120.209s, 21.022MB read
Requests/sec:     10.49
Transfer/sec:     175.48KB

This shows that the setup doesn't drop single connection out, even when pods are being terminated as soon as they become ready and start receiving requests. SUCCESS!

Conclusion

So in summary we found few simple principles that has to be met in order to make Kubernetes work for us and maintain zero-downtime deployments. The solution we found works even when put in stress.

Thanks for reading and let us know what you think. If you have another way to solve this issue we would love to know too!