Kubernetes: zero-downtime rolling updates
Rolling Update only makes sure that Kubernetes stops pods in rolling fashion - one by one, always ensuring there is the minimal desired amount of pods running. That may seem like enough for zero-downtime deployments. But as usual it is not that simple.
During rolling update Kubernetes has to terminate the old versions of pods - after all that's what you want. But that's problem when your pod is being terminated in middle of processing HTTP request. And that can happen when your application is not cooperating with Kubernetes. In this article I will look at why does that happen and how fix this.
The problem became apparent to us when we recently moved one of our most connected service to our Kubernetes cluster. The move was seamless, but after while we started seeing seemingly random connections errors to this service across our platform.
After some investigation we realized that whenver we deployed new update to the service, there was chance some of the other services fails some requests during the deployment.
To prove that hypothesis I did synthetic test using
wrk and triggering the rolling update at the same time:
wrk https://www.server.com/api & sleep 1 && kubectl set image deployment/api api=driftrock/api:2
The result confirmed the problem:
Running 10s test @ https://www.server.com/api 2 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 290.50ms 176.36ms 1.19s 87.78% Req/Sec 19.39 9.31 49.00 47.06% 368 requests in 10.10s, 319.48KB read Non-2xx or 3xx responses: 19 Requests/sec: 36.44 Transfer/sec: 31.63KB
Lets' look at the issue in more detail further in this article.
Kubernetes pod termination process
Let's look first at how Kubernetes terminates pods. It will be crucial to understand how application should handle the termination.
According to the documentation https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods it follows these steps (more or less):
- If pod has network services, kubernetes stops routing new connections to the terminating pod. Already established connections are left intact and open.
- Kubernetes sends
TERMsignal to the root process of each container in the pod, assuming containers will start stopping. The signal it sends cannot be configured.
- It waits time period specified in
terminationGracePeriodSecondsof pod (30 seconds by default). If the containers are still running at this point, it sends
KILLterminating the containers without giving them chance for one more breath.
That looks good. What is the issue?
It seems that Kubernetes does everything needed - stops sending new connections and gives the application 30 seconds to stop the currently processing work.
However there are two important catches. First is HTTP keep-alive connections and second is how applications
As we could see above, Kubernetes keeps any established network connections intact during the termination. That makes sense - Kubernetes gives the 30 seconds to the application to tell it's clients to disconnect or to send the response.
But if application doesn't do anything, the application will be receiving new requests via the opened
connection happily until the 30 seconds times out. So when client sends new request just milisecond
before Kubernetes terminates the container using
KILL signal, client will
experience dropped connection instead of receiving response.
In HTTP world this affect not only browser clients, where few dropped connections may not be disasterous. Most applications runs behinde a Load Balancer. In most cases LB connects to the backend using keep-alive connection. Which then affect also API calls, even if those calls are usually made as separated connections from the client to the public facing side of the Load Balancer.
TERM signal mishandling
Now you may be thinking - wait a moment, so why the application doesn't stop? Why does it keep receiving new requests
TERM signal until the last CPU instruction?
The problem is in that different applications within different Docker image setups will handle the signal in different, other, ways.
In this investigation I noticed 3 different cases of how application handle the
- it ignores it - for example Elixir's Phoenix framework doesn't have builtin graceful termination (see here)
- it does something else - for example
nginxterminates itself immediately (ie non-gracefuly) (see here)
- or it is run withing shell in Docker container (eg
/bin/sh -c "rails s"), so the application doesn't receive the signal at all.
Any of those cases leads to that that the problem describe about keep-alive connections will take effect. Let's look at on how to avoid it and make the applications ready for kubernetes rolling update.
Zero-downtime ready pod™
From above we can get sense of what needs to be satisfied in order pods to be trully ready for rolling updates on Kubernetes. Let me try to put it in points:
- If the pod has network service, it needs to gracefully drain connections after receiving TERM signal
- If your application has built-in handling of
TERMsignal, make sure that it receives it when it runs in Docker container
- Setup the
terminationGracePeriodSecondsaccording to what is expected maximum time of finishing any job currently processing. This depends on your application workloads.
- Have at least 2 (but ideally 3) replicas running, and setup the deployment to keep minimum at least 1 pod running during rolling updates.
When all conditions are met, Kubernetes will care about the rest and will make sure the rolling updates are actually zero-downtime - both for network application and long running background workloads.
Let's look now at how to achieve this conditions.
Draining connections (using nginx reverse proxy)
Unfortunately not all applications can be (easily) updated to gracefully drain connections. As mentioned above Phoenix framework doesn't have built-in support. To avoid dealing wich such discrepancies across different technologies, we can create simple nginx image, that will handle this universally.
Reading documentation on how to drain connections tells me
nginx can be controlled with signals to the main process: TERM, INT fast shutdown QUIT graceful shutdown
Great! It has signal for graceful shutdown. But wait... We need to send it
QUIT signal in order to do so, but
Kubernetes only will send
TERM which would also be received by Nginx but with different result - terminating
To overcome this I wrote small shell script that wraps around the nginx process and translates the
#!/bin/sh # Setup trap for TERM (= 15) signal, and send QUIT to nginx instead trap "echo SIGTERM trapped. Signalling nginx with QUIT.; kill -s QUIT \$(cat /var/run/nginx.pid)" 15 # Start the command in background - so the shell is in foreground and receives signals (otherwise it ignores signals) nginx "$@" & CHILD_PID=$! echo "Nginx started with PID $CHILD_PID" # Wait for the child in loop - apparently QUIT signal to nginx cases the `wait` to # exit, even if the process is still running. So let's also add loop until the process # exists while kill -s 0 $CHILD_PID; do wait $CHILD_PID; done echo "Process $CHILD_PID exited. Exitting too..."
Adding just simple
default.conf with reverse proxy configuration we can make the Dockerfile for our proxy image:
FROM nginx:mainline-alpine COPY default.conf /etc/nginx/conf.d/ ADD trap_term_nginx.sh /usr/local/bin/ CMD [ "/usr/local/bin/trap_term_nginx.sh", "-g", "daemon off;" ]
TERM signal to the new container will result in nginx quiting gracefully - ie wait until active requests are responded and not accepting any new requests over the existing connections. Then it will stop itself.
We opensourced the image and it can be found in Docker Hub: https://hub.docker.com/r/driftrock/https-redirect-proxy/
Ensuring process get the signal
Another condition I described above is to make sure process receives the signal when we know the application is ready to handle it.
sidekiq already handles
TERM as we want.
This may seem like there is nothing extra to do. Unfortunatelly in Docker environment as Kubernetes is, one has to
be extra careful to not make unintentional mistake.
The problem is when the container is setup to run command using shell. For example:
command: ["/bin/sh", "-c", "rake db:migrate && sidekiq -t 30"]
In this case the root process of the container will be
/bin/sh. As desribed above Kubernetes sends the signal to it.
What is not clear on first look however is that UNIX shell will ignores signals when there is a child process running
in it. It doesn't forward it to the child, nor do anything else. That will draw the signal not being sent to our
sidekiq in the example above.
There are two ways to fix this. Simple way is to instruct the shell to replace itself with the last command using
command: ["/bin/sh", "-c", "rake db:migrate && exec sidekiq -t 30"]
But if you can, best is to avoid using a shell wrapper at all. Run the command directly as first process and use Kubernetes Init containers for the commands you want to run before the application starts.
Setting Zero-downtime ready pod™
When we have proxy to handle keep-alive connections in HTTP applications and we know how to make sure other applications will receive TERM signal to gracefully stop, we can configure our Zero-downtime ready pod™.
First is to setup the nginx proxy. Add it as another container to your pod. The proxy assumes your application is listening on port
8080, and itself will listen on port
80. If your services are configured already for port
80, then you don't have to do anything else, just add the container (Side note: we find useful to name the container the same as the app container, with
... containers: ... - name: APP-NAME-proxy image: driftrock/https-redirect-proxy ports: - containerPort: 80 ... ...
Second thing to do is to ensure application is ready to receive
TERM signals. As I describe above, we can do:
... containers: - name: APP-NAME command: ['/bin/sh', '-c', 'rake db:migrate && exec puma -p 8080'] ... ... ...
And that's all. Of course given your specific details, you may need to do something slightly different here and there. I hope you got the idea however now.
To test this we updated the pods with the new proxy image, then started benchmark and make Kubernetes to run rolling update again:
Running 10s test @ https://www.server.com/api 2 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 249.36ms 95.76ms 842.78ms 90.19% Req/Sec 20.04 9.21 40.00 54.05% 403 requests in 10.10s, 349.87KB read Requests/sec: 39.91 Transfer/sec: 34.65KB
Now all benchmark request successfully finished even when pods were being terminated and spun up during the test rolling update.
Release the chaos monkey
After that I thought how to stress test this. In fact when the Kubernetes deployment is conducting the rolling update, it
does just plain pod termination on the back. The same as you can do with
kubectl delete pod [POD_NAME]. In fact the
steps of termination described above are taken from article called "Pod termination process", not "Deployment rolling update
Given that I was interested if the new setup will handle with just killing the pods in loop (just making sure
there is at least one pod running all the time), giving them just really short time to even live.
In theory it should work. Pod starts, it receives perhaps one request and start processing it.
At the same time it receives
TERM signal as my chaos monkey will try to kill it. The request will be finished
and no new ones will be routed to it.
Let's see what happens:
wrk -t 120 https://www.server.com/api & while true; do sleep 5 READY_PODS=$(kubectl get pods -l app=api-server -o json | jq -r ".items | map(select(.metadata | has(\"deletionTimestamp\") | not)) | map(select(.status.containerStatuses | map(.ready) | all)) | ..metadata.name") EXTRA_READY_PODS=$(echo $READY_PODS | ruby -e 'puts STDIN.readlines.shuffle[1..-1]' | tr '\n' ' ' ) /bin/sh -c "kubectl delete pods $EXTRA_READY_PODS" kubectl get pods -l app=api-server done
Running 120s test @ https://www.server.com/api 2 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 953.28ms 513.90ms 3938.00ms 90.19% Req/Sec 10.49 9.21 40.00 54.05% 1261 requests in 120.209s, 21.022MB read Requests/sec: 10.49 Transfer/sec: 175.48KB
This shows that the setup doesn't drop single connection out, even when pods are being terminated as soon as they become ready and start receiving requests. SUCCESS!
So in summary we found few simple principles that has to be met in order to make Kubernetes work for us and maintain zero-downtime deployments. The solution we found works even when put in stress.
Thanks for reading and let us know what you think. If you have another way to solve this issue we would love to know too!