
Behind the scenes
More AI, more price transparency – dispatches from Hackfest
by Martin Jungfer
Preview deployments are a bit tricky, especially if you don’t clean them up. I’ll tell you what preview deployments are and how we got them in shape.
Anyone who works with deployments of any kind knows that they’re in need of constant updating – and if you update, you’ll want to test as well (even if you don’t want to, you have to anyway, but that’s a different issue). When it comes to frontends, we have the added requirement that, every once in a while, multiple changes need to be tested at the same time, therefore, we can’t simply overwrite the existing test deployment. That’s why we have preview deployments, a kind of frontend deployment intended to test a specific change, sometimes to be run in parallel. And since non-tech specialists should be able to have a look at these changes without having to get a degree, they should be readily available to be viewed in a browser.
We run our deployments on Kubernetes, or more specifically on Azure Kubernetes Service (AKS), which has its advantages, but doesn’t do everything for us. We also run GKE aka Google Kubernetes Engine, but not for this specific use case. Cleaning up in particular isn’t AKS’s cup of tea either. This isn’t a problem for regular deployments (where newer versions overwrite older ones), but it certainly is for frontend previews that shouldn’t overwrite each other and the number of which can, as a result, be quite unpredictable. You’re supposed to stow them away neatly after use. But we couldn’t be bothered when we were children, and since IT people are basically tall eight-year-olds, you shouldn’t get your hopes up. As a result, we ended up with a ton of deployments and their related service and ingress resources along with them, and nobody to do a regular clean-up. Could we maybe at least keep the pods – they’re responsible for using CPU and memory after all – at zero until they’re needed?
Horizontal Pod Autoscaler to the rescue! If only it were that easy… because it can’t scale to zero, or rather, it can’t scale up a deployment from zero because it would need at least one pod with CPU and RAM usage metrics it could measure. So we had to keep all preview deployments constantly running with at least one pod, just in case someone wanted to look at it – and our frontend teams deploy very frequently. This costs a whole lot of resources that are now unavailable to other deployments and that have to be accounted for in the shape of more or larger nodes. However, these don’t come for free. A different solution was needed, so we sought and found Knative.
Red Hat says: kay-nay-tiv. We’re using the Serving component. For Eventing we’re using KEDA, but that is another story and shall be told another time. Right now, we’re not using other features such as traffic splitting, either.
Basically, Knative servings look a lot like your customary Deployment – Service – Ingress trinity, but they’re leveraging their own custom resource definitions, and are able to scale from zero as soon as requests start pouring in. This allows us to just go ahead and have every pull request build and roll out a preview deployment, without having to allocate system resources to it – no matter if anyone wants to check it out or not. Kourier, a dedicated Knative-specific ingress controller, helps routing the requests to the desired instance without having to set anything manually.
This is what a regular deployment looks like:
And here’s one created with Knative:
Of course, our feature teams aren’t keen on dealing with Knative or other infrastructure-related topics any more than absolutely necessary – and rightly so. We, Team Bender, are one of several platform engineering teams, responsible not only for Kubernetes, but also for making deployments as comfortable as possible for our feature teams. We therefore convened quite a while ago to create a collection of Helm charts to encapsulate all the disasterific details and make sure that every team only needs to have a simple values.yaml file containing the most important settings in their repository.
Providing Knative was essentially a two-step process:
The Kourier ingress controller in combination with some wildcard DNS helped us to make every preview deployment accessible through its pull request ID. So off we went.
No problems, but challenges. Kourier, as an additional ingress controller, demanded some additional effort to set up. We usually use Nginx and would have preferred to keep it that way with Knative as well because ingress rules are often tailored to a specific controller. Luckily, since our preview deployments are all leveraging a wildcard subdomain that’s not used for anything else, we could easily route the traffic to Kourier without making Nginx feel like it ought to have any responsibility in this.
Unfortunately, the performance characteristics of Knative deployments are slightly divergent from those of our regular deployments in some places, so we can only use them for load tests with a couple of caveats. We suspect the Kourier controller to be responsible for this, as it does some sort of buffering in some places where Nginx doesn’t. But this is little more than reading tea leaves at this point. As IT specialists, we naturally prefer coffee, but luckily we have little trouble sourcing suitable amounts of tea when the need arises.
Another small challenge was the fact that Knative doesn’t support hostPath mounts. Datadog, the monitoring solution we use, needs them to mount its config into every pod, so we’re currently living with a lack of metrics and alerts. Regular deployments don’t use Knative, so it’s not as bad as it sounds, but being able to spot possible issues already at this point would have been nice. Though the details are still somewhat nebulous, we do have a basic idea on how to solve this.
The bigger challenge was the clean-up I already mentioned. Knative doesn’t do that on its own, either.
For every preview deployment, Knative creates multiple services so that older and newer versions can run in parallel. And because we’re going about cleaning up with a multitude of levels of motivation, the frontend namespace ended up containing thousands of K8s resources. This is something that Kubernetes isn’t particularly fond of because it loses track of which service goes with which pods until you summon your inner Marie Kondō and do what must be done. We noticed this when the regular test frontend deployments wouldn’t work anymore either because Knative shares its namespace with them (not the brightest idea, as we soon found out). Apparently, keeping the pods at zero wasn’t quite sufficient.
The solution? We wrote some shell scripts (a bit old-school, but it works) that are executed by regularly scheduled pipelines, so that all preview deployments are deleted once they’re older than 14 days or when the pull request that created them is completed. This keeps our namespaces neat and tidy, and Kubernetes has no more trouble doing its job.
A short look into the namespace (generously supported by kubectl and grep) shows that, as of now (1 June 2023, 12:52 PM), 43 preview deployments are dwelling in there. The number of preview pods is three. 40 deployments are therefore running at zero pods that would have had at least one in the era before Knative came along. Most of our cluster nodes are equipped with 16 CPU cores and 32 GiB of RAM – all these preview pods that aren’t running because they’re not needed save us roughly half a node. In practical terms it’s more like an entire node. That doesn’t sound like much, but it’s a significant bit of saving in the end, and most likely just the tip of the iceberg. The number of pods that would have belonged to those preview deployments, and that we don’t see anymore because we’re cleaning them up regularly, would most likely be quite a bit higher.
To illustrate the point, here are two insightful graphs showing our savings based on the number of pods over time:
I know, there aren’t too many details in there, but at least I labelled the axes.
Of course, we’re nowhere near the end of our Knative implementation story – some of the aforementioned challenges are still ongoing, some things could be a bit more performant, and we might even find a better alternative to Knative tomorrow.
Have you ever had to deal with a use case like this one? How did you solve it, are you using Knative or something completely different? Ideas, remarks, questions, tomorrow’s lottery numbers – feel free to write a comment!