Behind the scenes

How Knative simplifies our frontend previews

Matthias Renaud
8.6.2023
Pictures: Christian Walker

Preview deployments are a bit tricky, especially if you don’t clean them up. I’ll tell you what preview deployments are and how we got them in shape.

Anyone who works with deployments of any kind knows that they’re in need of constant updating – and if you update, you’ll want to test as well (even if you don’t want to, you have to anyway, but that’s a different issue). When it comes to frontends, we have the added requirement that, every once in a while, multiple changes need to be tested at the same time, therefore, we can’t simply overwrite the existing test deployment. That’s why we have preview deployments, a kind of frontend deployment intended to test a specific change, sometimes to be run in parallel. And since non-tech specialists should be able to have a look at these changes without having to get a degree, they should be readily available to be viewed in a browser.

Deployments for children and other grown-ups

We run our deployments on Kubernetes, or more specifically on Azure Kubernetes Service (AKS), which has its advantages, but doesn’t do everything for us. We also run GKE aka Google Kubernetes Engine, but not for this specific use case. Cleaning up in particular isn’t AKS’s cup of tea either. This isn’t a problem for regular deployments (where newer versions overwrite older ones), but it certainly is for frontend previews that shouldn’t overwrite each other and the number of which can, as a result, be quite unpredictable. You’re supposed to stow them away neatly after use. But we couldn’t be bothered when we were children, and since IT people are basically tall eight-year-olds, you shouldn’t get your hopes up. As a result, we ended up with a ton of deployments and their related service and ingress resources along with them, and nobody to do a regular clean-up. Could we maybe at least keep the pods – they’re responsible for using CPU and memory after all – at zero until they’re needed?

Horizontal Pod Autoscaler to the rescue! If only it were that easy… because it can’t scale to zero, or rather, it can’t scale up a deployment from zero because it would need at least one pod with CPU and RAM usage metrics it could measure. So we had to keep all preview deployments constantly running with at least one pod, just in case someone wanted to look at it – and our frontend teams deploy very frequently. This costs a whole lot of resources that are now unavailable to other deployments and that have to be accounted for in the shape of more or larger nodes. However, these don’t come for free. A different solution was needed, so we sought and found Knative.

So, what does… how do you pronounce that anyway?

Red Hat says: kay-nay-tiv. We’re using the Serving component. For Eventing we’re using KEDA, but that is another story and shall be told another time. Right now, we’re not using other features such as traffic splitting, either.

Basically, Knative servings look a lot like your customary Deployment – Service – Ingress trinity, but they’re leveraging their own custom resource definitions, and are able to scale from zero as soon as requests start pouring in. This allows us to just go ahead and have every pull request build and roll out a preview deployment, without having to allocate system resources to it – no matter if anyone wants to check it out or not. Kourier, a dedicated Knative-specific ingress controller, helps routing the requests to the desired instance without having to set anything manually.

This is what a regular deployment looks like:

And here’s one created with Knative:

Service as a Service

Of course, our feature teams aren’t keen on dealing with Knative or other infrastructure-related topics any more than absolutely necessary – and rightly so. We, Team Bender, are one of several platform engineering teams, responsible not only for Kubernetes, but also for making deployments as comfortable as possible for our feature teams. We therefore convened quite a while ago to create a collection of Helm charts to encapsulate all the disasterific details and make sure that every team only needs to have a simple values.yaml file containing the most important settings in their repository.

Team Bender in action.
Team Bender in action.

Providing Knative was essentially a two-step process:

  1. Installing Knative itself on our clusters: in order to do this, we had to insert the suitable manifests into our cluster setup pipeline which does all that for us and also allows us to deploy a new cluster mostly automatically and switch to it almost without anyone noticing. But that is another story and shall be told another time.
  2. Providing the aforementioned Helm charts: our regular deployment chart was already quite adequate as a starting point. Some copy and pasting and some tinkering here and there, and soon we had something that we could unleash upon our unsuspecting colleagues without a bad conscience – at least we think so, because we had to look up «bad conscience» in the dictionary.

The Kourier ingress controller in combination with some wildcard DNS helped us to make every preview deployment accessible through its pull request ID. So off we went.

Any problems?

No problems, but challenges. Kourier, as an additional ingress controller, demanded some additional effort to set up. We usually use Nginx and would have preferred to keep it that way with Knative as well because ingress rules are often tailored to a specific controller. Luckily, since our preview deployments are all leveraging a wildcard subdomain that’s not used for anything else, we could easily route the traffic to Kourier without making Nginx feel like it ought to have any responsibility in this.

Unfortunately, the performance characteristics of Knative deployments are slightly divergent from those of our regular deployments in some places, so we can only use them for load tests with a couple of caveats. We suspect the Kourier controller to be responsible for this, as it does some sort of buffering in some places where Nginx doesn’t. But this is little more than reading tea leaves at this point. As IT specialists, we naturally prefer coffee, but luckily we have little trouble sourcing suitable amounts of tea when the need arises.

Another small challenge was the fact that Knative doesn’t support hostPath mounts. Datadog, the monitoring solution we use, needs them to mount its config into every pod, so we’re currently living with a lack of metrics and alerts. Regular deployments don’t use Knative, so it’s not as bad as it sounds, but being able to spot possible issues already at this point would have been nice. Though the details are still somewhat nebulous, we do have a basic idea on how to solve this.

The bigger challenge was the clean-up I already mentioned. Knative doesn’t do that on its own, either.

Old preview deployments do not spark joy

For every preview deployment, Knative creates multiple services so that older and newer versions can run in parallel. And because we’re going about cleaning up with a multitude of levels of motivation, the frontend namespace ended up containing thousands of K8s resources. This is something that Kubernetes isn’t particularly fond of because it loses track of which service goes with which pods until you summon your inner Marie Kondō and do what must be done. We noticed this when the regular test frontend deployments wouldn’t work anymore either because Knative shares its namespace with them (not the brightest idea, as we soon found out). Apparently, keeping the pods at zero wasn’t quite sufficient.

The solution? We wrote some shell scripts (a bit old-school, but it works) that are executed by regularly scheduled pipelines, so that all preview deployments are deleted once they’re older than 14 days or when the pull request that created them is completed. This keeps our namespaces neat and tidy, and Kubernetes has no more trouble doing its job.

A little bit of statistics

A short look into the namespace (generously supported by kubectl and grep) shows that, as of now (1 June 2023, 12:52 PM), 43 preview deployments are dwelling in there. The number of preview pods is three. 40 deployments are therefore running at zero pods that would have had at least one in the era before Knative came along. Most of our cluster nodes are equipped with 16 CPU cores and 32 GiB of RAM – all these preview pods that aren’t running because they’re not needed save us roughly half a node. In practical terms it’s more like an entire node. That doesn’t sound like much, but it’s a significant bit of saving in the end, and most likely just the tip of the iceberg. The number of pods that would have belonged to those preview deployments, and that we don’t see anymore because we’re cleaning them up regularly, would most likely be quite a bit higher.

To illustrate the point, here are two insightful graphs showing our savings based on the number of pods over time:

without Knative.
without Knative.
with Knative.
with Knative.

I know, there aren’t too many details in there, but at least I labelled the axes.

Where do we go from here?

Of course, we’re nowhere near the end of our Knative implementation story – some of the aforementioned challenges are still ongoing, some things could be a bit more performant, and we might even find a better alternative to Knative tomorrow.

Have you ever had to deal with a use case like this one? How did you solve it, are you using Knative or something completely different? Ideas, remarks, questions, tomorrow’s lottery numbers – feel free to write a comment!

54 people like this article



Tech
Follow topics and stay updated on your areas of interest

These articles might also interest you

  • Behind the scenes

    More AI, more price transparency – dispatches from Hackfest

    by Martin Jungfer

  • Behind the scenes

    From bytes to insights: Product Development in transition (part 1)

    by Ronny Wullschleger

  • Behind the scenes

    From Lego to iPhones, here’s what our customers search for most

    by Manuel Wenk

10 comments

Avatar
later