I run a small GKE cluster to host a number of personal projects, including this blog. Since this cluster only runs personal projects, I want to keep it as small as possible to keep costs down, but still make sure everything I run has the resources it needs to be as performant as it needs to be.
In Kubernetes, resource limits and requests give the cluster the information it needs about each pod to assign pods to nodes while making sure those nodes don't get too crowded, or contend for resources. Nodes automatically register the amount of CPU and memory they have available for pods, and Kubernetes allocates pods to nodes until these resources are consumed. When all nodes are out of resources, no additional pods can be scheduled, and pods may be evicted to allow higher-priority pods to be scheduled.
Many cloud platforms, however, support deploying the cluster autoscaler to automatically create new Kubernetes nodes when more resources are requests. Google Cloud supports this as an add-on for GKE clusters, and individual node pools can then be configured with autoscaling enabled or disabled, with autoscaling limits for each node pool.
But while configuring autoscaling on GKE is simple and clean, the autoscaler isn't particularly eager to scale down. This is a common problem with cloud providers, not just the GKE cluster-autoscaler. For large companies, it sometimes makes sense to write your own code to scale down, as Sony Imageworks did when launching their hybrid-cloud renderfarm in GCP:
For the rest of us, it may be too costly or complicated to take over autoscaling from Kubernetes. Fortunately, there are a few ways to help the autoscaler along and cut costs.
Ask nicely
The cluster autoscaler on any GKE cluster has a number of configuration options available - some of which are available only via the API or gcloud
CLI, or on alpha
or beta
API endpoints. The full set of options available on the CLI can be found at the gcloud SDK documentation for gcloud container clusters
.
One relevant option is found in the beta command: autoscaling profiles
gcloud beta container clusters update example-cluster \
--autoscaling-profile optimize-utilization
Setting the cluster's autoscaling-profile to optimize-utilization
instead of the default value balanced
will cause the autoscaler to prefer to scale down nodes quickly whenever possible. Google doesn't go into the specifics of the implementation of this profile, but does leave this note of advice:
This profile has been optimized for use with batch workloads that are not sensitive to start-up latency. We do not currently recommend using this profile with serving workloads.
Pod tuning with autoscaler events
cluster-autoscaler is a process like any other, and on many Kubernetes variants, it runs on the cluster, possibly on the master node, as a Pod. This is not the case for GKE, which hides the implementation details of the cluster master, which makes the autoscaler's logs inaccessible, and leaves you unable to configure many of the options available in the upstream.
Fortunately, GKE does expose a custom interface for understanding why the autoscaler is making decisions to scale up or down nodes: autoscaler events in Cloud Logging. These events are available for GKE versions starting at 1.15.4-gke.7, with the noScaleDown event type, the most recent at time of writing, being added in 1.16.8-gke.2. For cost savings, this last event is the most relevant, since it tells you why a node wasn't removed, so the rest of this post will assume you are using GKE 1.16.8-gke.2 or later.
The linked page gives a good guide on how to view these events via Cloud Logging. For each log item, the most relevant information is found in log field jsonPayload.noDecisionStatus.noScaleDown.nodes[].reason.messageId
, which will be one of these enumerated items. Here are some common issues the exploring the autoscaler's log events revealed on my cluster.
kube-system
Though the docs are pretty clear, I was unaware that cluster-autoscaler will by default refuse to evict non-Daemonset Pods in the kube-system
namespace. When most of these pods are system daemons and they are few compared to your service pods, this is likely a small concern.
But on a default GKE cluster, especially a small one, there are a surprising number of non-evictable pods in the kube-system
namespace, and each one can prevent scale down for a node to which it is assigned, no matter how low the utilization. A few such pods that are likely present on your clusters:
- kube-dns autoscaler
- calico-node-vertical-autoscaler
- calico-typha
- calico-typha-horizontal-autoscaler
- calico-typha-vertical-autoscaler
- metrics-server
For my cluster, on top of these, I had unwisely deployed even more pods to this sytem-critical namespace:
- Helm v2's Tiller pod
- NGINX ingress controller pods
- NGINX ingress default backend pod
- estafette's GKE preemptible node killer daemon
All together, these pods blocked downscaling a significant proportion of the time.
These events can be identified by the messageId
"no.scale.down.node.pod.kube.system.unmovable"
- to query only events like this, you could add the following log query line to your filter:
jsonPayload.noDecisionStatus.noScaleDown.nodes.reason.messageId = "no.scale.down.node.pod.kube.system.unmovable"
The majority of these pods can be considered low-priorty control plane pods, which are safe to evict. And the Cluster Autoscaler FAQ provides this advice on how to denote the pods as safe to evict:
kubectl create poddisruptionbudget <pdb name> --namespace=kube-system --selector app=<app name> --max-unavailable 1
Knapsack packing
One last way to optimize autoscaling is to carefully set the resource requests on your pods to avoid individual pods with large requests.
It's important to remember that for the autoscaler to scale down a node, it must be able to schedule all pods on that node onto other nodes, without scaling any node groups up. A node with only one pod cannot be scaled down if that pod doesn't fit on any other node. This is true even if the cluster as a whole has enough resources to accommodate that one pod - Kubernetes does not try to plan a series of evictions that will more densely pack the nodes, and will simply skip scaling down.
The easiest way to avoid this is run pods with requests small enough that they can be easily packed onto nodes. The relevant value is the ratio of a pod's request to the allocatable amount of that resource on a given node. If a single pod requires a large proportion of a node's resources, it will be harder to evict that pod and harder to scale down any node running that pod.
It's impossible to give general guidelines on how to size pod requests, but I can tell you how to detect if this is causing an autoscaler to avoid scaling down - the relevant messageId
is "no.scale.down.node.no.place.to.move.pods"
:
jsonPayload.noDecisionStatus.noScaleDown.nodes.reason.messageId = "no.scale.down.node.no.place.to.move.pods"
Lastly: cheat
This one is easy: just don't request resources!
Though not usually the right answer, sometimes Kubernetes is not cut out to manage resources for your pods. Maybe you're okay with heavily loading a node or manually assigning some Pods to it (making it more like a pet, less like cattle). Maybe you have database workloads and care more about managing IOPS than CPU and memory, or ML workloads where the only relevant resource is access to physical GPUs, and you want to share GPUs between pods (not yet supported on Kubernetes).
In this situation, you can do one of two things: drop the resource:
block of the PodSpec all together, or continue to take advantage of Pod resource limits, but without requests. Resource limits do not drive or affect autoscaling or pod assignment - they control cgroup CPU slice allocation, process memory limits for the OOM killer, and pod eviction priority when a node is OOM, but not autoscaling.
To do the latter, you have to be careful: a resource
spec without any requests:
will default to matching requests to limits according to the Kubernetes specification.
You must manually set requests:
to a small value:
With a bit of investigation and some small tweaks, the cluster autoscaler on GKE can be made to behave quite well. If you found this quick intro to some of the ways to control it, please get in touch and let me know at justin@palpant.us or via LinkedIn or Twitter.