Tuning cluster-autoscaler on GKE

I run a small GKE cluster to host a number of personal projects, including this blog. Since this cluster only runs personal projects, I want to keep it as small as possible to keep costs down, but still make sure everything I run has the resources it needs to be as performant as it needs to be.

In Kubernetes, resource limits and requests give the cluster the information it needs about each pod to assign pods to nodes while making sure those nodes don't get too crowded, or contend for resources. Nodes automatically register the amount of CPU and memory they have available for pods, and Kubernetes allocates pods to nodes until these resources are consumed. When all nodes are out of resources, no additional pods can be scheduled, and pods may be evicted to allow higher-priority pods to be scheduled.

Many cloud platforms, however, support deploying the cluster autoscaler to automatically create new Kubernetes nodes when more resources are requests. Google Cloud supports this as an add-on for GKE clusters, and individual node pools can then be configured with autoscaling enabled or disabled, with autoscaling limits for each node pool.

A Kubernetes node pool with autoscaling enabled, allowing 0-8 nodes to be created by the cluster-autoscaler

But while configuring autoscaling on GKE is simple and clean, the autoscaler isn't particularly eager to scale down. This is a common problem with cloud providers, not just the GKE cluster-autoscaler. For large companies, it sometimes makes sense to write your own code to scale down, as Sony Imageworks did when launching their hybrid-cloud renderfarm in GCP:

Sony Imageworks learned that automatic autoscaling was too slow on GCP, and chose to manage it manually

For the rest of us, it may be too costly or complicated to take over autoscaling from Kubernetes. Fortunately, there are a few ways to help the autoscaler along and cut costs.

Ask nicely

The cluster autoscaler on any GKE cluster has a number of configuration options available - some of which are available only via the API or gcloud CLI, or on alpha or beta API endpoints. The full set of options available on the CLI can be found at the gcloud SDK documentation for gcloud container clusters.

One relevant option is found in the beta command: autoscaling profiles

gcloud beta container clusters update example-cluster \
--autoscaling-profile optimize-utilization

Setting the cluster's autoscaling-profile to optimize-utilization instead of the default value balanced will cause the autoscaler to prefer to scale down nodes quickly whenever possible. Google doesn't go into the specifics of the implementation of this profile, but does leave this note of advice:

This profile has been optimized for use with batch workloads that are not sensitive to start-up latency. We do not currently recommend using this profile with serving workloads.

Pod tuning with autoscaler events

cluster-autoscaler is a process like any other, and on many Kubernetes variants, it runs on the cluster, possibly on the master node, as a Pod. This is not the case for GKE, which hides the implementation details of the cluster master, which makes the autoscaler's logs inaccessible, and leaves you unable to configure many of the options available in the upstream.

Fortunately, GKE does expose a custom interface for understanding why the autoscaler is making decisions to scale up or down nodes: autoscaler events in Cloud Logging. These events are available for GKE versions starting at 1.15.4-gke.7, with the noScaleDown event type, the most recent at time of writing, being added in 1.16.8-gke.2. For cost savings, this last event is the most relevant, since it tells you why a node wasn't removed, so the rest of this post will assume you are using GKE 1.16.8-gke.2 or later.

The linked page gives a good guide on how to view these events via Cloud Logging. For each log item, the most relevant information is found in log field jsonPayload.noDecisionStatus.noScaleDown.nodes[].reason.messageId, which will be one of these enumerated items. Here are some common issues the exploring the autoscaler's log events revealed on my cluster.

kube-system

Though the docs are pretty clear, I was unaware that cluster-autoscaler will by default refuse to evict non-Daemonset Pods in the kube-system namespace. When most of these pods are system daemons and they are few compared to your service pods, this is likely a small concern.

But on a default GKE cluster, especially a small one, there are a surprising number of non-evictable pods in the kube-system namespace, and each one can prevent scale down for a node to which it is assigned, no matter how low the utilization. A few such pods that are likely present on your clusters:

kube-dns autoscaler
calico-node-vertical-autoscaler
calico-typha
calico-typha-horizontal-autoscaler
calico-typha-vertical-autoscaler
metrics-server

For my cluster, on top of these, I had unwisely deployed even more pods to this sytem-critical namespace:

Helm v2's Tiller pod
NGINX ingress controller pods
NGINX ingress default backend pod
estafette's GKE preemptible node killer daemon

All together, these pods blocked downscaling a significant proportion of the time.

These events can be identified by the messageId "no.scale.down.node.pod.kube.system.unmovable" - to query only events like this, you could add the following log query line to your filter:

jsonPayload.noDecisionStatus.noScaleDown.nodes.reason.messageId = "no.scale.down.node.pod.kube.system.unmovable"

The majority of these pods can be considered low-priorty control plane pods, which are safe to evict. And the Cluster Autoscaler FAQ provides this advice on how to denote the pods as safe to evict:

kubectl create poddisruptionbudget <pdb name> --namespace=kube-system --selector app=<app name> --max-unavailable 1

Knapsack packing

One last way to optimize autoscaling is to carefully set the resource requests on your pods to avoid individual pods with large requests.

It's important to remember that for the autoscaler to scale down a node, it must be able to schedule all pods on that node onto other nodes, without scaling any node groups up. A node with only one pod cannot be scaled down if that pod doesn't fit on any other node. This is true even if the cluster as a whole has enough resources to accommodate that one pod - Kubernetes does not try to plan a series of evictions that will more densely pack the nodes, and will simply skip scaling down.

The easiest way to avoid this is run pods with requests small enough that they can be easily packed onto nodes. The relevant value is the ratio of a pod's request to the allocatable amount of that resource on a given node. If a single pod requires a large proportion of a node's resources, it will be harder to evict that pod and harder to scale down any node running that pod.

It's impossible to give general guidelines on how to size pod requests, but I can tell you how to detect if this is causing an autoscaler to avoid scaling down - the relevant messageId is "no.scale.down.node.no.place.to.move.pods":

jsonPayload.noDecisionStatus.noScaleDown.nodes.reason.messageId = "no.scale.down.node.no.place.to.move.pods"

Lastly: cheat

This one is easy: just don't request resources!

Man with eyes covered; image by Taras Chernus, via Unsplash — Photo by Taras Chernus / Unsplash

Though not usually the right answer, sometimes Kubernetes is not cut out to manage resources for your pods. Maybe you're okay with heavily loading a node or manually assigning some Pods to it (making it more like a pet, less like cattle). Maybe you have database workloads and care more about managing IOPS than CPU and memory, or ML workloads where the only relevant resource is access to physical GPUs, and you want to share GPUs between pods (not yet supported on Kubernetes).

In this situation, you can do one of two things: drop the resource: block of the PodSpec all together, or continue to take advantage of Pod resource limits, but without requests. Resource limits do not drive or affect autoscaling or pod assignment - they control cgroup CPU slice allocation, process memory limits for the OOM killer, and pod eviction priority when a node is OOM, but not autoscaling.

To do the latter, you have to be careful: a resource spec without any requests: will default to matching requests to limits according to the Kubernetes specification.

resources:
  limits:
    cpu: 2
    memory: 2Gi

A Pod with this resource spec will also request 2 CPUs and 2Gi of memory, and will not be scheduled to nodes with less available resources than that. It will also trigger the autoscaler to create new nodes if resources are unavailable.

You must manually set requests: to a small value:

resources:
  limits:
    cpu: 2
    memory: 2Gi
  requests:
    cpu: 10m
    memory: 16Mi

A Pod with this resource spec will be prevented from consuming more than 2 CPUs or 2Gi of memory, but can be scheduled even on a very busy node because of the low resource requests

With a bit of investigation and some small tweaks, the cluster autoscaler on GKE can be made to behave quite well. If you found this quick intro to some of the ways to control it, please get in touch and let me know at justin@palpant.us or via LinkedIn or Twitter.

Tuning cluster-autoscaler on GKE

Justin Palpant

Justin Palpant

Ask nicely

Pod tuning with autoscaler events

kube-system

Knapsack packing

Lastly: cheat

Folding@Home on Kubernetes

Simulating user traffic with Chrome and Golang

Ask nicely

Pod tuning with autoscaler events

kube-system

Knapsack packing

Lastly: cheat

Subscribe to a star shines upon the hour of our meeting

Subscribe to a star shines upon the hour of our meeting