a star shines upon the hour of our meeting

Tuning cluster-autoscaler on GKE

Justin Palpant — Wed, 17 Jun 2020 22:21:58 GMT

I run a small GKE cluster to host a number of personal projects, including this blog. Since this cluster only runs personal projects, I want to keep it as small as possible to keep costs down, but still make sure everything I run has the resources it needs to be as performant as it needs to be.

In Kubernetes, resource limits and requests give the cluster the information it needs about each pod to assign pods to nodes while making sure those nodes don't get too crowded, or contend for resources. Nodes automatically register the amount of CPU and memory they have available for pods, and Kubernetes allocates pods to nodes until these resources are consumed. When all nodes are out of resources, no additional pods can be scheduled, and pods may be evicted to allow higher-priority pods to be scheduled.

Many cloud platforms, however, support deploying the cluster autoscaler to automatically create new Kubernetes nodes when more resources are requests. Google Cloud supports this as an add-on for GKE clusters, and individual node pools can then be configured with autoscaling enabled or disabled, with autoscaling limits for each node pool.

kubernetes/autoscaler

Autoscaling components for Kubernetes. Contribute to kubernetes/autoscaler development by creating an account on GitHub.

GitHubkubernetes

A Kubernetes node pool with autoscaling enabled, allowing 0-8 nodes to be created by the cluster-autoscaler

But while configuring autoscaling on GKE is simple and clean, the autoscaler isn't particularly eager to scale down. This is a common problem with cloud providers, not just the GKE cluster-autoscaler. For large companies, it sometimes makes sense to write your own code to scale down, as Sony Imageworks did when launching their hybrid-cloud renderfarm in GCP:

Sony Imageworks learned that automatic autoscaling was too slow on GCP, and chose to manage it manually

For the rest of us, it may be too costly or complicated to take over autoscaling from Kubernetes. Fortunately, there are a few ways to help the autoscaler along and cut costs.

Ask nicely

The cluster autoscaler on any GKE cluster has a number of configuration options available - some of which are available only via the API or gcloud CLI, or on alpha or beta API endpoints. The full set of options available on the CLI can be found at the gcloud SDK documentation for gcloud container clusters.

One relevant option is found in the beta command: autoscaling profiles

gcloud beta container clusters update example-cluster \
--autoscaling-profile optimize-utilization

Setting the cluster's autoscaling-profile to optimize-utilization instead of the default value balanced will cause the autoscaler to prefer to scale down nodes quickly whenever possible. Google doesn't go into the specifics of the implementation of this profile, but does leave this note of advice:

This profile has been optimized for use with batch workloads that are not sensitive to start-up latency. We do not currently recommend using this profile with serving workloads.

Pod tuning with autoscaler events

cluster-autoscaler is a process like any other, and on many Kubernetes variants, it runs on the cluster, possibly on the master node, as a Pod. This is not the case for GKE, which hides the implementation details of the cluster master, which makes the autoscaler's logs inaccessible, and leaves you unable to configure many of the options available in the upstream.

Fortunately, GKE does expose a custom interface for understanding why the autoscaler is making decisions to scale up or down nodes: autoscaler events in Cloud Logging. These events are available for GKE versions starting at 1.15.4-gke.7, with the noScaleDown event type, the most recent at time of writing, being added in 1.16.8-gke.2. For cost savings, this last event is the most relevant, since it tells you why a node wasn't removed, so the rest of this post will assume you are using GKE 1.16.8-gke.2 or later.

The linked page gives a good guide on how to view these events via Cloud Logging. For each log item, the most relevant information is found in log field jsonPayload.noDecisionStatus.noScaleDown.nodes[].reason.messageId, which will be one of these enumerated items. Here are some common issues the exploring the autoscaler's log events revealed on my cluster.

kube-system

Though the docs are pretty clear, I was unaware that cluster-autoscaler will by default refuse to evict non-Daemonset Pods in the kube-system namespace. When most of these pods are system daemons and they are few compared to your service pods, this is likely a small concern.

But on a default GKE cluster, especially a small one, there are a surprising number of non-evictable pods in the kube-system namespace, and each one can prevent scale down for a node to which it is assigned, no matter how low the utilization. A few such pods that are likely present on your clusters:

kube-dns autoscaler
calico-node-vertical-autoscaler
calico-typha
calico-typha-horizontal-autoscaler
calico-typha-vertical-autoscaler
metrics-server

For my cluster, on top of these, I had unwisely deployed even more pods to this sytem-critical namespace:

Helm v2's Tiller pod
NGINX ingress controller pods
NGINX ingress default backend pod
estafette's GKE preemptible node killer daemon

All together, these pods blocked downscaling a significant proportion of the time.

These events can be identified by the messageId "no.scale.down.node.pod.kube.system.unmovable" - to query only events like this, you could add the following log query line to your filter:

jsonPayload.noDecisionStatus.noScaleDown.nodes.reason.messageId = "no.scale.down.node.pod.kube.system.unmovable"

The majority of these pods can be considered low-priorty control plane pods, which are safe to evict. And the Cluster Autoscaler FAQ provides this advice on how to denote the pods as safe to evict:

kubectl create poddisruptionbudget <pdb name> --namespace=kube-system --selector app=<app name> --max-unavailable 1

Knapsack packing

One last way to optimize autoscaling is to carefully set the resource requests on your pods to avoid individual pods with large requests.

It's important to remember that for the autoscaler to scale down a node, it must be able to schedule all pods on that node onto other nodes, without scaling any node groups up. A node with only one pod cannot be scaled down if that pod doesn't fit on any other node. This is true even if the cluster as a whole has enough resources to accommodate that one pod - Kubernetes does not try to plan a series of evictions that will more densely pack the nodes, and will simply skip scaling down.

The easiest way to avoid this is run pods with requests small enough that they can be easily packed onto nodes. The relevant value is the ratio of a pod's request to the allocatable amount of that resource on a given node. If a single pod requires a large proportion of a node's resources, it will be harder to evict that pod and harder to scale down any node running that pod.

It's impossible to give general guidelines on how to size pod requests, but I can tell you how to detect if this is causing an autoscaler to avoid scaling down - the relevant messageId is "no.scale.down.node.no.place.to.move.pods":

jsonPayload.noDecisionStatus.noScaleDown.nodes.reason.messageId = "no.scale.down.node.no.place.to.move.pods"

Lastly: cheat

This one is easy: just don't request resources!

Photo by Taras Chernus / Unsplash

Though not usually the right answer, sometimes Kubernetes is not cut out to manage resources for your pods. Maybe you're okay with heavily loading a node or manually assigning some Pods to it (making it more like a pet, less like cattle). Maybe you have database workloads and care more about managing IOPS than CPU and memory, or ML workloads where the only relevant resource is access to physical GPUs, and you want to share GPUs between pods (not yet supported on Kubernetes).

In this situation, you can do one of two things: drop the resource: block of the PodSpec all together, or continue to take advantage of Pod resource limits, but without requests. Resource limits do not drive or affect autoscaling or pod assignment - they control cgroup CPU slice allocation, process memory limits for the OOM killer, and pod eviction priority when a node is OOM, but not autoscaling.

To do the latter, you have to be careful: a resource spec without any requests: will default to matching requests to limits according to the Kubernetes specification.

resources:
  limits:
    cpu: 2
    memory: 2Gi

A Pod with this resource spec will also request 2 CPUs and 2Gi of memory, and will not be scheduled to nodes with less available resources than that. It will also trigger the autoscaler to create new nodes if resources are unavailable.

You must manually set requests: to a small value:

resources:
  limits:
    cpu: 2
    memory: 2Gi
  requests:
    cpu: 10m
    memory: 16Mi

A Pod with this resource spec will be prevented from consuming more than 2 CPUs or 2Gi of memory, but can be scheduled even on a very busy node because of the low resource requests

With a bit of investigation and some small tweaks, the cluster autoscaler on GKE can be made to behave quite well. If you found this quick intro to some of the ways to control it, please get in touch and let me know at justin@palpant.us or via LinkedIn or Twitter.

Simulating user traffic with Chrome and Golang

Justin Palpant — Thu, 04 Jun 2020 02:05:07 GMT

For any website, app, or product you support, there are a few dimensions to providing a good experience to your users, like availability, because no one likes error messages, or latency, because interactions should be smooth and quick.

Load testing is a great way to expose bottlenecks, fragility, and performance issues in your application. By adding a large amount of traffic in a controlled manner, you can often spot issues. And it never hurts to be prepared for what might happen if your blog goes viral!

My interest in automated load testing came about because I noticed that when running simple tests, the performance characteristics of my sites (like my Gitlab instance, or this blog) became totally unpredictable - different than what they were in the "steady-state" (with little to no traffic).

There are many ways to load test a website - so lets start with the most basic.

Making some HTTP requests

If you want to make sure that the HTTP-serving components of your system are performant, making simple HTTP requests to a public website is easy, and can be done at high scale with minimal resources.

A consumer-grade laptop running cURL script can easily make hundreds of requests per second. ab, a classic tool from Apache, and the more modern wrk, take the basic principle of cURL and provide configurable parallelism and (in the basic cases) high QPS, as well as statistic reporting, which can give you an idea of the range of latency and throughput characteristics of a server.

All of these tools can be run with almost no overhead, maxing out a basic webserver while consuming little to no CPU or memory on the load test machine.

$ wrk -t 8 -c 100 -d 120s https://justin.palpant.us/folding-home-on-kubernetes/
Running 2m test @ https://justin.palpant.us/folding-home-on-kubernetes/
  8 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   257.73ms  312.05ms   1.97s    81.28%
    Req/Sec    50.72     25.21   151.00     66.31%
  46582 requests in 2.00m, 4.50GB read
  Socket errors: connect 0, read 0, write 0, timeout 756
Requests/sec:    387.87
Transfer/sec:     38.34MB

wrk can easily make hundreds or thousands of HTTP requests per second from a laptop

However, these tools and tools like them have a weakness - for a webserver that serves a web application that users interact with via a browser, the load induced by a single HTTP request doesn't really mimic what would happen if a large number of users began to visit.

A user visiting a web page often has more side effects that cause load on your servers than one or even several HTTP requests.

Additional dynamic HTTP calls based on executed Javascript
Server-side rendering of new components
Server-side caching based on request headers

One HTTP request with cURL yields some HTML and no requests to metric-serving backends; but the browser executes JS and makes additional requests, placing load on the entire system

Simulate the user

To overcome these limitations, you can try to expose your servers to load that, from the server's perspective, appears to be regular user traffic. Can't find a few hundred or thousand people to reload your blog all day? No problem! Fortunately, there are a few ways to automate this.

3rd-party tools

Services such as flood.io can help with this - they provide a tool to simulate a huge number of users, geographically distributed, interacting with your website in simple ways. For more advanced cases, you can even provide Selenium scripts to execute complex sequences of interactions with a website. I use and will continue to use Flood for occasional high-stress testing.

flood.io load test simulating 250 constant visitors to this blog for 15 minutes

But Flood and similar tools have some limitations. Importantly, though they scale to large numbers, they are not meant to simulate sustained load - jobs typically run for minutes or hours, but not days or weeks. On top of that, the tools are expensive. Flood offers 500 virtual user-hours per month for free, and then $0.045/virtual user-hour thereafter. While this is great for a few bursts, simulating continuous load of only 10 users would cost upwards of $300/month.

Another way

I thought it would be interesting to try to build something that would provide a tunable way to load web pages and induce the corresponding stress on my monitoring stack (in the absence of a horde of developers constantly refreshing dashboards). My goals were simple and specific: simulate a full page load on an authenticated web page, in an indefinite loop, with some control over the parallelism on the client side.

For years, driving Chrome or other browsers via automation has been a staple of integration testing via frameworks like Selenium. Today, the Chrome DevTools Protocol, a gRPC API maintained by Google, facilitates this. It allows programmatic control of a Chrome browser instance from another process.

Several libraries wrapping the DevTools Protocol have been made for different languages to allow fine-grained control: Puppeteer for NodeJS (maintained by the Chrome development team), github.com/chromedp/chromedp for golang, headless_chrome for Rust. These libraries are at varying levels of maturity and full-featured, and if you are interested in building a new solution for driving Chrome generally, are a great place to start!

On top of these libraries, a few enterprising developers have built load-testing tools, such as puppeteer-loadtest and the powerful puppeteer-cluster.

Given that I'm more familiar with golang than the other two languages, I thought to see if I could scratch out a simple binary to meet my goals from chromedp.

gitlab.palpant.us/justin/chromedp-load-agent

Courtesy of the power and ease of golang, chromedp, Docker, and Kubernetes, in just a handful of hours I made a tool that:

Loads web pages continuously in a headless Chrome browser, with URLs specified via file or CLI argument
Simulates a complete page load, including awaiting load and optionally networkIdle0 events (meaning no network requests have been made for 500ms), with configurable timeout
Supports arbitrary HTTP headers, TLS verification using default CAs, or skipping TLS verification (for unsigned HTTPS websites).
Configurable parallelism via a reusable pool of browser tabs
Can be (and is!) run on Kubernetes, with jpalpant/chromedp-load-agent published to DockerHub
Can take screenshots of the page to validate a successful page load

Post-load screenshots from my Grafana instance showing all queries complete, taken with chromedp-load-agent test while visiting these dashboards

What's left

Right now, chromedp-load-agent is untested, the code isn't very well organized (on account of being my first project using spf13/cobra), it's expensive to run, and it's brittle. While functional, as a long-running service it has a lot of limitations. I'm not interested in making it into a library, but if I can, I'd love to improve other aspects:

Health checks suitable for a long-running server process
Prometheus metrics for application statistics, like successful or failed page loads
More utility to screenshots, like a web interface to show the most recent screenshot for each URL

Beyond that, there's a lot of potential for a reusable library that automates this work in Golang, but I think that's not a direction I want to pursue right now.

But the important thing: it works. I set out to add consistent, configurable load to my monitoring system, and this is what QPS looks like now:

So, did it reveal anything interesting? Or was this exercise a waste of time?

What I learned

Cloud-compute costs

The first thing that surprised me was how quickly the TCO of my small websites increased under constant load. I run most things on GCP via a GKE cluster, and of course my personal Grafana instance sees very little traffic day-to-day, but I wouldn't have guessed how and how much an increase in traffic would cost.

As with most cloud providers, Google charges for a wide variety of usage SKUs. Some like CPU and memory are obvious, while others like static IP addresses, load balancer rules, cluster management fees, and monthly storage fees, less obvious, but still intuitive. However, the constant page loads suddenly revealed a variety of non-intuitive costs for services that were previously inexpensive. The main culprit turned out to be (drumroll)...

Log ingestion

Believe it or not, GCP makes you pay through the nose for log ingestion once you pass a free usage threshold of 50GB. With NGINX logs and traces from various services being emitted on every request, even my small cluster consumed that allotment and rapidly started accruing log charges, at a rate of $0.50/GB. Fortunately, Cloud Logging allows flexible exclusion filters, and I was able to bring that cost back under control.

Beyond logs, I also noticed a sharp spike in charges due to Network Egress (data leaving GCP because the load agent is downloading it), and GCS Class B requests. The latter was interesting to me, and difficult to resolve. I use Thanos (a CNCF project) as part of my monitoring stack, and Thanos serves metrics from Google Cloud Storage. Thanos Store and Thanos Query, the components responsible for handling requests for metrics, offer very little in the way of caching, so every page load required downloading a piece of a GCS object to eventually display to the visitor in a Grafana dashboard.

At a lower level, Thanos Store reports statistics on these operations - the relevant operation was the bucket get_range request (from gcs.go)

That operation makes a storage.*.get request via the GCS JSON API, which is categorized as a Class B operation for billing. GCP charges $0.004/10,000 operations for this type of request against a Standard bucket. While that seems small, 100QPS translates to more than $100/month - a substantial amount, compared to other costs on this cluster. For ways to deal with this, I'll write about what I did another time: in the meantime, watch this amazing talk by Tom Wilkie and read the blog post!

CPU bottlenecks and autoscaling

To my surprise, memory usage for the pods I use stayed relatively stable when the load agent was enabled. However several pods showed drastic spikes in CPU usage. Notable among these were Grafana, Gitlab's Webservice (formerly "Unicorn"), NGINX, and the CloudSQL proxy I use to tunnel my GCP-managed databases for secure access from within the Kubernetes cluster..

Normally, this would be fine - my cluster runs with excess capacity, and CPU is a compressible resource, which means (in part) that it can be over-consumed without Kubernetes needing to terminate any pods or increase cluster capacity. Instead, the pods are throttled - prevented from consuming more CPU than their limits by not scheduling those processes.

Unfortunately, CPU throttling on any process that serves user traffic can lead to poor, as well as inconsistent, performance - a process in the midst of a request could suddenly be put on pause, delaying those requests being served and increasing their latency by an unpredictable amount - like this:

There are a few ways to get around this. In some cases, the right choice is to simply increase the CPU allocation to your pods to prevent the bottleneck. If large changes in load are infrequent, you can pick an allocation that works for your expected load and leave it, updating it manually when need be.

Sometimes you can't predict what load you need to handle. For those cases, a HorizontalPodAutoscaler can help. This Kubernetes primitive monitors the CPU or memory usage of all pods belonging to a Deployment and, if the usage exceeds a threshold, scales up the Deployment. If your Service is set up in the usual way, requests are automatically load balanced across the new and old Pods once all are available, reducing the CPU needed for each Pod. Scaling up or down is repeated until the CPU usage is within bounds, or the maximum or minimum number of Pods the HPA can use is reached.

For GitLab, which I deploy via the GitLab Cloudnative Helm Chart, an HPA for the webservice deployment can be configured via the gitlab.webservice.hpa field, as described in the docs. NGINX ingress similarly offers easy HPA configuration, and any HPA for any deployment can be made with kubectl autoscale as well.

If you are thinking about how to load test your website, application, or product, I hope some of this has been useful information! If you have any feedback or suggestions, or are interested in chromedp-load-agent or similar tools, please get in touch! I'm always learning and looking for input, and happy to chat about infrastructure any time.

Maximizing NVIDIA GPU performance on Linux

Justin Palpant — Sun, 26 Apr 2020 20:18:59 GMT

I got an NVIDIA RTX 2080 Super a few months ago. It's a great piece of hardware and up for anything I can throw at it, which so far includes Metro Exodus, Half-Life: Alyx, Folding@Home, and more. But out-of-the-box it was 15% less performant than it currently is when reporting maximum utilization. With a bit of debugging and a few small changes to the system, I've managed to reclaim that performance. Here's what I learned.

This post focuses on finding and addressing bottlenecks affecting GPU compute, but graphics processing can be slowed by many components: a slow CPU can prevent a GPU from running at maximum speed by failing to provide it with work quickly enough; a machine learning task that requires large amounts of data transfer may be limited elsewhere, such as GPU memory bandwidth, disk, or network activity. Rule these out first. A good rule-of-thumb is to check that GPU utilization is being reported as nearly 100%, while other components are not at their maximums.

Identifying the potential for more performance

I started investigating my GPU's performance after two observations: the first was that latency-sensitive VR games would sometimes stutter or jerk before becoming smooth again, with brief large spikes in frame latency (going from sub-6ms times up to 15-18ms for brief fractions of a second); the second was that when running at maximum utilization, my GPU temperature was pinned at 86°C with the GPU fans running at full speed.

Now, a bit of frame drop in a demanding game could maybe be expected, new GPU or not. And it's hard to find good information about what qualifies as "high" temperatures for a GPU, and what the effects of running at high temperatures are. Still, 86°C is warm, and since my case is a Fractal Node 202, an extremely compact mini-ITX that clocks in at 10.2L, cooling was at the top of my mind. I started to learn about what happens to a GPU as it reaches thermal maximums.

SM Clock Throttling

It turns out, what an NVIDIA GPU will do in order to stay cool is reduce the clock frequency of the streaming multiprocessor (SM) units, which contain CUDA cores, resulting in a decrease in performance that is proportional to the decrease in frequency for tasks running on these cores. The sign of a throttled GPU is a SM frequency that is uneven - full-power GPUs maintain a stable clock frequency.

Throttling confirmed! The SM Clock plot showed clear signs of throttling - spiking constantly between 1770MHz and 1690Gz, and even dropping to 1650MHz for a sustained window. The reference RTX 2080 Super has a base clock of 1650MHz, with a boost clock of 1815MHz, so these would seem to be good speeds, but the instability in the frequency meant something was wrong.

On Windows, third-party programs like GPU-Z can help you detect this by showing a graph of GPU frequency over time. On Linux, the job is somewhat more difficult: you can run nvidia-smi -q -d CLOCK to ask for the GPU frequency, but must run this repeatedly to see if the clock frequency is changing.

For those of us on Linux and without datacenter-style monitoring, though, there's an easier way!

PERFORMANCE

Just run nvidia-smi -q -d PERFORMANCE

$ nvidia-smi -q -d PERFORMANCE

==============NVSMI LOG==============

Driver Version                      : 440.66.08
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Performance State               : P2
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Active
        Display Clock Setting       : Not Active

This is the best list of active throttles I've seen, and when I was investigating, it clearly and consistently showed SW Thermal Slowdown - my GPU was too hot. Not hot enough to trigger the emergency brake that is a hardware slowdown, but hot enough to affect performance. Next up was to figure out how to fix it.*

GPU-tuned air-cooling on Linux

It was at this point that I learned something lucky: I had made a dumb mistake in my build, forgotten that the Fractal Node 202 has space for two case fans beneath the GPU. These are meant to be static pressure fans, pulling cool air in from outside, with the resulting hot air vented out by the CPU fan. I could add two Corsair ML120 Pro Blue 120mm fan as case fans easily enough.

Improving Fan Control

My mini-ITX motherboard is the Gigabyte Z390 I Aorus Pro Wifi, which has three fan headers and comes with the Smart Fan 5 fan control software in the BIOS. This was sufficient to make sure the fans turned on with default settings, but the control with Smart Fan 5 is limited - you can tie any of your fans to either the CPU temperatures, the PCH temperature, or a ambient temperature sensor somewhat removed from the CPU, and the available fan curves are highly customizable, but finicky.

Smart Fan 5 supports multiple fans with complex fan curves, but motherboard temperature sensors weren't a good choice to eliminate thermal throttling on the GPUs

Unfortunately tying case fan speed to ambient temperature meant that these fans wouldn't spin up when the GPUs were under load; tying it to CPU temperature meant that the fans would rapidly spin up and down even when the GPU was inactive, as CPU temperatures tend to be more variable than the temperatures of other components. Neither solution was sufficient.

lm-sensors and fancontrol

The go-to for fan speed control on Linux is a combination of lm-sensors, a powerful general-purpose hardware monitoring package, and fancontrol, a simple but useful script that monitors arbitrary temperature sensors and controls PWM outputs, in an infinite loop. On Ubuntu, both can be installed with apt and configured:

$ sudo apt install lm-sensors
$ sudo sensors-detect
$ sudo pwmconfig

For many systems this is sufficient to expose the temperature sensors for CPU temperature as well as the PWM outputs and sensors which provide fan speed control and feedback.

However, this doesn't work on this particular Gigabyte motherboard.

The Gigabyte motherboard uses a temperature sensor which isn't natively supported by the Linux kernel. Fortunately, there was once an enterprising developer who made a kernel module, it87.ko, which supports a large number of sensors of this type. The original maintainer chose to stop maintaining the repository, but several forks exist. I chose hannesha/it87, and compiled the DKMS module to make sure it continues to be compiled for future kernels I install.

$ cd ~
$ git clone https://github.com/hannesha/it87
$ cd it87
$ make
$ sudo make dkms

Install it87.ko to add support for the Gigabyte Z390 fan control and sensors

To enable an installed module like this, you would typically use modprobe, but here there was an issue: this repository is not kept up-to-date with newer motherboard specifications, and so when it attempts to detect the relevant hardware (which happens when the module is loaded), it fails - it is unable to detect the correct device.

$ sudo modprobe it87
modprobe: ERROR: could not insert 'it87': No such device

it87.ko cannot be loaded by modprobe with default parameters

Others have run into this issue on a similar motherboard - the it87 kernel module has an argument, force_id, where you can specify the specific hardware configuration it should target. Though none of the available configurations is a perfect match for the Z390 (preventing automatic matching), some do, conveniently, match closely enough that specifying the ID manually results in successful access to the sensors.

$ sudo modprobe it87 force_id=0x8628
$ sudo sensors-detect
...
Some Super I/O chips contain embedded sensors. We have to write to
standard I/O ports to probe them. This is usually safe.
Do you want to scan for Super I/O sensors? (YES/no): 
Probing for Super-I/O at 0x2e/0x2f
Trying family `National Semiconductor/ITE'...               No
Trying family `SMSC'...                                     No
Trying family `VIA/Winbond/Nuvoton/Fintek'...               No
Trying family `ITE'...                                      Yes
Found unknown chip with ID 0x8688
...

$ sensors
...
it8628-isa-0a40
Adapter: ISA adapter
in0:          +1.12 V  (min =  +0.00 V, max =  +3.06 V)
in1:          +2.00 V  (min =  +0.00 V, max =  +3.06 V)
in2:          +2.03 V  (min =  +0.00 V, max =  +3.06 V)
in3:          +2.02 V  (min =  +0.00 V, max =  +3.06 V)
in4:          +0.00 V  (min =  +0.00 V, max =  +3.06 V)  ALARM
in5:          +1.06 V  (min =  +0.00 V, max =  +3.06 V)
in6:          +1.21 V  (min =  +0.00 V, max =  +3.06 V)
3VSB:         +3.38 V  (min =  +0.00 V, max =  +6.12 V)
Vbat:         +3.19 V  
fan1:        1496 RPM  (min =    0 RPM)
fan2:        1541 RPM  (min =    0 RPM)
fan3:        1464 RPM  (min =    0 RPM)
temp1:        +57.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
temp2:        +64.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
temp3:        +77.0°C  (low  = +127.0°C, high = +127.0°C)
temp4:         +0.0°C  (low  =  +0.0°C, high = +127.0°C)
temp5:        +65.0°C  (low  =  +0.0°C, high = -120.0°C)
temp6:        +63.0°C  (low  =  +0.0°C, high = +127.0°C)
intrusion0:  OK

And just like that, I could see my fans speeds as well as a number of other sensors, and pwmconfig was able to successfully detect the correct fan control PWM outputs.

To make this permanent, it's necessary to put the new kernel module into /etc/modules, with the custom options in a separate conf file in /etc/modprobe.d:

dm-snapshot

# Generated by sensors-detect on Sun Jan 21 22:03:04 2018
# Chip drivers
coretemp

# Added manually, 2020-03-24, see hannesha/it87
it87

/etc/modules with it87 specified manually, coretemp found by sensors-detect. Note that adding custom options here will not allow the module to be loaded on boot, and an error will be logged.

# force kernel to assume IT87 module is similar to module 0x8628, even though it isn't
# seems to work on Z390 I Pro Wifi
options it87 force_id=0x8628

/etc/modprobe.d/it87.conf

GPU temperature fan control

Having fancontrol control the case fans was great, and easier to modify than leaving fan control in the BIOS, but still didn't solve the original problem: I needed my case fan speed to depend on GPU temperature.

At this point a StackOverflow post about connecting HDD temperatures to fancontrol revealed that fancontrol treats temperature sensors as simple files, so while it will by default read from /sys/class/hwmon/{sensorpath}, you can also specify an arbitrary file path from / as a sensor input in /etc/fancontrol. This allows you to update a file with an arbitrary temperature and have fancontrol use that file's content as if it were a sensor.

With a quick bash script that uses nvidia-smi to read the fan temperature from multiple GPUs and write those values to files, and a systemd unit to run this as a process, I could create a fancontrol-compatible "GPU-temperature sensor":

#!/bin/bash
# Read NVIDIA GPU temperatures and write to a file on a duty cycle

HELPTEXT="\
Export GPU temperatures to a directory. Each GPU is written to a file 'gpu_{gpu number}' in the directory.

Usage: export-gpu-temp --loop 2 --output /var/opt/gputemps --gpu 0 --gpu 1
Options:
  -o/--output (required) - path to a directory in which to write GPU temperatures
  -l/--loop (required) - time to sleep between GPU temperature query cycles, in seconds
  --gpu (required, multiple) - GPU number to query; repeat for multiple GPUs
"

set -eou pipefail

GPUS=()
while [[ $# -gt 0 ]]; do
    key="$1"

    case $key in
        -h|--help)
        echo "$HELPTEXT"
        exit 0
        ;;
        -o|--output)
        dirpath_output=$2
        if ! [ -w $dirpath_output ]; then
            echo "$dirpath_output is not a writeable directory"
            exit -1
        fi
        shift
        shift
        ;;
        -l|--loop)
        loop_time=$2
        if [[ $loop_time < 0.1 ]]; then
            echo "loop_time is very small (${loop_time}s), this may cause extra load on your GPU!"
        fi
        shift
        shift
        ;;
        --gpu)
        GPUS+=("$2")
        shift
        shift
        ;;
        *)
        echo "Unknown option $1"
        exit -1
        ;;
    esac
done

echo "Querying GPUs: ${GPUS[@]}"

while true
do
    for gpu_id in ${GPUS[@]}
    do
        gpu_output_path=${dirpath_output}/gpu_${gpu_id}

        if ! temp_degrees_c=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader --id=$gpu_id); then
            echo "Failed to fetch GPU ${gpu_id}"
        else
            temp_millidegrees_c=$(($temp_degrees_c * 1000))
            echo "$(date -Iseconds) GPU ${gpu_id} has temperature ${temp_degrees_c}"

            echo $temp_millidegrees_c > $gpu_output_path
        fi
    done

    echo "$(date -Iseconds) Sleeping ${loop_time}"
    sleep $loop_time
done

export-gpu-temp, a Bash script to write one or multiple GPU temperatures to individual files, to mimic a hwmon sensor

Note that fancontrol expects temperatures to be providing in millidegrees Celsius, following the hmwon interface, so the output from nvidia-smi needed to be multiplied by 1000.

[Unit]
Description=Export GPU temperatures to a file continuously
Documentation=

[Service]
Type=simple
ExecStart=/usr/local/bin/export-gpu-temp --gpu 0 --output /var/opt/fancontrol/ --loop 1
Restart=on-failure

[Install]
WantedBy=multi-user.target

A systemd Unit to export temperatures from GPU 0 to /var/opt/fancontrol/gpu_0 every second.

With that systemd unit up and running, it was a simple matter to modify /etc/fancontrol manually to point to the correct "hardware sensor" and establish temperature bounds for the two case fans. I chose to have the case fans shut off when the GPU temperature was below 60°C, and to reach max speed at 80°C. Here hwmon3/pwm2 and hwmon3/pwm3 are the two case fans. hwmon3/pwm1 is the CPU fan, and is tied to hwmon2/temp2_input, which is the temperature of the first CPU core.

INTERVAL=1
DEVPATH=hwmon2=devices/platform/coretemp.0 hwmon3=devices/platform/it87.2624
DEVNAME=hwmon2=coretemp hwmon3=it8628
FCTEMPS=hwmon3/pwm3=/var/opt/fancontrol/gpu_0 hwmon3/pwm2=/var/opt/fancontrol/gpu_0 hwmon3/pwm1=hwmon2/temp2_input
FCFANS=hwmon3/pwm3=hwmon3/fan3_input hwmon3/pwm2=hwmon3/fan2_input hwmon3/pwm1=hwmon3/fan1_input
MINTEMP=50 hwmon3/pwm3=60 hwmon3/pwm2=60 hwmon3/pwm1=60
MAXTEMP=50 hwmon3/pwm3=80 hwmon3/pwm2=80 hwmon3/pwm1=95
MINSTART=20 hwmon3/pwm3=20 hwmon3/pwm2=20 hwmon3/pwm1=56
MINSTOP=0 hwmon3/pwm3=0 hwmon3/pwm2=0 hwmon3/pwm1=16
MINPWM=0 hwmon3/pwm3=0 hwmon3/pwm2=0 hwmon3/pwm1=16
MAXPWM=230 hwmon3/pwm3=250 hwmon3/pwm2=250 hwmon3/pwm1=250
AVERAGE=5

Final /etc/fancontrol. Read more about the available options on the fancontrol man page

With the it87 kernel module, fancontrol, and this script, I believed I was in a good place and would resolve throttling with sensical fan control. GPU temperatures were noticeably lower under load, so it was time to check -d PERFORMANCE.

SW Power Throttle

nvidia-smi -q -d PERFORMANCE

==============NVSMI LOG==============

Driver Version                      : 440.66.08
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Performance State               : P2
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active

After all that work to fix the cooling problem, one new problem had developed: this GPU has a TDP of 250W. At full throttle and when properly cooled, that wasn't enough power. Fortunately, power limit controls are available in nvidia-smi. We can check what power range is appropriate for the GPU with nvidia-smi -q -d POWER:

nvidia-smi -q -d POWER
...
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 292.00 W

Which shows that even though the reference power limit is 250W, this can be easily configured as high as 292W and as low as 125W.

To change the power limit, run nvidia-smi -pl $PL_IN_WATTS as a superuser. Note, you may need to enable power control on the GPU with nvidia-smi -pm 1. This great blog post has more details, and also includes a quick introduction to overclocking an NVIDIA GPU on Linux, for the interested.

sudo nvidia-smi -pl 292
nvidia-smi -q -d POWER
...
        Power Limit                 : 292.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 292.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 292.00 W

Modify NVDIA GPU power limits on Linux with nvidia-smi -pl

Results

With the maximum power increased, fans installed and properly controlled, the GPU now runs at a comfortable 72-75°C, and the SM clock frequency remained stable at 1890MHz** for long intervals.

nvidia-smi no longer indicates any form of throttling is occurring:

nvidia-smi -q -d PERFORMANCE

==============NVSMI LOG==============

Timestamp                           : Sat Apr 25 13:04:02 2020
Driver Version                      : 440.66.08
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active

But the real test is in the benchmarks. While I somewhat unreliably observed higher Folding@Home Points Per Day, I decided to test a more accurate benchmark with Phoronix Test Suite:

Benchmarking with Phoronix Test Suite's pts/unigine-heaven benchmark. Full result here, my old GTX 1050Ti for reference.

An increase of 25FPS, or ~15%, is nothing to sneeze at! It's not huge, but it's approximately the difference between adjacent grades of graphics cards these days, so this felt like getting a free upgrade.

Summary

Check for GPU throttling with nvidia-smi -q -d PERFORMANCE --loop-ms=500. If thermal throttling occurs, consider improving cooling with better fans, additional case fans, or, failing that, a liquid-cooling system. If no thermal throttling is happening, don't waste time or money on a complex cooling setup! If you encounter hardware power throttling, you may need to buy a more powerful power supply. If software-defined power throttling is happening, try to change the software-defined power limits by checking the acceptable power range with nvidia-smi -q -d POWER and setting the active limits with nvidia-smi -pl.

As an added benefit, I find that it's easy to use the software-defined power limit as a cheap GPU throttle: reducing the power limit to 150W makes the GPU run cool, at the cost of about half the performance.

*The Performance State indicator is also interesting, and you can read more about it in the NVIDIA docs. According to this Reddit post, P0-P2 power states have identical core clock frequencies, but P2 reduces the memory clock frequency. It also states that all compute other than live graphical rendering will keep the card in the P2 state. Since my memory utilization is low, this isn't a problem, but if memory bandwidth or utilization is a concern, consider addressing a decreased power state.

**I have since managed to successfully overclock the SM frequency by +100MHz, stably, and now see constant frequencies at 1995MHz without thermal, power, or other stability issues. But the benchmark and plots show the state of the system and the performance gains without any overclocking.

Understanding btrfs on Ubuntu - An introduction to btrfs

Justin Palpant — Mon, 12 Apr 2021 16:00:00 GMT

With a recent reinstall of Ubuntu 20.04 on my personal computer, I decided that I wanted to explore a new type of filesystem, and find a replacement for how I used the traditional ext4 and LVM - and I settled on btrfs. While I was and am excited about the power of this next-generation filesystem, I've discovered that it doesn't run itself - with powerful features comes complexity.

Background

The computer I am testing on has four disks: one 1TB, one 2TB SSD, and two additional 512GB NVMe SSDs. The 1TB and 2TB disks are managed by a hardware RAID controller.

It's a personal computer used for gaming and daily use, as well as a place I experiment. It runs a small single-node Kubernetes cluster which I use for automated builds of some of my repos, for running Folding@Home, and several other services. You can learn more about this machine and my other infrastructure from the README for my homelab's repository.

During the reinstall, I also chose to dual-boot Windows and Linux, splitting the main disk in two: 200GB for the Ubuntu root (/) and 800GB for Windows. The two 512GB drives don't have a hardware RAID controller, but I wanted to use those in RAID1 for the /home folder in Ubuntu.

$ sudo lsblk
sda           8:0    0   1.8T  0 disk  
└─md126       9:126  0 931.5G  0 raid1 
  ├─md126p1 259:2    0   100M  0 part  /boot/efi
  ├─md126p2 259:3    0    16M  0 part  
  ├─md126p3 259:4    0 735.6G  0 part  <-- Windows 
  ├─md126p4 259:5    0   499M  0 part  
  └─md126p5 259:6    0 195.3G  0 part  /
sdb           8:16   0 931.5G  0 disk  
└─md126       9:126  0 931.5G  0 raid1 
  ├─md126p1 259:2    0   100M  0 part  /boot/efi
  ├─md126p2 259:3    0    16M  0 part  
  ├─md126p3 259:4    0 735.6G  0 part  <-- Windows  
  ├─md126p4 259:5    0   499M  0 part  
  └─md126p5 259:6    0 195.3G  0 part  /
nvme0n1     259:0    0 465.8G  0 disk  /home
nvme1n1     259:1    0 465.8G  0 disk

lsblk output showing RAID disks and NVME additions disks

I also have spent a significant amount of time in the past testing different systems of backup and restore for this machine, out of curiosity, including Ubuntu's built-in Deja Dup, duplicacy, and full-disk backups with dd and tar, and looked forward to some of the features btrfs provides to make backups easier.

So, with these goals in mind, what are some of the features btrfs provides to make this happen?

btrfs - a next-generation filesystem

btrfs was developed beginning in 2007, and merged into the Linux kernel mainline in 2009, and declared stable in 2013. The name has various pronunciations, and is a reference to the filesystem's core data structure: a copy-on-write (COW) B-tree.

Built-in to btrfs are a number of features you may not expect the filesystem to provide for you. Commonly-desired features that Linux users would install additional software for have been built into the filesystem from the beginning.

Device management

btrfs provides users some ways to manage physical devices directly, like LVM, as well as providing some support for software-RAID arrangements, like mdadm. Though not meant to compete feature-for-feature, btrfs supports:

Creating a filesystem with metadata, data, or both in RAID 0, RAID 1, RAID 10, RAID 5 and RAID 6
Performing day-2 addition of disks to an existing filesystem, and conversion between RAID levels
Resizing the filesystem to take advantage added disks, or decrease the size of the filesystem to account for lost disks
Balancing data across disks, including when replacing disks due to failure

btrfs doesn't provide complex volume group and logical device management like LVM - instead, all devices are joined into a common pool, and individual pieces of data are arranged onto the storage pool according to RAID configuration.

However, though devices are pooled within each btrfs filesystem, it is possible to have multiple filesystems active, and the command btrfs device scan searches all devices for distinct btrfs filesystems.

The Ubuntu installer creates one btrfs filesystem for the root directory. I placed this on a 200GB partition of the hardware RAID controlled pair of disks. I moved the home directory to a second filesystem using btrfs' RAID:

sudo mkfs.btrfs -m raid1 -d raid1 /dev/nvme0n1 /dev/nvme1n1
sudo mount /dev/nvme0n1 /home

Creating a second btrfs filesystetm with software RAID1 for /home

$ sudo btrfs filesystem show
Label: none  uuid: 3451815e-07c2-4b60-bd43-68fd338aa881
        Total devices 1 FS bytes used 172.82GiB
        devid    1 size 195.31GiB used 177.03GiB path /dev/md126p5

Label: none  uuid: af5e3ee6-40c6-4dc0-82f3-5f6a025f842c
        Total devices 2 FS bytes used 49.55GiB
        devid    1 size 465.76GiB used 83.03GiB path /dev/nvme0n1
        devid    2 size 465.76GiB used 83.03GiB path /dev/nvme1n1

Subvolumes

btrfs allows users to create multiple subvolumes within a filesystem. btrfs subvolumes resemble as folders within the filesystem, and can be nested within each other. The mounted filesystem within which you create the subvolume is the subvolume's parent. Mounting a parent subvolume implicitly mounts the child subvolumes at their path. Each subvolume has a UUID, a unique ID, and is also uniquely identified by its name, which is also the path at which it will appear under its parent. Because this name is also the subvolume's path, moving a subvolume is the same as renaming it.

However, subvolumes differ from folders in a number of ways:

Subvolumes can be individually mounted at another location
Subvolumes are globally queryable with btrfs subvolume list <path>

All btrfs filesystems have at least one subvolume: this is the root subvolume, and by convention has ID=5 and the hidden name FS_TREE. This root subvolume is not included in btrfs subvolume list, but in filesystems with additional subvolumes, it will be the ultimate parent of any child subvolumes. The top level field of btrfs subvolume list indicates the ID of the parent subvolume. With -a, the path field shows the name of the parent as the first part of the path.

$ sudo btrfs subvolume list -atq /
ID      gen     top level       parent_uuid     path
--      ---     ---------       -----------     ----
256     336674  5               -               <FS_TREE>/@

Ubuntu subvolumes for the root filesystem on OS install

Any subvolume within the filesystem can be mounted as the root, not just FS_TREE. Ubuntu, in fact, creates a child subvolume with the name @ and mounts this subvolume to the path / instead of mounting FS_TREE to that location.

For more help understanding subvolumes and when to use them, check out the btrfs SysadminGuide.

Snapshots

Because btrfs is copy-on-write, it supports lightweight snapshots which capture the current state and only record the changes made to the filesystem since the previous snapshot.

These snapshots are actually new subvolumes, identical to plain subvolumes except that they are populated with the content of their source at creation. This is done without consuming any space initially because the new subvolume simply references the data without making a new copy. Snapshots live within the filesystem as subvolumes, and are mountable and browseable like others.

Read-only snapshots can maintain the state of the filesystem at a fixed point in time, while read-write snapshots can restore such a snapshot and be used to recover.

Compression

On top of these features, btrfs also supports automatic compression using one of several algorithms: zlib, lzo, and zstd. Compression can be activated at mount-time, either using a btrfs-designed heuristic for when files are compressive by specifying -o compress (though some users do not recommend this approach), or on all files by specifying -o compress-force. You can also use extended attributes to enable or disable compression on individual files using the command btrfs property set <file> compression ... or chattr +c.

Credit for the title image to CyHawk - Own work based on [1]., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=11701365

Folding@Home on Kubernetes

Justin Palpant — Tue, 17 Mar 2020 16:00:00 GMT

Update: I now publish jpalpant/fah-client if you are looking for a thin, GPU-supporting wrapper around FAHClient until an official one is released, as well as jpalpant/folding-exporter, a Prometheus exporter for tracking F@H PPD.

Folding@Home (F@H or FAH) is an incredible distributed computing project that lets individuals sign up to donate extra compute resources to researchers solving problems that would otherwise only be accessible to those with the most powerful of supercomputers. It uses those resources by assigning each computer small pieces of incredibly complex molecular simulations, and then assembling the results when each computer is finished*. With the power of hundreds of thousands of users, F@H rivals the fastest supercomputers in the world today.

Having recently acquired a new GPU for my PC that was sitting very quiet most of the day, I was happy to learn that F@H was still running strong after all these years. Since my homelab includes a single-node Kubernetes cluster running on that PC, I decided to use it to see if I could run F@H. There are much easier ways to install F@H on your system, but I was happy to have another chance to use my build and deploy system.

This was a complete coincidence (as far as I remember), but I decided to try this on March 5th, about a week after F@H started to work on simulating COVID-19:

Help us in the fight against COVID-19! Download the app at: https://t.co/andJ4PDzVl #Coronavirus #2019nCov #COVID19 #SARSCoV2 https://t.co/BSmiV8phh1
— Folding@home (@foldingathome) February 27, 2020