Update: I now publish jpalpant/fah-client if you are looking for a thin, GPU-supporting wrapper around FAHClient until an official one is released, as well as jpalpant/folding-exporter, a Prometheus exporter for tracking F@H PPD.
Folding@Home (F@H or FAH) is an incredible distributed computing project that lets individuals sign up to donate extra compute resources to researchers solving problems that would otherwise only be accessible to those with the most powerful of supercomputers. It uses those resources by assigning each computer small pieces of incredibly complex molecular simulations, and then assembling the results when each computer is finished*. With the power of hundreds of thousands of users, F@H rivals the fastest supercomputers in the world today.
Having recently acquired a new GPU for my PC that was sitting very quiet most of the day, I was happy to learn that F@H was still running strong after all these years. Since my homelab includes a single-node Kubernetes cluster running on that PC, I decided to use it to see if I could run F@H. There are much easier ways to install F@H on your system, but I was happy to have another chance to use my build and deploy system.
This was a complete coincidence (as far as I remember), but I decided to try this on March 5th, about a week after F@H started to work on simulating COVID-19:
Help us in the fight against COVID-19! Download the app at: https://t.co/andJ4PDzVl #Coronavirus #2019nCov #COVID19 #SARSCoV2 https://t.co/BSmiV8phh1
— Folding@home (@foldingathome) February 27, 2020
Spinning up
The first stage for most of these small projects has 3 parts:
- Decide if I need to build a custom Docker image or if I can reuse an existing one
- Decide what Kubernetes objects I'll need in the deployment
- Bootstrap the git repo, GitLab project, build YAML, and "extras" (DNS, manual configs, healthchecks).
I settled on a custom Docker image after checking out a few that were available - most didn't leave me enough control over initialization, or didn't allow enough customization.
Because Folding@Home's client comes with a built-in web UI for manual configuration, I wanted to serve that UI over an authenticated, public web page. I typically use pusher/oauth2_proxy for this because it supports Sign-in with Google easily, and decided to use it here as well. I also wanted configuration from the web UI as well as work-in-progress to persist across container restarts. In a cloud environment this would mean using a PersistentVolume, but on this one-node cluster, I just use HostPath mounts. Likewise, the one-node cluster uses a lot of NodePort Services, instead of the Ingress resources that I like to use in the cloud.
A bit of finagling was needed to get this running, managing Secrets and tweaking the Dockerfile, but after a dozen commits or so, I was up and folding at folding.palpant.us (private, access-controlled).
*Glass shattering*
Everything went swimmingly for a couple of days, churning out a few million Points of folding research and keeping my system comfortably warm, with high GPU utilization, when suddenly GPU utilization went way down.
I was able to see some very obvious failure messages from the container's logs immediately - the GPU core was failing the work units it was assigned on startup with a cryptic note:
00:51:18:WU01:FS01:Starting
00:51:18:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/opt/folding/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 705 -lifeline 8 -ch
eckpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
00:51:18:WU01:FS01:Started FahCore on PID 70
00:51:18:WU01:FS01:Core PID:74
00:51:18:WU01:FS01:FahCore 0x22 started
...
00:51:18:WU01:FS01:0x22:Project: 11741 (Run 0, Clone 2360, Gen 1)
00:51:18:WU01:FS01:0x22:Unit: 0x000000018ca304f15e67d8cb67bdf2b9
00:51:18:WU01:FS01:0x22:Reading tar file core.xml
00:51:18:WU01:FS01:0x22:Reading tar file integrator.xml
00:51:18:WU01:FS01:0x22:Reading tar file state.xml
00:51:18:WU01:FS01:0x22:Reading tar file system.xml
00:51:19:WU01:FS01:0x22:Digital signatures verified
00:51:19:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
00:51:19:WU01:FS01:0x22:Version 0.0.2
00:51:21:WU01:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
I tried to see if any of the usual suspects were the problem: pod crashes, issues with NVIDIA drivers or OpenCL, googling obviously Googleable error messages. Most solutions pointed at issues with the specific work unit, or a retriable failure, so I deleted the pod, deleted the data in the data directory, deleted everything I could think to reset - without success.
And so I ended up where every good software project ends up - trawling the forums and finally asking for help. The good people on foldingforum.org were helpful and engaged quickly to try to help, but I went in knowing that my setup is fairly uncommon, and, to be frank, I didn't think it would be worth it to help get my one oddly configured RTX 2080 online when there was so much other work to do.
I was able to isolate that the issue was a problem with my Kubernetes setup, specifically by running FAHClient manually on the desktop and then also running the Docker image I was using directly with docker run
. But even with that narrow scope, I wasn't able to figure out the problem.
You see, but do you observe?
The next day, I happened to think to check my cluster resource usage Grafana dashboard - specifically, the memory usage chart for this Pod.
That pattern was immediately familiar, and telling - it was a container hitting the 500Mi memory limit I had assigned, and then crashing. But typically the pod would be killed and the termination reason would be OOMKilled, and I would get notified via Alertmanager setup for having a PodFrequentlyRestarting. Was one process being killed and not the pod? Is that possible?
Yes.
That issue pointed me to look at the node_vmstat_oom_kill
metric that is exposed if you run node_exporter, and sure enough, something on my server was being OOM killed approximately once per minute.
Clean up
From this point, the resolution was simple - remove the offending memory limit, deploy the change.
Immediately after the work unit started up, I could see memory usage spike up well beyond where the limit was placed.
Shortly after that I realized that node_exporter
wasn't running on my GKE cluster nodes when I tried to read node_vmstat_oom_kill
for the nodes in that cluster, so I enabled it.
Takeaways
First, observability is important - exit codes and log messages are helpful, but cluster-level monitoring can give detail when a specific process is misbehaving, even if you know nothing about that process. FAHClient isn't open source (yet!), so my ability to dive deeper was limited, and knowing what I know now, this INTERRUPTED message and the corresponding code are likely very deep in the stack. Prometheus and Grafana made that irrelevant.
Second - get folding!
Title image credit: Alissa Eckert, MS; Dan Higgins, MAM available at https://phil.cdc.gov/Details.aspx?pid=23311