Folding@Home on Kubernetes

Update: I now publish jpalpant/fah-client if you are looking for a thin, GPU-supporting wrapper around FAHClient until an official one is released, as well as jpalpant/folding-exporter, a Prometheus exporter for tracking F@H PPD.

Folding@Home (F@H or FAH) is an incredible distributed computing project that lets individuals sign up to donate extra compute resources to researchers solving problems that would otherwise only be accessible to those with the most powerful of supercomputers. It uses those resources by assigning each computer small pieces of incredibly complex molecular simulations, and then assembling the results when each computer is finished*. With the power of hundreds of thousands of users, F@H rivals the fastest supercomputers in the world today.

Having recently acquired a new GPU for my PC that was sitting very quiet most of the day, I was happy to learn that F@H was still running strong after all these years. Since my homelab includes a single-node Kubernetes cluster running on that PC, I decided to use it to see if I could run F@H. There are much easier ways to install F@H on your system, but I was happy to have another chance to use my build and deploy system.

This was a complete coincidence (as far as I remember), but I decided to try this on March 5th, about a week after F@H started to work on simulating COVID-19:

Help us in the fight against COVID-19! Download the app at: https://t.co/andJ4PDzVl #Coronavirus #2019nCov #COVID19 #SARSCoV2 https://t.co/BSmiV8phh1
— Folding@home (@foldingathome) February 27, 2020

Spinning up

The first stage for most of these small projects has 3 parts:

Decide if I need to build a custom Docker image or if I can reuse an existing one
Decide what Kubernetes objects I'll need in the deployment
Bootstrap the git repo, GitLab project, build YAML, and "extras" (DNS, manual configs, healthchecks).

I settled on a custom Docker image after checking out a few that were available - most didn't leave me enough control over initialization, or didn't allow enough customization.

FROM nvidia/opencl:devel-ubuntu18.04

LABEL maintainer="justin@palpant.us"

ARG FAH_VERSION_MAJOR=7
ARG FAH_VERSION_MINOR=5
ARG FAH_VERSION_PATCH=1

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install --no-install-recommends -y \
        ca-certificates wget bzip2 dumb-init &&\
        wget https://download.foldingathome.org/releases/public/release/fahclient/debian-stable-64bit/v${FAH_VERSION_MAJOR}.${FAH_VERSION_MINOR}/fahclient_${FAH_VERSION_MAJOR}.${FAH_VERSION_MINOR}.${FAH_VERSION_PATCH}_amd64.deb &&\
        mkdir -p /etc/fahclient/ &&\
        touch /etc/fahclient/config.xml &&\
        dpkg --install *.deb &&\
        apt-get autoremove -y &&\
        rm --recursive --verbose --force /tmp/* /var/log/* /var/lib/apt/ &&\
        mkdir /var/opt/folding

WORKDIR /var/opt/folding

COPY init.sh /init.sh

ENTRYPOINT [ "/init.sh" ]

Customized Dockerfile for FAH, inspired by johnktims/folding-at-home

Because Folding@Home's client comes with a built-in web UI for manual configuration, I wanted to serve that UI over an authenticated, public web page. I typically use pusher/oauth2_proxy for this because it supports Sign-in with Google easily, and decided to use it here as well. I also wanted configuration from the web UI as well as work-in-progress to persist across container restarts. In a cloud environment this would mean using a PersistentVolume, but on this one-node cluster, I just use HostPath mounts. Likewise, the one-node cluster uses a lot of NodePort Services, instead of the Ingress resources that I like to use in the cloud.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: folding-at-home
  namespace: prod
spec:
  serviceName: folding-at-home
  selector:
    matchLabels:
      app: folding-at-home
  replicas: 1
  template:
    metadata:
      labels:
        app: folding-at-home
    spec:
      imagePullSecrets:
        - name: k8s-gcr-read-only
      containers:
      - args:
        - -provider=google
        - -google-admin-email=lab@palpant.us
        - -google-group=folding-access@palpant.us
        - -email-domain=*
        - -google-service-account-json=/sa/palpantlab-main-40b0f9caae61.json
        - -upstream=http://localhost:7396
        - -http-address=0.0.0.0:4180
        - -redirect-url=https://folding.palpant.us/oauth2/callback
        env:
        - name: OAUTH2_PROXY_CLIENT_ID
          valueFrom:
            secretKeyRef:
              name: transmission-oauth2-proxy-account
              key: client_id
        - name: OAUTH2_PROXY_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              name: transmission-oauth2-proxy-account
              key: client_secret
        - name: OAUTH2_PROXY_COOKIE_SECRET
          valueFrom:
            secretKeyRef:
              name: transmission-oauth2-proxy-account
              key: cookie_secret
        volumeMounts:
        - name: transmission-ui-sa
          mountPath: /sa
        image: quay.io/pusher/oauth2_proxy:v5.0.0
        imagePullPolicy: Always
        livenessProbe:
          httpGet:
            scheme: HTTP
            path: /ping
            port: web
          initialDelaySeconds: 30
          timeoutSeconds: 30
        name: oauth2-proxy
        ports:
        - containerPort: 4180
          name: web
      - image: ${CONTAINER_REGISTRY}:${CI_COMMIT_SHA}
        name: folding
        imagePullPolicy: Always
        ports:
        - name: http
          containerPort: 7396
        - name: command
          containerPort: 36330
        livenessProbe:
          httpGet:
            scheme: HTTP
            path: /
            port: http
        args:
        - --web-allow=0/0
        - --allow=0/0
        - --cpu-usage=35
        - --session-lifetime=0
        - --session-timeout=0
        - --command-enable=true
        - --command-address=0.0.0.0
        - --command-allow-no-pass=0/0
        - --command-port=36330
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        volumeMounts:
        - name: folding-at-home-data
          mountPath: /var/opt/folding
      volumes:
      - name: folding-at-home-data
        hostPath:
          # directory location on host
          path: /data/folding-at-home
          # this field is optional
          type: Directory
      - name: transmission-ui-sa
        secret:
          secretName: transmission-ui-sa
---
apiVersion: v1
kind: Service
metadata:
  name: folding-at-home
  namespace: prod
spec:
  type: NodePort
  ports:
  - targetPort: web
    name: web
    port: 9092
  - targetPort: command
    name: command
    port: 9093
  selector:
    app: folding-at-home
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: folding-at-home-ui-restrict
  namespace: prod
spec:
  podSelector:
    matchLabels:
      app: folding-at-home
  policyTypes:
  - Ingress
  ingress:
  - from:
    - ipBlock:
        cidr: 192.168.0.10/32
    ports:
    - protocol: TCP
      port: 4180
  - from:
    - ipBlock:
        cidr: 192.168.0.0/28
    - ipBlock:
        cidr: 127.0.0.1/28
    ports:
    - protocol: TCP
      port: 36330

StatefulSet, Service, and NetworkPolicy for FAH on my PC

A bit of finagling was needed to get this running, managing Secrets and tweaking the Dockerfile, but after a dozen commits or so, I was up and folding at folding.palpant.us (private, access-controlled).

Glass shattering

Everything went swimmingly for a couple of days, churning out a few million Points of folding research and keeping my system comfortably warm, with high GPU utilization, when suddenly GPU utilization went way down.

I was able to see some very obvious failure messages from the container's logs immediately - the GPU core was failing the work units it was assigned on startup with a cryptic note:

00:51:18:WU01:FS01:Starting
00:51:18:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/opt/folding/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 705 -lifeline 8 -ch
eckpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
00:51:18:WU01:FS01:Started FahCore on PID 70
00:51:18:WU01:FS01:Core PID:74
00:51:18:WU01:FS01:FahCore 0x22 started

...

00:51:18:WU01:FS01:0x22:Project: 11741 (Run 0, Clone 2360, Gen 1)
00:51:18:WU01:FS01:0x22:Unit: 0x000000018ca304f15e67d8cb67bdf2b9
00:51:18:WU01:FS01:0x22:Reading tar file core.xml
00:51:18:WU01:FS01:0x22:Reading tar file integrator.xml
00:51:18:WU01:FS01:0x22:Reading tar file state.xml
00:51:18:WU01:FS01:0x22:Reading tar file system.xml
00:51:19:WU01:FS01:0x22:Digital signatures verified
00:51:19:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
00:51:19:WU01:FS01:0x22:Version 0.0.2
00:51:21:WU01:FS01:FahCore returned: INTERRUPTED (102 = 0x66)

I tried to see if any of the usual suspects were the problem: pod crashes, issues with NVIDIA drivers or OpenCL, googling obviously Googleable error messages. Most solutions pointed at issues with the specific work unit, or a retriable failure, so I deleted the pod, deleted the data in the data directory, deleted everything I could think to reset - without success.

And so I ended up where every good software project ends up - trawling the forums and finally asking for help. The good people on foldingforum.org were helpful and engaged quickly to try to help, but I went in knowing that my setup is fairly uncommon, and, to be frank, I didn't think it would be worth it to help get my one oddly configured RTX 2080 online when there was so much other work to do.

I was able to isolate that the issue was a problem with my Kubernetes setup, specifically by running FAHClient manually on the desktop and then also running the Docker image I was using directly with docker run. But even with that narrow scope, I wasn't able to figure out the problem.

You see, but do you observe?

The next day, I happened to think to check my cluster resource usage Grafana dashboard - specifically, the memory usage chart for this Pod.

That pattern was immediately familiar, and telling - it was a container hitting the 500Mi memory limit I had assigned, and then crashing. But typically the pod would be killed and the termination reason would be OOMKilled, and I would get notified via Alertmanager setup for having a PodFrequentlyRestarting. Was one process being killed and not the pod? Is that possible?

Yes.

That issue pointed me to look at the node_vmstat_oom_kill metric that is exposed if you run node_exporter, and sure enough, something on my server was being OOM killed approximately once per minute.

Clean up

From this point, the resolution was simple - remove the offending memory limit, deploy the change.

Immediately after the work unit started up, I could see memory usage spike up well beyond where the limit was placed.

Shortly after that I realized that node_exporter wasn't running on my GKE cluster nodes when I tried to read node_vmstat_oom_kill for the nodes in that cluster, so I enabled it.

Takeaways

First, observability is important - exit codes and log messages are helpful, but cluster-level monitoring can give detail when a specific process is misbehaving, even if you know nothing about that process. FAHClient isn't open source (yet!), so my ability to dive deeper was limited, and knowing what I know now, this INTERRUPTED message and the corresponding code are likely very deep in the stack. Prometheus and Grafana made that irrelevant.

Second - get folding!

F@H promo video by Bowman lab

Title image credit: Alissa Eckert, MS; Dan Higgins, MAM available at https://phil.cdc.gov/Details.aspx?pid=23311

Folding@Home on Kubernetes

Justin Palpant

Justin Palpant

Spinning up

Glass shattering

You see, but do you observe?

Clean up

Takeaways

Understanding btrfs on Ubuntu - An introduction to btrfs

Understanding btrfs on Ubuntu - An introduction to btrfs

Spinning up

*Glass shattering*

You see, but do you observe?

Clean up

Takeaways

Subscribe to a star shines upon the hour of our meeting

Subscribe to a star shines upon the hour of our meeting

Glass shattering