For any website, app, or product you support, there are a few dimensions to providing a good experience to your users, like availability, because no one likes error messages, or latency, because interactions should be smooth and quick.
Load testing is a great way to expose bottlenecks, fragility, and performance issues in your application. By adding a large amount of traffic in a controlled manner, you can often spot issues. And it never hurts to be prepared for what might happen if your blog goes viral!
My interest in automated load testing came about because I noticed that when running simple tests, the performance characteristics of my sites (like my Gitlab instance, or this blog) became totally unpredictable - different than what they were in the "steady-state" (with little to no traffic).
There are many ways to load test a website - so lets start with the most basic.
Making some HTTP requests
If you want to make sure that the HTTP-serving components of your system are performant, making simple HTTP requests to a public website is easy, and can be done at high scale with minimal resources.
A consumer-grade laptop running cURL script can easily make hundreds of requests per second. ab, a classic tool from Apache, and the more modern wrk, take the basic principle of cURL and provide configurable parallelism and (in the basic cases) high QPS, as well as statistic reporting, which can give you an idea of the range of latency and throughput characteristics of a server.
All of these tools can be run with almost no overhead, maxing out a basic webserver while consuming little to no CPU or memory on the load test machine.
However, these tools and tools like them have a weakness - for a webserver that serves a web application that users interact with via a browser, the load induced by a single HTTP request doesn't really mimic what would happen if a large number of users began to visit.
A user visiting a web page often has more side effects that cause load on your servers than one or even several HTTP requests.
- Server-side rendering of new components
- Server-side caching based on request headers
Simulate the user
To overcome these limitations, you can try to expose your servers to load that, from the server's perspective, appears to be regular user traffic. Can't find a few hundred or thousand people to reload your blog all day? No problem! Fortunately, there are a few ways to automate this.
Services such as flood.io can help with this - they provide a tool to simulate a huge number of users, geographically distributed, interacting with your website in simple ways. For more advanced cases, you can even provide Selenium scripts to execute complex sequences of interactions with a website. I use and will continue to use Flood for occasional high-stress testing.
But Flood and similar tools have some limitations. Importantly, though they scale to large numbers, they are not meant to simulate sustained load - jobs typically run for minutes or hours, but not days or weeks. On top of that, the tools are expensive. Flood offers 500 virtual user-hours per month for free, and then $0.045/virtual user-hour thereafter. While this is great for a few bursts, simulating continuous load of only 10 users would cost upwards of $300/month.
I thought it would be interesting to try to build something that would provide a tunable way to load web pages and induce the corresponding stress on my monitoring stack (in the absence of a horde of developers constantly refreshing dashboards). My goals were simple and specific: simulate a full page load on an authenticated web page, in an indefinite loop, with some control over the parallelism on the client side.
For years, driving Chrome or other browsers via automation has been a staple of integration testing via frameworks like Selenium. Today, the Chrome DevTools Protocol, a gRPC API maintained by Google, facilitates this. It allows programmatic control of a Chrome browser instance from another process.
Several libraries wrapping the DevTools Protocol have been made for different languages to allow fine-grained control: Puppeteer for NodeJS (maintained by the Chrome development team), github.com/chromedp/chromedp for golang, headless_chrome for Rust. These libraries are at varying levels of maturity and full-featured, and if you are interested in building a new solution for driving Chrome generally, are a great place to start!
Given that I'm more familiar with golang than the other two languages, I thought to see if I could scratch out a simple binary to meet my goals from chromedp.
Courtesy of the power and ease of golang, chromedp, Docker, and Kubernetes, in just a handful of hours I made a tool that:
- Loads web pages continuously in a headless Chrome browser, with URLs specified via file or CLI argument
- Simulates a complete page load, including awaiting
networkIdle0events (meaning no network requests have been made for 500ms), with configurable timeout
- Supports arbitrary HTTP headers, TLS verification using default CAs, or skipping TLS verification (for unsigned HTTPS websites).
- Configurable parallelism via a reusable pool of browser tabs
- Can be (and is!) run on Kubernetes, with jpalpant/chromedp-load-agent published to DockerHub
- Can take screenshots of the page to validate a successful page load
Right now, chromedp-load-agent is untested, the code isn't very well organized (on account of being my first project using spf13/cobra), it's expensive to run, and it's brittle. While functional, as a long-running service it has a lot of limitations. I'm not interested in making it into a library, but if I can, I'd love to improve other aspects:
- Health checks suitable for a long-running server process
- Prometheus metrics for application statistics, like successful or failed page loads
- More utility to screenshots, like a web interface to show the most recent screenshot for each URL
Beyond that, there's a lot of potential for a reusable library that automates this work in Golang, but I think that's not a direction I want to pursue right now.
But the important thing: it works. I set out to add consistent, configurable load to my monitoring system, and this is what QPS looks like now:
So, did it reveal anything interesting? Or was this exercise a waste of time?
What I learned
The first thing that surprised me was how quickly the TCO of my small websites increased under constant load. I run most things on GCP via a GKE cluster, and of course my personal Grafana instance sees very little traffic day-to-day, but I wouldn't have guessed how and how much an increase in traffic would cost.
As with most cloud providers, Google charges for a wide variety of usage SKUs. Some like CPU and memory are obvious, while others like static IP addresses, load balancer rules, cluster management fees, and monthly storage fees, less obvious, but still intuitive. However, the constant page loads suddenly revealed a variety of non-intuitive costs for services that were previously inexpensive. The main culprit turned out to be (drumroll)...
Believe it or not, GCP makes you pay through the nose for log ingestion once you pass a free usage threshold of 50GB. With NGINX logs and traces from various services being emitted on every request, even my small cluster consumed that allotment and rapidly started accruing log charges, at a rate of $0.50/GB. Fortunately, Cloud Logging allows flexible exclusion filters, and I was able to bring that cost back under control.
Beyond logs, I also noticed a sharp spike in charges due to Network Egress (data leaving GCP because the load agent is downloading it), and GCS Class B requests. The latter was interesting to me, and difficult to resolve. I use Thanos (a CNCF project) as part of my monitoring stack, and Thanos serves metrics from Google Cloud Storage. Thanos Store and Thanos Query, the components responsible for handling requests for metrics, offer very little in the way of caching, so every page load required downloading a piece of a GCS object to eventually display to the visitor in a Grafana dashboard.
At a lower level, Thanos Store reports statistics on these operations - the relevant operation was the
bucket get_range request (from gcs.go)
That operation makes a storage.*.get request via the GCS JSON API, which is categorized as a Class B operation for billing. GCP charges $0.004/10,000 operations for this type of request against a Standard bucket. While that seems small, 100QPS translates to more than $100/month - a substantial amount, compared to other costs on this cluster. For ways to deal with this, I'll write about what I did another time: in the meantime, watch this amazing talk by Tom Wilkie and read the blog post!
CPU bottlenecks and autoscaling
To my surprise, memory usage for the pods I use stayed relatively stable when the load agent was enabled. However several pods showed drastic spikes in CPU usage. Notable among these were Grafana, Gitlab's Webservice (formerly "Unicorn"), NGINX, and the CloudSQL proxy I use to tunnel my GCP-managed databases for secure access from within the Kubernetes cluster..
Normally, this would be fine - my cluster runs with excess capacity, and CPU is a compressible resource, which means (in part) that it can be over-consumed without Kubernetes needing to terminate any pods or increase cluster capacity. Instead, the pods are throttled - prevented from consuming more CPU than their limits by not scheduling those processes.
Unfortunately, CPU throttling on any process that serves user traffic can lead to poor, as well as inconsistent, performance - a process in the midst of a request could suddenly be put on pause, delaying those requests being served and increasing their latency by an unpredictable amount - like this:
There are a few ways to get around this. In some cases, the right choice is to simply increase the CPU allocation to your pods to prevent the bottleneck. If large changes in load are infrequent, you can pick an allocation that works for your expected load and leave it, updating it manually when need be.
Sometimes you can't predict what load you need to handle. For those cases, a HorizontalPodAutoscaler can help. This Kubernetes primitive monitors the CPU or memory usage of all pods belonging to a Deployment and, if the usage exceeds a threshold, scales up the Deployment. If your Service is set up in the usual way, requests are automatically load balanced across the new and old Pods once all are available, reducing the CPU needed for each Pod. Scaling up or down is repeated until the CPU usage is within bounds, or the maximum or minimum number of Pods the HPA can use is reached.
For GitLab, which I deploy via the GitLab Cloudnative Helm Chart, an HPA for the webservice deployment can be configured via the
gitlab.webservice.hpa field, as described in the docs. NGINX ingress similarly offers easy HPA configuration, and any HPA for any deployment can be made with
kubectl autoscale as well.
If you are thinking about how to load test your website, application, or product, I hope some of this has been useful information! If you have any feedback or suggestions, or are interested in chromedp-load-agent or similar tools, please get in touch! I'm always learning and looking for input, and happy to chat about infrastructure any time.