I got an NVIDIA RTX 2080 Super a few months ago. It's a great piece of hardware and up for anything I can throw at it, which so far includes Metro Exodus, Half-Life: Alyx, Folding@Home, and more. But out-of-the-box it was 15% less performant than it currently is when reporting maximum utilization. With a bit of debugging and a few small changes to the system, I've managed to reclaim that performance. Here's what I learned.
This post focuses on finding and addressing bottlenecks affecting GPU compute, but graphics processing can be slowed by many components: a slow CPU can prevent a GPU from running at maximum speed by failing to provide it with work quickly enough; a machine learning task that requires large amounts of data transfer may be limited elsewhere, such as GPU memory bandwidth, disk, or network activity. Rule these out first. A good rule-of-thumb is to check that GPU utilization is being reported as nearly 100%, while other components are not at their maximums.
Identifying the potential for more performance
I started investigating my GPU's performance after two observations: the first was that latency-sensitive VR games would sometimes stutter or jerk before becoming smooth again, with brief large spikes in frame latency (going from sub-6ms times up to 15-18ms for brief fractions of a second); the second was that when running at maximum utilization, my GPU temperature was pinned at 86°C with the GPU fans running at full speed.
Now, a bit of frame drop in a demanding game could maybe be expected, new GPU or not. And it's hard to find good information about what qualifies as "high" temperatures for a GPU, and what the effects of running at high temperatures are. Still, 86°C is warm, and since my case is a Fractal Node 202, an extremely compact mini-ITX that clocks in at 10.2L, cooling was at the top of my mind. I started to learn about what happens to a GPU as it reaches thermal maximums.
SM Clock Throttling
It turns out, what an NVIDIA GPU will do in order to stay cool is reduce the clock frequency of the streaming multiprocessor (SM) units, which contain CUDA cores, resulting in a decrease in performance that is proportional to the decrease in frequency for tasks running on these cores. The sign of a throttled GPU is a SM frequency that is uneven - full-power GPUs maintain a stable clock frequency.
Throttling confirmed! The SM Clock
plot showed clear signs of throttling - spiking constantly between 1770MHz and 1690Gz, and even dropping to 1650MHz for a sustained window. The reference RTX 2080 Super has a base clock of 1650MHz, with a boost clock of 1815MHz, so these would seem to be good speeds, but the instability in the frequency meant something was wrong.
On Windows, third-party programs like GPU-Z can help you detect this by showing a graph of GPU frequency over time. On Linux, the job is somewhat more difficult: you can run nvidia-smi -q -d CLOCK
to ask for the GPU frequency, but must run this repeatedly to see if the clock frequency is changing.
For those of us on Linux and without datacenter-style monitoring, though, there's an easier way!
PERFORMANCE
Just run nvidia-smi -q -d PERFORMANCE
$ nvidia-smi -q -d PERFORMANCE
==============NVSMI LOG==============
Driver Version : 440.66.08
CUDA Version : 10.2
Attached GPUs : 1
GPU 00000000:01:00.0
Performance State : P2
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Active
Display Clock Setting : Not Active
This is the best list of active throttles I've seen, and when I was investigating, it clearly and consistently showed SW Thermal Slowdown
- my GPU was too hot. Not hot enough to trigger the emergency brake that is a hardware slowdown, but hot enough to affect performance. Next up was to figure out how to fix it.*
GPU-tuned air-cooling on Linux
It was at this point that I learned something lucky: I had made a dumb mistake in my build, forgotten that the Fractal Node 202 has space for two case fans beneath the GPU. These are meant to be static pressure fans, pulling cool air in from outside, with the resulting hot air vented out by the CPU fan. I could add two Corsair ML120 Pro Blue 120mm fan as case fans easily enough.
Improving Fan Control
My mini-ITX motherboard is the Gigabyte Z390 I Aorus Pro Wifi, which has three fan headers and comes with the Smart Fan 5 fan control software in the BIOS. This was sufficient to make sure the fans turned on with default settings, but the control with Smart Fan 5 is limited - you can tie any of your fans to either the CPU temperatures, the PCH temperature, or a ambient temperature sensor somewhat removed from the CPU, and the available fan curves are highly customizable, but finicky.
Unfortunately tying case fan speed to ambient temperature meant that these fans wouldn't spin up when the GPUs were under load; tying it to CPU temperature meant that the fans would rapidly spin up and down even when the GPU was inactive, as CPU temperatures tend to be more variable than the temperatures of other components. Neither solution was sufficient.
lm-sensors and fancontrol
The go-to for fan speed control on Linux is a combination of lm-sensors, a powerful general-purpose hardware monitoring package, and fancontrol, a simple but useful script that monitors arbitrary temperature sensors and controls PWM outputs, in an infinite loop. On Ubuntu, both can be installed with apt
and configured:
$ sudo apt install lm-sensors
$ sudo sensors-detect
$ sudo pwmconfig
For many systems this is sufficient to expose the temperature sensors for CPU temperature as well as the PWM outputs and sensors which provide fan speed control and feedback.
However, this doesn't work on this particular Gigabyte motherboard.
The Gigabyte motherboard uses a temperature sensor which isn't natively supported by the Linux kernel. Fortunately, there was once an enterprising developer who made a kernel module, it87.ko, which supports a large number of sensors of this type. The original maintainer chose to stop maintaining the repository, but several forks exist. I chose hannesha/it87, and compiled the DKMS module to make sure it continues to be compiled for future kernels I install.
To enable an installed module like this, you would typically use modprobe, but here there was an issue: this repository is not kept up-to-date with newer motherboard specifications, and so when it attempts to detect the relevant hardware (which happens when the module is loaded), it fails - it is unable to detect the correct device.
Others have run into this issue on a similar motherboard - the it87 kernel module has an argument, force_id
, where you can specify the specific hardware configuration it should target. Though none of the available configurations is a perfect match for the Z390 (preventing automatic matching), some do, conveniently, match closely enough that specifying the ID manually results in successful access to the sensors.
$ sudo modprobe it87 force_id=0x8628
$ sudo sensors-detect
...
Some Super I/O chips contain embedded sensors. We have to write to
standard I/O ports to probe them. This is usually safe.
Do you want to scan for Super I/O sensors? (YES/no):
Probing for Super-I/O at 0x2e/0x2f
Trying family `National Semiconductor/ITE'... No
Trying family `SMSC'... No
Trying family `VIA/Winbond/Nuvoton/Fintek'... No
Trying family `ITE'... Yes
Found unknown chip with ID 0x8688
...
$ sensors
...
it8628-isa-0a40
Adapter: ISA adapter
in0: +1.12 V (min = +0.00 V, max = +3.06 V)
in1: +2.00 V (min = +0.00 V, max = +3.06 V)
in2: +2.03 V (min = +0.00 V, max = +3.06 V)
in3: +2.02 V (min = +0.00 V, max = +3.06 V)
in4: +0.00 V (min = +0.00 V, max = +3.06 V) ALARM
in5: +1.06 V (min = +0.00 V, max = +3.06 V)
in6: +1.21 V (min = +0.00 V, max = +3.06 V)
3VSB: +3.38 V (min = +0.00 V, max = +6.12 V)
Vbat: +3.19 V
fan1: 1496 RPM (min = 0 RPM)
fan2: 1541 RPM (min = 0 RPM)
fan3: 1464 RPM (min = 0 RPM)
temp1: +57.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor
temp2: +64.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor
temp3: +77.0°C (low = +127.0°C, high = +127.0°C)
temp4: +0.0°C (low = +0.0°C, high = +127.0°C)
temp5: +65.0°C (low = +0.0°C, high = -120.0°C)
temp6: +63.0°C (low = +0.0°C, high = +127.0°C)
intrusion0: OK
And just like that, I could see my fans speeds as well as a number of other sensors, and pwmconfig
was able to successfully detect the correct fan control PWM outputs.
To make this permanent, it's necessary to put the new kernel module into /etc/modules
, with the custom options in a separate conf file in /etc/modprobe.d
:
GPU temperature fan control
Having fancontrol
control the case fans was great, and easier to modify than leaving fan control in the BIOS, but still didn't solve the original problem: I needed my case fan speed to depend on GPU temperature.
At this point a StackOverflow post about connecting HDD temperatures to fancontrol
revealed that fancontrol
treats temperature sensors as simple files, so while it will by default read from /sys/class/hwmon/{sensorpath}
, you can also specify an arbitrary file path from /
as a sensor input in /etc/fancontrol
. This allows you to update a file with an arbitrary temperature and have fancontrol
use that file's content as if it were a sensor.
With a quick bash script that uses nvidia-smi
to read the fan temperature from multiple GPUs and write those values to files, and a systemd unit to run this as a process, I could create a fancontrol
-compatible "GPU-temperature sensor":
Note that fancontrol
expects temperatures to be providing in millidegrees Celsius, following the hmwon interface, so the output from nvidia-smi
needed to be multiplied by 1000.
With that systemd unit up and running, it was a simple matter to modify /etc/fancontrol
manually to point to the correct "hardware sensor" and establish temperature bounds for the two case fans. I chose to have the case fans shut off when the GPU temperature was below 60°C, and to reach max speed at 80°C. Here hwmon3/pwm2
and hwmon3/pwm3
are the two case fans. hwmon3/pwm1
is the CPU fan, and is tied to hwmon2/temp2_input
, which is the temperature of the first CPU core.
With the it87
kernel module, fancontrol
, and this script, I believed I was in a good place and would resolve throttling with sensical fan control. GPU temperatures were noticeably lower under load, so it was time to check -d PERFORMANCE
.
SW Power Throttle
nvidia-smi -q -d PERFORMANCE
==============NVSMI LOG==============
Driver Version : 440.66.08
CUDA Version : 10.2
Attached GPUs : 1
GPU 00000000:01:00.0
Performance State : P2
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
After all that work to fix the cooling problem, one new problem had developed: this GPU has a TDP of 250W. At full throttle and when properly cooled, that wasn't enough power. Fortunately, power limit controls are available in nvidia-smi
. We can check what power range is appropriate for the GPU with nvidia-smi -q -d POWER
:
nvidia-smi -q -d POWER
...
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 292.00 W
Which shows that even though the reference power limit is 250W, this can be easily configured as high as 292W and as low as 125W.
To change the power limit, run nvidia-smi -pl $PL_IN_WATTS
as a superuser. Note, you may need to enable power control on the GPU with nvidia-smi -pm 1
. This great blog post has more details, and also includes a quick introduction to overclocking an NVIDIA GPU on Linux, for the interested.
Results
With the maximum power increased, fans installed and properly controlled, the GPU now runs at a comfortable 72-75°C, and the SM clock frequency remained stable at 1890MHz** for long intervals.
nvidia-smi
no longer indicates any form of throttling is occurring:
nvidia-smi -q -d PERFORMANCE
==============NVSMI LOG==============
Timestamp : Sat Apr 25 13:04:02 2020
Driver Version : 440.66.08
CUDA Version : 10.2
Attached GPUs : 1
GPU 00000000:01:00.0
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
But the real test is in the benchmarks. While I somewhat unreliably observed higher Folding@Home Points Per Day, I decided to test a more accurate benchmark with Phoronix Test Suite:
An increase of 25FPS, or ~15%, is nothing to sneeze at! It's not huge, but it's approximately the difference between adjacent grades of graphics cards these days, so this felt like getting a free upgrade.
Summary
Check for GPU throttling with nvidia-smi -q -d PERFORMANCE --loop-ms=500
. If thermal throttling occurs, consider improving cooling with better fans, additional case fans, or, failing that, a liquid-cooling system. If no thermal throttling is happening, don't waste time or money on a complex cooling setup! If you encounter hardware power throttling, you may need to buy a more powerful power supply. If software-defined power throttling is happening, try to change the software-defined power limits by checking the acceptable power range with nvidia-smi -q -d POWER
and setting the active limits with nvidia-smi -pl
.
As an added benefit, I find that it's easy to use the software-defined power limit as a cheap GPU throttle: reducing the power limit to 150W makes the GPU run cool, at the cost of about half the performance.
*The Performance State indicator is also interesting, and you can read more about it in the NVIDIA docs. According to this Reddit post, P0-P2 power states have identical core clock frequencies, but P2 reduces the memory clock frequency. It also states that all compute other than live graphical rendering will keep the card in the P2 state. Since my memory utilization is low, this isn't a problem, but if memory bandwidth or utilization is a concern, consider addressing a decreased power state.
**I have since managed to successfully overclock the SM frequency by +100MHz, stably, and now see constant frequencies at 1995MHz without thermal, power, or other stability issues. But the benchmark and plots show the state of the system and the performance gains without any overclocking.