I got an NVIDIA RTX 2080 Super a few months ago. It's a great piece of hardware and up for anything I can throw at it, which so far includes Metro Exodus, Half-Life: Alyx, Folding@Home, and more. But out-of-the-box it was 15% less performant than it currently is when reporting maximum utilization. With a bit of debugging and a few small changes to the system, I've managed to reclaim that performance. Here's what I learned.

This post focuses on finding and addressing bottlenecks affecting GPU compute, but graphics processing can be slowed by many components: a slow CPU can prevent a GPU from running at maximum speed by failing to provide it with work quickly enough; a machine learning task that requires large amounts of data transfer may be limited elsewhere, such as GPU memory bandwidth, disk, or network activity. Rule these out first. A good rule-of-thumb is to check that GPU utilization is being reported as nearly 100%, while other components are not at their maximums.

Identifying the potential for more performance

I started investigating my GPU's performance after two observations: the first was that latency-sensitive VR games would sometimes stutter or jerk before becoming smooth again, with brief large spikes in frame latency (going from sub-6ms times up to 15-18ms for brief fractions of a second); the second was that when running at maximum utilization, my GPU temperature was pinned at 86°C with the GPU fans running at full speed.

Now, a bit of frame drop in a demanding game could maybe be expected, new GPU or not. And it's hard to find good information about what qualifies as "high" temperatures for a GPU, and what the effects of running at high temperatures are. Still, 86°C is warm, and since my case is a Fractal Node 202, an extremely compact mini-ITX that clocks in at 10.2L, cooling was at the top of my mind. I started to learn about what happens to a GPU as it reaches thermal maximums.

SM Clock Throttling

It turns out, what an NVIDIA GPU will do in order to stay cool is reduce the clock frequency of the streaming multiprocessor (SM) units, which contain CUDA cores, resulting in a decrease in performance that is proportional to the decrease in frequency for tasks running on these cores. The sign of a throttled GPU is a SM frequency that is uneven - full-power GPUs maintain a stable clock frequency.

Throttling confirmed! The SM Clock plot showed clear signs of throttling - spiking constantly between 1770MHz and 1690Gz, and even dropping to 1650MHz for a sustained window. The reference RTX 2080 Super has a base clock of 1650MHz, with a boost clock of 1815MHz, so these would seem to be good speeds, but the instability in the frequency meant something was wrong.

On Windows, third-party programs like GPU-Z can help you detect this by showing a graph of GPU frequency over time. On Linux, the job is somewhat more difficult: you can run nvidia-smi -q -d CLOCK to ask for the GPU frequency, but must run this repeatedly to see if the clock frequency is changing.

For those of us on Linux and without datacenter-style monitoring, though, there's an easier way!

PERFORMANCE

Just run nvidia-smi -q -d PERFORMANCE

$ nvidia-smi -q -d PERFORMANCE

==============NVSMI LOG==============

Driver Version                      : 440.66.08
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Performance State               : P2
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Active
        Display Clock Setting       : Not Active

This is the best list of active throttles I've seen, and when I was investigating, it clearly and consistently showed SW Thermal Slowdown - my GPU was too hot. Not hot enough to trigger the emergency brake that is a hardware slowdown, but hot enough to affect performance. Next up was to figure out how to fix it.*

GPU-tuned air-cooling on Linux

It was at this point that I learned something lucky: I had made a dumb mistake in my build, forgotten that the Fractal Node 202 has space for two case fans beneath the GPU. These are meant to be static pressure fans, pulling cool air in from outside, with the resulting hot air vented out by the CPU fan. I could add two Corsair ML120 Pro Blue 120mm fan as case fans easily enough.

Improving Fan Control

My mini-ITX motherboard is the Gigabyte Z390 I Aorus Pro Wifi, which has three fan headers and comes with the Smart Fan 5 fan control software in the BIOS. This was sufficient to make sure the fans turned on with default settings, but the control with Smart Fan 5 is limited - you can tie any of your fans to either the CPU temperatures, the PCH temperature, or a ambient temperature sensor somewhat removed from the CPU, and the available fan curves are highly customizable, but finicky.

Smart Fan 5 screenshot showing fan curve
Smart Fan 5 supports multiple fans with complex fan curves, but motherboard temperature sensors weren't a good choice to eliminate thermal throttling on the GPUs

Unfortunately tying case fan speed to ambient temperature meant that these fans wouldn't spin up when the GPUs were under load; tying it to CPU temperature meant that the fans would rapidly spin up and down even when the GPU was inactive, as CPU temperatures tend to be more variable than the temperatures of other components. Neither solution was sufficient.

lm-sensors and fancontrol

The go-to for fan speed control on Linux is a combination of lm-sensors, a powerful general-purpose hardware monitoring package, and fancontrol, a simple but useful script that monitors arbitrary temperature sensors and controls PWM outputs, in an infinite loop. On Ubuntu, both can be installed with apt and configured:

$ sudo apt install lm-sensors
$ sudo sensors-detect
$ sudo pwmconfig

For many systems this is sufficient to expose the temperature sensors for CPU temperature as well as the PWM outputs and sensors which provide fan speed control and feedback.

However, this doesn't work on this particular Gigabyte motherboard.

The Gigabyte motherboard uses a temperature sensor which isn't natively supported by the Linux kernel. Fortunately, there was once an enterprising developer who made a kernel module, it87.ko, which supports a large number of sensors of this type. The original maintainer chose to stop maintaining the repository, but several forks exist. I chose hannesha/it87, and compiled the DKMS module to make sure it continues to be compiled for future kernels I install.

$ cd ~
$ git clone https://github.com/hannesha/it87
$ cd it87
$ make
$ sudo make dkms
Install it87.ko to add support for the Gigabyte Z390 fan control and sensors

To enable an installed module like this, you would typically use modprobe, but here there was an issue: this repository is not kept up-to-date with newer motherboard specifications, and so when it attempts to detect the relevant hardware (which happens when the module is loaded), it fails - it is unable to detect the correct device.

$ sudo modprobe it87
modprobe: ERROR: could not insert 'it87': No such device
it87.ko cannot be loaded by modprobe with default parameters

Others have run into this issue on a similar motherboard - the it87 kernel module has an argument, force_id, where you can specify the specific hardware configuration it should target. Though none of the available configurations is a perfect match for the Z390 (preventing automatic matching), some do, conveniently, match closely enough that specifying the ID manually results in successful access to the sensors.

$ sudo modprobe it87 force_id=0x8628
$ sudo sensors-detect
...
Some Super I/O chips contain embedded sensors. We have to write to
standard I/O ports to probe them. This is usually safe.
Do you want to scan for Super I/O sensors? (YES/no): 
Probing for Super-I/O at 0x2e/0x2f
Trying family `National Semiconductor/ITE'...               No
Trying family `SMSC'...                                     No
Trying family `VIA/Winbond/Nuvoton/Fintek'...               No
Trying family `ITE'...                                      Yes
Found unknown chip with ID 0x8688
...

$ sensors
...
it8628-isa-0a40
Adapter: ISA adapter
in0:          +1.12 V  (min =  +0.00 V, max =  +3.06 V)
in1:          +2.00 V  (min =  +0.00 V, max =  +3.06 V)
in2:          +2.03 V  (min =  +0.00 V, max =  +3.06 V)
in3:          +2.02 V  (min =  +0.00 V, max =  +3.06 V)
in4:          +0.00 V  (min =  +0.00 V, max =  +3.06 V)  ALARM
in5:          +1.06 V  (min =  +0.00 V, max =  +3.06 V)
in6:          +1.21 V  (min =  +0.00 V, max =  +3.06 V)
3VSB:         +3.38 V  (min =  +0.00 V, max =  +6.12 V)
Vbat:         +3.19 V  
fan1:        1496 RPM  (min =    0 RPM)
fan2:        1541 RPM  (min =    0 RPM)
fan3:        1464 RPM  (min =    0 RPM)
temp1:        +57.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
temp2:        +64.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
temp3:        +77.0°C  (low  = +127.0°C, high = +127.0°C)
temp4:         +0.0°C  (low  =  +0.0°C, high = +127.0°C)
temp5:        +65.0°C  (low  =  +0.0°C, high = -120.0°C)
temp6:        +63.0°C  (low  =  +0.0°C, high = +127.0°C)
intrusion0:  OK

And just like that, I could see my fans speeds as well as a number of other sensors, and pwmconfig was able to successfully detect the correct fan control PWM outputs.

To make this permanent, it's necessary to put the new kernel module into /etc/modules, with the custom options in a separate conf file in /etc/modprobe.d:

dm-snapshot

# Generated by sensors-detect on Sun Jan 21 22:03:04 2018
# Chip drivers
coretemp

# Added manually, 2020-03-24, see hannesha/it87
it87
/etc/modules with it87 specified manually, coretemp found by sensors-detect. Note that adding custom options here will not allow the module to be loaded on boot, and an error will be logged.
# force kernel to assume IT87 module is similar to module 0x8628, even though it isn't
# seems to work on Z390 I Pro Wifi
options it87 force_id=0x8628
/etc/modprobe.d/it87.conf

GPU temperature fan control

Having fancontrol control the case fans was great, and easier to modify than leaving fan control in the BIOS, but still didn't solve the original problem: I needed my case fan speed to depend on GPU temperature.

At this point a StackOverflow post about connecting HDD temperatures to fancontrol revealed that fancontrol treats temperature sensors as simple files, so while it will by default read from /sys/class/hwmon/{sensorpath}, you can also specify an arbitrary file path from / as a sensor input in /etc/fancontrol. This allows you to update a file with an arbitrary temperature and have fancontrol use that file's content as if it were a sensor.

With a quick bash script that uses nvidia-smi to read the fan temperature from multiple GPUs and write those values to files, and a systemd unit to run this as a process, I could create a fancontrol-compatible "GPU-temperature sensor":

#!/bin/bash
# Read NVIDIA GPU temperatures and write to a file on a duty cycle

HELPTEXT="\
Export GPU temperatures to a directory. Each GPU is written to a file 'gpu_{gpu number}' in the directory.

Usage: export-gpu-temp --loop 2 --output /var/opt/gputemps --gpu 0 --gpu 1
Options:
  -o/--output (required) - path to a directory in which to write GPU temperatures
  -l/--loop (required) - time to sleep between GPU temperature query cycles, in seconds
  --gpu (required, multiple) - GPU number to query; repeat for multiple GPUs
"

set -eou pipefail

GPUS=()
while [[ $# -gt 0 ]]; do
    key="$1"

    case $key in
        -h|--help)
        echo "$HELPTEXT"
        exit 0
        ;;
        -o|--output)
        dirpath_output=$2
        if ! [ -w $dirpath_output ]; then
            echo "$dirpath_output is not a writeable directory"
            exit -1
        fi
        shift
        shift
        ;;
        -l|--loop)
        loop_time=$2
        if [[ $loop_time < 0.1 ]]; then
            echo "loop_time is very small (${loop_time}s), this may cause extra load on your GPU!"
        fi
        shift
        shift
        ;;
        --gpu)
        GPUS+=("$2")
        shift
        shift
        ;;
        *)
        echo "Unknown option $1"
        exit -1
        ;;
    esac
done

echo "Querying GPUs: ${GPUS[@]}"

while true
do
    for gpu_id in ${GPUS[@]}
    do
        gpu_output_path=${dirpath_output}/gpu_${gpu_id}

        if ! temp_degrees_c=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader --id=$gpu_id); then
            echo "Failed to fetch GPU ${gpu_id}"
        else
            temp_millidegrees_c=$(($temp_degrees_c * 1000))
            echo "$(date -Iseconds) GPU ${gpu_id} has temperature ${temp_degrees_c}"

            echo $temp_millidegrees_c > $gpu_output_path
        fi
    done

    echo "$(date -Iseconds) Sleeping ${loop_time}"
    sleep $loop_time
done
export-gpu-temp, a Bash script to write one or multiple GPU temperatures to individual files, to mimic a hwmon sensor

Note that fancontrol expects temperatures to be providing in millidegrees Celsius, following the hmwon interface, so the output from nvidia-smi needed to be multiplied by 1000.

[Unit]
Description=Export GPU temperatures to a file continuously
Documentation=

[Service]
Type=simple
ExecStart=/usr/local/bin/export-gpu-temp --gpu 0 --output /var/opt/fancontrol/ --loop 1
Restart=on-failure

[Install]
WantedBy=multi-user.target
A systemd Unit to export temperatures from GPU 0 to /var/opt/fancontrol/gpu_0 every second.

With that systemd unit up and running, it was a simple matter to modify /etc/fancontrol manually to point to the correct "hardware sensor" and establish temperature bounds for the two case fans. I chose to have the case fans shut off when the GPU temperature was below 60°C, and to reach max speed at 80°C. Here hwmon3/pwm2 and hwmon3/pwm3 are the two case fans. hwmon3/pwm1 is the CPU fan, and is tied to hwmon2/temp2_input, which is the temperature of the first CPU core.

INTERVAL=1
DEVPATH=hwmon2=devices/platform/coretemp.0 hwmon3=devices/platform/it87.2624
DEVNAME=hwmon2=coretemp hwmon3=it8628
FCTEMPS=hwmon3/pwm3=/var/opt/fancontrol/gpu_0 hwmon3/pwm2=/var/opt/fancontrol/gpu_0 hwmon3/pwm1=hwmon2/temp2_input
FCFANS=hwmon3/pwm3=hwmon3/fan3_input hwmon3/pwm2=hwmon3/fan2_input hwmon3/pwm1=hwmon3/fan1_input
MINTEMP=50 hwmon3/pwm3=60 hwmon3/pwm2=60 hwmon3/pwm1=60
MAXTEMP=50 hwmon3/pwm3=80 hwmon3/pwm2=80 hwmon3/pwm1=95
MINSTART=20 hwmon3/pwm3=20 hwmon3/pwm2=20 hwmon3/pwm1=56
MINSTOP=0 hwmon3/pwm3=0 hwmon3/pwm2=0 hwmon3/pwm1=16
MINPWM=0 hwmon3/pwm3=0 hwmon3/pwm2=0 hwmon3/pwm1=16
MAXPWM=230 hwmon3/pwm3=250 hwmon3/pwm2=250 hwmon3/pwm1=250
AVERAGE=5
Final /etc/fancontrol. Read more about the available options on the fancontrol man page

With the it87 kernel module, fancontrol, and this script, I believed I was in a good place and would resolve throttling with sensical fan control. GPU temperatures were noticeably lower under load, so it was time to check -d PERFORMANCE.

SW Power Throttle

nvidia-smi -q -d PERFORMANCE

==============NVSMI LOG==============

Driver Version                      : 440.66.08
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Performance State               : P2
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active

After all that work to fix the cooling problem, one new problem had developed: this GPU has a TDP of 250W. At full throttle and when properly cooled, that wasn't enough power. Fortunately, power limit controls are available in nvidia-smi. We can check what power range is appropriate for the GPU with nvidia-smi -q -d POWER:

nvidia-smi -q -d POWER
...
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 292.00 W

Which shows that even though the reference power limit is 250W, this can be easily configured as high as 292W and as low as 125W.

To change the power limit, run nvidia-smi -pl $PL_IN_WATTS as a superuser. Note, you may need to enable power control on the GPU with nvidia-smi -pm 1. This great blog post has more details, and also includes a quick introduction to overclocking an NVIDIA GPU on Linux, for the interested.

sudo nvidia-smi -pl 292
nvidia-smi -q -d POWER
...
        Power Limit                 : 292.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 292.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 292.00 W
Modify NVDIA GPU power limits on Linux with nvidia-smi -pl

Results

With the maximum power increased, fans installed and properly controlled, the GPU now runs at a comfortable 72-75°C, and the SM clock frequency remained stable at 1890MHz** for long intervals.

nvidia-smi no longer indicates any form of throttling is occurring:

nvidia-smi -q -d PERFORMANCE

==============NVSMI LOG==============

Timestamp                           : Sat Apr 25 13:04:02 2020
Driver Version                      : 440.66.08
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active

But the real test is in the benchmarks. While I somewhat unreliably observed higher Folding@Home Points Per Day, I decided to test a more accurate benchmark with Phoronix Test Suite:

Unigine Heaven 4.0 benchmark comparison, +25FPS with fans
Benchmarking with Phoronix Test Suite's pts/unigine-heaven benchmark. Full result here, my old GTX 1050Ti for reference.

An increase of 25FPS, or ~15%, is nothing to sneeze at! It's not huge, but it's approximately the difference between adjacent grades of graphics cards these days, so this felt like getting a free upgrade.

Summary

Check for GPU throttling with nvidia-smi -q -d PERFORMANCE --loop-ms=500. If thermal throttling occurs, consider improving cooling with better fans, additional case fans, or, failing that, a liquid-cooling system. If no thermal throttling is happening, don't waste time or money on a complex cooling setup! If you encounter hardware power throttling, you may need to buy a more powerful power supply. If software-defined power throttling is happening, try to change the software-defined power limits by checking the acceptable power range with nvidia-smi -q -d POWER and setting the active limits with nvidia-smi -pl.

As an added benefit, I find that it's easy to use the software-defined power limit as a cheap GPU throttle: reducing the power limit to 150W makes the GPU run cool, at the cost of about half the performance.

*The Performance State indicator is also interesting, and you can read more about it in the NVIDIA docs. According to this Reddit post, P0-P2 power states have identical core clock frequencies, but P2 reduces the memory clock frequency. It also states that all compute other than live graphical rendering will keep the card in the P2 state. Since my memory utilization is low, this isn't a problem, but if memory bandwidth or utilization is a concern, consider addressing a decreased power state.

**I have since managed to successfully overclock the SM frequency by +100MHz, stably, and now see constant frequencies at 1995MHz without thermal, power, or other stability issues. But the benchmark and plots show the state of the system and the performance gains without any overclocking.