First written 2014-03-14.
Updated 2016-07-31.
These are my notes on benchmarking CPU-bound code.
This article is specific to the i7-2600K Sandybridge CPU I'm currently (2014) using, but a lot of it will generalize to both earlier and later CPUs.
RDTSCP
The RDTSCP
instruction is used to read the Time Stamp Counter. It's an unprivileged instruction, so it doesn't require calling into the kernel "or access to a platform resource," which isn't defined in the Intel manuals. This makes it very fast to read, plus it has a high resolution, and so the Intel manuals recommend it over other clocks in your system like the HPET or the ACPI timer.
How to call RDTSCP from C
uint64_t rdtscp(void) { uint32_t lo, hi; __asm__ volatile ("rdtscp" : /* outputs */ "=a" (lo), "=d" (hi) : /* no inputs */ : /* clobbers */ "%rcx"); return (uint64_t)lo | (((uint64_t)hi) << 32); }
How RDTSCP works
- The lower 32 bits of the TSC are loaded into
eax
. - The upper 32 bits are loaded into
edx
. - The CPU ID is loaded into
ecx
.
Some of the finer points:
- Intel CPUs perform out-of-order execution, so
RDTSCP
waits until all previous instructions finish before it reads the TSC. (the olderRDTSC
instruction doesn't do this) RDTSCP
doesn't prevent the instructions after it from starting, and this can throw off your timings. You can work around this by issuing anLFENCE
or aCPUID
afterRDTSCP
.- In Long Mode, the upper halves of
rax
,rcx
andrdx
are zeroed out. - Technically,
ecx
is set to the IA32_TSC_AUX MSR value for that logical CPU (i.e. hyperthread), but the Linux kernel initializes this to the logical CPU's number, starting from zero (plus 4096 on the second package). RDTSCP
andRDTSC
can be disabled by the kernel (or hypervisor) by setting the TSD bit in CR4, but I suspect this is an exotic security measure that you probably won't run into.
What is the Time Stamp Counter
On modern CPUs, the TSC measures wall time.
On older CPUs, the TSC was counting internal processor clock cycles. This means its frequency changed when the processor's frequency scaling changed. (e.g. SpeedStep)
Since Intel Nehalem (2008), CPUs have an "invariant TSC" that increments at a constant rate, regardless of TurboBoost and ACPI P-states, C-states and T-states.
Whether your particular CPU has an invariant TSC can be detected using the CPUID
instruction. You can even do this on the commandline:
grep -c constant_tsc /proc/cpuinfo
Note that what Intel calls "invariant TSC," Linux splits into two bits: constant_tsc
if the frequency is constant, and nonstop_tsc
if the TSC doesn't stop in C-states.
An invariant TSC is guaranteed to be monotonic (i.e. it doesn't decrease), and is synchronized among logical CPUs on the same package. According to an anonymous kernel person, there's a master TSC in the CPU package. All logical CPUs read this, and can add a per-logical-core offset that can be written to an MSR. I've heard of TSCs becoming desynchronized between packages, but not within a package.
Modern CPUs will both overclock and underclock themselves
Although the TSC's frequency is constant, the speed of the CPU is not. I have a separate page about benchmarking and turbo which has graphs of what this looks like and how to control for it.
The short version is:
- Keep one core busy:
while :; do :; done
- Turn off TurboBoost:
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
What is the frequency of the TSC
The Intel manual warns that the frequency of the TSC, while constant, is not necessarily the "maximum qualified frequency" of the processor, or the frequency given in the brand string.
e.g. On my system, /proc/cpuinfo
says:
model name : Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
/sys/devices/system/cpu/cpu0/cpufreq/
is 1,600,000.
/sys/devices/system/cpu/cpu0/cpufreq/
is 5,900,000.
(these values are documented to be in kHz)
When idle, /sys/devices/system/cpu/cpu0/cpufreq/
displays about 2,000,000.
When the CPU is spinning, it reports 3,799,898.
When idle, cpufreq-aperf
reports numbers between 3,000,000 and 4,000,000.
When spinning, it reports all eight (logical) cores pegged to 6,549,000.
If I disable TurboBoost, that number becomes 5,900,000.
When I measure using roughly:
RDTSCP clock_gettime(CLOCK_MONOTONIC) usleep() RDTSCP clock_gettime(CLOCK_MONOTONIC)
When I measure, I get about 3,411,000,000 TSC increments per second, regardless of how busy the CPU is.
I don't know how much the TSC frequency drifts.