First written 2014-03-14.
Updated 2016-07-31.

These are my notes on benchmarking CPU-bound code.

This article is specific to the i7-2600K Sandybridge CPU I'm currently (2014) using, but a lot of it will generalize to both earlier and later CPUs.

RDTSCP

The RDTSCP instruction is used to read the Time Stamp Counter. It's an unprivileged instruction, so it doesn't require calling into the kernel "or access to a platform resource," which isn't defined in the Intel manuals. This makes it very fast to read, plus it has a high resolution, and so the Intel manuals recommend it over other clocks in your system like the HPET or the ACPI timer.

How to call RDTSCP from C

uint64_t rdtscp(void) {
  uint32_t lo, hi;
  __asm__ volatile ("rdtscp"
      : /* outputs */ "=a" (lo), "=d" (hi)
      : /* no inputs */
      : /* clobbers */ "%rcx");
  return (uint64_t)lo | (((uint64_t)hi) << 32);
}

How RDTSCP works

Some of the finer points:

What is the Time Stamp Counter

On modern CPUs, the TSC measures wall time.

On older CPUs, the TSC was counting internal processor clock cycles. This means its frequency changed when the processor's frequency scaling changed. (e.g. SpeedStep)

Since Intel Nehalem (2008), CPUs have an "invariant TSC" that increments at a constant rate, regardless of TurboBoost and ACPI P-states, C-states and T-states.

Whether your particular CPU has an invariant TSC can be detected using the CPUID instruction. You can even do this on the commandline:

grep -c constant_tsc /proc/cpuinfo

Note that what Intel calls "invariant TSC," Linux splits into two bits: constant_tsc if the frequency is constant, and nonstop_tsc if the TSC doesn't stop in C-states.

An invariant TSC is guaranteed to be monotonic (i.e. it doesn't decrease), and is synchronized among logical CPUs on the same package. According to an anonymous kernel person, there's a master TSC in the CPU package. All logical CPUs read this, and can add a per-logical-core offset that can be written to an MSR. I've heard of TSCs becoming desynchronized between packages, but not within a package.

Modern CPUs will both overclock and underclock themselves

Although the TSC's frequency is constant, the speed of the CPU is not. I have a separate page about benchmarking and turbo which has graphs of what this looks like and how to control for it.

The short version is:

What is the frequency of the TSC

The Intel manual warns that the frequency of the TSC, while constant, is not necessarily the "maximum qualified frequency" of the processor, or the frequency given in the brand string.

e.g. On my system, /proc/cpuinfo says:

model name  : Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz

/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq is 1,600,000.
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq is 5,900,000.
(these values are documented to be in kHz)

When idle, /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq displays about 2,000,000.
When the CPU is spinning, it reports 3,799,898.

When idle, cpufreq-aperf reports numbers between 3,000,000 and 4,000,000.
When spinning, it reports all eight (logical) cores pegged to 6,549,000.
If I disable TurboBoost, that number becomes 5,900,000.

When I measure using roughly:

RDTSCP
clock_gettime(CLOCK_MONOTONIC)
usleep()
RDTSCP
clock_gettime(CLOCK_MONOTONIC)

When I measure, I get about 3,411,000,000 TSC increments per second, regardless of how busy the CPU is.

I don't know how much the TSC frequency drifts.