VDPPS is slow

5 Jan 2014.

VDPPS is the AVX instruction for Dot Product of Packed Single Precision Floating-Point Values.

As of Jan 2014, on my Intel® Core™ i7-2600K CPU @ 3.40GHz (Sandy Bridge), VDPPS is 128.39% +/- 0.04% slower at 95% confidence than a non-vectorized implementation of dot product. (That is, VDPPS has 2.28x the runtime of the alternative.)

VDPPS calculates the dot product of two 128-bit XMM registers, each containing up to four floats. Using a mask, we can exclude some of the components from the sum, e.g. to calculate the dot product of 3-vectors.

VDPPS behaves the same as DPPS from SSE4.1

There's a 256-bit variant of VDPPS which can operate on eight floats, and a 128-bit VDPPD which operates on two doubles, but there's no instruction for a dot product of four doubles.

Comparing implementations

Here's a naive C implementation of 3-vector dot product:

struct vec3f {
  float x;
  float y;
  float z;
};

float dot3_v1(const vec3f& a, const vec3f& b) {
  return a.x * b.x + a.y * b.y + a.z * b.z;
}

Here's what gcc 4.8.2 generates with -O3 -march=corei7-avx -masm=intel -fverbose-asm -S:

vmovss  xmm1, DWORD PTR [rdi]   # a_2(D)->x, a_2(D)->x
vmovss  xmm0, DWORD PTR [rdi+4] # a_2(D)->y, a_2(D)->y
vmulss  xmm1, xmm1, DWORD PTR [rsi]     # D.7524, a_2(D)->x, b_4(D)->x
vmulss  xmm0, xmm0, DWORD PTR [rsi+4]   # D.7524, a_2(D)->y, b_4(D)->y
vaddss  xmm1, xmm1, xmm0        # D.7524, D.7524, D.7524
vmovss  xmm0, DWORD PTR [rdi+8] # a_2(D)->z, a_2(D)->z
vmulss  xmm0, xmm0, DWORD PTR [rsi+8]   # D.7524, a_2(D)->z, b_4(D)->z
vaddss  xmm0, xmm1, xmm0        # D.7524, D.7524, D.7524
ret

It's using the scalar AVX instructions to do floating point math, which is expected. Vectorized code would use packed single precision instructions like vmulps instead of the scalar single precision vmulss.

Here's how I coaxed gcc into emitting VDPPS:

struct vec4f {
  float x;
  float y;
  float z;
  float w;
};

float dot3_v2(const vec4f& a, const vec4f& b) {
  __v4sf u = _mm_load_ps(&(a.x));
  __v4sf v = _mm_load_ps(&(b.x));
  // Mask:
  // 4 high bits: which elements should be summed. (w,z,y,x)
  // 4 low bits: which output slots should contain the result. (3,2,1,0)
  int mask = 0b01110001;
  return _mm_dp_ps(u, v, mask)[0];
}

The resulting assembler is much shorter:

vmovaps xmm1, XMMWORD PTR [rdi]
vdpps   xmm0, xmm1, XMMWORD PTR [rsi], 113
ret

Benchmarking

If you'd like to try it yourself, the source code plus test rig is here: dot.cc

My ministat results:

  N           Min           Max        Median           Avg        Stddev
 v1   0.009199861   0.009354274   0.009204123   0.009207761 1.5952315e-05
 v2    0.02101862   0.021209983    0.02102264   0.021029907 2.3628586e-05
Difference at 95.0% confidence
  0.0118221 +/- 3.94136e-06
  128.393% +/- 0.0428048%
  (Student's t, pooled s = 2.01592e-05)

Benchmarking tip: if your CPU has Turbo Boost like mine, you can dramatically shorten the tail on the distribution of your early measurements by keeping one of the cores busy. An easy way to do that is open a window and run:

perl -e 'while(1) {}'

It only takes a fraction of a second to take effect.