5 Jan 2014.

`VDPPS`

is the `AVX`

instruction for Dot Product of Packed Single Precision Floating-Point Values.

As of Jan 2014, on my Intel® Core™ i7-2600K CPU @ 3.40GHz (Sandy Bridge), `VDPPS`

is 128.39% +/- 0.04% slower at 95% confidence than a non-vectorized implementation of dot product. (That is, `VDPPS`

has 2.28x the runtime of the alternative.)

`VDPPS`

calculates the dot product of two 128-bit `XMM`

registers, each containing up to four `float`

s. Using a mask, we can exclude some of the components from the sum, e.g. to calculate the dot product of 3-vectors.

`VDPPS`

behaves the same as `DPPS`

from `SSE4.1`

There's a 256-bit variant of `VDPPS`

which can operate on eight `float`

s, and a 128-bit `VDPPD`

which operates on two `double`

s, but there's no instruction for a dot product of four `double`

s.

## Comparing implementations

Here's a naive C implementation of 3-vector dot product:

struct vec3f { float x; float y; float z; }; float dot3_v1(const vec3f& a, const vec3f& b) { return a.x * b.x + a.y * b.y + a.z * b.z; }

Here's what `gcc 4.8.2`

generates with `-O3 -march=corei7-avx -masm=intel -fverbose-asm -S`

:

vmovss xmm1, DWORD PTR [rdi] # a_2(D)->x, a_2(D)->x vmovss xmm0, DWORD PTR [rdi+4] # a_2(D)->y, a_2(D)->y vmulss xmm1, xmm1, DWORD PTR [rsi] # D.7524, a_2(D)->x, b_4(D)->x vmulss xmm0, xmm0, DWORD PTR [rsi+4] # D.7524, a_2(D)->y, b_4(D)->y vaddss xmm1, xmm1, xmm0 # D.7524, D.7524, D.7524 vmovss xmm0, DWORD PTR [rdi+8] # a_2(D)->z, a_2(D)->z vmulss xmm0, xmm0, DWORD PTR [rsi+8] # D.7524, a_2(D)->z, b_4(D)->z vaddss xmm0, xmm1, xmm0 # D.7524, D.7524, D.7524 ret

It's using the scalar `AVX`

instructions to do floating point math, which is expected. Vectorized code would use **packed** single precision instructions like `vmulps`

instead of the **scalar** single precision `vmulss`

.

Here's how I coaxed `gcc`

into emitting `VDPPS`

:

struct vec4f { float x; float y; float z; float w; }; float dot3_v2(const vec4f& a, const vec4f& b) { __v4sf u = _mm_load_ps(&(a.x)); __v4sf v = _mm_load_ps(&(b.x)); // Mask: // 4 high bits: which elements should be summed. (w,z,y,x) // 4 low bits: which output slots should contain the result. (3,2,1,0) int mask = 0b01110001; return _mm_dp_ps(u, v, mask)[0]; }

The resulting assembler is much shorter:

vmovaps xmm1, XMMWORD PTR [rdi] vdpps xmm0, xmm1, XMMWORD PTR [rsi], 113 ret

## Benchmarking

If you'd like to try it yourself, the source code plus test rig is here: dot.cc

My `ministat`

results:

N Min Max Median Avg Stddev v1 0.009199861 0.009354274 0.009204123 0.009207761 1.5952315e-05 v2 0.02101862 0.021209983 0.02102264 0.021029907 2.3628586e-05 Difference at 95.0% confidence 0.0118221 +/- 3.94136e-06 128.393% +/- 0.0428048% (Student's t, pooled s = 2.01592e-05)

Benchmarking tip: if your CPU has Turbo Boost like mine, you can dramatically shorten the tail on the distribution of your early measurements by keeping one of the cores busy. An easy way to do that is open a window and run:

`perl -e 'while(1) {}'`

It only takes a fraction of a second to take effect.