5 Jan 2014.

`VDPPS` is the `AVX` instruction for Dot Product of Packed Single Precision Floating-Point Values.

As of Jan 2014, on my Intel® Core™ i7-2600K CPU @ 3.40GHz (Sandy Bridge), `VDPPS` is 128.39% +/- 0.04% slower at 95% confidence than a non-vectorized implementation of dot product. (That is, `VDPPS` has 2.28x the runtime of the alternative.)

`VDPPS` calculates the dot product of two 128-bit `XMM` registers, each containing up to four `float`s. Using a mask, we can exclude some of the components from the sum, e.g. to calculate the dot product of 3-vectors.

`VDPPS` behaves the same as `DPPS` from `SSE4.1`

There's a 256-bit variant of `VDPPS` which can operate on eight `float`s, and a 128-bit `VDPPD` which operates on two `double`s, but there's no instruction for a dot product of four `double`s.

## Comparing implementations

Here's a naive C implementation of 3-vector dot product:

```struct vec3f {
float x;
float y;
float z;
};

float dot3_v1(const vec3f& a, const vec3f& b) {
return a.x * b.x + a.y * b.y + a.z * b.z;
}```

Here's what `gcc 4.8.2` generates with `-O3 -march=corei7-avx -masm=intel -fverbose-asm -S`:

```vmovss  xmm1, DWORD PTR [rdi]   # a_2(D)->x, a_2(D)->x
vmovss  xmm0, DWORD PTR [rdi+4] # a_2(D)->y, a_2(D)->y
vmulss  xmm1, xmm1, DWORD PTR [rsi]     # D.7524, a_2(D)->x, b_4(D)->x
vmulss  xmm0, xmm0, DWORD PTR [rsi+4]   # D.7524, a_2(D)->y, b_4(D)->y
vaddss  xmm1, xmm1, xmm0        # D.7524, D.7524, D.7524
vmovss  xmm0, DWORD PTR [rdi+8] # a_2(D)->z, a_2(D)->z
vmulss  xmm0, xmm0, DWORD PTR [rsi+8]   # D.7524, a_2(D)->z, b_4(D)->z
vaddss  xmm0, xmm1, xmm0        # D.7524, D.7524, D.7524
ret```

It's using the scalar `AVX` instructions to do floating point math, which is expected. Vectorized code would use packed single precision instructions like `vmulps` instead of the scalar single precision `vmulss`.

Here's how I coaxed `gcc` into emitting `VDPPS`:

```struct vec4f {
float x;
float y;
float z;
float w;
};

float dot3_v2(const vec4f& a, const vec4f& b) {
// 4 high bits: which elements should be summed. (w,z,y,x)
// 4 low bits: which output slots should contain the result. (3,2,1,0)
}```

The resulting assembler is much shorter:

```vmovaps xmm1, XMMWORD PTR [rdi]
vdpps   xmm0, xmm1, XMMWORD PTR [rsi], 113
ret```

## Benchmarking

If you'd like to try it yourself, the source code plus test rig is here: dot.cc

My `ministat` results:

```  N           Min           Max        Median           Avg        Stddev
v1   0.009199861   0.009354274   0.009204123   0.009207761 1.5952315e-05
v2    0.02101862   0.021209983    0.02102264   0.021029907 2.3628586e-05
Difference at 95.0% confidence
0.0118221 +/- 3.94136e-06
128.393% +/- 0.0428048%
(Student's t, pooled s = 2.01592e-05)```

Benchmarking tip: if your CPU has Turbo Boost like mine, you can dramatically shorten the tail on the distribution of your early measurements by keeping one of the cores busy. An easy way to do that is open a window and run:

`perl -e 'while(1) {}'`

It only takes a fraction of a second to take effect.