5 Jan 2014.
VDPPS is the AVX instruction for Dot Product of Packed Single Precision Floating-Point Values.
As of Jan 2014, on my Intel® Core™ i7-2600K CPU @ 3.40GHz (Sandy Bridge), VDPPS is 128.39% +/- 0.04% slower at 95% confidence than a non-vectorized implementation of dot product. (That is, VDPPS has 2.28x the runtime of the alternative.)
VDPPS calculates the dot product of two 128-bit XMM registers, each containing up to four floats. Using a mask, we can exclude some of the components from the sum, e.g. to calculate the dot product of 3-vectors.
VDPPS behaves the same as DPPS from SSE4.1
There's a 256-bit variant of VDPPS which can operate on eight floats, and a 128-bit VDPPD which operates on two doubles, but there's no instruction for a dot product of four doubles.
Comparing implementations
Here's a naive C implementation of 3-vector dot product:
struct vec3f {
float x;
float y;
float z;
};
float dot3_v1(const vec3f& a, const vec3f& b) {
return a.x * b.x + a.y * b.y + a.z * b.z;
}
Here's what gcc 4.8.2 generates with -O3 -march=corei7-avx -masm=intel -fverbose-asm -S:
vmovss xmm1, DWORD PTR [rdi] # a_2(D)->x, a_2(D)->x vmovss xmm0, DWORD PTR [rdi+4] # a_2(D)->y, a_2(D)->y vmulss xmm1, xmm1, DWORD PTR [rsi] # D.7524, a_2(D)->x, b_4(D)->x vmulss xmm0, xmm0, DWORD PTR [rsi+4] # D.7524, a_2(D)->y, b_4(D)->y vaddss xmm1, xmm1, xmm0 # D.7524, D.7524, D.7524 vmovss xmm0, DWORD PTR [rdi+8] # a_2(D)->z, a_2(D)->z vmulss xmm0, xmm0, DWORD PTR [rsi+8] # D.7524, a_2(D)->z, b_4(D)->z vaddss xmm0, xmm1, xmm0 # D.7524, D.7524, D.7524 ret
It's using the scalar AVX instructions to do floating point math, which is expected. Vectorized code would use packed single precision instructions like vmulps instead of the scalar single precision vmulss.
Here's how I coaxed gcc into emitting VDPPS:
struct vec4f {
float x;
float y;
float z;
float w;
};
float dot3_v2(const vec4f& a, const vec4f& b) {
__v4sf u = _mm_load_ps(&(a.x));
__v4sf v = _mm_load_ps(&(b.x));
// Mask:
// 4 high bits: which elements should be summed. (w,z,y,x)
// 4 low bits: which output slots should contain the result. (3,2,1,0)
int mask = 0b01110001;
return _mm_dp_ps(u, v, mask)[0];
}
The resulting assembler is much shorter:
vmovaps xmm1, XMMWORD PTR [rdi] vdpps xmm0, xmm1, XMMWORD PTR [rsi], 113 ret
Benchmarking
If you'd like to try it yourself, the source code plus test rig is here: dot.cc
My ministat results:
N Min Max Median Avg Stddev v1 0.009199861 0.009354274 0.009204123 0.009207761 1.5952315e-05 v2 0.02101862 0.021209983 0.02102264 0.021029907 2.3628586e-05 Difference at 95.0% confidence 0.0118221 +/- 3.94136e-06 128.393% +/- 0.0428048% (Student's t, pooled s = 2.01592e-05)
Benchmarking tip: if your CPU has Turbo Boost like mine, you can dramatically shorten the tail on the distribution of your early measurements by keeping one of the cores busy. An easy way to do that is open a window and run:
perl -e 'while(1) {}'
It only takes a fraction of a second to take effect.