5 Jan 2014.
VDPPS
is the AVX
instruction for Dot Product of Packed Single Precision Floating-Point Values.
As of Jan 2014, on my Intel® Core™ i7-2600K CPU @ 3.40GHz (Sandy Bridge), VDPPS
is 128.39% +/- 0.04% slower at 95% confidence than a non-vectorized implementation of dot product. (That is, VDPPS
has 2.28x the runtime of the alternative.)
VDPPS
calculates the dot product of two 128-bit XMM
registers, each containing up to four float
s. Using a mask, we can exclude some of the components from the sum, e.g. to calculate the dot product of 3-vectors.
VDPPS
behaves the same as DPPS
from SSE4.1
There's a 256-bit variant of VDPPS
which can operate on eight float
s, and a 128-bit VDPPD
which operates on two double
s, but there's no instruction for a dot product of four double
s.
Comparing implementations
Here's a naive C implementation of 3-vector dot product:
struct vec3f { float x; float y; float z; }; float dot3_v1(const vec3f& a, const vec3f& b) { return a.x * b.x + a.y * b.y + a.z * b.z; }
Here's what gcc 4.8.2
generates with -O3 -march=corei7-avx -masm=intel -fverbose-asm -S
:
vmovss xmm1, DWORD PTR [rdi] # a_2(D)->x, a_2(D)->x vmovss xmm0, DWORD PTR [rdi+4] # a_2(D)->y, a_2(D)->y vmulss xmm1, xmm1, DWORD PTR [rsi] # D.7524, a_2(D)->x, b_4(D)->x vmulss xmm0, xmm0, DWORD PTR [rsi+4] # D.7524, a_2(D)->y, b_4(D)->y vaddss xmm1, xmm1, xmm0 # D.7524, D.7524, D.7524 vmovss xmm0, DWORD PTR [rdi+8] # a_2(D)->z, a_2(D)->z vmulss xmm0, xmm0, DWORD PTR [rsi+8] # D.7524, a_2(D)->z, b_4(D)->z vaddss xmm0, xmm1, xmm0 # D.7524, D.7524, D.7524 ret
It's using the scalar AVX
instructions to do floating point math, which is expected. Vectorized code would use packed single precision instructions like vmulps
instead of the scalar single precision vmulss
.
Here's how I coaxed gcc
into emitting VDPPS
:
struct vec4f { float x; float y; float z; float w; }; float dot3_v2(const vec4f& a, const vec4f& b) { __v4sf u = _mm_load_ps(&(a.x)); __v4sf v = _mm_load_ps(&(b.x)); // Mask: // 4 high bits: which elements should be summed. (w,z,y,x) // 4 low bits: which output slots should contain the result. (3,2,1,0) int mask = 0b01110001; return _mm_dp_ps(u, v, mask)[0]; }
The resulting assembler is much shorter:
vmovaps xmm1, XMMWORD PTR [rdi] vdpps xmm0, xmm1, XMMWORD PTR [rsi], 113 ret
Benchmarking
If you'd like to try it yourself, the source code plus test rig is here: dot.cc
My ministat
results:
N Min Max Median Avg Stddev v1 0.009199861 0.009354274 0.009204123 0.009207761 1.5952315e-05 v2 0.02101862 0.021209983 0.02102264 0.021029907 2.3628586e-05 Difference at 95.0% confidence 0.0118221 +/- 3.94136e-06 128.393% +/- 0.0428048% (Student's t, pooled s = 2.01592e-05)
Benchmarking tip: if your CPU has Turbo Boost like mine, you can dramatically shorten the tail on the distribution of your early measurements by keeping one of the cores busy. An easy way to do that is open a window and run:
perl -e 'while(1) {}'
It only takes a fraction of a second to take effect.