8 Feb 2015.
Short version: write a naïve kernel. Before you look at the generated assembler, write a wrapper around the kernel.
Longer version
Consider this code:
void add1(float a[4], float b[4]) { for (int i = 0; i < 4; ++i) { a[i] += b[i]; } }
gcc version 4.9.1 invoked as g++ -S -O3 -march=corei7-avx -masm=intel add.cc
produces:
_Z4add1PfS_: vmovss xmm0, DWORD PTR [rdi] ; xmm0 = a[0] vaddss xmm0, xmm0, DWORD PTR [rsi] ; xmm0 = xmm0 + b[0] vmovss DWORD PTR [rdi], xmm0 ; a[0] = xmm0 vmovss xmm0, DWORD PTR [rdi+4] vaddss xmm0, xmm0, DWORD PTR [rsi+4] vmovss DWORD PTR [rdi+4], xmm0 vmovss xmm0, DWORD PTR [rdi+8] vaddss xmm0, xmm0, DWORD PTR [rsi+8] vmovss DWORD PTR [rdi+8], xmm0 vmovss xmm0, DWORD PTR [rdi+12] vaddss xmm0, xmm0, DWORD PTR [rsi+12] vmovss DWORD PTR [rdi+12], xmm0 ret
Decoding the name:
$ c++filt _Z4add1PfS_ add1(float*, float*) $
The x86_64 calling convention is that the first two arguments are passed in the rdi
and rsi
registers, respectively.
A float
is 32 bits wide.
The compiler unrolls the loop into four copies of:
xmm0 = a[n]
xmm0 = xmm0 + b[n]
a[n] = xmm0
Why didn't it vectorize?
Alignment
I can explicitly tell the compiler to assume that a[]
and b[]
are aligned to the width of an xmm
register (16 bytes):
void add2(float a_[4], float b_[4]) { float* a = (float*)__builtin_assume_aligned(a_, 16); float* b = (float*)__builtin_assume_aligned(b_, 16); for (int i = 0; i < 4; ++i) { a[i] += b[i]; } }
This produces:
_Z4add2PfS_: lea rax, [rsi+16] ; rax = b + 16 ; cmp sets EFLAGS like sub cmp rdi, rax ; a - (b + 16) jae .L6 ; jump if above or equal: a >= b + 16 lea rax, [rdi+16] ; rax = a + 16 cmp rsi, rax ; b - (a + 16) jb .L3 ; jump if below: (b < a + 16) && !(a >= b + 16) .L6: vmovaps xmm0, XMMWORD PTR [rdi] vaddps xmm0, xmm0, XMMWORD PTR [rsi] vmovaps XMMWORD PTR [rdi], xmm0 ret .L3: vmovss xmm0, DWORD PTR [rdi] vaddss xmm0, xmm0, DWORD PTR [rsi] vmovss DWORD PTR [rdi], xmm0 vmovss xmm0, DWORD PTR [rdi+4] vaddss xmm0, xmm0, DWORD PTR [rsi+4] vmovss DWORD PTR [rdi+4], xmm0 vmovss xmm0, DWORD PTR [rdi+8] vaddss xmm0, xmm0, DWORD PTR [rsi+8] vmovss DWORD PTR [rdi+8], xmm0 vmovss xmm0, DWORD PTR [rdi+12] vaddss xmm0, xmm0, DWORD PTR [rsi+12] vmovss DWORD PTR [rdi+12], xmm0 ret
The start of the function is two comparisons: if a[]
and b[]
don't overlap, I get the fast path in .L6
, otherwise I get the slow path in .L3
which is identical to the generated code for add1()
above.
The fast path is exactly what I want:
vmovaps
is an aligned load of four float
s from a[]
:
- vector
- move
- a = the memory location is aligned to a multiple of 16 bytes (as opposed to unaligned)
- p = packed group of four floats (as opposed to scalar, which is one float)
- s = single precision (s =
float
, d =double
)
vaddps
is a vectorized add of four float
s in xmm0
against four floats in memory at b[]
.
Next, let's get rid of the slow path.
Restrict
I can use restrict
to tell the compiler that the inputs don't overlap:
void add3(float* __restrict a_, float* __restrict b_) { float* a = (float*)__builtin_assume_aligned(a_, 16); float* b = (float*)__builtin_assume_aligned(b_, 16); for (int i = 0; i < 4; ++i) { a[i] += b[i]; } }
This generates just the fast path:
_Z4add3PfS_: vmovaps xmm0, XMMWORD PTR [rdi] vaddps xmm0, xmm0, XMMWORD PTR [rsi] vmovaps XMMWORD PTR [rdi], xmm0 ret
Closing the loop
Let's go back to add1()
but write a wrapper around it:
extern void invalidate_a(float x[4]); extern void invalidate_b(float x[4]); void harness() { float a[4], b[4]; invalidate_a(a); invalidate_b(b); add1(a, b); // <--- This is the business end. invalidate_a(a); }
The invalidate
functions trick the compiler into thinking that meaningful reads and writes are happening to the two arrays. This is to prevent it from optimizing harness()
down to a no-op.
This produces exactly what I wanted:
_Z7harnessv: .LFB3: sub rsp, 40 mov rdi, rsp ; address of a[] on the stack call _Z12invalidate_aPf lea rdi, [rsp+16] ; address of b[] call _Z12invalidate_bPf vmovaps xmm0, XMMWORD PTR [rsp] ; inlined add1 mov rdi, rsp ; preparing for the later call to invalidate_a vaddps xmm0, xmm0, XMMWORD PTR [rsp+16] ; inlined add1 vmovaps XMMWORD PTR [rsp], xmm0 ; inlined add1 call _Z12invalidate_aPf add rsp, 40 ret
The code generated for add1()
has to be conservative about its assumptions.
When I add a harness around it, the compiler can choose good memory locations for a[]
and b[]
. The optimizer is given the alignment and non-overlap information, and produces better code.
My take-away is that I should write a simple kernel, without all the noise of annotations like __restrict
, and use a harness around it to see what the generated code looks like.
For your convenience, the code in one file: add.cc.