8 Feb 2015.

Short version: write a naïve kernel. Before you look at the generated assembler, write a wrapper around the kernel.

Longer version

Consider this code:

void add1(float a[4], float b[4]) {
  for (int i = 0; i < 4; ++i) {
    a[i] += b[i];
  }
}

gcc version 4.9.1 invoked as g++ -S -O3 -march=corei7-avx -masm=intel add.cc produces:

_Z4add1PfS_:
  vmovss	xmm0, DWORD PTR [rdi]        ; xmm0 = a[0]
  vaddss	xmm0, xmm0, DWORD PTR [rsi]  ; xmm0 = xmm0 + b[0]
  vmovss	DWORD PTR [rdi], xmm0        ; a[0] = xmm0

  vmovss	xmm0, DWORD PTR [rdi+4]
  vaddss	xmm0, xmm0, DWORD PTR [rsi+4]
  vmovss	DWORD PTR [rdi+4], xmm0
  vmovss	xmm0, DWORD PTR [rdi+8]
  vaddss	xmm0, xmm0, DWORD PTR [rsi+8]
  vmovss	DWORD PTR [rdi+8], xmm0
  vmovss	xmm0, DWORD PTR [rdi+12]
  vaddss	xmm0, xmm0, DWORD PTR [rsi+12]
  vmovss	DWORD PTR [rdi+12], xmm0
  ret

Decoding the name:

$ c++filt _Z4add1PfS_
add1(float*, float*)
$

The x86_64 calling convention is that the first two arguments are passed in the rdi and rsi registers, respectively.

A float is 32 bits wide.

The compiler unrolls the loop into four copies of:

Why didn't it vectorize?

Alignment

I can explicitly tell the compiler to assume that a[] and b[] are aligned to the width of an xmm register (16 bytes):

void add2(float a_[4], float b_[4]) {
  float* a = (float*)__builtin_assume_aligned(a_, 16);
  float* b = (float*)__builtin_assume_aligned(b_, 16);
  for (int i = 0; i < 4; ++i) {
    a[i] += b[i];
  }
}

This produces:

_Z4add2PfS_:
  lea	rax, [rsi+16]  ; rax = b + 16
  ; cmp sets EFLAGS like sub
  cmp	rdi, rax       ; a - (b + 16)
  jae	.L6            ; jump if above or equal: a >= b + 16

  lea	rax, [rdi+16]  ; rax = a + 16
  cmp	rsi, rax       ; b - (a + 16)
  jb	.L3            ; jump if below: (b < a + 16) && !(a >= b + 16)
.L6:
  vmovaps	xmm0, XMMWORD PTR [rdi]
  vaddps	xmm0, xmm0, XMMWORD PTR [rsi]
  vmovaps	XMMWORD PTR [rdi], xmm0
  ret
.L3:
  vmovss	xmm0, DWORD PTR [rdi]
  vaddss	xmm0, xmm0, DWORD PTR [rsi]
  vmovss	DWORD PTR [rdi], xmm0
  vmovss	xmm0, DWORD PTR [rdi+4]
  vaddss	xmm0, xmm0, DWORD PTR [rsi+4]
  vmovss	DWORD PTR [rdi+4], xmm0
  vmovss	xmm0, DWORD PTR [rdi+8]
  vaddss	xmm0, xmm0, DWORD PTR [rsi+8]
  vmovss	DWORD PTR [rdi+8], xmm0
  vmovss	xmm0, DWORD PTR [rdi+12]
  vaddss	xmm0, xmm0, DWORD PTR [rsi+12]
  vmovss	DWORD PTR [rdi+12], xmm0
  ret

The start of the function is two comparisons: if a[] and b[] don't overlap, I get the fast path in .L6, otherwise I get the slow path in .L3 which is identical to the generated code for add1() above.

The fast path is exactly what I want:

vmovaps is an aligned load of four floats from a[]:

vaddps is a vectorized add of four floats in xmm0 against four floats in memory at b[].

Next, let's get rid of the slow path.

Restrict

I can use restrict to tell the compiler that the inputs don't overlap:

void add3(float* __restrict a_, float* __restrict b_) {
  float* a = (float*)__builtin_assume_aligned(a_, 16);
  float* b = (float*)__builtin_assume_aligned(b_, 16);
  for (int i = 0; i < 4; ++i) {
    a[i] += b[i];
  }
}

This generates just the fast path:

_Z4add3PfS_:
  vmovaps	xmm0, XMMWORD PTR [rdi]
  vaddps	xmm0, xmm0, XMMWORD PTR [rsi]
  vmovaps	XMMWORD PTR [rdi], xmm0
  ret

Closing the loop

Let's go back to add1() but write a wrapper around it:

extern void invalidate_a(float x[4]);
extern void invalidate_b(float x[4]);

void harness() {
  float a[4], b[4];
  invalidate_a(a);
  invalidate_b(b);
  add1(a, b);  // <--- This is the business end.
  invalidate_a(a);
}

The invalidate functions trick the compiler into thinking that meaningful reads and writes are happening to the two arrays. This is to prevent it from optimizing harness() down to a no-op.

This produces exactly what I wanted:

_Z7harnessv:
.LFB3:
  sub	rsp, 40
  mov	rdi, rsp       ; address of a[] on the stack
  call	_Z12invalidate_aPf
  lea	rdi, [rsp+16]  ; address of b[]
  call	_Z12invalidate_bPf
  vmovaps	xmm0, XMMWORD PTR [rsp]           ; inlined add1
  mov	rdi, rsp  ; preparing for the later call to invalidate_a
  vaddps	xmm0, xmm0, XMMWORD PTR [rsp+16]  ; inlined add1
  vmovaps	XMMWORD PTR [rsp], xmm0           ; inlined add1
  call	_Z12invalidate_aPf
  add	rsp, 40
  ret

The code generated for add1() has to be conservative about its assumptions.

When I add a harness around it, the compiler can choose good memory locations for a[] and b[]. The optimizer is given the alignment and non-overlap information, and produces better code.

My take-away is that I should write a simple kernel, without all the noise of annotations like __restrict, and use a harness around it to see what the generated code looks like.

For your convenience, the code in one file: add.cc.