## Vectorization with CLANG

Vectorization with CLANG

Adapted from parts of http://locklessinc.com/articles/vectorize/ for gcc. It touches on the use of “restricted” to avoid array aliasing, predicate to enable vectorization and other issues with vectorization in CLANG.

# 1 Getting Started

This is post adapated from http://locklessinc.com/articles/vectorize/

# 2 Vectorization

Vectorization is a general optimization technique that can buy you an order of magnitude performance increase in some cases. It is also a delicate operation. On the one hand, vectorization is “automatic:” when Clang is told to optimize aggressively, it will automatically try to vectorize every loop in your program. On the other hand, very small changes to loop structure cause Clang to give up and not vectorize at all. Furthermore, these small changes may allow your code to vectorize but not yield the expected speedup. Occasionally, unvectorized code may be faster than vectorized code.

This recitation is about 1) learning to read the assembly code 2) getting a handle on how to interpret what Clang is actually doing when it vectorizes code; in homework 3, you see the actual performance impacts of vectorizing code.

In this recitation, you will play with some selected examples of loops that can be auto vectorized.

## 2.1 Example 1

#include <stdlib.h>
#include <math.h>

#define SIZE (1L << 16)

void test1(uint8_t *a, uint8_t *b)
{
int i;

for (i = 0; i < SIZE; i++)
{
a[i] += b[i];
}
}

\$ make clean; make ASSEMBLE=1 VECTORIZE=1 example1.o

You should see the following output, informing you that the loop has been vectorized.

test1-4.c:12:3: remark: vectorized loop (vectorization width: 16, interleaved count: 1)

[-Rpass=loop-vectorize]

for (i = 0; i < SIZE; i++) {

Now, let’s take a look at the assembly code in example1.s

# BB#0:                                 # %entry

#DEBUG_VALUE: test:a <- %RDI

#DEBUG_VALUE: test:b <- %RSI

#DEBUG_VALUE: test:i <- 0

.loc    1 12 3 prologue_end discriminator 1 # test1-4.c:12:3

leaq    65535(%rsi), %rax

cmpq    %rdi, %rax

jb      .LBB0_2    (jump if %rax is greater than %rdi)

# BB#1:                                 # %entry

#DEBUG_VALUE: test:b <- %RSI

#DEBUG_VALUE: test:a <- %RDI

leaq    65535(%rdi), %rax

cmpq    %rsi, %rax

jb      .LBB0_2

#DEBUG_VALUE: test:a <- %RDI

#DEBUG_VALUE: test:b <- %RSI

xorl    %eax, %eax

.align  16, 0x90

.LBB0_5:                                # %for.body

# =>This Inner Loop Header: Depth=1

.loc    1 13 13                 # test1-4.c:13:13

.Ltmp0:

movb    (%rsi,%rax), %cl

movb    1(%rsi,%rax), %cl

movb    2(%rsi,%rax), %cl

movb    3(%rsi,%rax), %cl

.Ltmp1:

cmpq    \$65536, %rax            # imm = 0x10000

jne     .LBB0_5

jmp     .LBB0_6

#DEBUG_VALUE: test:b <- %RSI

#DEBUG_VALUE: test:a <- %RDI

xorl    %eax, %eax

.align  16, 0x90

.LBB0_3:                                # %vector.body

# =>This Inner Loop Header: Depth=1

.loc    1 13 13 is_stmt 1       # test1-4.c:13:13

.Ltmp2:

movdqu  (%rsi,%rax), %xmm0

movdqu  (%rdi,%rax), %xmm1

movdqu  %xmm1, (%rdi,%rax)

movdqu  16(%rsi,%rax), %xmm0

movdqu  16(%rdi,%rax), %xmm1

movdqu  %xmm1, 16(%rdi,%rax)

movdqu  32(%rsi,%rax), %xmm0

movdqu  32(%rdi,%rax), %xmm1

movdqu  %xmm1, 32(%rdi,%rax)

movdqu  48(%rsi,%rax), %xmm0

movdqu  48(%rdi,%rax), %xmm1

movdqu  %xmm1, 48(%rdi,%rax)

.Ltmp3:

.loc    1 12 3 is_stmt 1 discriminator 1 # test1-4.c:12:3

cmpq    \$65536, %rax            # imm = 0x10000

jne     .LBB0_3

This code first checks if there is an partial overlap between array a and b. If there is an overlap, then it does a simple non-vectorized code. If there is no overlap, then goto .LBB0_2, and do a vectorized version.

The above can at best be called partially vectorized. The problem is that the compiler is constrained by what we tell it about the arrays. If we tell it more, then perhaps it can do more optimization. The most obvious thing is to inform the compiler that no overlap is possible. This is done in standard C by using the restrict qualifier for the pointers.

#include <stdint.h>

#include <stdlib.h>

#include <math.h>

#define SIZE (1L << 16)

void test(uint8_t * restrict  a,  uint8_t * restrict  b) {

size_t i;

for (i = 0; i < SIZE; i++) {

a[i] += b[i];

}

}

Now, you should see the following assembly code.

# BB#0:                                 # %entry

#DEBUG_VALUE: test:a <- %RDI

#DEBUG_VALUE: test:b <- %RSI

xorl    %eax, %eax

.Ltmp0:

#DEBUG_VALUE: test:i <- 0

.align  16, 0x90

.LBB0_1:                                # %vector.body

# =>This Inner Loop Header: Depth=1

.loc    1 13 13 prologue_end    # test1-4.c:13:13

movdqu  (%rsi,%rax), %xmm0

movdqu  (%rdi,%rax), %xmm1

movdqu  %xmm1, (%rdi,%rax)

movdqu  16(%rsi,%rax), %xmm0

movdqu  16(%rdi,%rax), %xmm1

movdqu  %xmm1, 16(%rdi,%rax)

movdqu  32(%rsi,%rax), %xmm0

movdqu  32(%rdi,%rax), %xmm1

movdqu  %xmm1, 32(%rdi,%rax)

movdqu  48(%rsi,%rax), %xmm0

movdqu  48(%rdi,%rax), %xmm1

movdqu  %xmm1, 48(%rdi,%rax)

.Ltmp1:

.loc    1 12 3 is_stmt 1 discriminator 1 # test1-4.c:12:3

cmpq    \$65536, %rax            # imm = 0x10000

jne     .LBB0_1

.Ltmp2:

This is better, but it is assuming the data are NOT 16 bytes aligned. It also means that the inner loop above can’t assume that both arrays are aligned. If clang were smart, it could test for the cases where the arrays are either both aligned, or both unaligned, and have a fast inner loop. However, it doesn’t do that currently.

So in order to get the performance we are looking for, we need to tell Clang that the arrays are aligned. There are a couple of ways to do that. The first is to construct a (non-portable) aligned type, and use that in the function interface. The second is to add an intrinsic or two within the function itself. The second option is easier to implement on older code bases, as other functions calling the one to be vectorized do not have to be modified. The intrinsic has for this is called __builtin_assume_aligned:

#include <stdint.h>

#include <stdlib.h>

#include <math.h>

#define SIZE (1L << 16)

void test(uint8_t * restrict  a,  uint8_t * restrict  b) {

size_t i;

a = __builtin_assume_aligned(a, 16);

b = __builtin_assume_aligned(b, 16);

for (i = 0; i < SIZE; i++) {

a[i] += b[i];

}

}

# BB#0:                                 # %entry

#DEBUG_VALUE: test:a <- %RDI

#DEBUG_VALUE: test:b <- %RSI

xorl    %eax, %eax

.Ltmp0:

#DEBUG_VALUE: test:i <- 0

.align  16, 0x90

.LBB0_1:                                # %vector.body

# =>This Inner Loop Header: Depth=1

.loc    1 17 10 prologue_end    # test1-4.c:17:10

movdqa  (%rdi,%rax), %xmm0

movdqa  %xmm0, (%rdi,%rax)

movdqa  16(%rdi,%rax), %xmm0

movdqa  %xmm0, 16(%rdi,%rax)

movdqa  32(%rdi,%rax), %xmm0

movdqa  %xmm0, 32(%rdi,%rax)

movdqa  48(%rdi,%rax), %xmm0

movdqa  %xmm0, 48(%rdi,%rax)

.Ltmp1:

.loc    1 16 3 discriminator 1  # test1-4.c:16:3

cmpq    \$65536, %rax            # imm = 0x10000

jne     .LBB0_1

.Ltmp2:

Now finally, we get the nice tight vectorized code we were looking for clang has used packed SSE instructions to add 16 bytes at a time. It also manages to load and store two at a time, which it did do last time. The question is now that we understand what we need to tell the compiler, how much more complex can the loop be before auto-vectorization fails.

Next we try to turn on AVX2 instructions

After you turn on AVX2=1

# BB#0:                                 # %entry

#DEBUG_VALUE: test:a <- %RDI

#DEBUG_VALUE: test:b <- %RSI

xorl    %eax, %eax

.Ltmp0:

#DEBUG_VALUE: test:i <- 0

.align  16, 0x90

.LBB0_1:                                # %vector.body

# =>This Inner Loop Header: Depth=1

.loc    1 17 10 prologue_end    # test1-4.c:17:10

vmovdqu (%rdi,%rax), %ymm0

vmovdqu %ymm0, (%rdi,%rax)

vmovdqu 32(%rdi,%rax), %ymm0

vmovdqu %ymm0, 32(%rdi,%rax)

vmovdqu 64(%rdi,%rax), %ymm0

vmovdqu %ymm0, 64(%rdi,%rax)

vmovdqu 96(%rdi,%rax), %ymm0

vmovdqu %ymm0, 96(%rdi,%rax)

.Ltmp1:

.loc    1 16 3 discriminator 1  # test1-4.c:16:3

subq    \$-128, %rax

cmpq    \$65536, %rax            # imm = 0x10000

jne     .LBB0_1

.Ltmp2:

Question1: This code is still not aligned. Fix the code to make sure it uses aligned moves for the best performance.

## 2.2 Example 2

void test8(uint8_t * restrict a, uint8_t * restrict b)
{
size_t i;

uint8_t *x = __builtin_assume_aligned(a, 16);
uint8_t *y = __builtin_assume_aligned(b, 16);

for (i = 0; i < SIZE; i++)
{
/* max() */
if (y[i] > x[i]) x[i] = y[i];
}
}

\$ make clean; make ASSEMBLE=1 VECTORIZE=1 example2.o

By default, it is not vectorized at all. What has gone wrong? The problem here is the C memory model. Clang isn’t allowed to introduce extra writes to a memory location. If some other thread modifies x[] while the above routine is called, then those extra writes could stomp on modifications. We need a way to tell Clang that using the vectorized max() instruction is okay. Fortunately, this isn’t too hard, we just use a non-conditional write

void test(uint8_t * restrict a, uint8_t * restrict b) {

size_t i;

a = __builtin_assume_aligned(a, 16);

b = __builtin_assume_aligned(b, 16);

for (i = 0; i < SIZE; i++) {

/* max() */

a[i] = (b[i] > a[i])? b[i] : a[i];

}

}

Now you should see the vectorized assembly

# BB#0:                                 # %entry

#DEBUG_VALUE: test:a <- %RDI

#DEBUG_VALUE: test:b <- %RSI

xorl    %eax, %eax

.Ltmp0:

#DEBUG_VALUE: test:i <- 0

.align  16, 0x90

.LBB0_1:                                # %vector.body

# =>This Inner Loop Header: Depth=1

movdqa  (%rsi,%rax), %xmm0

pmaxub  (%rdi,%rax), %xmm0

movdqa  %xmm0, (%rdi,%rax)

movdqa  16(%rsi,%rax), %xmm0

pmaxub  16(%rdi,%rax), %xmm0

movdqa  %xmm0, 16(%rdi,%rax)

movdqa  32(%rsi,%rax), %xmm0

pmaxub  32(%rdi,%rax), %xmm0

movdqa  %xmm0, 32(%rdi,%rax)

movdqa  48(%rsi,%rax), %xmm0

pmaxub  48(%rdi,%rax), %xmm0

movdqa  %xmm0, 48(%rdi,%rax)

.Ltmp1:

cmpq    \$65536, %rax            # imm = 0x10000

jne     .LBB0_1

.Ltmp2:

This code looks like the vectorized code for addition, and we didn’t have to use any explicit intrinsic to do so. Clang has pattern-matched the operations and found the instruction we are looking for, which is very encouraging.

## 2.3 Example 3

Open up example3.c. The code does not vectorize. Inspect the assembly and determine why this code does not vectorize. Do you think it would be faster if it did vectorize?

\$ make clean; make ASSEMBLE=1 VECTORIZE=1 example3.o

void test15(uint8_t * restrict a, uint8_t * restrict b)
{
size_t i;

uint8_t *x = __builtin_assume_aligned(a, 16);
uint8_t *y = __builtin_assume_aligned(b, 16);

for (i = 0; i < SIZE; i++)
{
x[i] = y[i + 1];
}
}

## 2.4 Example 4

Take a look at example4.c

\$ make clean; make ASSEMBLE=1 VECTORIZE=1 example4.o

And you should see the following output.

.LBB0_1:                                # %for.body

# =>This Inner Loop Header: Depth=1

.loc    1 17 7 prologue_end     # test21.c:17:7

.Ltmp1:

#DEBUG_VALUE: test:y <- %XMM0

.Ltmp2:

.Ltmp3:

.loc    1 16 26 discriminator 2 # test21.c:16:26

.loc    1 16 3 is_stmt 0 discriminator 1 # test21.c:16:3

cmpq    \$65536, %rax            # imm = 0x10000

jne     .LBB0_1

.Ltmp4:

# BB#2:

Is this code vectorized? If not why?

This isn’t vectorized. What is going on? The problem here is that Clang isn’t allowed to re-order the operations we give it. Even though the maximum operation, and the addition operation are associative with real numbers, they aren’t with floating point numbers. (Consider what happens with signed zeros, for example.)

We need to tell Clang that reordering operations is okay with us. To do this, we need to add another compile-time flag, “-ffast-math”.

Add the compilation flag -ffast-math to the Makefile and compile the program again.

If we do this, we find that the output from functions 20 and 21 are improved:

# BB#0:                                 # %entry

#DEBUG_VALUE: test:a <- %RDI

#DEBUG_VALUE: test:x <- %RDI

#DEBUG_VALUE: test:y <- 0.000000e+00

xorpd   %xmm0, %xmm0

xorl    %eax, %eax

.Ltmp0:

#DEBUG_VALUE: test:i <- 0

xorpd   %xmm1, %xmm1

.align  16, 0x90

.LBB0_1:                                # %vector.body

# =>This Inner Loop Header: Depth=1

.loc    1 17 7 prologue_end     # test21.c:17:7

.Ltmp1:

.Ltmp2:

.loc    1 16 3 discriminator 1  # test21.c:16:3

cmpq    \$65536, %rax            # imm = 0x10000

jne     .LBB0_1

# BB#2:

# BB#2:                                 # %middle.block

.loc    1 17 7                  # test21.c:17:7

.Ltmp3:

movapd  %xmm1, %xmm0

shufpd  \$1, %xmm0, %xmm0        # xmm0 = xmm0[1,0]