Vectorization with CLANG

Vectorization with CLANG

Adapted from parts of http://locklessinc.com/articles/vectorize/ for gcc. It touches on the use of “restricted” to avoid array aliasing, predicate to enable vectorization and other issues with vectorization in CLANG. 

1 Getting Started

 

This is post adapated from http://locklessinc.com/articles/vectorize/

2 Vectorization

Vectorization is a general optimization technique that can buy you an order of magnitude performance increase in some cases. It is also a delicate operation. On the one hand, vectorization is “automatic:” when Clang is told to optimize aggressively, it will automatically try to vectorize every loop in your program. On the other hand, very small changes to loop structure cause Clang to give up and not vectorize at all. Furthermore, these small changes may allow your code to vectorize but not yield the expected speedup. Occasionally, unvectorized code may be faster than vectorized code.

This recitation is about 1) learning to read the assembly code 2) getting a handle on how to interpret what Clang is actually doing when it vectorizes code; in homework 3, you see the actual performance impacts of vectorizing code.

In this recitation, you will play with some selected examples of loops that can be auto vectorized.

2.1 Example 1

#include <stdlib.h>
#include <math.h>

#define SIZE (1L << 16)

void test1(uint8_t *a, uint8_t *b)
{
int i;

for (i = 0; i < SIZE; i++)
{
    a[i] += b[i];
}
}

$ make clean; make ASSEMBLE=1 VECTORIZE=1 example1.o

You should see the following output, informing you that the loop has been vectorized.

test1-4.c:12:3: remark: vectorized loop (vectorization width: 16, interleaved count: 1)

     [-Rpass=loop-vectorize]

 for (i = 0; i < SIZE; i++) {

Now, let’s take a look at the assembly code in example1.s

# BB#0:                                 # %entry

       #DEBUG_VALUE: test:a <- %RDI

       #DEBUG_VALUE: test:b <- %RSI

       #DEBUG_VALUE: test:i <- 0

       .loc    1 12 3 prologue_end discriminator 1 # test1-4.c:12:3

       leaq    65535(%rsi), %rax

       cmpq    %rdi, %rax     

       jb      .LBB0_2    (jump if %rax is greater than %rdi)

# BB#1:                                 # %entry

       #DEBUG_VALUE: test:b <- %RSI

       #DEBUG_VALUE: test:a <- %RDI

       leaq    65535(%rdi), %rax

       cmpq    %rsi, %rax

       jb      .LBB0_2

# BB#4:                                 # %for.body.preheader

       #DEBUG_VALUE: test:a <- %RDI

       #DEBUG_VALUE: test:b <- %RSI

       xorl    %eax, %eax

       .align  16, 0x90

.LBB0_5:                                # %for.body

                                       # =>This Inner Loop Header: Depth=1

       .loc    1 13 13                 # test1-4.c:13:13

.Ltmp0:

       movb    (%rsi,%rax), %cl

       addb    %cl, (%rdi,%rax)

       movb    1(%rsi,%rax), %cl

       addb    %cl, 1(%rdi,%rax)

       movb    2(%rsi,%rax), %cl

       addb    %cl, 2(%rdi,%rax)

       movb    3(%rsi,%rax), %cl

       addb    %cl, 3(%rdi,%rax)

.Ltmp1:

       addq    $4, %rax

       cmpq    $65536, %rax            # imm = 0x10000

       jne     .LBB0_5

       jmp     .LBB0_6

.LBB0_2:                                # %vector.body.preheader

       #DEBUG_VALUE: test:b <- %RSI

       #DEBUG_VALUE: test:a <- %RDI

       xorl    %eax, %eax

       .align  16, 0x90

.LBB0_3:                                # %vector.body

                                       # =>This Inner Loop Header: Depth=1

       .loc    1 13 13 is_stmt 1       # test1-4.c:13:13

.Ltmp2:

       movdqu  (%rsi,%rax), %xmm0

       movdqu  (%rdi,%rax), %xmm1

       paddb   %xmm0, %xmm1

       movdqu  %xmm1, (%rdi,%rax)

       movdqu  16(%rsi,%rax), %xmm0

       movdqu  16(%rdi,%rax), %xmm1

       paddb   %xmm0, %xmm1

       movdqu  %xmm1, 16(%rdi,%rax)

       movdqu  32(%rsi,%rax), %xmm0

       movdqu  32(%rdi,%rax), %xmm1

       paddb   %xmm0, %xmm1

       movdqu  %xmm1, 32(%rdi,%rax)

       movdqu  48(%rsi,%rax), %xmm0

       movdqu  48(%rdi,%rax), %xmm1

       paddb   %xmm0, %xmm1

       movdqu  %xmm1, 48(%rdi,%rax)

.Ltmp3:

       .loc    1 12 3 is_stmt 1 discriminator 1 # test1-4.c:12:3

       addq    $64, %rax

       cmpq    $65536, %rax            # imm = 0x10000

       jne     .LBB0_3

This code first checks if there is an partial overlap between array a and b. If there is an overlap, then it does a simple non-vectorized code. If there is no overlap, then goto .LBB0_2, and do a vectorized version.

The above can at best be called partially vectorized. The problem is that the compiler is constrained by what we tell it about the arrays. If we tell it more, then perhaps it can do more optimization. The most obvious thing is to inform the compiler that no overlap is possible. This is done in standard C by using the restrict qualifier for the pointers.

#include <stdint.h>

#include <stdlib.h>

#include <math.h>

#define SIZE (1L << 16)

void test(uint8_t * restrict  a,  uint8_t * restrict  b) {

 size_t i;

 for (i = 0; i < SIZE; i++) {

   a[i] += b[i];

 }

}

Now, you should see the following assembly code.

# BB#0:                                 # %entry

       #DEBUG_VALUE: test:a <- %RDI

       #DEBUG_VALUE: test:b <- %RSI

       xorl    %eax, %eax

.Ltmp0:

       #DEBUG_VALUE: test:i <- 0

       .align  16, 0x90

.LBB0_1:                                # %vector.body

                                       # =>This Inner Loop Header: Depth=1

       .loc    1 13 13 prologue_end    # test1-4.c:13:13

       movdqu  (%rsi,%rax), %xmm0

       movdqu  (%rdi,%rax), %xmm1

       paddb   %xmm0, %xmm1

       movdqu  %xmm1, (%rdi,%rax)

       movdqu  16(%rsi,%rax), %xmm0

       movdqu  16(%rdi,%rax), %xmm1

       paddb   %xmm0, %xmm1

       movdqu  %xmm1, 16(%rdi,%rax)

       movdqu  32(%rsi,%rax), %xmm0

       movdqu  32(%rdi,%rax), %xmm1

       paddb   %xmm0, %xmm1

       movdqu  %xmm1, 32(%rdi,%rax)

       movdqu  48(%rsi,%rax), %xmm0

       movdqu  48(%rdi,%rax), %xmm1

       paddb   %xmm0, %xmm1

       movdqu  %xmm1, 48(%rdi,%rax)

.Ltmp1:

       .loc    1 12 3 is_stmt 1 discriminator 1 # test1-4.c:12:3

       addq    $64, %rax

       cmpq    $65536, %rax            # imm = 0x10000

       jne     .LBB0_1

.Ltmp2:

This is better, but it is assuming the data are NOT 16 bytes aligned. It also means that the inner loop above can’t assume that both arrays are aligned. If clang were smart, it could test for the cases where the arrays are either both aligned, or both unaligned, and have a fast inner loop. However, it doesn’t do that currently.

So in order to get the performance we are looking for, we need to tell Clang that the arrays are aligned. There are a couple of ways to do that. The first is to construct a (non-portable) aligned type, and use that in the function interface. The second is to add an intrinsic or two within the function itself. The second option is easier to implement on older code bases, as other functions calling the one to be vectorized do not have to be modified. The intrinsic has for this is called __builtin_assume_aligned:

#include <stdint.h>

#include <stdlib.h>

#include <math.h>

#define SIZE (1L << 16)

void test(uint8_t * restrict  a,  uint8_t * restrict  b) {

 size_t i;

 a = __builtin_assume_aligned(a, 16);

 b = __builtin_assume_aligned(b, 16);

 for (i = 0; i < SIZE; i++) {

   a[i] += b[i];

 }

}

Add builtin assume aligned 16

# BB#0:                                 # %entry

       #DEBUG_VALUE: test:a <- %RDI

       #DEBUG_VALUE: test:b <- %RSI

       xorl    %eax, %eax

.Ltmp0:

       #DEBUG_VALUE: test:i <- 0

       .align  16, 0x90

.LBB0_1:                                # %vector.body

                                       # =>This Inner Loop Header: Depth=1

       .loc    1 17 10 prologue_end    # test1-4.c:17:10

       movdqa  (%rdi,%rax), %xmm0

       paddb   (%rsi,%rax), %xmm0

       movdqa  %xmm0, (%rdi,%rax)

       movdqa  16(%rdi,%rax), %xmm0

       paddb   16(%rsi,%rax), %xmm0

       movdqa  %xmm0, 16(%rdi,%rax)

       movdqa  32(%rdi,%rax), %xmm0

       paddb   32(%rsi,%rax), %xmm0

       movdqa  %xmm0, 32(%rdi,%rax)

       movdqa  48(%rdi,%rax), %xmm0

       paddb   48(%rsi,%rax), %xmm0

       movdqa  %xmm0, 48(%rdi,%rax)

.Ltmp1:

       .loc    1 16 3 discriminator 1  # test1-4.c:16:3

       addq    $64, %rax

       cmpq    $65536, %rax            # imm = 0x10000

       jne     .LBB0_1

.Ltmp2:

Now finally, we get the nice tight vectorized code we were looking for clang has used packed SSE instructions to add 16 bytes at a time. It also manages to load and store two at a time, which it did do last time. The question is now that we understand what we need to tell the compiler, how much more complex can the loop be before auto-vectorization fails.

Next we try to turn on AVX2 instructions

After you turn on AVX2=1

# BB#0:                                 # %entry

       #DEBUG_VALUE: test:a <- %RDI

       #DEBUG_VALUE: test:b <- %RSI

       xorl    %eax, %eax

.Ltmp0:

       #DEBUG_VALUE: test:i <- 0

       .align  16, 0x90

.LBB0_1:                                # %vector.body

                                       # =>This Inner Loop Header: Depth=1

       .loc    1 17 10 prologue_end    # test1-4.c:17:10

       vmovdqu (%rdi,%rax), %ymm0

       vpaddb  (%rsi,%rax), %ymm0, %ymm0

       vmovdqu %ymm0, (%rdi,%rax)

       vmovdqu 32(%rdi,%rax), %ymm0

       vpaddb  32(%rsi,%rax), %ymm0, %ymm0

       vmovdqu %ymm0, 32(%rdi,%rax)

       vmovdqu 64(%rdi,%rax), %ymm0

       vpaddb  64(%rsi,%rax), %ymm0, %ymm0

       vmovdqu %ymm0, 64(%rdi,%rax)

       vmovdqu 96(%rdi,%rax), %ymm0

       vpaddb  96(%rsi,%rax), %ymm0, %ymm0

       vmovdqu %ymm0, 96(%rdi,%rax)

.Ltmp1:

       .loc    1 16 3 discriminator 1  # test1-4.c:16:3

       subq    $-128, %rax

       cmpq    $65536, %rax            # imm = 0x10000

       jne     .LBB0_1

.Ltmp2:

Question1: This code is still not aligned. Fix the code to make sure it uses aligned moves for the best performance.

2.2 Example 2

void test8(uint8_t * restrict a, uint8_t * restrict b)
{
size_t i;

uint8_t *x = __builtin_assume_aligned(a, 16);
uint8_t *y = __builtin_assume_aligned(b, 16);

for (i = 0; i < SIZE; i++)
{
/* max() */
if (y[i] > x[i]) x[i] = y[i];
}
}

$ make clean; make ASSEMBLE=1 VECTORIZE=1 example2.o

By default, it is not vectorized at all. What has gone wrong? The problem here is the C memory model. Clang isn’t allowed to introduce extra writes to a memory location. If some other thread modifies x[] while the above routine is called, then those extra writes could stomp on modifications. We need a way to tell Clang that using the vectorized max() instruction is okay. Fortunately, this isn’t too hard, we just use a non-conditional write

void test(uint8_t * restrict a, uint8_t * restrict b) {

 size_t i;

 a = __builtin_assume_aligned(a, 16);

 b = __builtin_assume_aligned(b, 16);

 for (i = 0; i < SIZE; i++) {

   /* max() */

   a[i] = (b[i] > a[i])? b[i] : a[i];

 }

}

Now you should see the vectorized assembly

# BB#0:                                 # %entry

       #DEBUG_VALUE: test:a <- %RDI

       #DEBUG_VALUE: test:b <- %RSI

       xorl    %eax, %eax

.Ltmp0:

       #DEBUG_VALUE: test:i <- 0

       .align  16, 0x90

.LBB0_1:                                # %vector.body

                                       # =>This Inner Loop Header: Depth=1

       movdqa  (%rsi,%rax), %xmm0

       pmaxub  (%rdi,%rax), %xmm0

       movdqa  %xmm0, (%rdi,%rax)

       movdqa  16(%rsi,%rax), %xmm0

       pmaxub  16(%rdi,%rax), %xmm0

       movdqa  %xmm0, 16(%rdi,%rax)

       movdqa  32(%rsi,%rax), %xmm0

       pmaxub  32(%rdi,%rax), %xmm0

       movdqa  %xmm0, 32(%rdi,%rax)

       movdqa  48(%rsi,%rax), %xmm0

       pmaxub  48(%rdi,%rax), %xmm0

       movdqa  %xmm0, 48(%rdi,%rax)

.Ltmp1:

       addq    $64, %rax

       cmpq    $65536, %rax            # imm = 0x10000

       jne     .LBB0_1

.Ltmp2:

This code looks like the vectorized code for addition, and we didn’t have to use any explicit intrinsic to do so. Clang has pattern-matched the operations and found the instruction we are looking for, which is very encouraging.

2.3 Example 3

Open up example3.c. The code does not vectorize. Inspect the assembly and determine why this code does not vectorize. Do you think it would be faster if it did vectorize?

$ make clean; make ASSEMBLE=1 VECTORIZE=1 example3.o


void test15(uint8_t * restrict a, uint8_t * restrict b)
{
size_t i;

uint8_t *x = __builtin_assume_aligned(a, 16);
uint8_t *y = __builtin_assume_aligned(b, 16);

for (i = 0; i < SIZE; i++)
{
x[i] = y[i + 1];
}
}

2.4 Example 4

Take a look at example4.c

$ make clean; make ASSEMBLE=1 VECTORIZE=1 example4.o

And you should see the following output.

.LBB0_1:                                # %for.body

                                       # =>This Inner Loop Header: Depth=1

       .loc    1 17 7 prologue_end     # test21.c:17:7

       addsd   (%rdi,%rax,8), %xmm0

.Ltmp1:

       #DEBUG_VALUE: test:y <- %XMM0

       addsd   8(%rdi,%rax,8), %xmm0

.Ltmp2:

       addsd   16(%rdi,%rax,8), %xmm0

       addsd   24(%rdi,%rax,8), %xmm0

       addsd   32(%rdi,%rax,8), %xmm0

       addsd   40(%rdi,%rax,8), %xmm0

       addsd   48(%rdi,%rax,8), %xmm0

       addsd   56(%rdi,%rax,8), %xmm0

.Ltmp3:

       .loc    1 16 26 discriminator 2 # test21.c:16:26

       addq    $8, %rax

       .loc    1 16 3 is_stmt 0 discriminator 1 # test21.c:16:3

       cmpq    $65536, %rax            # imm = 0x10000

       jne     .LBB0_1

.Ltmp4:

# BB#2:

Is this code vectorized? If not why?

This isn’t vectorized. What is going on? The problem here is that Clang isn’t allowed to re-order the operations we give it. Even though the maximum operation, and the addition operation are associative with real numbers, they aren’t with floating point numbers. (Consider what happens with signed zeros, for example.)

We need to tell Clang that reordering operations is okay with us. To do this, we need to add another compile-time flag, “-ffast-math”.

Add the compilation flag -ffast-math to the Makefile and compile the program again.

If we do this, we find that the output from functions 20 and 21 are improved:

# BB#0:                                 # %entry

       #DEBUG_VALUE: test:a <- %RDI

       #DEBUG_VALUE: test:x <- %RDI

       #DEBUG_VALUE: test:y <- 0.000000e+00

       xorpd   %xmm0, %xmm0

       xorl    %eax, %eax

.Ltmp0:

       #DEBUG_VALUE: test:i <- 0

       xorpd   %xmm1, %xmm1

       .align  16, 0x90

.LBB0_1:                                # %vector.body

                                       # =>This Inner Loop Header: Depth=1

       .loc    1 17 7 prologue_end     # test21.c:17:7

.Ltmp1:

       addpd   (%rdi,%rax,8), %xmm0

       addpd   16(%rdi,%rax,8), %xmm1

       addpd   32(%rdi,%rax,8), %xmm0

       addpd   48(%rdi,%rax,8), %xmm1

       addpd   64(%rdi,%rax,8), %xmm0

       addpd   80(%rdi,%rax,8), %xmm1

       addpd   96(%rdi,%rax,8), %xmm0

       addpd   112(%rdi,%rax,8), %xmm1

.Ltmp2:

       .loc    1 16 3 discriminator 1  # test21.c:16:3

       addq    $16, %rax

       cmpq    $65536, %rax            # imm = 0x10000

       jne     .LBB0_1

# BB#2:                

# BB#2:                                 # %middle.block

       .loc    1 17 7                  # test21.c:17:7

.Ltmp3:

       addpd   %xmm0, %xmm1

       movapd  %xmm1, %xmm0

       shufpd  $1, %xmm0, %xmm0        # xmm0 = xmm0[1,0]

       addpd   %xmm1, %xmm0

.Ltmp4:

some questions that are useful for checking your understanding of stuff.

  1. Show your assembly output of aligned AVX2 instructions for example 1.
  2. Show your assembly output of vectorized example 4 with fast math flag.
  3. Question: What does %rax contains for example 1?
  4. Question: For example 1, what is the difference between aligned code and unaligned code? Why do you think aligned is better than unaligned code?
  5. Question: What are the differences between AVX2 and regular vectorized instructions? Why do these differences exist?
  6. Question: Why example 3 does not vectorize?
Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s