Auto-vectorization With the gcc Compiler

Auto-vectorization

What is it?

One way of increasing the performance of loops is by triggering auto-vectorization methods built into the gcc compiler. Vectorization will speed up your build when the vector type operation are simple and executed multiple times.  At the lowest level, gcc exposes compiler instructions that match one-to-one with the vector instructions on the CPU.

Vector instructions are instructions that the CPU implements by setting one-dimensional array. Therefore, it is possible to write code using them in a higher level language such as C/C++, and still have nearly complete control over the output.

Successfully activating auto-vectorization isn’t an easy process and most often requires code to be re-written. By default, the compiler will not vectorize the program unless you compile using the -O2 flag and up.  To get an idea of how vectorization works, the comparison below demonstrates that without vectorization enabled there are 3 x 32-bit unused integers.

3x32-bit_unused_integers
3 x 32-bit unused integers
vectorized integer map
With vectorization enabled the compiler can take advantage of the extra registers.

How do I enable it?

 To demonstrate how to successfully implement auto-vectorization we will create a simple C program that:
  • creates two 1000-element integer arrays
  • fills both arrays with random numbers in the range -1000 to +1000
  • sums both arrays element-by-element to a third array
  • sum the third array and display the result
  • auto enable gcc auto-vectorization

Confirming a successful auto-vectorization can be a little tricky. Using the gcc compiler flags-O3 will enable optimizations but it won’t output any messages to the console.

After a little digging around the gcc docs I came across the following flags:

  • -fopt-info-vec-missed //display information about vectors that couldn’t be optimized
  • -fopt-info-loop-optimized //display successfully auto-vectorized loops
Demo Files

Download the files I used for the following examples: c_loop_autovect_lab.zip

  1. unzip c_loop_autovect_lab.zip
  2. cd c_loop_autovect_lab 
  3. make //build the files
    • make run //execute all binaries and create objdump output
    • make miss //display all the missed optimizations
    • make opt //display all vectorized loops
    • make clean  //cleans up binaries, *.miss , *.opt,  *.txt

Examples

Let’s start with a simple for loop with statically defined arrays and a long data type to store the result

When we look at the loop_vect_v0.miss file we notice the gcc compiler complaining that there are not enough data references to auto-vectorize the loop.

What if we re-code our program and create a separate for loop to handle the additions.

Success! It looks like the compiler auto-vectorized the second for loop.

The console returns the following message after compiling

loop_vect_v2.c:21:1: note: loop vectorizedloop_vect_v2.c:21:1: note: loop turned into non-loop; it never loops.

loop_vect_v2.c:21:1: note: loop with 3 iterations completely unrolled

loop_vect_v2.c:7:5: note: loop turned into non-loop; it never loopsIf we run compile with the -fopt-info-loop-vec-all flag we can view complete detailed of all the optimizations.

loop_vect_v1.c:19:1: note: ===== analyze_loop_nest =====loop_vect_v1.c:19:1: note: === vect_analyze_loop_form ===

loop_vect_v1.c:19:1: note: === get_loop_niters ===

loop_vect_v1.c:19:1: note: === vect_analyze_data_refs ===

loop_vect_v1.c:19:1: note: got vectype for stmt: _9 = array1[i_41];

vector(4) int

loop_vect_v1.c:19:1: note: got vectype for stmt: _10 = array2[i_41];

vector(4) int

loop_vect_v1.c:19:1: note: Cost model analysis:

Vector inside of loop cost: 4

Vector prologue cost: 1

Vector epilogue cost: 2

Scalar iteration cost: 4

Scalar outside cost: 0

Vector outside cost: 3

prologue iterations: 0

epilogue iterations: 0It looks like the simpler the loop is the higher chance of auto-vectorization. In the last example, the compiler confirmed that the arrays were aligned and could completely unroll the loop.

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program’s execution speed at the expense of its binary size, which is an approach known as the space–time tradeoff-Wikipedia

Reflection

Although gcc’s auto-vectorization can increase build performance it may not be practical for certain applications. There are many restrictions conditions to consider auto-vectorization. gcc needs confirmation that arrays are aligned and data is aligned. Also, code will most likely have to be re-written to simplify loop functionality and even then auto-vectorization isn’t guaranteed.

Related Posts

Categories