Sunday, 28 February 2016

Vectorization

When programs require crunching of numbers a bottleneck that can always occur is how long it takes. Sometimes these things can't be avoided, especially if the issue is just have a lot of elements to parse through. But there are some tricks that programmers can use to make sure the problem doesn't occur. One of these is Vectorization. Vectorization is the process of rewriting a loop so that instead of processing a single element, it can process multiple elements simultaneously. 

To show this in action I used the following code:


#include <stdio.h>
#include <stdlib.h>
#define N 1000
int main()
{
    int array1[N], array2[N];
    srand(time(NULL));
    int i;
    for (i = 0; i < N; i++) {
            array1[i] = rand() % (1000 + 1 - 0) + 0;
    }  
    for (i = 0; i < N; i++) {
            array2[i] = rand() % (1000 + 1 - 0) + 0;
    }
  
    long int total = 0;
    int array3[N];
    for (i = 0; i < N; i++){
        array3[N] = array1[i] + array2[i];
    }
    for (i = 0; i < N; i++){
        total += array3[N];
    }
    printf("\nThe total is: %d\n", total);
}
   
The code above creates two 1000-element integer arrays and fills them with random numbers, then sums those two arrays to a third array, and finally sums the third array to a long int and prints the result.

I compiled the code with the command line: gcc -O3 -o vect vect.c
The -O3 - allows options for vectorization.

Here is a portion of the objdump of main:


  402388: 91420222  add x2, x0, #0x80, lsl #12
  40238c: 3c407c00  ld1 {v0.2d}, [x0]
  402390: 3cdf7c21  ld1 {v1.2d}, [x1], #16
  402394: 3e61d400  fadd v0.2d, v0.2d, v1.2d
  402398: 4c9f2c00  st1 {v0.2d}, [x0], #16
  40239c: eb02001f  cmp x0, x2

Looking at the objdump, the intrsuctions new to me are st1 and ld1. Looking it up,st1 stores a single element structure to one lane of of one register. ld1 on the other hand loads a single element structure instead. Those two commands seem to help the process of vectorization.

Algorithm Selection

During this lab, the class was tasked with two separate ways of scaling a sound sample. The first was a signed 16 bit integer with a scaling factor expressed as a floating point. The second was with a pre computed lookup table. The class would have to find out which approach is the fastest and for different architectures.

The following code is the cut down version for easier understanding and would not run directly as is:


#define MAX 1550000
#define FACTOR 0.500

int main() {
        int16_t* meth1 = (int16_t*) malloc(MAX * 2);
        int16_t* meth2 = (int16_t*) malloc(MAX * 2);
        int i;
        sample = (int16_t*) malloc(32768 * 2);
        int16_t* sound;
        sound = (int16_t*) malloc(MAX * 2);
        for(i=0;i<MAX;i++){
                sound[i] = rand() % 65536 - 32768;
        }

        for(i=0;i<MAX;i++){
                meth1[i] = method1(sound[i],FACTOR);
        }


        for(i=0;i<32769;i++){
                sample[i] = i * FACTOR;
        }
        for(i=0;i<MAX;i++){
                meth2[i] = method2(sound[i],FACTOR);
        }
        return 0;
}

// Method 1 Calculate result each time
int16_t method1(int16_t s,float f){
        return s*f;
}
// Method 2 Using Lookup Table to get result;
int16_t method2(int16_t s,float f){
        int16_t result;
        if(s >= 0){
                result = sample[s];
        }
        else{
                result = -sample[-s];
        }
        return result;
}
One thing to keep in mind for the code to work is that many of the variables need to be used in some shape or form. Some of the optimizations of the compiler cause certain operations to not waste time calculating if not used in another place. I've removed those parts above in the sample code to reduce verbose.

The results between the x86_64 system and ARMv8 system are as following:


System Scaling Lookup
x86_64 ~0.169ms ~0.166ms
ARMv8 ~0.95ms ~0.44ms

From the results above we can see that the implementation of the problem was best run on the x86_64 system. As it probably has better hardware for faster processing. However, what's interesting to note is that the lookup performed much better on the ARMv8 compared to the scaling. The x86_64 system didn't see as much as a performance difference.