Thursday, 21 April 2016

Providing for the Upstream

I have submitted my work to the upstream through github.

https://github.com/google/zopfli/issues/104

 I thought it would be best practice to first show my finding to the community for them to comment on the benefits and cons that I don't know about. Unfortunately, even 4 days after posting no one has responded yet. I assume even if I had made a pull request, the same thing would have happened at this point in time as no new pull requests were answered either.

For the sake of time though, if the work had been accepted, I would have had to went through their guidelines for contributing. Which follows a Google Individual Contributor License Agreement (CLA). The following is stated:

Before you contribute

Before we can use your code, you must sign the Google Individual Contributor License Agreement (CLA), which you can do online. The CLA is necessary mainly because you own the copyright to your changes, even after your contribution becomes part of our codebase, so we need your permission to use and distribute your code. We also need to be sure of various other things—for instance that you'll tell us if you know that your code infringes on other people's patents. You don't have to sign the CLA until after you've submitted your code for review and a member has approved it, but you must do it before we can put your code into our codebase. Before you start working on a larger contribution, you should get in touch with us first through the issue tracker with your idea so that we can help out and possibly guide you. Coordinating up front makes it much easier to avoid frustration later on.

Code reviews

All submissions, including submissions by project members, require review. We use Github pull requests for this purpose.

The small print

Contributions made by corporations are covered by a different agreement than the one above, the Software Grant and Corporate Contributor License Agreement.

 I followed their suggestion of:

Before you start working on a larger contribution, you should get in touch with us first through the issue tracker with your idea so that we can help out and possibly guide you. Coordinating up front makes it much easier to avoid frustration later on.

But since they have not answered me yet, I was not able to continue with the signing of the CLA.

Sunday, 10 April 2016

The Trials and Tribulations of Optimizing Zpofli

What Didn't Work

When I first went about doing this, my methods of testing were incorrect and there was a lot of research I should have done better on.

For example, reading more about the zopfli compression, the reason it is slow was because that was the entire point of why it was made. It was meant to be 80x slower than gzip while only providing an up to 8% boost. So it became a tall order to try to better it.

Here's comparing the size differences between the two.

gzipzopfli
75993247550970

As you can see, in this case for the 10mb file made with the command:

base64 /dev/urandom | head -c 10000000 > 10mb.txt

it can only provide a 0.6% increase. Although with less random text, it can probably provide better performance. Still, we're dealing with razor thin margins.

What I first tried to do was follow this guide but what ended up happening was that I was getting an even worse performance than by doing the makefile. And this wasn't a permanent solution either as I wouldn't be able to port the exe file to linux.

Next I tried to change the code anywhere I could to see even just a bit of a boost. Changing any data types where I thought it could help. The only problem was that it wouldn't compile correctly. The code was so advanced and accurate, it was difficult to pinpoint where and what to change for everything. So I thought it was a bust to try to even attempt this.

What Did Work 

So I thought my only chance was maybe changing the compiler options around to see what sticks. In the makefile, and the documentation it's suggested that the program is compiled with an optimize of -o2. I thought this was odd, as why not go for the full o3? As well as some other options that allow it to be compiled correctly and such.

Here are the results when comparing -o2 and -o3 on the xerxes server.

-o2-o3
real    0m29.852sreal    0m29.847s

As you can see, no difference at all that can really be assumed thanks to the higher optimization level. As I ran it many times between the two, with -o2 sometimes being faster than -o3 and vice versa. I can only assume the reason the documentation suggests -o2 over -o3 that it just compiles quicker, and maybe it uses less memory, although I did not test for this.

So again I was stumped, as that means the optimization levels won't matter, until I remembered that I should try it on the aarchie server! And the results were terrific.

-o2-o3
real    3m56.909sreal    3m50.971s

I tested multiple times on both levels to make sure it wasn't just a fluke, but nope, results were always +/- 1 second from their respective levels. This means that changing between the levels can provide as much as ~2% difference in speed! And when something takes this long to compress, that is something that is noticeable when you're working with things that could take hours.
  
Testing with larger files on the xerxes server with different optimizing levels showed the same speed between them all. Unlike showing a difference like it did on the ARM64. Which means that changing it is an ARM64 improvement only. This is probably why it was not noticed or included in the makefile.

What Now?

Before submitting these findings to the upstream I'll have to gather more research for why this happens. There must one or two only flags in the level 3 optimization flag that help it only on the ARM64 platform. Once I find this out, I could then push an ARM64 only makefile so others could see the same performance boost.

Tuesday, 29 March 2016

Identifying a Possible Optimization for the Zopfli Library

The Zopfli Compression Algorithm is a compression library programmed in C to perform very well, but slow for deflate or zlib compression. The library is only able to compress, not decompress.

I chose this library as the community was the largest and most active on Github. There are some pros and cons for this. One being that any problems that occur are reported and able to be fixed quicker as it's a quick and easy way to post the issue on Github. However this can also be a problem because although there are many issues on how to improve the library, many of them are solved and closed for the community being so active.

Tests:

The follow is done by creating a random size text file on the ARM64 platform with the command

dd if=/dev/urandom of=10mb.txt count=1M bs=10
and transferring it over to the x86_64 platform so that the tests were running on the same data. I tested both a 1M and 10M file on the platforms. Here are the results ran with time ./zopfli 10mb.txt:

System1M10M
x86_64real    0m3.309sreal    0m33.970s
ARMv8real    0m27.127sreal    4m26.325s

They weren't kidding when they said it was "slow". Although any changes I do, good or bad, are more easily noticeable.

 

Optimizations Choices:

There are three avenues I believe I can follow in a priority order for possible ways to optimize the library. In the deflate.c there is a TODO of

static void AddBits(unsigned symbol, unsigned length,
                    unsigned char* bp, unsigned char** out, size_t* outsize) {
  /* TODO(lode): make more efficient (add more bits at once). */
  unsigned i;
  for (i = 0; i < length; i++) {
    unsigned bit = (symbol >> i) & 1;
    if (*bp == 0) ZOPFLI_APPEND_DATA(0, out, outsize);
    (*out)[*outsize - 1] |= bit << *bp;
    *bp = (*bp + 1) & 7;
  }
}

for the AddBits function call. Perhaps there's something I can change to help the code along.

My second choice afterwards will be trying different flags for the build:

gcc src/zopfli/*.c -O2 -W -Wall -Wextra -ansi -pedantic -lm -o zopfli
as of right now, there may be some other other flags in O3 optimization that may help some edge cases, but could make it unstable in other areas. This will require patience a lot of compiling and test running.

Finally, the least enjoyable one would be trying to make it platform optimized for the ARM64. No issues were made on the Github so this might be a platform that has been overlooked.

Monday, 14 March 2016

fipa-pure-const & fdse Compiler Options

fipa-pure-const

The fipa-pure-const option allows the compiler to "Discover which functions are pure or constant. Enabled by default at -O and higher."

To understand pure functions there are two requirements:
  1. The function always evaluates the same result value given the same argument value(s). The function result value cannot depend on any hidden information or state that may change while program execution proceeds or between different executions of the program, nor can it depend on any external input from I/O devices.
  2. Evaluation of the result does not cause any semantically observable side effect or output, such as mutation of mutable objects or output to I/O devices.
Pure functions examples are:

  • sin(x), returning the sine of a number x
  • length(s), returning the size of a string s
Impure functions would be functions that are not the same value if the conditions are the same, like the day of the week or a random().

Implication

for(int i = 0; i < fibonacci(100); ++i)

with the fipa-pure-const the loop becomes

for
(int i = 0,end = fibonacci(100); i < end; ++i)

as the fibonacci() doesn't require to be calculated more than once.

fdse

The fdse option allows the compiler to "Perform dead store elimination (DSE) on RTL. Enabled by default at -O and higher."

Dead store is a variable that is assigned a value but is not read by any subsequent instruction. Dead store values waste processor time and memory.

Implication


function func(a, b) {
    var x;
    var i = 300;
    while (i--) {
        x = a + b; // dead store
    }
}
 
Removing dead store values can have the unintended consequence of not allowing 
overwrites to happen. 

Sources: 

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
https://en.wikipedia.org/wiki/Pure_function
https://en.wikipedia.org/wiki/Constant_function
https://en.wikipedia.org/wiki/Dead_store

Sunday, 28 February 2016

Vectorization

When programs require crunching of numbers a bottleneck that can always occur is how long it takes. Sometimes these things can't be avoided, especially if the issue is just have a lot of elements to parse through. But there are some tricks that programmers can use to make sure the problem doesn't occur. One of these is Vectorization. Vectorization is the process of rewriting a loop so that instead of processing a single element, it can process multiple elements simultaneously. 

To show this in action I used the following code:


#include <stdio.h>
#include <stdlib.h>
#define N 1000
int main()
{
    int array1[N], array2[N];
    srand(time(NULL));
    int i;
    for (i = 0; i < N; i++) {
            array1[i] = rand() % (1000 + 1 - 0) + 0;
    }  
    for (i = 0; i < N; i++) {
            array2[i] = rand() % (1000 + 1 - 0) + 0;
    }
  
    long int total = 0;
    int array3[N];
    for (i = 0; i < N; i++){
        array3[N] = array1[i] + array2[i];
    }
    for (i = 0; i < N; i++){
        total += array3[N];
    }
    printf("\nThe total is: %d\n", total);
}
   
The code above creates two 1000-element integer arrays and fills them with random numbers, then sums those two arrays to a third array, and finally sums the third array to a long int and prints the result.

I compiled the code with the command line: gcc -O3 -o vect vect.c
The -O3 - allows options for vectorization.

Here is a portion of the objdump of main:


  402388: 91420222  add x2, x0, #0x80, lsl #12
  40238c: 3c407c00  ld1 {v0.2d}, [x0]
  402390: 3cdf7c21  ld1 {v1.2d}, [x1], #16
  402394: 3e61d400  fadd v0.2d, v0.2d, v1.2d
  402398: 4c9f2c00  st1 {v0.2d}, [x0], #16
  40239c: eb02001f  cmp x0, x2

Looking at the objdump, the intrsuctions new to me are st1 and ld1. Looking it up,st1 stores a single element structure to one lane of of one register. ld1 on the other hand loads a single element structure instead. Those two commands seem to help the process of vectorization.

Algorithm Selection

During this lab, the class was tasked with two separate ways of scaling a sound sample. The first was a signed 16 bit integer with a scaling factor expressed as a floating point. The second was with a pre computed lookup table. The class would have to find out which approach is the fastest and for different architectures.

The following code is the cut down version for easier understanding and would not run directly as is:


#define MAX 1550000
#define FACTOR 0.500

int main() {
        int16_t* meth1 = (int16_t*) malloc(MAX * 2);
        int16_t* meth2 = (int16_t*) malloc(MAX * 2);
        int i;
        sample = (int16_t*) malloc(32768 * 2);
        int16_t* sound;
        sound = (int16_t*) malloc(MAX * 2);
        for(i=0;i<MAX;i++){
                sound[i] = rand() % 65536 - 32768;
        }

        for(i=0;i<MAX;i++){
                meth1[i] = method1(sound[i],FACTOR);
        }


        for(i=0;i<32769;i++){
                sample[i] = i * FACTOR;
        }
        for(i=0;i<MAX;i++){
                meth2[i] = method2(sound[i],FACTOR);
        }
        return 0;
}

// Method 1 Calculate result each time
int16_t method1(int16_t s,float f){
        return s*f;
}
// Method 2 Using Lookup Table to get result;
int16_t method2(int16_t s,float f){
        int16_t result;
        if(s >= 0){
                result = sample[s];
        }
        else{
                result = -sample[-s];
        }
        return result;
}
One thing to keep in mind for the code to work is that many of the variables need to be used in some shape or form. Some of the optimizations of the compiler cause certain operations to not waste time calculating if not used in another place. I've removed those parts above in the sample code to reduce verbose.

The results between the x86_64 system and ARMv8 system are as following:


System Scaling Lookup
x86_64 ~0.169ms ~0.166ms
ARMv8 ~0.95ms ~0.44ms

From the results above we can see that the implementation of the problem was best run on the x86_64 system. As it probably has better hardware for faster processing. However, what's interesting to note is that the lookup performed much better on the ARMv8 compared to the scaling. The x86_64 system didn't see as much as a performance difference.


Sunday, 31 January 2016

Compling a C Program

In this lab, we investigated the source code of a C program against the output of the C compiler. We first started off with compiling hello.c as is.
gcc -o hello hello.c -g -O0 -fno-builtin

The options do the following:
  • -g:                  Enables debug information
  • -O0:               Disables GCC optimizations
  • -fno-builtin:   Disables C optimizations

The resulting binary files was just a few kilobytes in size.  In the lab we were to recompile the code with the following changes:
  1. Add the compiler option -static
  2. Remove the compiler option -fno-builtin
  3. Remove the compiler option -g
  4. Add additional arguments to the printf()
  5. Move the printf() call to a separate function named output(), and call that function from main()
  6. Remove the -O0 and add -03 to the gcc options

 

Add the Compiler Option -static

The static option causes all the dependencies to be added for the program in the binary. Everything is included and compiled along with the source code. This option makes the program access links much faster but increases the size of the binary. Without the option the program will dynamically look for the links.

 

Remove the Compiler Option -fno-builtin

This option tells the compiler to avoid using any built in optimization techniques. When used it changes the function printf() to become puts(). The difference between the two is that printf() validates each character to make sure its a formatted. While puts() simply inserts the function into the buffer and onto the screen without any validation. Other functions that are being used in the background are also changed but I am unable to tell why and for what purpose. I assume most of the operations are much faster and more direct just like printf() and puts().

 

Remove the Compiler Option -g

The -g option allows debugging to occur. Without it errors are not as human readable and warning are not told. The -g option can make the job of figuring out easier for the developer, but the public binary should not be released with it.

 

Add Additional Arguments to the printf()

When additional arguments are added to printf(), based on the platform, numbers are loaded into the registry first while others are pushed onto the stack.

int main() {
  400500:       55                      push   %rbp
  400501:       48 89 e5                mov    %rsp,%rbp
  400504:       48 83 ec 30             sub    $0x30,%rsp
    printf("Hello World!\n%d%d%d%d%d%d%d%d%d%d%d", 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100);
  400508:       c7 44 24 28 64 00 00    movl   $0x64,0x28(%rsp)
  40050f:       00
  400510:       c7 44 24 20 5a 00 00    movl   $0x5a,0x20(%rsp)
  400517:       00
  400518:       c7 44 24 18 50 00 00    movl   $0x50,0x18(%rsp)
  40051f:       00
  400520:       c7 44 24 10 46 00 00    movl   $0x46,0x10(%rsp)
  400527:       00
  400528:       c7 44 24 08 3c 00 00    movl   $0x3c,0x8(%rsp)
  40052f:       00
  400530:       c7 04 24 32 00 00 00    movl   $0x32,(%rsp)
  400537:       41 b9 28 00 00 00       mov    $0x28,%r9d
  40053d:       41 b8 1e 00 00 00       mov    $0x1e,%r8d
  400543:       b9 14 00 00 00          mov    $0x14,%ecx
  400548:       ba 0a 00 00 00          mov    $0xa,%edx
  40054d:       be 00 00 00 00          mov    $0x0,%esi
  400552:       bf 00 06 40 00          mov    $0x400600,%edi
  400557:       b8 00 00 00 00          mov    $0x0,%eax
  40055c:       e8 7f fe ff ff          callq  4003e0 <printf@plt>
}
  400561:       c9                      leaveq
  400562:       c3                      retq
  400563:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  40056a:       00 00 00
  40056d:       0f 1f 00                nopl   (%rax)

 

Move the printf() Call to a Seperate Function Named output(), and Call that Function From main()

Moving the function for this is straight forward. All that seems to happens is that the output() function has its own section that main calls upon.

 

Remove the -O0 and Add -O3 to the GCC options

Any -OX command allows for optimization to be applied to the program. X being the level of optimization, where 0 is the minimal level of optimization and 3 is the most optimized. -O3 has the potential to break your program as it goes above and beyond to make sure it is optimized. From -O0 to -O3 different options are turned on automatically for the developer but they can also enable these options themselves for better control and choice.

Here are the important differences in main(-O3 vs -O0):
int main() {
    printf("Hello World!\n");
  400410:       bf a0 05 40 00          mov    $0x4005a0,%edi
  400415:       31 c0                   xor    %eax,%eax
  400417:       e9 c4 ff ff ff          jmpq   4003e0 <printf@plt>
and
int main() {
  400500:       55                      push   %rbp
  400501:       48 89 e5                mov    %rsp,%rbp
    printf("Hello World!\n");
  400504:       bf b0 05 40 00          mov    $0x4005b0,%edi
  400509:       b8 00 00 00 00          mov    $0x0,%eax
  40050e:       e8 cd fe ff ff          callq  4003e0 <printf@plt>
}
  400513:       5d                      pop    %rbp
  400514:       c3                      retq
  400515:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  40051c:       00 00 00
  40051f:       90                      nop

As you can see, -O3 accomplishes the same task in less code than -O0.