Thursday, April 21, 2016

Trying to get my changes upstream

As noted before my optimization was to change the compiler flag from -O2 to -O3. This increased the speed with 11% on x86_64. During the testfase of this optimization I changed the compiler flags in the Makefile itself. If I wanted this to get committed in the community I'd have to find the place where the makefiles are being created.

This would normally be the configure file. However I could not find where the flags were set for the makefiles which I thought was very strange because I am very sure that when you configure and make the program, it compiles with the -O2 flag.

I used grep to find files where -O2 would be used, and the only file it found was an instructions file of how you can manually add -O2 while configuring, not as a standard.

Then I tried using grep to find CFLAGS and where they would be defined. What I discovered is that they use a pthread library which helps find the right flags for compilation. (https://manned.org/pthread-config/b1cf9868). I quoted this from the md5deep documentation:
#   This macro figures out how to build C programs using POSIX threads. It
#   sets the PTHREAD_LIBS output variable to the threads library and linker
#   flags, and the PTHREAD_CFLAGS output variable to any special C compiler
#   flags that are needed.
I did not know how to manipulate the pthread configuration so it would always use -O3. I did read in their instructions that they had an email address for questions or remarks on the compiler options in the README file. It was however not in the file, so I could not contact them personally either.

On that note I'm sad to share that I could not get the right flags in the configure step. This means I could not commit anything to the github project because I could not contact them and ask for help or an explanation.

Sunday, April 10, 2016

Compiling Optimization Betty

After benchmarking x86_64 with the -O3 optimization flag, it's time to test this on ARCH64.
Since I can only work from the command line on the server I needed to come up with a command to replace all Makefiles' -O2 with -O3. The command I found the easiest was the following:
find -name Makefile | xargs sed -i 's/-O2/-O3/' 
The following benchmarks are done on ARCH64, server Betty.

With -O2 flag
10.5 mb file 105 mb file 1.5 gb file
real: 0m0.037s real: 0m0.345s real: 0m4.792s
user: 0m0.028s user: 0m0.323s user: 0m4.551s
sys: 0m0.001s sys: 0m0.027s sys: 0m0.400s

With -O3 flag
10.5 mb file 105 mb file 1.5 gb file
real: 0m0.036s real: 0m0.343s real: 0m4.768s
user: 0m0.028s user: 0m0.323s user: 0m4.499s
sys: 0m0.001s sys: 0m0.027s sys: 0m0.426s

As you can see the -O3 did not do anything on ARCH64. I thought this was very strange and checked the executable file to see if it changed at all. The file did change, it got larger as expected. Yet there is no gain in speed. Comparing the real time of the 1.5gb file again, it wasn't even 1% faster. So for ARCH64 I recommend using -O2 because it doesn't change much in run time and your file is smaller.

For Betty I'll have to find another optimization possibility, although I wouldn't know what the next possibility would be. Probably make something specifically for ARCH64, but this would cost way more time.

Another way to go is add the compiler flag -fopt-info-missed to find missed optimizations and see if I can do something about that. Source: https://gcc.gnu.org/onlinedocs/gccint/Dump-examples.html


Compiling Optimization x86_64

As mentioned in the previous post I am going to take a look at the compiler flags and how I might be able to optimize them.

The currents flags for the C compiler that are used:
-pthread -g -pg -O2 -MD -D_FORTIFY_SOURCE=2 -Wpointer-arith -Wmissing-declarations -Wmissing-prototypes -Wshadow -Wwrite-strings -Wcast-align -Waggregate-return -Wbad-function-cast -Wcast-qual -Wundef -Wredundant-decls -Wdisabled-optimization -Wfloat-equal -Wmissing-format-attribute -Wmultichar -Wc++-compat -Wmissing-noreturn -funit-at-a-time -Wall -Wstrict-prototypes -Weffc++
As you can see there are a lot of flags to compile this program with the C compiler. What caught my eye was that they're using the O2 flag and not O3, so I saw this as a optimization possibility and went straight in to change it. Once I changed all the flags I did another benchmark of the program to see if it had helped at all. ~

To check if it did anything, I looked at the executable file. It made the file larger with the -O3 optimization. It went from 2.7 mb to 3.5 mb.

Note: Don't forget to take out the -pg flag as well to have a correct benchmark. -pg makes your program slower so then you won't have the correct comparison.

The following benchmarks are for x86_64

With -O2 flag
10.5 mb file 105 mb file 1.5 gb file
real: 0m0.053s real: 0m0.287s real: 0m3.458s
user: 0m0.021s user: 0m0.200s user: 0m2.793s
sys: 0m0.003s sys: 0m0.012s sys: 0m0.184s

With -O3 flag
10.5 mb file 105 mb file 1.5 gb file
real: 0m0.045s real: 0m0.258s real: 0m3.071s
user: 0m0.020s user: 0m0.197s user: 0m2.756s
sys: 0m0.002s sys: 0m0.014s sys: 0m0.177s

By comparing the real time you can see the -O3 flag makes the program 11.2% faster. Which is a pretty nice optimization. So I recommend they replace the -O2 flag with -O3.

Now it's time to benchmark the software on Betty and hope so see an optimization as well. 


G-Profiling

To enable g-profiling you need to add the -pg flag to the compiler options. These options can be found in the Makefile. Just look for the terms "CFLAGS", "CPPFLAGS", and "LDFLAGS". Behing these options add the -pg flag and g-profiling should be enabled.

At first I couldn't figure out why my software wasn't producing the gmon.out file, which gprof generates for you. I added -pg to all the flags in the head Makefile. But apparently there were way more Makefiles where I needed to change the flags. There were 6 Makefiles in total, so I added -pg to all the compiler flags.

Luckily this did generate the gmon.out file so I could take a look at the profiling by running the following command:
gprof md5deep > gprof.txt
This generates a readable file which contains the profiling. The strange thing was that my gprof file said that every function it called took 0 seconds, even though it said that the function got called 8 times. It probably went through a lot of functions quickly and didn't register time.

The function that gets called 8 times is hash_final_sha1 which looks like this:
void hash_final_sha1(void * ctx, unsigned char *sum) { 
sha1_finish((sha1_context *)ctx.sum); } 
Since it's a one liner there isn't much I can optimize here. But it does call other functions which I can take a look at. I went through all the functions that call each other until I found a function that actually did things by itself instead of just forwarding to another function. I ended up at the following function:
void sha1_update( sha1_context *ctx, const unsigned char *input, size_t ilen )
{
    size_t fill;
    unsigned long left;

    if( ilen <= 0 )
        return;

    left = ctx->total[0] & 0x3F;
    fill = 64 - left;

    ctx->total[0] += (unsigned long) ilen;
    ctx->total[0] &= 0xFFFFFFFF;

    if( ctx->total[0] < (unsigned long) ilen )
        ctx->total[1]++;

    if( left && ilen >= fill )
    {
        memcpy( (void *) (ctx->buffer + left),
                (const void *) input, fill );
        sha1_process( ctx, ctx->buffer );
        input += fill;
        ilen  -= fill;
        left = 0;
    }

    while( ilen >= 64 )
    {
        sha1_process( ctx, input );
        input += 64;
        ilen  -= 64;
    }

    if( ilen > 0 )
    {
        memcpy( (void *) (ctx->buffer + left),
                (const void *) input, ilen );
    }
}
Looking at it I could not find a possible way to optimize this function. Since it took a while to find a suitable function, and I couldn't find an optimization, I'm going to look into the compiler flags next since they use the O2 flag, and this should be able to change to O3.

Tuesday, March 29, 2016

MD5deep focus area

So my first thought on the focus area was implementing a way for aarch64 to be installed correctly since I ran into trouble with the architecture. But in the latest configure file I had to download they fixed that issue already and was dated 2016-03-24, so that was pretty recent.

Since I'm more familiar with code than all the configuration and make files, I decided to take a better look at the source code.

MD5deep benchmarking betty

Now that I'm done testing and benchmarking on my own x86_64 fedora system, I need to test md5deep on an AARCH64 system. I transferred the files I used in the x86_64 benchmark to the betty server using scp and sftp. Once I got the test files transferred I transferred the tar.gz file of md5deep so I could try to install it on betty. Here's where I ran into some issues...

The normal way to install the md5deep is to unpack it and run the following commands:
sh bootstrap.sh
sudo ./configure
make
make install 
Sadly it stopped at the configure step. It could not recognize which build to use since it did not know aarch64. Luckily in the error it gave me a website to download the most recent configure files which did contain aarch64! (ftp://ftp.gnu.org/pub/gnu/config/) So after some more use of sftp the configure step finally worked and I could run the make install command.

On to the benchmarking. Just like benchmarking on the x86_64 I ran the md5deep program 100 times with 3 different file sizes. The results are as following:
10.5 mb file
real: 0m0.037s
user: 0m0.028s
sys: 0m0.001s
105 mb file
real: 0m0.345s
user: 0m0.323s
sys: 0m0.027s
1.5 gb file
real: 0m4.792s
user: 0m4.551s
sys: 0m0.400s
If you only have one file md5deep can hash it pretty quickly. Even if you hash entire directory structures you wouldn't have to wait too long.

MD5deep benchmarking x86_64

I currently only have md5deep installed on my own fedora workstation, so I could test if I could get it working. Now that I did get it to work I can benchmark it for my laptop. When I get this to work I will do the same on betty.

My fedora specifications are:
Memory: 2.9 GiB
Processor: Intel Core i7-3630QM CPU @ 2.40GHz
Machine architecture: x86_64

To benchmark this program I need a few files of different sizes. With the following command I can create a file with complete random data and a size of approximately 100mb.
dd if=/dev/random of=test100m bs=1M count=100 iflag=fullblock
After running this command for different file sizes, I ended up with a 105mb testfile and a 10.5mb file.
By running the following command you can run the md5deep program and see how long it took to complete.
time md5deep test100m
To get a presentabele benchmark you have to run this command a lot of times and get the average. Instead of manually running this command 100 times I wrote the following command which would do it for me:
time for i in {1..100}; do md5deep test100m; done
The time you get for this is the total time of running it 100 times, so the time result you get has to be divided by 100 to get the average time of running once. I got the following results:
10.5 mb file
real: 0m0.053s
user: 0m0.021s
sys: 0m0.003s
105 mb file
real: 0m0.287s
user: 0m0.200s
sys: 0m0.012s

After seeing that these files still get hashed pretty quickly I decided to download a bigger file to see how long that would take. I downloaded the Fedora Workstation iso, which is 1.5gb, to test with.
1.5gb file
real: 0m3.458s
user: 0m2.793s
sys: 0m0.184s
It still doesn't take very long for a 1.5gb file to be hashed, so this program is pretty fast.