February 13, 2008

Take advantage of multiple CPU cores during file compression

Author: Ben Martin

With the number of CPU cores in desktop machines moving from two to four and soon eight, the ability to execute computationally expensive tasks in parallel is becoming more important. The mgzip tools that can take advantage of multiple CPU cores during file compression, while pbzip2 uses multiple cores for both compression and decompression.

A file compressed with pbzip2 can be decompressed with the standard bzip2 program, while one compressed with mgzip can be decompressed with the standard gunzip utility. You can find pbzip2 packages for Fedora 8 in the standard repositories and install it using yum install pbzip2. However, mgzip doesn't appear to be packaged for many distributions, so you must install it from source.

When building mgzip you might discover that it fails to compile because the zlib header defines gz_header and that variable is used by mgzip.c to contain the hex values of a valid gzip archive. You can fix this easily by adding a prefix to the gz_header variable and the few references to it in mgzip.c. It doesn't really matter what prefix you use, as long as it makes a valid C identifier that is different from gz_header; for example, mgzip_gz_header.

$ make
gcc -g -O2 -c -o mgzip.o mgzip.c
mgzip.c:40: error: 'gz_header' redeclared as different kind of symbol
/usr/include/zlib.h:124: error: previous declaration of 'gz_header' was here
mgzip.c: In function 'compress_infile_to_outfile':
mgzip.c:530: warning: cast to pointer from integer of different size

Running the utilities

Both tools offer similar command-line switches to the non-parallel versions that you're already familiar with. Some options are missing; for example, mgzip does not offer the --recursive option that gzip has.

One caveat with pbunzip2 is that it will only use multiple cores if the bzip2 compressed file was created with pbzip2. This is because pbzip2 compresses a file in pieces that can be decompressed in parallel. This means that if you download linux-2.6.23.tar.bz2 from a kernel source mirror, you will only be able to use a single CPU core to decompress it. Since the quoted size increase of using multiple pieces for pbzip2 is very small (less than 0.2%, from pbzip2's manual), it would be nice if the main bzip2 program would default to creating pieces as well in the future to produce bzip2 files that are more friendly to multicore downloaders.

Examples of simple use of both tools is shown below. They should be familiar to anyone who has used bzip2 and gzip before. As you can see from the size of the compressed files, pbzip2 comes much closer to producing a compressed file that is the same size as the non-parallel compression tool's output. The pbzip2 output is 0.46% larger than the output of bzip2.

$ bunzip2 linux-2.6.23.tar.bz2
$ gzip -c linux-2.6.23.tar > linux-2.6.23.tar.gzip
$ mgzip -c linux-2.6.23.tar > linux-2.6.23.tar.mgzip
$ ls -lh
-rw-r----- 1 ben ben 253M 2008-01-19 18:55 linux-2.6.23.tar
-rw-rw-r-- 1 ben ben 56M 2008-01-19 18:57 linux-2.6.23.tar.gzip
-rw-rw-r-- 1 ben ben 67M 2008-01-19 18:57 linux-2.6.23.tar.mgzip

$ gunzip -c linux-2.6.23.tar.mgzip > linux-2.6.23.tar.mgzip-gunzip
$ md5sum linux-2.6.23.tar.mgzip-gunzip linux-2.6.23.tar
853c87de6fe51e57a0b10eb4dbb12113 linux-2.6.23.tar.mgzip-gunzip
853c87de6fe51e57a0b10eb4dbb12113 linux-2.6.23.tar

$ bzip2 -c -k -9 linux-2.6.23.tar > linux-2.6.23.tar.bzip2
$ pbzip2 -c -k -9 linux-2.6.23.tar > linux-2.6.23.tar.pbzip2

$ ls -lh
-rw-r----- 1 ben ben 253M 2008-01-19 18:55 linux-2.6.23.tar
-rw-rw-r-- 1 ben ben 56M 2008-01-19 18:57 linux-2.6.23.tar.gzip
-rw-rw-r-- 1 ben ben 67M 2008-01-19 18:57 linux-2.6.23.tar.mgzip
-rw-rw-r-- 1 ben ben 44M 2008-01-19 19:03 linux-2.6.23.tar.bzip2
-rw-rw-r-- 1 ben ben 44M 2008-01-19 19:01 linux-2.6.23.tar.pbzip2

$ ls -l
-rw-r----- 1 ben ben 264704000 2008-01-19 18:55 linux-2.6.23.tar
-rw-rw-r-- 1 ben ben 45488158 2008-01-19 19:03 linux-2.6.23.tar.bzip2
-rw-rw-r-- 1 ben ben 57928789 2008-01-19 18:57 linux-2.6.23.tar.gzip
-rw-rw-r-- 1 ben ben 69968799 2008-01-19 18:57 linux-2.6.23.tar.mgzip
-rw-rw-r-- 1 ben ben 45695449 2008-01-19 19:01 linux-2.6.23.tar.pbzip2

Because mgzip can read the data to be compressed from stdin, you can pipe an uncompressed tar file to it. A major drawback to the currently available version of pbzip2, however, is that input to the utility cannot come from stdin or a pipe. This means that you need to create a real tar file before you can compress it with pbzip2. Shown below are commands to extract a tarball compressed with pbzip2 using multiple CPU cores and a method of compressing a tar file with pbzip2.

$ pbunzip2 -c /tmp/test/linux-2.6.23.tar.pbzip2 | tar xvf -

$ tar cvf linux-2.6.23.tar linux-2.6.23
$ pbzip2 -9 linux-2.6.23.tar

$ tar cvO linux-2.6.23 | mgzip > linux-2.6.23.tar.gz

I ran some benchmarks to test the performance gain I could get with these parallel compression tools. I tested them on an Intel Q6600 2.4GHz quad core machine, using the kernel linux-2.6.23.tar file, which I picked for its availability and because source tar files are likely to be relevant to many Linux.com readers.

Comparisons of the compressed file sizes and times to compress are revealing. mgzip at default compression speed was substantially faster than gzip, but also produced an output file that was quite a bit larger than gzip's. With -9 compression, mgzip is only twice as fast as gzip on a quad core machine, but the compressed file is much closer in size to what gzip would produce. For both tests, pbzip2 produced an output file that was similar in size to what bzip2 would make. For bzip2 -9 level compression, pbzip2 took reasonable advantage of four cores, requiring only 31% the time that bzip2 needed.

Decompression times
are also interesting. These were both performed on the output of pbzip2 in order to ensure that multicore decompression was possible.

With mgzip and pbzip2 you can take advantage of all your CPU cores to shorten compression and decompression times. This obviously has the largest impact when you are waiting for an archive to decompress before you can proceed with another task. Using the normal bzip2 you would have effectively wasted three of four cores on a Q6600 quad core machine during (de)compression operations. You might also set up a cron job to recompress bzip2 files downloaded from the Internet to pbzip2 format so that when the time comes to expand one of them later, the work can be spread across your cores.


  • Tools & Utilities
  • Desktop Software