April 23, 2009

Efficient Handling of Bitset Operations in C++, Pt. 2

In Part 1 of this series, we reviewed options for handling sets of bits from C++. In the conclusion, we'll review the pros and cons of BitMagic.

Unfortunately BitMagic does not use the autotools to build it's example programs. The changes to a few files shown below will allow you to build most of the sample programs that come with BitMagic on a 64-bit Linux machine using gcc. Once these changes are in place you should be able to type make rebuild to create most of the sample programs.

// canon-system
echo intel-linux-gnu
exit 0;;
// platforms/intel-linux-gcc.mk
CXX = g++ -Wall
CC = gcc -Wall
// makefile.in

A major downside for BitMaigc is that the compiler intrinsics it uses for its SSE2 bitset implementation are not compatible with gcc. This is not a show stopper because there is an implementation in BitMagic that does not use SSE2 intrinsics. So you end up with the ability to compile code that uses BitMagic with gcc or another compiler and turn on and off the SSE2 intrinsics bitset version depending on if the current compiler supports them. When using gcc, you'll have to rely on the gcc compiler itself to try to optimize the bitset using SIMD instructions. The process of gcc creating SIMD code from C/C++ is called automatic vectorization.

I'll benchmark the three implementations of the same program shown above which all use different implementations of bitsets. These are: std::bitset, boost::dynamic_bitset, and BitMagic. For each implementation I'll compile three binaries, one without compiler optimization, one with -O3 and automatic vectorization enabled (the default for optimization level 3), and the later with -O3 and automatic vectorization turned off.

As I am benchmarking, I'll compile all versions with -mtune=native to make sure that gcc takes advantage of the instructions of the CPU I am using: an Intel Q6600. It can be informative to use the -ftree-vectorizer-verbose=1 option when compiling so that gcc tells you when and where it has vectorized execution.

Shown below is a plot of the std::bitset, boost::dynamic_bitset, and BitMagic for a bitset of 100000 bits, with a density of 50%, using 50, 100, 500 and 1000 thousand iterations of a bit intersection. As you would expect, the performance scales linearly with the number of iterations performed. The first standout is that BitMagic-opt (The -O3 compiled version using BitMagic) is actually slower than the unoptimized std::bitset binary. There is quite a difference between the runtimes of the std and boost binaries when no optimization is used. The two fastest implementations were the boost and std using optimization.


There are some differences in the runtimes of the two optimized binaries with and without automatic vectorization. A graph is less instructive as the differences tend to be smaller, for example, 3.73 seconds to 3.84 seconds for the std implementation with and without auto vectorization respectively. The same figures are 5.67 to 5.51 for boost::dynamic_bitset with and without auto vectorization respectively.

Of the three bit operations, AND, OR and XOR, there are some minor differences in benchmark times depending on which bitset is used. Comparing the optimized, auto vectorized binaries for 1,000,000 iterations, the boost bitset required 5.67 seconds to process logical ANDs with only 5.53 seconds for OR and XOR. Less than 0.03 seconds separates the three operations on a BitMagic bitset. And contrary to the boost bitset, the std bitset was faster for logical ANDs at 3.73 seconds apposed to about 3.81 for the other operations.

Shown below is the performance as the size of the bitset increases. Note that when using the std::bitset, multiple binaries have to be made in order to adjust the size of the bitset. The most interesting thing is that the boost implementation becomes the slowest as the size of the bitset approaches a million bits. Both of the other bitsets remain relatively linear in terms of size to time ratio.


I took two larger data points to see how the boost implementation went from a million on. Shown in the graph below is the same dataset scaling up to bitsets of size 10 million bits. None of the implementations handle this size particularly well or even close to linearly. The best is the std::bitset which goes from 4.7 to 88 seconds when the bitset is increased from 1 to 10 million bits. At least the boost implementation does not continue to skyrocket as much as it did leading up to a million bits.



When using gcc, the boost::dynamic_bitset is a great choice -- you get the flexibility of being able to choose the bitset size at runtime and do not loose a great deal of performance over std::bitset when you want a bitset of less than about a half million bits in size. Unfortunately I was unable to test the explicit SSE2 mode of BitMagic because it does not work against gcc and other compilers for which the intrinsic do work for are extremely expensive if you are purchasing the compiler just to publish benchmark results.