June 15, 2005

64-bit performance in Gentoo Linux

Author: Jem Matzan

Many people wonder if 64-bit GNU/Linux offers any kind of performance increase over 32-bit. We've already covered the theoretical advantages and some of the pitfalls of using a totally 64-bit AMD64 system. Now it's time to measure performance.

I tested Gentoo Linux 2005.0 for x86 and AMD64 because it is customizable enough for benchmarking and can be reduced to only the software and services that I needed for performance testing. Other 64-bit OSes are available and would probably work well, but by using Gentoo I knew that I could compile (or recompile) the entire operating environment with the most appropriate options. Previously I did some performance testing on FreeBSD for AMD64, but used different hardware and test criteria.

I ran benchmark tests covering database performance with MySQL and Super Smack; encryption performance with OpenSSL; 3D rendering performance with Unreal Tournament 2004; and compiler speed with a timed compile of X.org.

My test system hardware was a workstation equipped with:

  • An Athlon 64 4000+ processor
  • MSI K8T Neo2-FIR motherboard
  • Corsair TwinX LL 1024MB set (two tested 512MB modules)
  • Seagate SATA-V 160MB hard drive connected to the VIA SATA RAID chip
  • Albatron Nvidia GeForce FX5700 Ultra3 (128MB DDR3 video RAM)

The software was Gentoo Linux 2005.0 using the Universal ISOs. I performed a stage 3 installation with no USE flags, the compiler options set for -pipe -O2 -fomit-frame-pointer, and the Pentium 4 and K8 -march options. I used the Pentium 4 option with the Athlon 64 because it has the same technologies (SSE, SSE2, MMX) that the Pentium 4 option provides. This could enhance performance in some tests, as the AMD-specific architectures below the K8 do not include SSE2. I also set the MAKEOPTS variable to -j2, which increases processor load by performing two parallel makes when compiling.

I disabled CPU frequency scaling in the kernel for both setups. The drivers for all of the system's hardware were built into the kernel, with the exception of the video driver. I installed the Nvidia driver version 1.0.6629-r4 from Portage for both architectures.

I ran all the benchmark tests from the command line except for Unreal Tournament 2004, which required an X server. For testing this, I installed Fluxbox as a low-overhead window manager to work from.

It's important to note that this benchmarking project measures the performance of the software, not the hardware, so the software setup for both test cases is going to have to be different. In a hardware performance comparison, the software in each test case must remain the same or as similar as possible to eliminate any software variables. When comparing 32-bit and 64-bit performance on the same hardware, the situation is just the opposite: the hardware must be the same and the software must change. The operating system might be the same distribution and the software may be the same versions between two test cases, but the compiler will behave differently and compile in different options and features for each architecture. In some cases, the 64-bit tests will not elicit results that come close to the theoretical limits of the hardware, due to the fact that the AMD64 architecture has been available only for a short amount of time compared with 32-bit x86, which has had more than a dozen years to achieve maximum performance optimization.

OpenSSL speed test

OpenSSL is responsible for the bulk of the Internet's daily data encryption and decryption. It uses several different protocols for a variety of data types and applications. Some protocols are more CPU- and memory-intensive than others, and, doubtless, some have been tweaked for better performance on the x86 architecture.

My benchmark command was openssl speed > openssl.txt. You may notice in the configuration options at the top of the results that OpenSSL has been compiled slightly differently for each architecture. This does not invalidate the results, as both tests were run using the default, unadorned configuration that Gentoo Linux provides. Here are the results, with 32-bit listed first:

OpenSSL 0.9.7e 25 Oct 2004
built on: Thu May 26 12:20:08 EST 2005
options:bn(64,32) md2(int) rc4(idx,int) des(ptr,risc1,16,long) aes(partial) idea(int) blowfish(idx)
compiler: i686-pc-linux-gnu-gcc -fPIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DOPENSSL_NO_KRB5 -DL_ENDIAN -DTERMIO -Wall -O2 -march=pentium4 -pipe -fomit-frame-pointer -Wa,--noexecstack -DSHA1_ASM -DMD5_ASM -DRMD160_ASM
available timing options: TIMES TIMEB HZ=100 [sysconf value]
timing function used: times
The 'numbers' are in 1000s of bytes per second processed.

type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md2 1972.43k 4168.21k 5798.06k 6426.28k 6635.52k
mdc2 5760.49k 6564.10k 6793.22k 6858.07k 6850.25k
md4 19628.92k 68563.37k 196675.16k 368591.19k 495938.22k
md5 17130.05k 59096.75k 169058.56k 314997.76k 425473.37k
hmac(md5) 23604.52k 74008.53k 197261.27k 339661.82k 430830.93k
sha1 17479.47k 56888.85k 145963.09k 239197.18k 294557.01k
rmd160 14445.16k 42099.41k 93096.19k 133608.11k 153138.52k
rc4 191472.19k 213686.31k 218618.20k 221005.82k 221640.02k
des cbc 56039.15k 61279.66k 62657.19k 63102.29k 63255.89k
des ede3 20636.86k 21323.07k 21541.21k 21596.84k 21613.23k
idea cbc 42682.09k 45844.16k 46854.31k 47114.92k 47188.65k
rc2 cbc 21029.18k 21966.81k 22171.82k 22229.67k 22249.47k
rc5-32/12 cbc 177728.25k 199504.81k 204492.71k 206903.30k 206955.50k
blowfish cbc 95399.50k 101648.68k 103123.63k 103719.25k 103890.94k
cast cbc 46513.56k 49863.98k 50752.00k 51018.41k 51079.85k
aes-128 cbc 52871.24k 53593.86k 54562.05k 54757.72k 54790.83k
aes-192 cbc 46108.03k 46761.24k 47343.02k 47491.07k 47529.98k
aes-256 cbc 40442.60k 41259.07k 41572.19k 41811.63k 41842.01k
sign verify sign/s verify/s
rsa 512 bits 0.0005s 0.0000s 2161.0 24897.0
rsa 1024 bits 0.0020s 0.0001s 489.5 9602.4
rsa 2048 bits 0.0114s 0.0003s 87.8 3073.1
rsa 4096 bits 0.0729s 0.0011s 13.7 892.3
sign verify sign/s verify/s
dsa 512 bits 0.0003s 0.0004s 2886.5 2368.8
dsa 1024 bits 0.0010s 0.0012s 1052.0 858.8
dsa 2048 bits 0.0030s 0.0037s 330.6 268.2

And for 64-bit:

OpenSSL 0.9.7e 25 Oct 2004
built on: Fri May 27 11:59:48 EST 2005
options:bn(64,64) md2(int) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(ptr2)
compiler: x86_64-pc-linux-gnu-gcc -fPIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DOPENSSL_NO_KRB5 -m64 -DL_ENDIAN -DTERMIO -Wall -DMD32_REG_T=int -march=k8 -O2 -pipe -fomit-frame-pointer
available timing options: TIMES TIMEB HZ=100 [sysconf value]
timing function used: times
The 'numbers' are in 1000s of bytes per second processed.

type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md2 1784.76k 3744.92k 5164.63k 5708.12k 5890.05k
mdc2 7256.28k 8077.85k 8304.21k 8364.71k 8385.88k
md4 22411.01k 75244.97k 206877.95k 367137.45k 475886.93k
md5 17413.99k 55832.64k 138771.20k 227383.64k 280029.87k
hmac(md5) 22347.46k 66172.69k 157067.69k 238682.79k 282099.71k
sha1 19861.55k 55092.52k 115175.68k 158165.33k 177750.02k
rmd160 13796.62k 37246.02k 74987.61k 101936.13k 114021.72k
rc4 159219.06k 167012.57k 172110.08k 173409.96k 173869.74k
des cbc 46540.77k 48923.29k 49505.28k 49701.89k 49780.05k
des ede3 18546.21k 18859.88k 18989.23k 19012.95k 19027.29k
idea cbc 47039.89k 49124.67k 50088.02k 50313.90k 50383.53k
rc2 cbc 24591.25k 25378.18k 25574.49k 25616.73k 25630.04k
rc5-32/12 cbc 111133.89k 122278.81k 125113.77k 126182.74k 126509.06k
blowfish cbc 81767.43k 87866.88k 89326.85k 89899.01k 90073.77k
cast cbc 62758.01k 66275.14k 67141.80k 67438.93k 67532.12k
aes-128 cbc 106734.07k 111689.02k 113902.85k 114442.58k 114638.85k
aes-192 cbc 96930.67k 99510.91k 101275.73k 101686.95k 101856.60k
aes-256 cbc 87376.95k 89460.65k 90853.63k 91180.37k 91310.76k
sign verify sign/s verify/s
rsa 512 bits 0.0002s 0.0000s 4714.1 56446.0
rsa 1024 bits 0.0007s 0.0000s 1513.2 24438.7
rsa 2048 bits 0.0036s 0.0001s 280.8 8944.2
rsa 4096 bits 0.0220s 0.0004s 45.6 2794.3
sign verify sign/s verify/s
dsa 512 bits 0.0001s 0.0002s 7805.7 6586.8
dsa 1024 bits 0.0003s 0.0004s 3149.7 2624.9
dsa 2048 bits 0.0009s 0.0012s 1056.6 864.5

32-bit OpenSSL seems to pull ahead on several of the tests, but 64-bit blows it away by factors of two and three in the AES, RSA, and DSA ciphers. The top set of tests measures the algorithm speed of the listed ciphers. The second set of tests -- these may be more important to some -- test signing and verifying encryption keys.

Super Smack and MySQL

Super Smack is a database benchmarking utility that works with either MySQL or PostgreSQL. I chose to use MySQL, since it is more common among Web-based applications and has a larger installed base. The MySQL version used for both systems was 12.22 Distrib 4.0.24.

The benchmark command was super-smack -d mysql/usr/share/super-smack/select-key.smack 10 10000 && super-smack -d mysql/usr/share/super-smack/update-select.smack 10 10000 (that's 10 clients with 10,000 queries), and the results are given in queries per second.

Super Smack 32-bit Super Smack 64-bit
Select-key 17374.58 q/s 18148.10 q/s
Select_index 9333.48 q/s 9717.08 q/s
Update_index 9333.48 q/s 9717.08 q/s

The test parameters generate 200,000 MyISAM table queries, which is sufficient for testing, and identical to the settings in Tony Bourke's database benchmarking article.

The 64-bit edition was faster, but not by a significant degree.

Unreal Tournament 2004

Unreal Tournament 2004 comes with both 32-bit and 64-bit binaries, making it an ideal OpenGL 3D rendering and gaming performance benchmark. It works in conjunction with Nvidia's 64-bit Linux driver. The driver version for both 32-bit and 64-bit was the latest stable version at the time of testing, which was 1.0.6629-r4. UT2004 was patched to version 3339, which is also the latest at the time of testing.

I followed Andreas 'GlaDiaC' Schneider's benchmarking guide (scroll down to section 7 for the benchmarking procedure) for the UT2004 tests. I changed the number of bots to 16 to increase the CPU load, raised the test time to 120 seconds to increase the accuracy of the data, and turned the detail settings to their maximums. Screen resolution was 1024x768 and color depth was 24-bit. The results, which are the log files from the tests, are listed in frames per second. The lowest recorded framerate is the first number, the second is the average frame rate, and the third number in the series is the maximum recorded framerate.

UT2004 Build UT2004_Build_[2004-11-11_10.48]
x86 Linux
AuthenticAMD Unknown processor @ 2400 MHz
GeForce FX 5700 Ultra/AGP/SSE2/3DNOW!

ons-primeval?spectatoronly=1?numbots=16?quickstart =1?attractcam=1 -benchmark -seconds=120 -ini=default.ini -exec=../Benchmark/Stuff/botmatchexec.txt

7.255768 / 60.568241 / 141.303101 fps rand[1543912059]
Score = 58.373878

And for 64-bit:

UT2004 Build UT2004_Build_[2004-11-11_10.48]
x86-64 Linux
Unknown processor @ 2400 MHz
GeForce FX 5700 Ultra/AGP/SSE2

ons-primeval?spectatoronly=1?numbots=16?quickstart =1?attractcam=1 -benchmark -seconds=120 -ini=default.ini -exec=../Benchmark/Stuff/botmatchexec.txt

7.701819 / 43.102283 / 93.125053 fps rand[707662726]
Score = 43.058800

The 32-bit version got considerably better framerates, which provides a smoother game experience. But do you notice something missing in the system information block? The 3DNOW! multimedia extensions are not being used on the 64-bit system. I tried recompiling the entire operating environment with the 3dnow USE flag, but it still didn't register in Unreal Tournament. Hopping over to an Opteron system with a different Nvidia card, I found the same results -- no 3DNOW! extensions in the 64-bit version of UT2004.

I don't know if the absence of AMD's multimedia extensions are the cause of the lower framerates, and I don't know what part of the equation is to blame for this, but it's safe to assume that something is not as it should be with the 64-bit software, since 3DNOW! is part of the AMD64 instruction set architecture.

X.org compile time

You do a lot of compiling on a Gentoo Linux system, and the same can be said of FreeBSD and other operating systems that are source-based or have a Ports-like infrastructure. To test compiler speed, I ran emerge --fetchonly xorg-x11, which retrieves all of the X.org source code (a total of nine files). When it finished, I ran time emerge xorg-x11 and recorded the compile time. The first number in the table is the total time the compile took to complete; the second number is the time the entire build took to execute; and the third number is the time consumed by system overhead during the compilation procedure.

32-bit 64-bit
26min 39sec real 32min 16sec real
21min 39sec user 22min 7sec user
4min 10sec system 9min 23sec system

The 32-bit system compiled X.org faster than its 64-bit counterpart. The real time-killer looks like system overhead. Both systems used GCC 3.4.3-r1 and Linux kernel 2.6.11-gentoo-r7. Again, I don't know if any single factor is to blame, or if there are several contributors to the inferior 64-bit performance.

According to a benchmark test performed last year, GCC can have a profound effect on the speed of generated code, especially for AMD64 systems. FreeBSD's David O'Brien pointed out in this email regarding a previous benchmarking project that GCC compile time performance is not truly the focus of GCC development -- the performance of the compiled binary is all that matters. In our Gentoo benchmark, we're in a sense testing the speed of the code that GCC compiles. In the X.org compile test, we're also testing the speed of the compiler itself, which is not so much an indication of GCC's quality as it is the time you will be spending compiling programs. In this case, it seems that AMD64 users will endure longer compile times with this version of GCC. This does not ignore architecture-specific hand-coded assembly optimizations that no doubt benefits one or both architectures in the above tests.


While everyday "Internet and email" desktop performance of a 64-bit operating system may not be much different from that of a 32-bit platform, CPU- and memory-intensive applications see significantly enhanced performance. 3D gaming performance could suffer from not-yet-perfect Nvidia drivers (there were two newer "unstable" versions of the Nvidia driver in Portage at the time of this writing) and 64-bit game binaries that are still experimental. 64-bit gaming is, after all, a new thing to PC game developers.

64-bit operating systems may not be practical for simple desktop use at this point, partially because of some of the hassles in setting them up, and partially because they offer little performance increase for most desktop applications. But the advantage of running a Web or email server is obvious when you look at the OpenSSL and MySQL results, assuming you use those technologies.

Sometimes the purpose of a benchmarking project is to show which squeaky wheels need the grease. This benchmarking project has shown that there's still a long way to go for AMD64-specific optimizations in the GNU/Linux world.