Performance Analysis in Linux (Continued): When Performance Really Matters

427

By Gabriel Krisman Bertazi, Software Engineer at Collabora.

This blog post is based on the talk I gave at the Open Source Summit North America 2017 in Los Angeles. Let me start by thanking my employer Collabora, for sponsoring my trip to LA.

Last time I wrote about Performance Assessment, I discussed how an apparently naive code snippet can hide major performance drawbacks. In that example, the issue was caused by the randomness of the conditional branch direction, triggered by our unsorted vector, which really confused the Branch Predictor inside the processor.

An important thing to mention before we start, is that performance issues arise in many forms and may have several root causes. While in this series I have focused on processor corner-cases, those are in fact a tiny sample of how thing can go wrong for performance. Many other factors matter, particularly well-thought algorithms and good hardware. Without a well-crafted algorithm, there is no compiler optimization or quick hack that can improve the situation.

In this post, I will show one more example of how easy it is to disrupt performance of a modern CPU, and also run a quick discussion on why performance matters – as well as present a few cases where it shouldn’t matter.

If you have any questions, feel free to start a discussion below in the Comments section and I will do my best to follow-up on your question.

CPU Complexity is continuously rising

Every year, new generations of CPUs and GPUs hit the market carrying an always increasing count of transistors inside their enclosures as show by the graph below, depicting the famous Moore’s law. While the metric is not perfect on itself, it is a fair indication of the steady growth of complexity inside of our integrated circuits.

transistor_count.svg.png
Figure 1: © Wgsimon. Licensed under CC-BY-SA 3.0 unported.
Much of this additional complexity in circuitry comes in the form of specialized hardware logic, whose main goal is to explore common patterns in data and code, in order to maximize a specific performance metric, like execution time or power saving. Mechanisms like Data and Instruction caches, prefetch units, processor pipelines and branch predictors are all examples of such hardware. In fact, multiple levels of data and instruction caches are so important for the performance of a system, that they are usually advertised in high caps when a new processor hits the market.

While all these mechanisms are tailored to provide good performance for the common case of programming and common data patterns, there are always cases where an oblivious programmer can end up hitting the corner case of such mechanisms, and not only write code which is unable to benefit from them, but also code which executes way worse than if there were no optimization mechanism at all.

As a general rule, compilers are increasingly great at detecting and modifying code to benefit from the CPU architecture, but there will always be cases where they won’t be able to detect bad patterns and modify the code. In those cases, there is no replacement for a capable programmer who understands how the machine is designed, and who can adjust the algorithm to benefit from its design.

When does performance really matter?

The first reaction of an inexperienced developer after learning about some of the architectural issues that affect performance, might be to start profiling everything he can get his hands on, to obtain the absolute maximum capability of his expensive new hardware. This approach is not only misleading, but an actual waste of time.

In a city that experiences traffic jams every day, there is little point in buying a faster car instead of taking the public bus. In both scenarios, you are going be stuck in the traffic for hours instead of arriving at your destination earlier. The same happens with your programs. Consider an interactive program that performs a task in background while waiting for user input, there is little point in trying to gain a few cycles by optimizing the task, since the entire system is still limited by the human input, which will always be much, much slower than the machine. In a similar sense, there is little point in trying to speed-up the boot time of a machine that almost never reboots, since the reboot time cost will be payed only rarely, when a restart is required.

In a very similar sense, the speed-up you gain by recompiling every single program in your computer with the fastest compiler optimizations possible for your machine, like some people like to do, is completely irrelevant, considering the fact that the machine will spend most of the time in an idle state, waiting for the next user input.

What actually makes a difference, and should be target of every optimization work, are cases where the workload is so intensive that gaining a few extra cycles very often will result in a real increase of the computing done in the long run. This requires, first off all, that the code being optimized is actually in the critical path of performance, which means that that part of the code is actually what is holding the rest of the system back. If that is not the case, the gain will be minimum and the effort will be wasted.

Moving back to the reboot example, in a virtualization environment, where new VMs or containers boxes need to be spawned very fast and very often to respond to new service requests, it makes a lot of sense to optimize reboot time. In that case, every microsecond saved at boot time matters to reduce to overall response of the system.

The corollary of the Ahmdal’s law states just that. It argues that there is little sense in aggressively optimizing a part of the program that executes only a few times, very quickly, instead of optimizing the part that occupies the largest part of the execution time. In another (famous) words, a gain of 10% of time in code that executes 90% of time is much better for the overall performance than a 90% speed up in code that executes only 10% of the time.

ahmdal.png

Continue reading on Collabora’s blog.

Learn more from Gabriel Krisman Bertazi at Open Source Summit Europe, as he presents “Code Detective: How to Investigate Linux Performance Issues” on Monday, October 23.