uniprof: Transparent Unikernel for Performance Profiling and Debugging

164

Unikernels are small and fast and give Docker a run for its money, while at the same time still giving stronger features of isolation, says Florian Schmidt, a researcher at NEC Europe, who has developed uniprof, a unikernel performance profiler that can also be used for debugging. Schmidt explained more in his presentation at Xen Summit in Budapest in July.

Most developers think that unikernels are hard to create and debug. This is not entirely true: Unikernels are a single linked binary that come with a shared address space, which mean you can use gdb. That said, developers do lack tools, such as effective profilers, that would help create and maintain unikernels.

Enter uniprof

uniprof’s goals are to be a performance profiler that does not require changes to the unikernel’s code. It requires minimal overhead while profiling, which means it can be useful even in production environments.

According to Schmidt, you may think that all you need is a stack profiler, something that would capture stack traces at regular intervals. You could then analyze them to figure out which code paths show up especially often, either because they are functions that take a long time to run, or because they are functions that are hit over and over again. This would point you to potential bottlenecks in your code.

A stack profiler for Xen already exists: xenctx is part of the Xen tool suite and is a generic introspection tool for Xen guests. As it has the option to print a call stack, you could run it over and over again, and you have something like a stack profiler. In fact, this was the starting point for uniprof, says Schmidt.

However, this approach presents several problems, not least of which that xenctx is slow and can take up to 3ms per trace. This may not seem much, but it adds up. And, desiring very high performance is not simply a nice feature; it’s necessity. A profiler interrupts the guest all the time — you have to pause, create a stack trace, and then unpause it. You cannot grab a stack trace while the guest is running or you will encounter race conditions when the guest modifies the stack while you are reading it. A high overhead can influence the results because it may change the unikernel behavior. So, you need a low overhead for a stack tracer if you are going to use it on production unikernels.

Making it Work

For a profiler to work, you need to access the registers to get the instruction pointers. These will allow you to know where you are in the code. Then you need the Framepointer to get the size of the stack frame. Fortunately this is easy: You can get both with getvcpucontext() hypercall.

Then you need to access the stack memory to read the information in it and read the addresses and the next FPs. This is more complicated. You need to read the memory from the guest you are profiling and dump the contents into the guest running the profiler. For that, you need the address resolution because the mapping functionality wants machine frame numbers. This is a multi-step and complex process that finally renders is a series of memory addresses.

Finally, you need to resolve these addresses into function names to see what is going on. For that you need a symbol table which is again thankfully easy to implement. All you need to do is extract symbols from ELF with nm.

Now you have all these stack traces, you have to analyse them. One tool you can use is Flamegraphs. Flamegraphs allows you to graphically represent the result and lets you see the relative run time of each step in your stack trace.

Performance

Schmidt started his project by modifying xenctx. The original xenctx utility has a huge overhead that you can solve with cached memory and virtual machine translations. This reduces the delay from 3ms per trace to 40µs.

xenctx uses linear search to resolve symbols. To make the search faster, you can use a binary search. Alternatively you could avoid doing the resolution altogether, at least while doing the stack tracing. You can wait and do it offline, after the tracing. This reduces the overhead further to 30µs or less.

Schmidt, however, discovered that, by adding some functionalities to xenctx and eliminating others, he was fundamentally changing the tool and it seemed logical to create a new tool from scratch.

uniprof originally is about 100 times faster than xenctx. Furthermore, Xen 4.7 introduces new low-level libraries, such as libxencall and libxenforeignmemory. Using these libraries instead of libxc used in older versions of Xen, you can reduce latency by a further factor of 3. The original version of uniprof took 35 µs for each stack call, while the version that uses libxencall only takes 12µs.

The latest version of uniprof supports both sets of libraries, just in case you are running an older version of Xen. uniprof also fully supports ARM, something xenctx doesn’t do.

You can watch the complete presentation below: