Gaining eBPF Vision: A New Way to Trace Linux Filesystem Disk Requests
A real-world use case of eBPF tracing to understand file access patterns in the Linux kernel and optimize large applications.
By Gabriel Krisman Bertazi, Software Engineer at Collabora.
When Brendan Gregg gave his Performance Analysis superpowers with Linux BPF talk during the Open Source Summit in Los Angeles last year, he wasn't messing around. Using the new eBPF tracing tools really feels like you gained some x-ray vision powers since, suddenly, opening the program's hood is no longer necessary to see in details how it is behaving internally.
I had the chance of applying these new gained powers last month, when I was asked to develop a small tool to trace when and what parts of each file was being accessed for the first-time during a program initialization. This request came with a few constraints, like the information had to be available independently of the kind of buffered I/O method being used (synchronous, aio, memory-mapping). It also should be trivial to correlate the block data with which files were being accessed and, most importantly, the tracing code should not result in a large performance impact in the observed system.
What is the right level to install our probe?
I started by investigating where exactly we'd want to place our probe:
Tracing at the block layer level
A tracer at the Block Layer would examine requests in terms of disk blocks, which don't directly correlate to filesystem files. A trace at this level gives insight on which areas of the disk are being read or write, and whether they are organized physically in a contiguous fashion. But it doesn't give you a higher level view of the system in a file basis. Other tools already exist to trace block-level accesses, such as the EBF script biosnoop and the traditional blktrace.
Tracing at the filesystem level
A tracer at the filesystem level exposes data in terms of file blocks, which can resolve to one or more blocks of data in the disk. In an example scenario, an Ext4 filesystem with 4Kib file block sizes, one system page likely corresponds to 1 physical block (in 4K disks) or 4 physical blocks in disks with 512 sector size. The tracing at the filesystem level allows us to look at the file in terms of offsets, such that we ignore disk fragmentation. Different installations might fragment the disk differently, and from an application level perspective, we shouldn't really be interested in the disk layout as much as what file blocks of data we need, in case we want to optimize by prefetching them.
Tracing at the Page Cache level
The page cache is a structure that sits in between the VFS/memory-mapping system and the filesystem layer, and is responsible for managing memory sections already read from the disk. By tracing the inclusion and removal pages in this cache we could gain the "First-access" behavior for free. When a new page is first accessed it is brought to the cache, and further accesses won't need to go to the disk. If the page is eventually dropped from the cache because it is no longer needed, a new use will need to reach the disk, and a new access entry will be logged. In our Performance investigation scenario, was the exactly functionality we were looking for.
The probe we implemented traces Page Cache Misses inside the Kernel Page Cache handling functions to identify the first time a block is requested, before the request is even submitted to the disk. Further requests to the same memory area (as long as the data is still mapped) will return a hit in the cache, which we don't care about, nor trace. This prevents our code from interfering with further accesses, severely diminishing the impact on performance our probe could have.
By tracing the page cache, we are also capable of differentiating blocks that were directly requested by the user program from blocks requested by the Read Ahead logic inside the kernel. Knowing which blocks were read ahead is very interesting information for application developers and system administrators, since it allows them to tune their systems or applications to prefetch the blocks they want in a sane manner.